How can we best use crowdsourcing to perform a subjective labeling task with low inter-rater agreement? We have developed a framework for debugging this type of subjective judgment task, and for improving label quality before the crowdsourcing task is run at scale. Our framework alternately varies characteristics of the work, assesses the reliability of the workers, and strives to improve task design by disaggregating the labels into components that may be less subjective to the workers, thereby potentially improving inter-rater agreement. A second contribution of this work is the introduction of a technique, Human Intelligence Data-Driven Enquiries (HIDDEN), that uses Captcha-inspired subtasks to evaluate worker effectiveness and reliability while also producing useful results and enhancing task performance. HIDDEN subtasks pivot around the same data as the main task, but ask workers to perform less subjective judgment subtasks that result in higher inter-rater agreement. To illustrate our framework and techniques, we discuss our efforts to label high quality social media content, with the ultimate aim of identifying meaningful signal within complex results