IMS-Microsoft Research Workshop: Foundations of Data Science – False Discovery Rates – a new deal

Session Chair Intro – Rafael Irizarry Harvard University Session Chair Intro: Statistical and Computational Challenges in Biology

Matthew Stephens University of Chicago False Discovery Rates – a new deal False Discovery Rate (FDR) methodology, first put forward by Benjamini and Hochberg, and further developed by many authors – including Storey, Tibshirani, and Efron – is now one of the most widely used statistical methods in large-scale scientific data analysis, particularly in genomics. A typical genomics workflow consists of i) estimating thousands of effects, and their associated p values; ii) feeding these p values to software (e.g. the widely used qvalue package) to estimate the FDR for any given significance threshold. In this talk we take a fresh look at this problem, and highlight two deficiencies of this standard pipeline that we believe could be improved. First, current methods, being based directly on p values (or z scores), fail to fully account for the fact that some measurements are more precise than others. Second, current methods assume that the least significant p values (those near 1) are all null – something that initially appears intuitive, but will not necessarily hold in practice. We suggest simple approaches to address both issues, and demonstrate the potential for these methods to increase the number of discoveries at a given FDR threshold. We also discuss the connection between this problem and shrinkage estimation, and problems involving sparsity more generally.

Rafael Irizarry and Matthew Stephens
Harvard University, University of Chicago