Statistical inference on reproducibility in high-throughput experiments
Date
Journal Title
Journal ISSN
Volume Title
Abstract
Results in high-throughput genomics are known to have large variability across independent replicate studies. For this reason, the formal assessment of the agreement of results for many hypotheses across replicate studies has been a burgeoning area of research in statistical genomics. Hypotheses with consistent results are called reproducible, while those without consistency are called irreproducible. The presence of reproducibility in experimental research is critical, as it ensures the validity of findings. In this dissertation, we devise three methods for assessing the reproducibility of results from high-throughput genomic studies, each with advantages under certain settings. First, we notice that many of the existing approaches to assessing the reproducibility of results from two replicate high-throughput genomics studies either depend on strict parametric assumptions on available summary statistics or fail to properly consider the consistency of reproducible signal across experiments in addition to its strength. Motivated by \cite{philtron2018maximum}, we introduce a function based on the rankings of summary statistics from each experiment to define a notion for reproducibility and identify reproducible hypotheses. The proposed nonparametric statistic takes into account both the signal strength and consistency of results. By examining the geometry of the space of ranks of summary statistics and utilizing the negative association dependence structure of ranks, a novel procedure is introduced for recognizing reproducible findings while controlling the false discovery rate (FDR). This method controls FDR under relatively mild assumptions. The theoretical FDR findings are validated through simulations that also reveal the method to be more powerful than existing procedures. Finally, the procedure is applied to two large-scale TWAS datasets, uncovering reproducible features. Second, we notice that existing methods for assessing the reproducibility of high-throughput studies ignore the known group structures of genetic features, such as transcripts belonging to the same gene or genes belonging to the same pathway. Motivated by \cite{li2011measuring} and \cite{Liu2016ANATMTOGH}, we present an empirical Bayesian framework for reproducibility that incorporates this group structure. Additionally, we introduce algorithms for testing reproducibility at the hypothesis and group levels that maintain control of posterior FDR. Next, a data-driven estimation procedure based on the EM algorithm is proposed to enable the application of these algorithms when the parameters it relies on are unknown. In simulation, we show that the inclusion of the group structure in the hypothesis-level procedure leads to superior performance in terms of power and FDR control compared to more naive methods, and that the group-level procedure outperforms methods that rely on aggregation prior to analysis. The proposed procedures enable researchers to integrate known group structure information into the reproducibility problem, yielding higher-quality results. Finally, while there is a dearth of existing literature for analyzing reproducibility across two replicate studies, there are strikingly few methods that consider cases with more than two studies, and those that exist generally assume the distributions of irreproducible summary statistics are known. Leveraging Kendall's coefficient of concordance, we introduce a rank-based statistic that quantifies the agreement of results for a particular hypothesis without enforcing such strict assumptions. Noticing that in real high-throughput genomic settings, we have many ``housekeeping'' genes that are unrelated to the disease of interest and thus can be considered as a control set, we utilize conformal inferential and bootstrapping techniques to devise three procedures for calculating approximate $p$-values from a set of the proposed statistics that can be used to discover reproducible hypotheses at a nominal level of FDR. Simulation studies reveal that the three methods show preferable performance to existing methods in terms of power and FDR control. Applying the methods to single-cell expression data from five COVID-19 studies, we show that the proposed statistic and its procedures can identify genes and gene pathways associated with COVID-19.