New developments on linear regression with random design and high-dimensional mediation analysis
Date
Journal Title
Journal ISSN
Volume Title
Abstract
Linear regression is arguably the most widely used statistical method. In this thesis, we study the robustness of the least squares estimator when regressors are random and the errors are correlated with unknown correlation structure. We further investigate the small-sample robustness of the least squares estimator and offer a new geometric perspective on the F-test. Motivated by the Baron-Kenny approach, we also apply linear models to high-dimensional mediation analysis with the treatment-by-mediator interaction. In linear regression with fixed regressors and correlated errors, the conventional wisdom is to modify the variance-covariance estimator to accommodate the known correlation structure of the errors. We depart from the literature by showing that with random regressors, linear regression inference is robust to correlated errors with unknown correlation structure. The existing theoretical analyses for linear regression are no longer valid because even the asymptotic normality of the least-squares coefficients breaks down in this regime. We first prove the asymptotic normality of the t statistics by establishing their Berry–Esseen bounds based on a novel probabilistic analysis of self-normalized statistics. We then study the local power of the corresponding t tests and show that, perhaps surprisingly, error correlation can even enhance power in the regime of weak signals. Overall, our results show that linear regression is applicable more broadly than the conventional theory suggests, and further demonstrate the value of randomization to ensure robustness of inference. Next, we explore the small sample robustness of the least squares estimator by the F and t tests. The F distribution is one of the most widely applied statistical tools in small sample inference, and it has been recognized that its definition does not necessarily require normality, but merely a spherical distribution. While existing literature has touched upon the relationship between the F and spherical distributions, these discussions remain either incomplete or not rigorously structured. We provide a geometric perspective that clearly delineates the relationship between F and spherical distributions, and introduce a novel definition of the F distribution. Perhaps surprisingly, based on this new definition, in the linear model, the validity of the ordinary least squares F-test and t-test is preserved under spherical symmetry of the design matrix, even if the error terms have non-zero means, heteroscedasticity, strong correlations, or heavy tails. Finally, we apply the linear model (Baron-Kenny approach) in the mediation analysis under high-dimensional setting with interaction. Mediation analysis has been commonly applied in various fields, including economics, finance, and genomic and genetic research. A key challenge in this domain is the inference of natural direct and indirect effects in the presence of potential interactions between treatment and high-dimensional mediators. These interactions often give rise to moderator effects, which are further complicated by the intricate dependencies among the mediators. In this paper, we introduce a new inference procedure that addresses this challenge. By incorporating a non-convex penalty into the outcome model, our method effectively identifies important mediators while accounting for their interactions with the treatments, which admits the guaranteed oracle property. Leveraging the oracle property, we can exploit a projection onto the mediator model, guided by the estimated important direction in the mediator space. We establish the asymptotic normality of both natural indirect and direct effects for inference. Additionally, we develop an algorithm that utilizes the overlapping group SCAD penalty to promote heredity structure among the main effects and interactions, which comes with provable guarantees. Our extensive numerical studies, comparing our method with other existing approaches across various scenarios, demonstrate its effectiveness. To illustrate the practical application of our methods, we conduct a study investigating the impact of childhood trauma on cortisol stress reactivity. Using DNA methylation loci as mediators, we uncover several new loci that remain undetected when interactions are ignored.