Statistical modeling of high-dimensional categorical data with applications to mutation fitness and sparse text topic analysis
Date
Journal Title
Journal ISSN
Volume Title
Abstract
The growing availability of large-scale categorical data has created a strong need for statistical methods capable of modeling high-dimensional discrete structures. Such data are common in fields like biological sequence analysis, natural language processing, and social network modeling, where observations often involve thousands of categorical or count-valued variables, exhibiting complex dependencies and high sparsity. Conventional statistical models, designed for continuous or low-dimensional settings, often fall short in capturing the latent structure and combinatorial complexity of such data. This dissertation introduces new statistical modeling frameworks and estimation techniques tailored for high-dimensional categorical data, supported by theoretical guarantees and validated through applications in protein sequence analysis and topic modeling. The first part of the dissertation focuses on modeling mutational fitness in proteins, where predicting the effects of amino acid mutations is challenging due to the vast combinations of sites and amino acid types. We propose a new framework for analyzing protein sequences using the Potts model with node-wise high-dimensional multinomial regression. Our method identifies key site interactions and important amino acids, quantifying mutation effects through evolutionary energy derived from model parameters. It encourages sparsity in both site-wise and amino acid-wise dependencies through element-wise and group sparsity. We have established, for the first time to our knowledge, the ℓ2 convergence rate for estimated parameters in the high-dimensional Potts model using sparse group Lasso, matching the existing minimax lower bound for high-dimensional linear models with a sparse group structure, up to a factor depending only on the multinomial nature of the Potts model. This theoretical guarantee enables accurate quantification of estimated energy changes. Additionally, we incorporate structural data into our model by applying penalty weights across site pairs. Our method outperforms others in predicting mutation fitness, as demonstrated by comparisons with high-throughput mutagenesis experiments across 12 protein families. The second part focus on topic modeling which is a fundamental technique for uncovering latent semantic structures in large text corpora. While traditional probabilistic models such as Latent Dirichlet Allocation and probabilistic Latent Semantic Indexing have been widely adopted, they often rely on assumptions that do not align well with the properties of real-world text data, particularly the pervasive presence of zero counts. These structural zeros, especially in short documents, often reflect more than random sampling variability and can indicate meaningful absence. To address these limitations, we propose a novel Zero-Inflated Poisson model that incorporates three essential components: a zero-inflation mechanism explicitly accounting for excess zeros that arise from structural rather than sampling sources; a functional link connecting the zero-inflation probability to the Poisson intensity to capture informative missingness related to topic prevalence, and document-level random effects accounting for unobserved heterogeneity across documents. An efficient alternating optimization algorithm is developed for intensity parameters estimation under a low-rank structure. We establish finite-sample error bounds for topic-word matrix recovery via a vertex hunting procedure. Empirical studies on synthetic datasets show that the model outperforms existing methods in sparse and heterogeneous settings. Application to a real-world corpus of statistical publications further confirms the model's ability to recover meaningful topics and track their evolution over time.