Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all linear combinations of subword statistics, and fully characterize their different orders of magnitude. Moreover, we establish the spectral decomposition of the space of word statistics of each order. We provide explicit formulas for the eigenvectors and eigenvalues of the covariance matrix of the multivariate distribution of these statistics. This framework includes as special cases several well-studied random variables from the combinatorial and statistical literature. Our techniques include algebraic tools such as representations and operators in the words algebra.
Joint work with Tsviqa Lakrec and Ran Tessler
Chaim Even-Zohar gained his PhD in Mathematics at the Hebrew University of Jerusalem in 2016 under the supervision of Nati Linial. Then he spent one semester as a Research Fellow at the Institute for Computational and Experimental Research in Mathematics at Providence, Rhode Island. In 2017-2019 he was a Krener Assistant Professor in the Mathematics Department at the University of California, Davis. Now he is a Research Associate at the Alan Turing Institute, the UK national institute for data science and artificial intelligence. His research interests include Combinatorics, Probability, and Low-dimensional Topology, and topics in Statistics and Computer Science. He was awarded the Rudin Scholarship, the Klein Prize, the Wolf Foundation Scholarship, and the Rothschild Postdoctoral Fellowship.