Alon Kipnis - Two-sample problem for large, sparse, high-dimensional distributions under rare/weak perturbations

Consider two samples, each obtained by independent draws from two possibly different distributions over the same finite and large alphabet (features). We would like to test whether the two distributions are identical, or not. We propose a method to perform a two-sample test of this form by taking feature-by-feature p-values based on a binomial allocation model, combining the p-values using Higher Criticism. Performance on real-world data (e.g. authorship attribution challenges) shows this to be an effective unsupervised, untrained discriminator even under violations of the binomial allocation model.

We analyze the method in a `rare/weak departures' setting where, if two distributions are actually different, they differ only in relatively few features and only by relatively subtle amounts. We perform a phase diagram analysis in which the phase space quantifies how rare and how weak such departures are. Although our proposal does not require any formal specification of an alternative hypothesis, nor does it require any specification of a baseline or null hypothesis, in the limit where counts are high, the method delivers the optimal phase diagram in the rare/weak setting: it is asymptotically fully powerful inside the region of phase space where a formally specified test would have been fully powerful. In the limit where counts are low, we derive the phase diagram as well, although the optimality of the resulting diagram remains an open question.

Date and Time: 
Monday, January 18, 2021 - 19:00 to Tuesday, January 19, 2021 - 19:45
Speaker: 
Alon Kipnis
Location: 
Zoom
Speaker Bio: 

Alon Kipnis is a lecturer and a postdoctoral research scholar in the department of statistics at Stanford University, hosted by David Donoho. He received his Ph.D. degree in electrical engineering from Stanford University under the supervision of Andrea Goldsmith. His research combines mathematical statistics, information theory, and ambitious data science.