ARCC Workshop: Stein's method and applications in high-dimensional statistics

Concentration of measure inequalities are one of the most valuable tools in the study of high dimensional statistics, and are most often employed under the assumption that observations are Gaussian or have light tailed distributions. Additionally, some form of independence is also usually assumed to hold. Typically though, there is no reason to believe that real-world data sets can be modeled by such mathematically convenient distributions, as heavy-tailed models exhibiting dependence offer better approximations to reality. We will explore how Stein's method may be used to weaken assumptions, such as independence, and may inform recent promising advances that produce performance guarantees for heavy tailed distributions comparable to those for the Gaussian.

High dimensional data are often represented through empirical measures, as they provide a flexible view which allow focus to be placed on different aspects of data. Stein's method can be used to describe the asymptotic behavior of empirical measures even when the observations are heterogeneous and not independent of each other. In particular the error in the projection of empirical measures on subspaces can be bounded. For high-dimensional data one often seeks informative low-dimensional summaries. We shall investigate how Stein's method can be used to quantify the accuracy of such dimension reduction techniques.

Sequential hypothesis testing, estimation, and changepoint detection are also fertile ground for Stein techniques. One open problem is to obtain explicit distributional bounds between a stopped sequential test statistic and its limiting distribution, a problem connected to the excess over the boundary of a stopped random walk. A related problem is to explore the distributional effect of early stopping rules in Markov chain Monte Carlo methods for the analysis of high-dimensional data sets. Here the interest lies in stopping the Markov chain Monte Carlo run when it deviates too much from the target, requiring quantification of the distributional distance from the stopped chain to the limit, a main strength of Stein's method.

Large data sets are often processed in distributed systems that consist of several nodes, each of which are only able to access different data sub-samples. Communication between nodes being expensive or time consuming, each node functions independently and results are merged to obtain output at the final step. We want to understand how to design ``optimal'' merging strategies, and to study connections between divide-and-conquer algorithms and the rates of convergence in normal approximation.

The overarching theme of the workshop will be the development of new methods in high dimensional data analysis by applying recent advances in probabilistic methods.

Material from the workshop

Normal Approximation and Fourth Moment Theorems for Monochromatic Triangles

by Bhaswar B. Bhattacharya, Xiao Fang, Han Yan

Relaxing the Gaussian assumption in Shrinkage and SURE in high dimension

by Max Fathi, Larry Goldstein, Gesine Reinert, and Adrien Saumard

Arcsine laws for random walks generated from random permutations with applications to genomics

by Xiao Fang, Han Liang Gan, Susan Holmes, Haiyan Huang, Erol Peköz, Adrian Röllin, Wenpin Tang

A note on three-fold branched covers of $S^4$

by Ryan Blair, Patricia Cahn, Alexandra Kjuchukova, Jeffrey Meier

Normal approximation for stochastic gradient descent via non-asymptotic rates of martingale CLT

by Andreas Anastasio, Krishnakumar Balasubramanian, Murat A. Erdogdu

Stein's method and applications in high-dimensional statistics

Original Announcement

Material from the workshop