Stein's method and applications in high-dimensional statistics

August 6 to August 10, 2018

at the

American Institute of Mathematics, San Jose, California

organized by

Jay Bartroff, Larry Goldstein, Stanislav Minsker, and Gesine Reinert

Original Announcement

This workshop is devoted to the applications of Stein's method to address modern, practical questions in high-dimensional mathematical statistics.

The main topics of the workshop are the following.

  1. Concentration of measure inequalities and sparse recovery problems
  2. Empirical measures and dimension reduction
  3. Sequential analysis and change-point detection
  4. Connections between distributed statistical estimation and rates of convergence to normal approximation

Concentration of measure inequalities are one of the most valuable tools in the study of high dimensional statistics, and are most often employed under the assumption that observations are Gaussian or have light tailed distributions. Additionally, some form of independence is also usually assumed to hold. Typically though, there is no reason to believe that real-world data sets can be modeled by such mathematically convenient distributions, as heavy-tailed models exhibiting dependence offer better approximations to reality. We will explore how Stein's method may be used to weaken assumptions, such as independence, and may inform recent promising advances that produce performance guarantees for heavy tailed distributions comparable to those for the Gaussian.

High dimensional data are often represented through empirical measures, as they provide a flexible view which allow focus to be placed on different aspects of data. Stein's method can be used to describe the asymptotic behavior of empirical measures even when the observations are heterogeneous and not independent of each other. In particular the error in the projection of empirical measures on subspaces can be bounded. For high-dimensional data one often seeks informative low-dimensional summaries. We shall investigate how Stein's method can be used to quantify the accuracy of such dimension reduction techniques.

Sequential hypothesis testing, estimation, and changepoint detection are also fertile ground for Stein techniques. One open problem is to obtain explicit distributional bounds between a stopped sequential test statistic and its limiting distribution, a problem connected to the excess over the boundary of a stopped random walk. A related problem is to explore the distributional effect of early stopping rules in Markov chain Monte Carlo methods for the analysis of high-dimensional data sets. Here the interest lies in stopping the Markov chain Monte Carlo run when it deviates too much from the target, requiring quantification of the distributional distance from the stopped chain to the limit, a main strength of Stein's method.

Large data sets are often processed in distributed systems that consist of several nodes, each of which are only able to access different data sub-samples. Communication between nodes being expensive or time consuming, each node functions independently and results are merged to obtain output at the final step. We want to understand how to design ``optimal'' merging strategies, and to study connections between divide-and-conquer algorithms and the rates of convergence in normal approximation.

The overarching theme of the workshop will be the development of new methods in high dimensional data analysis by applying recent advances in probabilistic methods.

Material from the workshop

A list of participants.

The workshop schedule.

A report on the workshop activities.

A list of open problems.

Papers arising from the workshop:

Normal Approximation and Fourth Moment Theorems for Monochromatic Triangles
by  Bhaswar B. Bhattacharya, Xiao Fang, Han Yan
Relaxing the Gaussian assumption in Shrinkage and SURE in high dimension
by  Max Fathi, Larry Goldstein, Gesine Reinert, and Adrien Saumard
Arcsine laws for random walks generated from random permutations with applications to genomics
by  Xiao Fang, Han Liang Gan, Susan Holmes, Haiyan Huang, Erol Peköz, Adrian Röllin, Wenpin Tang
A note on three-fold branched covers of $S^4$
by  Ryan Blair, Patricia Cahn, Alexandra Kjuchukova, Jeffrey Meier
Normal approximation for stochastic gradient descent via non-asymptotic rates of martingale CLT
by  Andreas Anastasio, Krishnakumar Balasubramanian, Murat A. Erdogdu