American Institute of Mathematics, San Jose, California
Jay Bartroff, Larry Goldstein, Stanislav Minsker, and Gesine Reinert
The main topics of the workshop are the following.
Concentration of measure inequalities are one of the most valuable tools in the study of high dimensional statistics, and are most often employed under the assumption that observations are Gaussian or have light tailed distributions. Additionally, some form of independence is also usually assumed to hold. Typically though, there is no reason to believe that real-world data sets can be modeled by such mathematically convenient distributions, as heavy-tailed models exhibiting dependence offer better approximations to reality. We will explore how Stein's method may be used to weaken assumptions, such as independence, and may inform recent promising advances that produce performance guarantees for heavy tailed distributions comparable to those for the Gaussian.
High dimensional data are often represented through empirical measures, as they provide a flexible view which allow focus to be placed on different aspects of data. Stein's method can be used to describe the asymptotic behavior of empirical measures even when the observations are heterogeneous and not independent of each other. In particular the error in the projection of empirical measures on subspaces can be bounded. For high-dimensional data one often seeks informative low-dimensional summaries. We shall investigate how Stein's method can be used to quantify the accuracy of such dimension reduction techniques.
Sequential hypothesis testing, estimation, and changepoint detection are also fertile ground for Stein techniques. One open problem is to obtain explicit distributional bounds between a stopped sequential test statistic and its limiting distribution, a problem connected to the excess over the boundary of a stopped random walk. A related problem is to explore the distributional effect of early stopping rules in Markov chain Monte Carlo methods for the analysis of high-dimensional data sets. Here the interest lies in stopping the Markov chain Monte Carlo run when it deviates too much from the target, requiring quantification of the distributional distance from the stopped chain to the limit, a main strength of Stein's method.
Large data sets are often processed in distributed systems that consist of several nodes, each of which are only able to access different data sub-samples. Communication between nodes being expensive or time consuming, each node functions independently and results are merged to obtain output at the final step. We want to understand how to design ``optimal'' merging strategies, and to study connections between divide-and-conquer algorithms and the rates of convergence in normal approximation.
The overarching theme of the workshop will be the development of new methods in high dimensional data analysis by applying recent advances in probabilistic methods.
The workshop schedule.
A report on the workshop activities.