This AIM workshop is a followup to a semester-long event at the Statistical and Applied Mathematical Sciences Institute (SAMSI) located in the Research Triangle Park in North Carolina. The workshop was dedicated to the study of random matrices and their applications in a variety of real-world problems. Particular emphasis was put on problems that give rise to data sets of large sample sizes, high-dimensional vectors, and commensurately large random matrices.

Among the working groups at SAMSI, and at this workshop, were

- Climate and Weather
- Wireless Communication
- Universality
- Regularization and Covariance
- Geometric Methods
- Multivariate Distributions
- Graphical Models/Bayesian Methods
- Estimating Functionals of High-Dimensional Sparse Vectors

The examples that we shall describe in detail below draw on ideas from several of these groups.

When confronted with a large random matrix constructed from high-dimensional data, statisticians often perform a "principal component analysis" (PCA) of the data. In principal component analysis, the eigenvalues and eigenvectors of the random matrix are used to develop low-dimensional approximations to the high-dimensional data. Principal component analysis is widely used in applications such as meteorology, the analysis of income tax returns, and the design of Internest search engines. The Climate and Weather working group focused primarily on principal component analysis and its applications to the detection and attribution of climate change.

Graphical models are statistical models designed to analyse complex high-dimensional data. Graphical models often are characterized by the nature of their covariance matrices or the inverse of their covariance matrices. The working group on Graphical Models/Bayesian Methods developed methods for reducing the dimension and the number of parameters of high-dimensional data sets.

The Internal Revenue Service (IRS) of the United States Government is one
of the largest users of linear algebra. The IRS may view an individual
taxpayer as a large vector with *d* entries, where *d* can be a fairly
large positive integer (at least 100). The entries in this vector
may include zip code, income, number of income streams, and other
pieces of data that are relevant to the collection of taxes. Call this
vector *X*.

If *X _{1}, X_{2}, ..., X_{N}* are

*
X¯ = (1/N) ∑ _{j=1}^{N} X_{j} .
*

Then we can define the *sample covariance matrix*

*
M = ∑_{j=1}^{N} (X_{j} - X¯)(X_{j} - X¯)^{T} .
*

We use here the exponent *T* to denote the transpose of the vector.
Thus **M** is a *d × d* matrix. It is known that, if *N*
is larger than *d*, then **M** is positive definite almost surely. Here
"almost surely" is probabilistic jargon that means that the event occurs
with probability 1. In other words, it is a sure bet that this event will occur.
A positive definite matrix *A* is one which satisfies **x** A ^{T} > 0. Equivalently,
*A* is positive definite if the determinant of each square upper-left submatrix
is positive. The study of the covariance matrix **M** gives rise to the so-called *Wishart distribution*.

If the IRS determines that your vector *X* of taxpayer
information deviates markedly from the average or mean
*X¯* of other taxpayers in your zip code, then it
concludes that there is something unusual about your income
tax status. Thus you are more likely to be the subject of an audit.

The IRS has various means of calculating your deviation from
the mean. One of these, the previously mentioned technique of
principal component analysis, is to calculate the eigenvalues
*λ _{j}* and eigenvectors

If your *X* is well-approximated by the eigenvectors * v_{j}* then
your data fits in well with that of the chosen population (in your
zip code). So you fit the profile of an "average taxpayer" and it
is not likely that you will get an audit. If instead your

At the SAMSI event, and at this AIM workshop, this group studied
high-dimensional random matrices.
For example, the data vectors *X* that would come from a question
in the human genome project would typically have 10,000 pieces of
information. This would give rise to a very large covariance matrix.
The corresponding calculations and analysis are orders of magnitude
more difficult than those that correspond to small *X* and small **M**.

One of the matters of interest is *discriminant analysis*. This is
a device for determining the probability of misclassification of an *X*.
For example, imagine that there are two populations of citizens in a
neighborhood---group **A** consisting of those
who work as executives for big corporations and group **B** consisting
of those who are freelance consultants.
The first group have regular, fairly large, salaries. The second group will
consist of people with an irregular income stream with widely deviating magnitude.
Obviously these two different types of citizens will have different tax characteristics.
Given a citizen's tax vector *X*, we want to be able to determine analytically
whether *X* belongs in **A** or in **B**. Thus one needs a *metric*,
or a notion of distance, to determine which of **A** or **B** the vector
*X* is closest to. And this metric cannot be the standard isotropic Euclidean
metric which treats each coordinate in the same way. Different tax data will
count more or less than other tax data, so one requires a non-isotropic metric
that weights the different pieces of data differently. The positive
definite matrix **M** gives a device for constructing such a metric.

The analytical questions give rise to subtle considerations in Riemannian geometry. One wishes to know how to calculate distances, and geodesics, in a space of positive definite matrices. The resulting analyses draw on many parts of modern mathematics.

The Wireless Communications group at this workshop is concerned with using Bessel functions and other sophisticated notions from classical analysis to design multiple input and multiple output channels for cell phones. It uses covariance analysis to effect these studies.

The Multivariate Distribution group wants to apply these techniques in medical
studies. For example, in the testing of a new drug each *X* represents
a patient. Many of the sort of people who would volunteer for a drug study
are unreliable. Thus the data sets that arise from the study have gaps in
them. The random matrix techniques provide methods for interpolating across
these gaps and still drawing meaningful conclusions.

An important message here is that random matrix theory is a burgeoning and developing part of modern mathematics. It is used in many different disciplines, ranging from number theory to mathematical physics to probability. Moreover, it is used decisively in statistical studies as indicated in the present discussion. The subject is a source of new problems and new research directions, and should serve as an attractive venue for young researchers.