Tuesday Discussion on Statistical Issues
This discussion was moderated by Ruth Charney.
Q. Diaconis suggested making a list of the kinds of
noise, or sources of variation, that are implicit in estimating trees?
Participant discussion led to the following list:
- alignment
- errors in sequencing, (though Epstein pointed out that the
quality is steadily improving)
- misspecification of the model (process of going from the data to
a tree):
mutation rates,
independence between sites,
change of nucleotide composition
- variation within species (1 in 1000 genes)
- bias in corrections for distances (for distance based models)
- variation between fragments of same DNA
(bias created by choice of fragments)
- selection varies in different parts of the genome
- gene identification (problems created by gene duplication and gene loss)
- hybridization (branching comes back together, structure is not a tree)
and horizontal gene transfer
- optimization
- not enough data-- length of sequences, number of taxa
Q. are these problems worth working on incrementally or all at once?
Which ones are most important? (Diaconis)
- Penny: differences in nucleotide composition?
Q. Can we define the geometry (distance) on tree space to behave well
with respect to a particular model?
- Vert: what are the implications of the noise in the definition of the
tree space? Instead of defining the geometry of tree space
beforehand, and then asking what properties it satisfies, would it be
possible to define a tree space in terms in such a way that it has
good properties (is stable, averages are at least as good as the trees
themselves, etc.)
- Diaconis: this is
perhaps related to statistical geometry or information geometry--
using the Fisher information to define a Riemannian metric on the space
of distributions.
- Epstein: an average is a summary statistic, but doesn't summarize
everything we might want.
- Evans: information geometry example: using the upper half plane to
parametrize normals on a line, the geometry that best reflects
closeness of normals is hyperbolic geometry.
Q.
What are residuals for trees, and how do we estimate the residuals?
(Residuals measure how far each data point is from fitted tree.)
Are there graphical methods for detecting departure from the model?
(Penny)
Q. Random walks on sets of trees (using tree rotations)- how fast does it
converge?
- Once again, the question of how to measure distance
in the space of trees emerged.
Q. What are appropriate meeasures on tree space?
- Glenn asked if there a need for a non-parametric likelihood on trees?
Diaconis said that empirical likelihood related to the boostrap, so it could
contribute.
Q. If exact distances aren't easy to compute,
can we obtain any upper and lower bounds for distances (such as the BHV
metric)? (Su)
Back to the
main index
for Geometric models of biological phenomena.