# Tuesday Discussion on Statistical Issues

This discussion was moderated by Ruth Charney.

Q. Diaconis suggested making a list of the kinds of noise, or sources of variation, that are implicit in estimating trees? Participant discussion led to the following list:

• alignment
• errors in sequencing, (though Epstein pointed out that the quality is steadily improving)
• misspecification of the model (process of going from the data to a tree): mutation rates, independence between sites, change of nucleotide composition
• variation within species (1 in 1000 genes)
• bias in corrections for distances (for distance based models)
• variation between fragments of same DNA (bias created by choice of fragments)
• selection varies in different parts of the genome
• gene identification (problems created by gene duplication and gene loss)
• hybridization (branching comes back together, structure is not a tree) and horizontal gene transfer
• optimization
• not enough data-- length of sequences, number of taxa

Q. are these problems worth working on incrementally or all at once? Which ones are most important? (Diaconis)

• Penny: differences in nucleotide composition?

Q. Can we define the geometry (distance) on tree space to behave well with respect to a particular model?

• Vert: what are the implications of the noise in the definition of the tree space? Instead of defining the geometry of tree space beforehand, and then asking what properties it satisfies, would it be possible to define a tree space in terms in such a way that it has good properties (is stable, averages are at least as good as the trees themselves, etc.)

• Diaconis: this is perhaps related to statistical geometry or information geometry-- using the Fisher information to define a Riemannian metric on the space of distributions.

• Epstein: an average is a summary statistic, but doesn't summarize everything we might want.

• Evans: information geometry example: using the upper half plane to parametrize normals on a line, the geometry that best reflects closeness of normals is hyperbolic geometry.

Q. What are residuals for trees, and how do we estimate the residuals? (Residuals measure how far each data point is from fitted tree.) Are there graphical methods for detecting departure from the model? (Penny)

• Penny also asked why there are so many definitions of maximum likelihood?

• Diaconis suggested that one topic probabalists and biologists may benefit from is work of Aldous and students, on analyzing rates of convergence of random walks on phylogenetic trees. Aldous responded by noting that the case they can analyze is not necessarily so useful to biologists (one where leaf is taken off and pushed somewhere else). For an -leaf tree, the random walk needs random steps. A more realistic example is to cut somewhere, and attach whole subtree in a different place. It is believed that about steps is needed to mix this kind of chain. This could be useful in MCMC algorithms.

Q. Random walks on sets of trees (using tree rotations)- how fast does it converge?

• Once again, the question of how to measure distance in the space of trees emerged.

Q. What are appropriate meeasures on tree space?

• Glenn asked if there a need for a non-parametric likelihood on trees? Diaconis said that empirical likelihood related to the boostrap, so it could contribute.

Q. If exact distances aren't easy to compute, can we obtain any upper and lower bounds for distances (such as the BHV metric)? (Su)

Back to the main index for Geometric models of biological phenomena.