Sunday Discussion on Biological Issues

The discussion was moderated by John Luecke. Susan Holmes opened the discussion by suggesting that we might ask the biologists: what mathematical or biological questions related to phylogenetic trees are most important to biologists? She also invited clarifications about things that have been addressed in talks earlier in the day.

Q. What properties of phylogenetic trees make them different from random trees? Is there some kind of structure that makes them different? (Epstein)

Phylogenetic trees don't look like trees generated by random branching processes. (Huelsenbeck)

Q. What are the characteristics that a ``distance between trees'' should have that biologists would want? (Luecke)

Luecke: For instance, the Billara-Holmes-Vogtmann (BHV) metric measures distances along geodesics in tree space, and the trees change as you move along the geodesic, but does this have a reasonable biological intepretation?
Sober: that does not have biological meaning. Biologists just want to know how to tell if trees are similar or not.
Felsenstein: you are assuming we (biologists) want a notion of distance between trees?
(Scribe: this appears to have been in contrast to just a notion of similarity. This elicited remarks from mathematicians about the fruitfulness of using a metric to tell how similar two trees are.)
Huelsenbeck pointed out several ways to measure how different trees are, e.g., involving the contraction/expansion of edges to get from one to the other, or differences in length of edges, or squared branch length, etc. Some discussion of the Robinson-Foulds distance ensued.
Luecke: so computational simplicity would be one important consideration for a distance.
Huelsenbeck: Any distance notion on phylogenetic trees should have a good theoretical foundation.

Q. Is there a difference between some real distribution of trees on tree space rather than some random distribution? (Holmes)

Evans: this will help us to construct meaningful confidence sets on trees.
Felsenstein: Tree spaces are weird. For a cloud of trees, we want to characterize where in this space those trees are.

Q. Problem: make notions of distances accessible to biologists, so they might use them. (Huelsenbeck)

Some comments were made about the use of methods of Kuhner-Felsenstein, weighted Robinson-Foulds, in biology.
Diaconis pointed out that in his work on non-standard data structures, the notions of distances are very useful and basic. In doing exploratory analyses, they can be useful in statistical tasks. So, what do biologists think of these distances?
Holmes pointed out that how fast MCMC methods converge depends on the geometry of the space, and the metric chosen.
Huelsenbeck suggested that a useful task for mathematicians would be to make some sense of the uses of these distances to biologists? (such as in an article pitched to biologists).

More comments about tree space topology and branch lengths:

Huelsenbeck: are there versions of tree space that biologists might not be interested in, due to technical conditions? For instance, if you give a program an alignment, the branches have to be short enough so that you can align them.
Felsenstein: In many cases in biology we are interested in grappling with, such as origins of mammals, many of the orders seem to have popped up rather quickly. We could be interested in (a) the branch length and (b) the topology.
When the orders diverged, important things were happening very quickly, such as morphological changes. But the molecules you are studying may not have been involved. If you are interested in those morphological changes, it is important to know what order the branching occurred.
On the other hand, in some other questions, topology many not be as important as branch length. It really depends on the question you are asking.
Sober: A third issue is important: the character states of the interior nodes of trees. You might want to know what is the sequence of changes that occurred on some branch? for instance in some calculation, you may want to integrate over all possible states of interior nodes.
Felsenstein pointed out that these states on interior nodes are sometimes overinterpreted-- they are often viewed as actual states, rather than estimates.
Penny: in biology, knowing the ``true'' tree may only be a starting point of the real investigation-- some other aspect that the biologist is truly interested in. Huelsenbeck gave an example of an evolution biologist interested in sexual selection. Although she made a phylogeny (what she thought was the best tree), she wasn't primarily interested in the phylogeny, but in further questions about sexual selection.

Q. How to pick a tree (or tree average) from a set of trees resulting from data?

Epstein: in bacteria dna fragments, each one gives a tree. Which one to use? Need a tree distance to analyze the resulting 22 different trees. The distance he used was the BHV metric, but he suspects one obtains much the same answer with many different tree metrics.
Evans: this is similar to Mallow's model in statistics, where you have a distribution on permutations, centered on a permutation, dropping off in some radial sense. Here, you have a central tree, and then the probability that you observe some other tree dies off as you move farther away.
Penny: one approach to identify ``bad'' portions of a tree is to try to ``identify the guilty taxon'', by succedssively removing taxa and then see whether this stabilizes the tree significantly.

Q. How to understand or deal with residuals, such as non-tree like data?

Holmes: gave an example in which a friend has a reference tree, and 8 different plumage trees. In tree space, if the plumage tree differs from the reference dna tree in some direction, is that present in the way these things actually evolved?
In statistics, when doing regression, you compare data points to a fitted line, and ignore the ones that are way off. Here, in tree space, there's a similar question for non-tree like data: how much did I have to bend the data to make it into a tree?
Sober: how do you decide how much off is too much, before you are worried?
(Scribe's note: this seems to require a notion of distance not just on tree space, but some larger space-- such as on the space of DNA sequences-- in which tree space is embedded. We discussed embedding questions on Wednesday.)

More comments on distances:

Diaconis told a story where distances fit data remarkably well. Perception psychologist Roger Shepard studied the visual system, by showing subjects a configuration of blocks, then some other configuration, and then asked them: are they are the same? He found that the time it took people to decide was the geodesic in the three dimensional rotation group. The data is quite remarkable, and gave straight line plots. We know what it means to measure the distance in the rotation group (but why the brain should know about this is some fascinating subject in itself.)
Similarly, does a distance between trees have to have some interpretation in terms of evolution, in order to be the ``right'' notion of distance?
Epstein: For instance, given dna or amino acid sequences, one for each taxon, this data produces a tree. As sequences change or evolve one nucleotide at a time, the trees will change. What path does this trace out in tree space?
(Scribe's note: this appears to require a space of trees that includes trees with many different numbers of leaves, so that one can speak of how a tree evolves as species split off from one another.)
In response to a question, Felsenstein mentioned a few sources of variation in trees: (a) statistical error. (b) coalescence: take a gene copy in three species, and think of copies ancestral to these. The copies do not come together instantly, but have some stochastic chance of mixing, and they may come in some random order that conflicts with a species tree. (c) horizontal gene transfer.
(Scribe's note: a more detailed discussion of this topic can be found in Tuesday's discussion.)

Q. Which is better: concatenating DNA sequences first or averaging trees later?

Holmes: empirical studies show that if you take all the data and compute one tree, you generally do less well at estimating the tree than than if you take the fragments, use them, then average them. This comes from the CAT(0) property. The intuition behind it is if you average in a negatively curved space, you converge much faster than you should.
St. John: pointed out an example with 20 simulated DNA sequences: if is one part of the genome, and is another part, then the trees obtained from using combined with do not overlap, and are somewhat in between, the trees obtained from and from alone. Felsenstein commented that the stochastic effects pushed and in different directions in tree space.
She also noted that in studying hybridization (e.g., sunflowers), she was surprised that she didn't get more overlap, and often the right answer was with part of the data, not all the data.
Felsenstein: Biologists are in disagreement about whether concatenating or averaging is better. Statistically, which is better?
It was pointed out that a paper by Cunningham (1997) does comparisons. Also, work by Amit, and statistical literature on boosting and bagging.
Billera: there's a notion of equivalent metrics in topology, and any of these will do. (In doing geometry, where issues like curvature come into play, the choice of metrics is important.) It may be the case that even though certain metrics are good enough for some purposes, it doesn't mean that other metrics aren't valid.

Q. What is the right notion of an ``average'' of trees?

The biologists present agreed that this was a very interesting question for biologists.
Vert: Two important ideas emerge when working with decision trees: (1) average decrease some variance, (2) averaging can leave the space. Boosting will leave the space, if we don't leave the space it's really not boosting. The averaging we are discussing here doesn't leave the space of trees.
More discussion on averaging took place here. Holmes made a comment about the ``non-associativity'' of trees (related to considering trees of trees, when building trees one character at a time). Billera made some remarks about how averaging may be better than concatenation, but we have no guarantee that it is any good.
Penny discussed the notion of a median tree: the tree that is closest on average to all the others.
Felsenstein: other examples are the consensus tree, and majority rule consensus tree.

Q. What are good properties of averaging? (Luecke)

Penny: would like a fully resolved binary tree, though he points out that some others might sacrifice that in favor of other features.
Felsenstein posed a problem about averaging properties: given a bunch of trees from different genes, to study a question like: are chimps closer to humans than gorillas? Say 2/3 of the trees trees show humans with chimps, others show chimps with gorillas. Suppose in trees that group humans with chimps, the average branch length is 1, but the others don't have it. When you average... do you want it to be length .66 or length 1?
There was a question about how different are trees that result from diffrent averaging methods. It was pointed out (Huelsenbeck) that differences between methods for a single gene is much smaller than the differences across genes.

Q. What would be the distribution on trees that gives majority rule consensus as its average? (Holmes)

Holmes: would like any tree average to be an expected value with respect to some distribution on tree space.

Back to the main index for Geometric models of biological phenomena.