Paul Lewis Lab research projects

Model Selection

Plot of distributions along path from prior to posteriorIn collaboration with Ming-Hui Chen and Lynn Kuo in the UConn Statistics department, my lab has worked on the problem of estimating the marginal likelihood, a quantity used to compare models. The marginal likelihood measures the average fit of a model to a data set; the model with the best average fit is deemed best. Estimating the marginal likelihood accurately is not easy, however. The method we came up with is called the stepping-stone method.

The figure on the left shows a simple example involving only two sequences and a model with 2 parameters: the edge length (ν) and transition transversion rate ratio (κ). The stepping-stone method uses a series of probability distributions ranging from the posterior distribution to a reference distribution (here the prior distribution). The prior distribution in a Bayesian analysis represents our belief in the values of the model parameters without reference to the sequences we’ve observed, while the posterior represents our belief after the sequences have been taken into account. The marginal likelihood is related to both the prior and the posterior. The marginal likelihood is a weighted average of the likelihood (probability of the data given the model), and the prior distribution provides the weights. The marginal likelihood also represents the constant that is used to normalize the posterior distribution (i.e. scale it so that it represents a proper probability distribution). The stepping-stone method uses samples from both prior and posterior as well as several distributions that lie between to provide an accurate estimate of the marginal likelihood. See the Xie et al. (2011) and Fan et al. (2011) papers for a more complete explanation.


Xie, W., P. O. Lewis, Y. Fan, L. Kuo, and M.-H. Chen. 2011. Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Systematic Biology 60:150-160.

Fan, Y., R. Wu, M.-H. Chen, L. Kuo and P. O. Lewis. 2011. Choosing among partition models in Bayesian phylogenetics. Molecular Biology and Evolution 28(1):523-532.[pdf]

Bayesian Star Tree Paradox

startreeIf sequence data are simulated using a 4-taxon star tree (such as the one shown on the left) and evaluated with standard software tools for Bayesian phylogenetic inference, one of the 3 possible fully-resolved trees is often supported very strongly. This is paradoxical in that most people expect the three possible resolutions to be equally supported in this case, but such an outcome is only seen when the sequence length is tiny (e.g. 1 site). It appears that uncertainty in this case is manifested in the inability to predict, from dataset to dataset, which of the 3 possible fully-resolved tree topologies will be favored. This behavior is troubling, and possible examples of this behavior have been pointed out by several researchers. Many more potential examples can be found in the literature by looking for high posterior probabilities but low bootstrap support, combined with tiny internal edges.

We argue that the central problem here is the non-identifiability of the tree topology, and propose a solution using reversible-jump MCMC. Our rjMCMC sampler visits not only fully-resolved tree topologies, but can visit topologies containing hard polytomies as well. This effectively places a point mass prior probability on polytomies, providing an alternative in situations in which a fully-resolved topology is not a reasonable option. The analysis can be made as conservative as desired by modifying the prior distribution assumed for topologies, but in our (albeit limited) experience it does not appear easy to destroy support for real edges by using a prior that strongly supports polytomous topologies.

Reference: Lewis, P. O., Holder, M. T., and Holsinger, K. E. 2005. Polytomies and Bayesian phylogenetic inference. Systematic Biology 54(2): 241-253

Phylodiversity in Desert Green Algae (the other land plants)

A major thrust in the laboratory of Louise Lewis is diversity and systematics of green algae (Phylum Chlorophyta) living in the soils of North American deserts. These unicellular green algae are capable of tolerating the harsh conditions posed by desert soil environments, and represent an important (yet not well understood) component of desert microbiotic crust communities. The 18S rDNA sequences of a number of green algal isolates have been determined, and these data suggest that several lineages of green algae have diversified within deserts. One might be tempted to think that the green algal cells isolated from desert soils are simply the result of spores dispersed into deserts from distant aquatic sources. This study shows that the 18S sequences of these desert isolates are more divergent from their nearest aquatic relatives than would be predicted if they were merely incidental visitors. We characterize the molecular phylodiversity of desert green algae and demonstrate with a Bayesian analysis of 150 green algal 18S sequences that all freshwater classes of green algae have yielded desert lineages. The numerous transitions from desert to aquatic existence apparent from the phylogeny argue that it is no longer accurate to portray land plants as resulting from a single origin. The highly celebrated origin leading to the embryophytes is but one of many transitions to terrestriality.

Reference: Lewis, L. A., and Lewis, P. O. 2005. Unearthing the molecular phylodiversity of desert soil green algae (Chlorophyta). Systematic Biology 54(6): 936-947.