The model

Given an alignment of DNA sequences, we are very tempted to infer the phylogenetic tree relating those sequences. But in the presence of recombination, different regions of this alignment will support distinct topologies.

This alignment can be divided into three non-recombinant regions. The recombination took place between the ancestor of F.002 and the ancestor of B.002, with the

product of this recombination giving rise to the extant sequences Q1, Q2, Q3 and Q4. We can infer this scenario since a recombination can be described

by a subtree prune-and-regraft branch swap.

biomc2 tries to detect these regions and the possible topologies for them. There are many programs devised for this, but the distinguishing features of biomc2 are:

  • it is a Bayesian procedure, which means the "answer" given by the program is actually a distribution of possible answers. So we don't have only a point estimate for the locations of recombination break-points, but the posterior frequency of all possible locations. We don't have the "best" topology for each non-recombinant region, but the distribution of topologies for each segment.
  • it is in fact a hierarchical Bayesian procedure, so that the priors do not influence the outcome so much. This is accomplished since the parameters of the prior distributions themselves follow some (so-called hyperprior) distribution.
  • it is independent of a definition a priori of recombinant and non-recombinant sequences. Some methods cleverly assume that the topology is known and fixed (=no recombination) for some of the sequences, thus decreasing the complexity of the problem. If they are even more clever they do this for all possible combinations. biomc2 does not attempt to fix the topology and do the full exploitation at once.
  • the model works with large alignments theoretically, despite in practice the MCMC will be too slow with more than 30 or 40 sequences. If we have too many sequences we must split the alignment into smaller datasets.
  • a distance between neighboring topologies is used as a lower bound on the number of recombinations. As shown by the figure above we can, in theory, count the minimum number of recombinations between two trees. In practice we don't have an algorithm for that but biomc2 has a good approximation, which is modelled according to the assumption that fewer recombinations are more likely than many.
  • the posterior distribution of the topology distances have information about possible hotspots. If we only know the recombination mosaic structure (the breakpoint locations, like the colors in the figure above) it is hard to distinguish an ancestral recombination event from a recombination hotspot (region where recombination is more likely to occur, be it for physical-chemical reasons or selective pressure). The distance between trees implemented in biomc2 have a close relation with this: if the distance is larger than one then most certainly we need more than one recombination to explain the difference, which is indicative of a hotspot.

More background information can be found in my Ph.D. thesis, which focused on recombination detection on HIV sequences. You can also download a pdf version of the poster presented at SMBE2010 (available under a Creative Commons License), and read this blog post where I comment about it.