The configuration is handled by autoconf/automake, and general installation instruction can be found in file INSTALL. I've tested only on Debian linux machines (intel and AMD64) and MacOSX (using fink) but it may work on Windows using cygwin.
$ tar zxvf biomc2-1.9.tgz
$ cd biomc2-1.9/build
$ ../configure --enable-static-binary --prefix=$HOME
$ make install
The "build" directory is where the compilation will be done. It is not necessary, but it is good practice to build the program in a separate directory. You may choose some other directory at will, adapting the instructions as necessary.
The option "--enable-static-binary" will link the compiled files with "-static", so that you can copy the binaries to other machines with the same architecture/OS but that lack the libraries. I had problems compiling static binaries myself because of the GTK libraries (libcairo.a).
Another option specific to these programs is "--enable-optimization", that will include optimization flags when compiling. The program may become faster, but I don't know if it works for other architectures or older compilers.
An important parameter is "--prefix=SOMEWHERE", since the compiled programs will go to "SOMEWHERE/bin/" and so forth. The default is '/usr/local', but it may be better to install it locally. In the example above, it will go to your home directory ($HOME).
$ apt-get install libgtk2.0-dev
This is the main program, that given a DNA alignment and a tree file will try to estimate the recombination break-points and the number of recombinations for each break-point. The alignment file should be in INTERLEAVED nexus format. Or at least have the word "INTERLEAVE", since as far as I understand the non-interleaved format is a special case (the whole sequence in one line).
The tree file is just an initial guess and should be in nexus format. If you feel unconfortable about the arbitrariness of this initial state you can give a tree file with lots of trees. The program will choose one randomly. All trees should be bifurcating (no politomies), with exception of the root node - since unrooted trees are represented by a trifurcation, in parenthetic format. The program can read rooted and unrooted trees, since in the end it will remove an eventual root node.
The only input parameter to the program is the control file name. In this file there will be information about the DNA and tree files, along with other model parameters. A template file can be found at example/ctrl.biomc2.
The program outputs to the screen progress information, consisting of the acceptance rates for each move and a point estimate of the number of recombinations nSPR and the number of break-points nCOP in the form [nSPR nCOP] . It also outputs the sampled values of each parameter to the files post.tre and post.dist. If you are recovering the prior distribution (in other words, if you set "prior=1" in control_file) then the output files will be named prior.tre and prior.dist.
The file post.dist has a variable number of columns and should be processed by program biomc2.summarise The six first columns (after the third row) are respectively
The file post.tree is a standard nexus tree file with trees mapped to segments and samples. You need biomc2.summarise to make sense out of them, though.
This program reads the post.tree and post.dist output from the MCMC sampler and summarises the posterior distribution of break-points. It changed a lot from the previous version, and is focused on determining a likely mosaic structure. The caveat is that it can no longer summarise the posterior distribution of average rates per segment - but this was not our main purpose in the first place. If necessary these and other values can be output with a little programming (on function sample_to_output_file() of file run_sampler.c).
The program uses two main approaches to estimate a break-point mosaic structure: the piece-wise median and the centroid sample. The piece-wise median is based on the idea that the posterior distribution of break-points will have several modes, one for each break-point, along the alignment. Thus we partition the distribution based on the CDF where the number of partitions equals the median number of break-points, and for each partition we find its median value. The centroid sample is in fact the sample whose mosaic structure minimizes the total distance to all other samples. This distance algorithm will be published in the Annals of the Institute of Statistical Mathematics. For each estimated mosaic structure we infer the topology as the MAP topology over all segments composing the non-recombinant regions.
The program output is composed of three files: mosaicTreeFreq.txt, mosaicTrees.tre and recomb_freq.pdf. The file mosaicTrees.tre contains a list with all relevant topologies, named as character "t" followed by a numerical ID (used by mosaicTreeFreq.txt). The file mosaicTreeFreq.txt contains a table with 8 columns, where each row represents one segment:
If you are curious: if a third argument is given to biomc2.summarise with a tree file it will compare the trees in this file with the posterior trees and output the corresponding index (tree ID) in the posterior samples. I used this obscure option when comparing different tree reconstruction procedures using simulated data, where we know the true tree and want to see if the sampler can find it. Don't worry about this option.
Given an initial tree it simulates new topologies whose recombination distance from the previously simulated tree is given by the user. The initial tree is sampled from 'tree_file', and the program will create a chain of topologies such that all have the same recombination distance from their neighbors. The program may not simulate perfectly the given recombination distance since it applies the specified number of SPRs on the current tree to generate the next tree, but for large numbers we may have cycles. The program tries to minimize this by choosing the prune and regraft edges without replacement. The output will contain the tree with uniformly sampled branch lengths, preceded by : the simulation number, real and estimated recombination distances, respectively. The program will also print to the stderr the mean and standard deviation of a few statistics: the overall error, given by the overestimated distance minus the underestimated distance; the modular error, given by the overestimated distance plus the underestimated distance; the overestimation error; and the underestimation error.
This program is not necessary for recombination detections and can be run independently from the analysis.
Calculates d_SPR between a pair of tree files and outputs the histogram of distances. A special case is when each file has only one tree, where the program will output the approximate SPR distance. A file recomb_leaves.txt will contain the frequency of recombination for each leaf. Remember that this frequency is an approximation. (this program is not described in the paper and we don't use it yet).