From Lab to the World

Course materials for Michael Dunn's course at the 2018 LOT Winter School

Lecture outline and readings

Software requirements

Exercise 1: Bayesian Phylogenetic Inference with MrBayes
Datafile: Aslian28.nex

Exercise 2: Constraints and multiple partitions
Datafiles: new-britain-swadesh.nex, new-britain-structural.nex

Exercise 3: Ancestral State Reconstruction and the DISCRETE test
Datafiles: Aslian28+structure1.nex, Aslian28.nex.trees, Aslian-complexity.txt, Aslian-semantics.txt

1. Inferring phylogenies with MrBayes

The MrBayes software package is operated from the application console (a special terminal just for MrBayes provided by the package itself); you can also use it with the unix/linux terminal if you are already familiar with that.

Bayesian Phylogenetic inference using MrBayes consists o the following steps:

1.1 Prepare and load the data (e.g. in a Nexus format file

1.2 Set the evolutionary model and priors

Minimally you have to set the model (using lset) and the priors (using prset).

Model

We’ll set two things, the substitution model and the rates model. The substition model is set by nst, and has three possible values (1, 2 or 6).

code name
1 F81
2 HKY
6 GTR

Most of these models are designed for nucleotides (A, C, T, and G); only the F81 model is suitable for binary (‘restriction’) data. F81 is parameterised as a single rate for q01, while the rate q10 is derived from the relative frequencies of 0 versus 1 in the data.

lset nst=1

The gamma rates model assumes that features can be classified into a certain number of rates classes.::

lset rates=gamma

By default MrBayes uses four rate categories, but if you want a different number you can set it (note that processing time increases proportionally to the number of categories). Here I speed things up by reducing the number of classes to two::

lset ngammacat=2

You can set all these things in one go if you want (I’m going to put ngammacat back to four, because it’s a good compromise)::

lset nst=1 rates=gamma ngammacat=4 

Priors

For better for worse MrBayes chooses quite workable priors all by itself. But you should really aim to work out what MrBayes is doing, and make informed decisions of your own.

Relevant priors include information about

To review the model settings before the analysis you can enter showmodel.

1.3 Run the analysis

You set the parameters of the Monte Carlo Markov Chain search with mcmcp. The arguments include:

For example, mcmcp filename=mytest tells MrBayes to write the output to files with the prefix mytest.

Once we’ve got some output to look at we’ll probably want to change ngen and samplefreq, but you can’t decide this ahead of time. You can also set these parameters on the mcmc command, which launches the Monte Carlo Markov Chain search::

mcmc ngen=1000000 samplefreq=1000

Take a look at how fast the data is scrolling past (rows are numbered) and estimate how long the run will take to complete. You probably have time for a cup of coffee.

1.4 Check convergence

When the chain comes to an end (and you are running MrBayes interactively rather than controlling it by a script) the programme gives you the opportunity to continue the search for longer::

     Average standard deviation of split frequencies: 0.030479

     Continue with analysis? (yes/no):

The rule of thumb here is that if the average standard deviation of split frequencies is less than 0.01 then you’ve probably run the chain long enough. Otherwise enter yes and let it carry on.

The other way to check convergence is to look at the likelihood trace. You can do this by loading the log file into Tracer http://tree.bio.ed.ac.uk/software/tracer/, or if you are an R user, you read it into R and plot the values::

# Read in the the logfile, skipping lines starting with "["
R> logfile <- read.delim("Aslian28.nex.run1.p", comment.char="[") 
R> plot(logfile$LnL, type="l")   # plot the trace as a line
# get rid of the first e.g. 500 samples to see how stationary
# the run is after the burn-in
R> plot(logfile$LnL[-c(1:500)], type="l")

1.5 Summarize the samples

You can then repeat the analysis using other evolutionary models, and compare how they fit in order to select the best model to describe your data.

Checking the data

Did you notice the warning when you read in the data to MrBayes?:

.. WARNING:: There are 69 characters incompatible with the specified coding
             bias. These characters will be excluded.

Let’s use R to try to work out what’s going on. We’ll read in the data, and look at the values::

R> library(ape)
R> d <- read.nexus.data("Aslian28.nex")
# convert the data to a data frame
R> d <- as.data.frame(d)
# transpose rows and columns
R> d <- t(d)
# get the row names
R> lx <- rownames(d)
# apply the as.numeric function to the columns (option 2) of d
R> d <- apply(d, 2, as.numeric)
# put the row names back (I don't know why they disappeared)
R> rownames(d) <- lx
# apply the sum.numeric function to the columns (option 2) of d
# in order to count how many ones there are in each column
R> ones <- apply(d, 2, sum)
# there are 68 columns with all zero
R> table(ones)
ones
 0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 ...
68  13 147  78  45  39  24  19  18  13   8  12  12  12   6  10 

So there are 68 columns with all zero in the database. The warning message refered to 69::

# check how many languages
R> nrow(d)
[1] 28
# there is one column with all ones!
R> table(ones)[c("28")]
28 
 1 

And 68 + 1 = 69. This must be it.

From the MrBayes manual:

A problem with some binary data sets, notably restriction sites, is that there is an ascertainment (coding) bias such that certain characters will always be missing from the observed data. It is impossible, for instance, to detect restriction sites that are absent in all of the studied taxa. MrBayes corrects for this bias by calculating likelihoods conditional on the unobservable characters being absent (Felsenstein, 1992). The ascertainment (coding) bias is selected using lset coding. There are five options: (1) there is no bias, all types of characters could, in principle, be observed (lset coding=all); (2) characters that are absent (state 0) in all taxa cannot be observed (lset coding=noabsencesites); (3) characters that are present (state 1) in all taxa cannot be observed (lset coding=nopresencesites); (4) characters that are constant (either state 0 or 1) in all taxa cannot be observed (lset coding=variable); and (5) only characters that are parsimony-informative have been scored (lset coding=informative). For restriction sites it is typically true that all-absence sites cannot be observed, so the correct coding bias option is noabsencesites.

Interpreting your results

There are lots of ways to look at your results

How to interpret a splitsgraph

If you do choose to reduce your sample to a single tree, use R or FigTree to make a nice picture out of it.

Reducing your tree sample to a single tree is a convenience for visualization, not a part of the analysis. You should do any calculations that you're interested in on the basis of the complete sample.  

Running MrBayes from a control file

Rather than typing in all the MrBayes commands each time you tweak an analysis, it is much more reliable to use a “MrBayes” block in your nexus file. This block starts with begin mrbayes; and (like any other nexus block) ends with end;. The MrBayes block can contain any command that you type into the MrBayes console; but don’t forget to end them with a semicolon (the nexus standard — this also allows for more readable, multiline commands. It’s good practice to include comments to yourself explaining what the control file is supposed to do. Nexus comments go in [ square brackets ]. Here is an example of somewhat elaborate MrBayes block::

[ topology constrained to include South and Central Aslian ]

begin mrbayes;
  log start filename=test.log replace;	
  constraint southaslian = Semaq_Beri_Brw Semaq_Beri_Jaboy
      Semelai Mah_Meri;
  constraint centralaslian = Semnam_Bal Semnam_Malau Lanoh_Kertei
      Temiar_Kelantan Temiar_Perak Semai_Ringlet Semai_Kampar;
  prset statefreqpr=fixed(empirical) 
      brlenspr=unconstrained:exponential(0.05);
  prset topologypr = constraints(southaslian,centralaslian);
  lset nst=1 rates=gamma ngammacat=4 coding=noabsencesites;
  mcmc ngen=1500000			
      printfreq=1000
      samplefreq=1500				
      nchains=4 					
      savebrlens=yes				
      nruns = 2
  ;
  sumt burnin=1000;					
  sump;
  plot filename=Aslian28.nex.run1.p match=all;
  plot filename=Aslian28.nex.run2.p match=all;
  log stop;
end;

You can then run this file in MrBayes using the command line mb myfile.nex (assuming that the file with the data in it is called myfile.nex. If you want to be left with an interactive console once everything has been run, use the -i flag, i.e. mb -i myfile.nex. Another neat way to do it is to keep the data in one nexus file (e.g. like the Aslian28.nex file used in the exercises above, and then make a control file with just the mrbayes block. Include execute Aslian28.nex; as the first line, after begin mrbayes;. This way you can have multiple control files with different settings, but all using the same data file.