Course materials for Michael Dunn's course at the 2018 LOT Winter School
Exercise 1: Bayesian Phylogenetic Inference with MrBayes
Datafile: Aslian28.nex
Exercise 2: Constraints and multiple partitions
Datafiles: new-britain-swadesh.nex, new-britain-structural.nex
Exercise 3: Ancestral State Reconstruction and the DISCRETE test
Datafiles: Aslian28+structure1.nex, Aslian28.nex.trees, Aslian-complexity.txt, Aslian-semantics.txt
In this exercise we practise two things which are prerequisites for carrying out Ancestral State Reconstrution (see exercise 3). Constraining tree inferenc to include particular nodes is useful for incorporating prior phylogenetic knowledge into your analyeses, and is crucial if you want to reconstruct values at these specific nodes. Analysis with multiple partitions lets you use multiple kinds of data at once, and is also used in ancestral state reconstruction.
Exclude a language from the analysis:
delete Temiar_Perak
Note that MrBayes automatically redoes the ascertainment bias check when you change the included taxa.
Set the outgroup:
outgroup Kammu
Constrain the search so that it only visits regions of the tree space containing the South Aslian clade:
constraint southaslian = Semaq_Beri_Brw Semaq_Beri_Jaboy
Semelai Mah_Meri
prset topologypr = constraints(southaslian)
In the Ancestral States Reconstruction exercise (exercise 3), MrBayes will estimate the values for all clades defined by the constraints
command.
Output of the taxastat command:
MrBayes > taxastat
Showing taxon status:
Number of taxa = 28 (of which 27 are included)
Number of constraints = 1
1 -- Trees with 'hard' constraint "southaslian" are infinitely
more probable than those without
Constraints
Taxon Inclusion 1
------------------------------------------------
1 (Tenen_Palian) -- Included .
2 (Tenen_Paborn) -- Included .
3 (Kensiw_Perak) -- Included .
4 (Kensiw_Kedah) -- Included .
5 (Kintaq) -- Included .
6 (Jahai_Banun) -- Included .
7 (Jahai_Rual) -- Included .
8 (Menriq_Lah) -- Included .
9 (Menriq_Rual) -- Included .
10 (Batek_Teh_Taku) -- Included .
11 (Batek_Teh_Lebir) -- Included .
12 (Batek_T) -- Included .
13 (Batek_Deq_Terengganu) -- Included .
14 (Batek_Deq_Koh) -- Included .
15 (Ceq_Wong) -- Included .
16 (Semnam_Bal) -- Included .
17 (Semnam_Malau) -- Included .
18 (Lanoh_Kertei) -- Included .
19 (Temiar_Kelantan) -- Included .
20 (Temiar_Perak) -- Deleted .
21 (Semai_Ringlet) -- Included .
22 (Semai_Kampar) -- Included .
23 (Jah_Hut) -- Included .
24 (Mah_Meri) -- Included *
25 (Semaq_Beri_Brw) -- Included *
26 (Semaq_Beri_Jaboy) -- Included *
27 (Semelai) -- Included *
-> 28 (Kammu) -- Included .
'.' indicate that the taxon is not present in the constraint.
'*' indicate that the taxon is present in the 'hard' constraint.
'+' indicate that the taxon is present in the first group of
a 'partial' constraint.
'-' indicate that the taxon is present in the second group of
a 'partial' constraint.
'#' indicate that the taxon is present in the 'negative' constraint.
Arrow indicates current outgroup.
The Bayes Factor test measures the ratio L(H₁D) (the likelihood of Hypothesis 1 given the data compared to the likelihood of Hypothesis 2 given the data. This presumes that ‘the data’ D is the same for each likelihood calculation. You can’t compare one hypothesis with one set of data to another hypothesis with another set of data.
To do this manually you need to calculate the harmonic mean of the log likelihood scores of your parameter settings. This is surprisingly easy:
# Write a function to calculate the harmonic mean of a vector
R> harmonic.mean <- function(v) length(v) / sum(1/v)
# Read in the parameter values file from the MrBayes output
R> model1.p <- read.delim("model1.nex.run1.p", skip=1)
# The log-likelihood scores are in the column LnL:
R> harmonic.mean(model1.p$LnL)
[1] -4634
To compare how much we can prefer model 1 to model 2, take twice the difference between the harmonic means:
R> bayes.factor <- 2 * (harmonic.mean(model1.p$LnL) -
harmonic.mean(model2.p$LnL))
The interpretation of Bayes Factors are:
BF | interpretation |
---|---|
0-2 | Barely worth mentioning |
2-6 | Positive evidence in favour of model 1 |
6-10 | Strong |
10 | Very strong |
Source: Kass and Rafferty (1995)
Cognate presence/absence is coded using the binary restriction data Type. If you have e.g. structural data with more than two states you should use the standard data type. The exercise on Ancestral State Reconsruction contains an example.
You can include two different kinds of data evolving under two different models in a single analysis. It is defined as follows in the NEXUS file:
#NEXUS
begin data;
[ ... other stuff ... ]
format mixed(Restriction:1-100,Standard:101-200)
interleave=yes gap=- missing=?;
The numbers 1-100 and 101-201 specify the columns to be included in each partition. Then in the MrBayes block you name the character sets, define a partition (a combination of character sets), and set this partition as the active one:
begin mrbayes;
charset swadesh100 = 1-100;
charset structuralfeatures = 101-200;
partition complete = 2: swadesh100, structuralfeatures;
set partition = complete; [ use this partitioning ]
If you’re doing this by the console you can enter showmodel
to see what the current setup is for each partition. The parameters can be set for partitions individually, or in groups. The following command sets both partitions to use gamma variation for rates estimate:
lset applyto=(all) rates=gamma;
Other things to think about include the coding
setting (for ascertainment bias, which might be different in the different kinds of data you’re using). If you enter showmodel
again you can see that many parameters are shared across partitions. You probably want these to be estimated separately, which you do with unlink
:
unlink statefreq=(all) revmat=(all) shape=(all) pinvar=(all);
Check the difference with showmodel
.
Finally, this analysis assumes that all partitions evolve at the same rate. To set MrBayes to estimate rates independently for each partition, use:
prset applyto=(all) ratepr=variable
new-britain-swadesh.nex
and new-britain-structural.nex
.
restriction
type). Do you want the same parameters for each one? If you are scrupulous about this (i.e. have lots of time) you will test both gamma
and invgamma
, as well as various branch length priors, etc.interleave=yes
format option you can simply paste one matrix below the other (but both within a single matrix .... ;
command. Define a partition for each, and make sure the parameter estimates are independent for each partition (using unlink
)