Large tree likelihood fail

Before we add rescaling, let me prove to you that our code as it now stands will not work on a large tree. Download the file rbcl738.zip and move the rbcl738.nex and rbcl738nj.tre files within it to the same directory where you have rbcL.nex.

Modify your main.cpp file

To compute the log-likelihood of this 738 taxon tree, modify your main.cpp as follows:

#include <iostream>
#include "node.hpp"
#include "tree_summary.hpp"
#include "likelihood.hpp"

using namespace strom;

int main(int argc, const char * argv[])
    {
    std::cout << "Starting..." << std::endl;

    try
        {
        // Read and store data
        Data::SharedPtr d(new Data());
        d->getDataFromFile("rbcl738.nex");

        // Create a substitution model
        GTRModel::SharedPtr gtr = GTRModel::SharedPtr(new GTRModel());
        gtr->setExchangeabilitiesAndStateFreqs(
            {0.08394222, 0.34116704, 0.03603322, 0.15737940, 0.30297095, 0.07850717}, 
            {0.309769, 0.163380, 0.121023, 0.405828});
        gtr->setGammaShape(0.517281);
        gtr->setGammaNCateg(4);
        std::cout << gtr->describeModel() << std::endl;

        // Create a likelihood object that will compute log-likelihoods
        Likelihood<Node>::SharedPtr likelihood(new Likelihood<Node>());
        likelihood->setData(d);
        likelihood->setModel(gtr);

        // Read in a tree
        TreeSummary<Node>::SharedPtr tree_summary(new TreeSummary<Node>());
        tree_summary->readTreefile("rbcl738nj.tre", 0);
        Tree<Node>::SharedPtr tree = tree_summary->getTree(0);

        // Calculate the log-likelihood for the tree
        double lnL = likelihood->calcLogLikelihood(tree);
        std::cout << boost::str(boost::format("log likelihood = %.5f") % lnL) << std::endl;
        std::cout << "      (expecting -144730.8)" << std::endl;
        }
    catch (XStrom x)
        {
        std::cerr << "Strom encountered a problem:\n  " << x.what() << std::endl;
        }

    std::cout << "\nFinished!" << std::endl;

    return 0;
    }

Run your software

Execute your software within the debugger. You should find that the log likelihood computed is shown as -inf (negative infinity), which means the likelihood is zero, which in turn means that underflow has occurred in at least some of the site likelihood calculations. The next page will show you how to set up BeagleLib to rescale the likelihood so that this underflow does not happen.

Read this if you are interested in checking the likelihood in PAUP*…

There is a file named paup.nex bundled with rbcl738.nex that can be used to verify the log-likelihood in the program PAUP*. If you look at the format command in the rbcl738.nex file, you will notice this equate statement:

equate="S=? Y=? R=? M=? W=? K=? V=? D=? B=?"

The sequences in this file contain several ambiguity codes. The full set of ambiguity codes for DNA sequence data is:

R,r ==> {AG}
Y,y ==> {CT}
M,m ==> {AC}
K,k ==> {GT}
S,s ==> {CG}
W,w ==> {AT}
H,h ==> {ACT}
B,b ==> {CGT}
V,v ==> {ACG}
D,d ==> {AGT}
N,n ==> {ACGT}

PAUP* recognizes all of these ambiguity codes and handles them accordingly, but BeagleLib treats any ambiguity as complete ambiguity, so the equate statement has been added to the file to effectively turn all of the ambiguity codes present in the file into ?, which represents complete ambiguity and is equivalent to N. That will allow our program to obtain the same log-likelihood as PAUP*.