The TreeSummary class

Create a new class named TreeSummary and store it in a header file named tree_summary.hpp. This class will read a tree file and enumerate all distinct tree topologies and count the number of trees having each unique topology. This ability is essential for summarizing a posterior sample of trees. It allows one to construct the 95% credible set of tree topologies, for example, and sets the stage for constructing a majority rule consensus tree.

Here is the TreeSummary class declaration. The class has 2 data members. The vector _newicks stores the tree descriptions of all trees read from a tree file. The vector _treeIDs stores a map that associates a key in the form of a tree ID (the set of all splits in a tree) with a value in the form of a vector of integers, where each integer in the vector is the index of a tree description stored in _newicks. Each separate tree ID corresponds to a unique topology, and the associated vector allows us to count and, if desired, print out all trees having that topology.

#pragma once

#include <set>
#include <map>
#include <vector>
#include <fstream>
#include <cassert>
#include <algorithm>
#include <boost/shared_ptr.hpp>
#include <boost/format.hpp>
#include <boost/regex.hpp>
#include "split.hpp"
#include "tree_manip.hpp"
#include "xstrom.hpp"

namespace strom
    {

    template <class T>
    class TreeSummary
        {
        public:
                                        TreeSummary();
                                        ~TreeSummary();

            void                        readTreefile(const std::string filename, unsigned skip);
            void                        showSummary() const;
            typename Tree<T>::SharedPtr getTree(unsigned index);
            std::string                 getNewick(unsigned index);
            void                        clear();

        private:

            Split::treemap_t            _treeIDs;
            std::vector<std::string>    _newicks;

        public:
            typedef boost::shared_ptr< TreeSummary<T> > SharedPtr;
        };

    // insert member function bodies here

    }

The constructor and destructor

The constructor and destructor function do nothing except report that a TreeSummary object was created or destroyed, respectively.

template <class T>
inline TreeSummary<T>::TreeSummary()
    {
    std::cout << "Constructing a TreeSummary" << std::endl;
    }

template <class T>
inline TreeSummary<T>::~TreeSummary()
    {
    std::cout << "Destroying a TreeSummary" << std::endl;
    }

The getTree member function

This function returns a shared pointer to a tree built from the tree description stored in _newicks[index]. If index is too large, an XStrom exception is thrown.

template <class T>
inline typename Tree<T>::SharedPtr TreeSummary<T>::getTree(unsigned index)
    {
    if (index >= _newicks.size())
        throw XStrom("getTree called with index >= number of stored trees");

    TreeManip<T> tm;

    // build the tree
    tm.buildFromNewick(_newicks[index]);

    return tm.getTree();
    }

The getNewick accessor

template <class T>
inline std::string TreeSummary<T>::getNewick(unsigned index)
    {
    if (index >= _newicks.size())
        throw XStrom("getNewick called with index >= number of stored trees");

        return _newicks[index];
    }

The clear member function

The clear function simply empties both the _treeIDs and _newicks vectors.

template <class T>
inline void TreeSummary<T>::clear()
    {
    _treeIDs.clear();
    _newicks.clear();
    }

The readTreefile member function

This function reads the tree file specified by filename, skipping the first skip trees. It uses a boost::sregex_iterator to find all tree descriptions. For each tree description found, it uses boost::regex_replace to strip out any nexus comments, and uses boost::regex_search to split out the actual tree description from the tree name and equals sign. It then builds the tree using the TreeManip::buildFromNewick function and uses TreeManip::storeSplits to create a tree ID (called splitset) for the tree. If the tree ID for this tree has never been seen, it creates a new entry in the map for this tree ID and associates it with a vector of length 1 containing the index of this tree. If the tree ID has already been seen, it simply adds the current tree index to the vector already associated with that tree ID.

template <class T>
inline void TreeSummary<T>::readTreefile(const std::string filename, unsigned skip)
    {
    TreeManip<T> tm;
    Split::treeid_t splitset;
    
    // Open the tree file
    std::ifstream inf(filename.c_str());
    if (!inf)
        throw XStrom(boost::str(boost::format("Could not open file named \"%s\"") % filename));
        
    // Read contents of file and store in string
    std::string file_contents((std::istreambuf_iterator<char>(inf)), std::istreambuf_iterator<char>());

    // Commit to storing new data
    clear();
    
    // extract tree descriptions and store in vector _newicks
    boost::regex re("^\\s*[Tt]ree(.+?);");
    boost::sregex_iterator m1(file_contents.begin(), file_contents.end(), re);
    boost::sregex_iterator m2;  // empty iterator used only to detect when we are done
    unsigned n = 0;
    for (; m1 != m2; ++m1)
        {
        const boost::match_results<std::string::const_iterator>& what = *m1;
        if (n >= skip)
            {
            // strip nexus comments from tree description
            boost::regex nexus_comment_re("\\[.+?\\]");
            std::string stripped = boost::regex_replace(what[1].str(), nexus_comment_re, std::string());

            // strip tree name and equals sign, leaving only the tree description
            std::string newick_only;
            boost::smatch what;
            boost::regex pattern("^.+?=\\s*(.+)");
            bool regex_ok = boost::regex_search(stripped, what, pattern);
            if (regex_ok)
                {
                // what[0] contains the whole string
                // what[1] contains only what comes after the first =
                // Construct a string using characters from what[1].first to what[1].second
                newick_only.insert(newick_only.begin(), what[1].first, what[1].second);
                }
            else
                {
                throw XStrom("regex failed to find = separating tree name from tree description in tree definition");
                }

            // store the newick tree description
            _newicks.push_back(newick_only);
            unsigned tree_index = (unsigned)_newicks.size() - 1;

            // build the tree
            tm.buildFromNewick(newick_only);

            // store set of splits
            splitset.clear();
            tm.storeSplits(splitset);

            // iterator iter will point to the value corresponding to key splitset
            // or to end (if splitset is not already a key in the map)
            Split::treemap_t::iterator iter = _treeIDs.lower_bound(splitset);

            if (iter == _treeIDs.end() || iter->first != splitset)
                {
                // splitset key not found in map, need to create an entry
                std::vector<unsigned> v(1, tree_index);  // vector of length 1 with only element set to tree_index
                _treeIDs.insert(iter, Split::treemap_t::value_type(splitset, v));
                }
            else
                {
                // splitset key was found in map, need to add this tree's index to vector
                iter->second.push_back(tree_index);
                }
            }
        n += 1;
        }
    }

The showSummary member function

Finally, add the body of the showSummary function, which reports the information stored in _treeIDs in a couple of different ways. It first outputs the number of trees it read from the file. Next, it uses BOOST_FOREACH to iterate through all distinct tree topologies stored in _treeIDs, reporting the index of each tree having that topology. It finished by producing a table showing the number of trees found for each distinct tree topology, sorted from most frequent topology to the least frequent topology.

template <class T>
inline void TreeSummary<T>::showSummary() const
    {
    // Produce some output to show that it works
    std::cout << boost::str(boost::format("Read %d trees from file") % _newicks.size()) << std::endl;

    // Show all unique topologies with a list of the trees that have that topology
    // Also create a map that can be used to sort topologies by their sample frequency
    typedef std::pair<unsigned, unsigned> sorted_pair_t;
    std::vector< sorted_pair_t > sorted;
    int t = 0;
    BOOST_FOREACH(const Split::treemap_t::value_type & key_value_pair, _treeIDs)
        {
        unsigned topology = ++t;
        unsigned ntrees = (unsigned)key_value_pair.second.size();
        sorted.push_back(std::pair<unsigned, unsigned>(ntrees,topology));
        std::cout << "Topology " << topology << " seen in these " << ntrees << " trees:" << std::endl << "  ";
        std::copy(key_value_pair.second.begin(), key_value_pair.second.end(), std::ostream_iterator<unsigned>(std::cout, " "));
        std::cout << std::endl;
        }

    // Show sorted histogram data
    std::sort(sorted.begin(), sorted.end());
    unsigned npairs = (unsigned)sorted.size();
    std::cout << "\nTopologies sorted by sample frequency:" << std::endl;
    std::cout << boost::str(boost::format("%20s %20s") % "topology" % "frequency") << std::endl;
    BOOST_REVERSE_FOREACH(const sorted_pair_t & ntrees_topol_pair, sorted)
        {
        unsigned n = ntrees_topol_pair.first;
        unsigned t = ntrees_topol_pair.second;
        std::cout << boost::str(boost::format("%20d %20d") % t % n) << std::endl;
        }
    }