Build a tree from a Newick description

We not only need the ability to create descriptions of existing trees, we need to build trees from descriptions. In this section of the tutorial, we will ultimately create the function buildFromNewick, but first we need to create several helper functions that will be called from within buildFromNewick.

Add declarations to TreeManip class

Start by adding declarations for all the methods we will need to the TreeManip class declaration in the file tree_manip.hpp. New additions are in bold, blue text:

template <class T>
class TreeManip
{
public:
                                TreeManip();
                                ~TreeManip();

    typename Tree<T>::TreeShPtr getTree();
    void                        createTestTree();
    std::string                 makeNewick(unsigned precision) const;
    void                        buildFromNewick(const std::string newick, bool rooted = false);
    void                        rerootAt(int node_index);
    void                        clear();

private:

    void                        refreshPreorder(T * root_node);
    void                        rerootHelper(T * m, T * t);
    void                        extractNodeNumberFromName(T * nd, std::set<unsigned> & used);
    unsigned                    countNewickLeaves(const std::string newick);
    void                        stripOutNexusComments(std::string & newick);
    bool                        canHaveSibling(T * nd, bool rooted);

    typename Tree<T>::TreeShPtr _tree;


public:
    typedef boost::shared_ptr< TreeManip<T> > TreeManipShPtr;
};

The extractNodeNumberFromName member function

The program we are creating will only ever read newick tree descriptions in which taxa are represented by positive integers (normally representing their position in the data matrix, with the first taxon being 1). This private function converts the node name (a string representation of an integer greater than zero) into an unsigned integer. If the node name cannot be converted to a positive integer, an XStrom exception is thrown.

template <class T>
inline void TreeManip<T>::extractNodeNumberFromName(T * nd, std::set<unsigned> & used)
    {
    assert(nd);
    bool success = true;
    unsigned x = 0;
    try
        {
        x = boost::lexical_cast<unsigned>(nd->_name);
        }
    catch(boost::bad_lexical_cast &)
        {
        // node name could not be converted to an integer value
        success = false;
        }

    if (success)
        {
        // conversion succeeded
        // attempt to insert x into the set of node numbers altready used
        std::pair<std::set<unsigned>::iterator, bool> insert_result = used.insert(x);
        if (insert_result.second)
            {
            // insertion was made, so x has NOT already been used
            nd->_number = x - 1;
            }
        else
            {
            // insertion was not made, so set already contained x
            throw XStrom(boost::str(boost::format("leaf number %d used more than once") % x));
            }
        }
    else
        throw XStrom(boost::str(boost::format("node name (%s) not interpretable as a positive integer") % nd->_name));
    }

The countNewickLeaves member function

This private function counts the number of leaves in the tree defined by a newick string. This makes use of the Boost regular expression (regex) library. The regular expression pattern (taxonexpr) identifies leaf nodes as integer values preceded by either a comma or a left parenthesis and followed by a comma, right parenthesis, or colon. It also allows non-integer taxon names, and even taxon names containing embedded spaces if they are surrounded by single quotes, even though such newick tree descriptions will ultimately be rejected (see section above on the extractNodeNumberFromName function).

The way this function works is concise but less than transparent. The variable m1 iterates through all taxon names in the supplied newick string, while m2 is never initialized and thus represents the value that m1 would take if it was iterated past the end of the string. The function std::distance(m1, m2) counts the number of times m1 needs to be incremented until it equals m2, which is the number of taxa in the newick string.

template <class T>
inline unsigned TreeManip<T>::countNewickLeaves(const std::string newick)
    {
    boost::regex taxonexpr("[(,]\\s*(\\d+|\\S+?|['].+?['])\\s*(?=[,):])");
    boost::sregex_iterator m1(newick.begin(), newick.end(), taxonexpr);
    boost::sregex_iterator m2;
    return (unsigned)std::distance(m1, m2);
    }

The stripOutNexusComments member function

This private function strips out comments from the newick string. Many programs place special comments inside newick strings that only they understand (e.g. FigTree stores color information about nodes in such comments). Comments in newick strings are delimited by square brackets, so this function uses the Boost regex_replace function to replace all such comments with the empty string. Note that the ampersand (&) in the parameter list means that the newick string is supplied as a reference. As a result, it is modified in place, which explains why this function does not return the modified string.

template <class T>
inline void TreeManip<T>::stripOutNexusComments(std::string & newick)
    {
    boost::regex commentexpr("\\[.*?\\]");
    newick = boost::regex_replace(newick, commentexpr, std::string(""));
    }

The refreshPreorder member function

This private function must be called whenever the structure of the tree changes. It rebuilds the _preorder vector by visiting nodes in preorder sequence and storing pointers to these nodes as it goes.

template <class T>
inline void TreeManip<T>::refreshPreorder(T * root_node)
    {
    // Create vector of node pointers in preorder sequence
    _tree->_preorder.clear();
    _tree->_preorder.reserve(_tree->_nodes.size() - 1); // _preorder does not include root node

    if (!root_node)
        return;
        
    T * first_preorder = root_node->_left_child;
        
    // sanity check: first preorder node should be the only child of the root node
    assert(first_preorder->_right_sib == 0);
        
    T * nd = first_preorder;
    _tree->_preorder.push_back(nd);
    
    while (true)
        {
        if (!nd->_left_child && !nd->_right_sib)
            {
            // nd has no children and no siblings, so next preorder is the right sibling of 
            // the first ancestral node that has a right sibling.
            T * anc = nd->_parent;
            while (anc && !anc->_right_sib)
                anc = anc->_parent;
            if (anc)
                {
                // We found an ancestor with a right sibling
                _tree->_preorder.push_back(anc->_right_sib);
                nd = anc->_right_sib;
                }
            else
                {
                // nd is last preorder node in the tree
                break;
                }
            }
        else if (nd->_right_sib && !nd->_left_child)
            {
            // nd has no children (it is a tip), but does have a sibling on its right
            _tree->_preorder.push_back(nd->_right_sib);
            nd = nd->_right_sib;
            }
        else if (nd->_left_child && !nd->_right_sib)
            {
            // nd has children (it is an internal node) but no siblings on its right
            _tree->_preorder.push_back(nd->_left_child);
            nd = nd->_left_child;
            }
        else
            {
            // nd has both children and siblings on its right
            _tree->_preorder.push_back(nd->_left_child);
            nd = nd->_left_child;
            }
            
        }   // end while loop
        
    // renumber internal nodes in postorder sequence
    int curr_internal = _tree->_nleaves;
    for (typename std::vector<T *>::reverse_iterator rit = _tree->_preorder.rbegin(); rit != _tree->_preorder.rend(); ++rit)
        {
        T * nd = *rit;
        
        if (nd->_left_child)
            {
            // nd is an internal node
            
            // node numbers for internal nodes are negative integers
            nd->_number = curr_internal++;
            }
        }
            
    if (_tree->_is_rooted)
        root_node->_number = curr_internal;
    }

The canHaveSibling member function

This function is used to check for polytomies (internal nodes with degree > 3, or with degree > 2 for the root node in a rooted tree). It returns true if and only if adding a right sibling to nd is allowed (i.e. would not generate a polytomy), otherwise returning false.

template <class T>
inline bool TreeManip<T>::canHaveSibling(T * nd, bool rooted)
    {
    assert(nd);
    bool nd_can_have_sibling = true;
    if (!nd->_parent)
        {
        // trying to give root node a sibling
        nd_can_have_sibling = false;
        }
    else if (nd != nd->_parent->_left_child)
        {
        if (nd->_parent->_parent)
            {
            // trying to give a sibling to a sibling of nd, and nd's parent is not the root
            nd_can_have_sibling = false;
            }
        else
            {
            if (rooted)
                {
                // root node has exactly 2 children in rooted trees
                nd_can_have_sibling = false;
                }
            else if (nd != nd->_parent->_left_child->_right_sib)
                {
                // trying to give root node more than 3 children
                nd_can_have_sibling = false;
                }
            }
        }

    return nd_can_have_sibling;
    }

The rerootAt member function

This private member function reroots the tree at the node indicated, which is assumed to exist in the tree and is a leaf node (it throws an XStrom exception if either of these assumptions turn out to be false). The way this function works is to move the target node (the one having _number equal to node_index) down the tree one step at a time until it is at the root position. At each step, the nodes below it are moved up to maintain the branching structure of the tree. This is best explained using a figure. In the figure below, node 5 (labeled nd) is to be the new root node, and the nodes t and m are indicated at each step as they are defined at the moment the rerootHelper function is called. Read the figure starting at the upper left panel and reading left to right across rows. Each row does exactly the same thing, but with a different assignment of t and m. There are three steps (three rows) because the new root node starts out three segments away from its destination at the root of the tree. You can download a PDF version of this figure if the one below is too hard to read.
Figure illustrating algorithm for rerooting a tree at a tip

template <class T>
inline void TreeManip<T>::rerootAt(int node_index)
    {
    typename std::vector<T>::iterator it = _tree->_nodes.begin();
    for (; it != _tree->_nodes.end(); ++it)
        {
        if (it->_number == node_index)
            break;
        }
    if (it == _tree->_nodes.end())
        throw XStrom(boost::str(boost::format("no node found with index equal to %d") % node_index));
        
    T & nd = *it;
    if (nd._left_child)
        throw XStrom(boost::str(boost::format("cannot currently root trees at internal nodes (e.g. node %d)") % nd._number));
    
    T * t = &nd;
    T * m = nd._parent;
    while (nd._parent)
        {
        // Begin by swapping the mover's edge length with nd's edge length
        double tmp_edgelen = m->_edge_length;
        m->_edge_length = nd._edge_length;
        nd._edge_length = tmp_edgelen;

        // Make the "mover" node m (along with all of its children except nd) the rightmost child of the "target" node t
        rerootHelper(m, t);

        // The next target node is always the previous mover, and the next mover node is always nd's parent
        t = m;
        m = nd._parent;
        }
    refreshPreorder(&nd);
    }

The rerootHelper member function

This private member function simplifies the rerootAt member function described above by handling the node rearrangements needed to move the new root node down one step. See the figure in the description of rerootAt for an illustration of what this function does. At each step, there is a target node (t) and a mover node (m). Node m is moved (along with all of its children except for the child that is along the path to t) up the tree so that it becomes the rightmost child of t. Afterwards, t is assigned to m, and m is assigned to be the parent of the new root node. This function ends up being rather tedious because it has to handle all possible relative positions of m and t.

template <class T>
inline void TreeManip<T>::rerootHelper(T * m, T * t)
    {
    assert(m);
    assert(t);
    
    // Save nodes to which m attaches
    T * m_left_child    = m->_left_child;
    T * m_right_sib     = m->_right_sib;
    T * m_parent        = m->_parent;

    // Starting with t, walk down tree to identify x, the child of m that is on the path from m to t
    T * x = t;
    while (x->_parent != m)
        {
        x = x->_parent;
        assert(x);
        }
    T * x_right_sib = x->_right_sib;

    // Identify x_left_sib, the immediate left sibling of x (will be NULL if x is _left_child of m)
    T * x_left_sib = 0;
    if (x != m_left_child)
        {
        x_left_sib = m_left_child;
        while (x_left_sib->_right_sib != x)
            {
            x_left_sib = x_left_sib->_right_sib;
            assert(x_left_sib);
            }
        }

    // identify m_left_sib, the immediate left sibling of m (will be NULL if m is root node or is _left_child of its parent)
    T * m_left_sib = 0;
    if (m_parent && m != m_parent->_left_child)
        {
        m_left_sib = m_parent->_left_child;
        while (m_left_sib->_right_sib != m)
            {
            m_left_sib = m_left_sib->_right_sib;
            assert(m_left_sib);
            }
        }

    // Put x where m is now
    if (!m_parent)
        {
        // m is the root node
        assert(!m_right_sib);
        assert(!m_left_sib);
        x->_right_sib = 0;
        x->_parent = 0;
        if (x == m_left_child)
            m->_left_child = x_right_sib;
        else
            x_left_sib->_right_sib = x_right_sib;
        }
    else if (m == m_parent->_left_child)
        {
        // m is leftmost child of its parent
        x->_right_sib = m_right_sib;
        x->_parent = m_parent;
        m->_right_sib = 0;
        m->_parent = 0;
        m_parent->_left_child = x;
        if (x == m_left_child)
            m->_left_child = x_right_sib;
        else
            x_left_sib->_right_sib = x_right_sib;
        }
    else
        {
        // m is not leftmost child of its parent
        m_left_sib->_right_sib = x;
        x->_right_sib = m_right_sib;
        x->_parent = m_parent;
        m->_right_sib = 0;
        m->_parent = 0;
        if (x == m_left_child)
            m->_left_child = x_right_sib;
        else
            x_left_sib->_right_sib = x_right_sib;
        }
    
    // Make m the new rightmost child of t
    m->_parent = t;
    if (!t->_left_child)
        t->_left_child = m;
    else
        {
        // Find rightmost child of t
        m_left_sib = t->_left_child;
        while (m_left_sib->_right_sib)
            m_left_sib = m_left_sib->_right_sib;

        // Make rightmost child of t the left sib of m
        m_left_sib->_right_sib = m;
        }
    }

The buildFromNewick member function

Now that all the helper functions are in place, it is time to define the body of the buildFromNewick function.

template <class T>
inline void TreeManip<T>::buildFromNewick(const std::string newick, bool rooted)
{
    clear();
    _tree.reset(new Tree<T>());
    _tree->_is_rooted = rooted;
    
    std::set<unsigned> used; // used to ensure that two tips do not have the same number
    unsigned curr_leaf = 0;

    std::string commentless_newick = newick;
    stripOutNexusComments(commentless_newick);

    _tree->_nleaves = countNewickLeaves(commentless_newick);
    if (_tree->_nleaves == 0)
        throw XStrom("Expecting newick tree description to have at least 4 leaves");
    unsigned max_nodes = 2*_tree->_nleaves - (_tree->_is_rooted ? 0 : 2);
    _tree->_nodes.resize(max_nodes);

    // Assign all nodes a default node number (leaves will replace this number with the number equivalent of their name)
    for (unsigned i = 0; i < max_nodes; ++i)
        _tree->_nodes[i]._number = _tree->_nleaves;
    
    // This will point to the first tip node encountered so that we can reroot at this node before returning
    T * first_tip = 0;

    unsigned num_edge_lengths = 0;
    unsigned curr_node_index = 0;

    try
        {
        // Root node
        T * nd = &_tree->_nodes[curr_node_index];
    
        if (_tree->_is_rooted)
            {
            nd = &_tree->_nodes[++curr_node_index];
            nd->_parent = &_tree->_nodes[curr_node_index - 1];
            nd->_parent->_left_child = nd;
            }

        // Some flags to keep track of what we did last
        enum {
            Prev_Tok_LParen		= 0x01,	// previous token was a left parenthesis ('(') 
            Prev_Tok_RParen		= 0x02,	// previous token was a right parenthesis (')') 
            Prev_Tok_Colon		= 0x04,	// previous token was a colon (':') 
            Prev_Tok_Comma		= 0x08,	// previous token was a comma (',') 
            Prev_Tok_Name		= 0x10,	// previous token was a node name (e.g. '2', 'P._articulata') 
            Prev_Tok_EdgeLen	= 0x20	// previous token was an edge length (e.g. '0.1', '1.7e-3') 
            };
        unsigned previous = Prev_Tok_LParen;

        // Some useful flag combinations 
        unsigned LParen_Valid = (Prev_Tok_LParen | Prev_Tok_Comma);
        unsigned RParen_Valid = (Prev_Tok_RParen | Prev_Tok_Name | Prev_Tok_EdgeLen);
        unsigned Comma_Valid  = (Prev_Tok_RParen | Prev_Tok_Name | Prev_Tok_EdgeLen);
        unsigned Colon_Valid  = (Prev_Tok_RParen | Prev_Tok_Name);
        unsigned Name_Valid   = (Prev_Tok_RParen | Prev_Tok_LParen | Prev_Tok_Comma);

        // loop through the characters in newick, building up tree as we go
        std::string::const_iterator newick_start = commentless_newick.begin();
        std::string::const_iterator i = commentless_newick.begin();
        for (; i != commentless_newick.end(); ++i)
            {
            char ch = *i;
            if (iswspace(ch))
                continue;
            switch(ch)
                {
                case ';':
                    //throw XStrom("premature semicolon");
                    break;  

                case ')':
                    // If nd is bottommost node, expecting left paren or semicolon, but not right paren
                    if (!nd->_parent)
                        throw XStrom(boost::str(boost::format("Too many right parentheses at position %d in tree description") % (unsigned)std::distance(newick_start, i)));
                    
                    // Expect right paren only after an edge length, a node name, or another right paren
                    if (!(previous & RParen_Valid))
                        throw XStrom(boost::str(boost::format("Unexpected right parenthesisat position %d in tree description") % (unsigned)std::distance(newick_start, i)));

                    // Go down a level
                    nd = nd->_parent;
                    if (!nd->_left_child->_right_sib)
                        throw XStrom(boost::str(boost::format("Internal node has only one child at position %d in tree description") % (unsigned)std::distance(newick_start, i)));
                    previous = Prev_Tok_RParen;
                    break;

                case ':':
                    // Expect colon only after a node name or another right paren
                    if (!(previous & Colon_Valid))
                        throw XStrom(boost::str(boost::format("Unexpected colon at position %d in tree description") % (unsigned)std::distance(newick_start, i)));
                    previous = Prev_Tok_Colon;
                    break;

                case ',':
                    // Expect comma only after an edge length, a node name, or a right paren
                    if (!nd->_parent || !(previous & Comma_Valid))
                        throw XStrom(boost::str(boost::format("Unexpected comma at position %d in tree description") % (unsigned)std::distance(newick_start, i)));
                    
                    // Check for polytomies
                    if (!canHaveSibling(nd, rooted))
                        throw XStrom(boost::str(boost::format("Polytomy found in the following tree description:\n%s") % newick));

                    // Create the sibling
                    curr_node_index++;
                    if (curr_node_index == _tree->_nodes.size())
                        throw XStrom(boost::str(boost::format("Too many nodes specified by tree description (%d nodes allocated for %d leaves)") % _tree->_nodes.size() % _tree->_nleaves));
                    nd->_right_sib = &_tree->_nodes[curr_node_index];
                    nd->_right_sib->_parent = nd->_parent;
                    nd = nd->_right_sib;
                    previous = Prev_Tok_Comma;
                    break;

                case '(':
                    // Expect left paren only after a comma or another left paren
                    if (!(previous & LParen_Valid))
                        throw XStrom(boost::str(boost::format("Not expecting left parenthesis at position %d in tree description") % (unsigned)std::distance(newick_start, i)));
                    
                    // Create new node above and to the left of the current node
                    assert(!nd->_left_child);
                    curr_node_index++;
                    if (curr_node_index == _tree->_nodes.size())
                        throw XStrom(boost::str(boost::format("malformed tree description (more than %d nodes specified)") % _tree->_nodes.size()));
                    nd->_left_child = &_tree->_nodes[curr_node_index];
                    nd->_left_child->_parent = nd;
                    nd = nd->_left_child;
                    previous = Prev_Tok_LParen;
                    break;

                case '\'':
                    // Encountered an apostrophe, which always indicates the start of a 
                    // node name (but note that node names do not have to be quoted)

                    // Expect node name only after a left paren (child's name), a comma (sib's name)
                    // or a right paren (parent's name)
                    if (!(previous & Name_Valid))
                        throw XStrom(boost::str(boost::format("Not expecting node name at position %d in tree description") % (unsigned)std::distance(newick_start, i)));

                    // Get the rest of the name
                    nd->_name.clear();
                    for (++i; i != commentless_newick.end(); ++i)
                        {
                        ch = *i;
                        if (ch == '\'')
                            break;
                        else if (iswspace(ch))
                            nd->_name += ' ';
                        else
                            nd->_name += ch;
                        }
                    if (ch != '\'')
                        throw XStrom(boost::str(boost::format("Expecting single quote to mark the end of node name at position %d in tree description") % (unsigned)std::distance(newick_start, i)));
                
                    if (!nd->_left_child)
                        {
                        extractNodeNumberFromName(nd, used);
                        curr_leaf++;
                        if (!first_tip)
                            first_tip = nd;
                        }

                    previous = Prev_Tok_Name;
                    break;

                default:
                    // Expecting either an edge length or an unquoted node name
                    if (previous == Prev_Tok_Colon)
                        {
                        // Edge length expected (e.g. "235", "0.12345", "1.7e-3")
                        std::string::const_iterator j = i;
                        for (; i != commentless_newick.end(); ++i)
                            {
                            ch = *i;
                            if (ch == ',' || ch == ')' || iswspace(ch))
                                {
                                --i;
                                break;
                                }
                            bool valid = (ch =='e' || ch == 'E' || ch =='.' || ch == '-' || ch == '+' || isdigit(ch));
                            if (!valid)
                                throw XStrom(boost::str(boost::format("Invalid branch length character (%c) at position %d in tree description") % ch % (unsigned)std::distance(newick_start, i)));
                            }
                        std::string edge_length_str = std::string(j,i+1);
                        nd->_edge_length = atof(edge_length_str.c_str());
                        if (nd->_edge_length < 1.e-10)
                            nd->_edge_length = 1.e-10;

                        ++num_edge_lengths;
                        previous = Prev_Tok_EdgeLen;
                        }
                    else
                        {
                        // Get the node name
                        nd->_name.clear();
                        for (; i != commentless_newick.end(); ++i)
                            {
                            ch = *i;
                            if (ch == '(')
                                throw XStrom(boost::str(boost::format("Unexpected left parenthesis inside node name at position %d in tree description") % (unsigned)std::distance(newick_start, i)));
                            if (iswspace(ch) || ch == ':' || ch == ',' || ch == ')')
                                {
                                --i;
                                break;
                                }
                            nd->_name += ch;
                            }

                        // Expect node name only after a left paren (child's name), a comma (sib's name) or a right paren (parent's name)
                        if (!(previous & Name_Valid))
                            throw XStrom(boost::str(boost::format("Unexpected node name (%s) at position %d in tree description") % nd->_name % (unsigned)std::distance(newick_start, i)));
                        
                        if (!nd->_left_child)
                            {
                            extractNodeNumberFromName(nd, used);
                            curr_leaf++;
                            if (!first_tip)
                                first_tip = nd;
                            }

                        previous = Prev_Tok_Name;
                        }
                }
                if (i == commentless_newick.end())
                    {
                    break;
                    }
            }   // loop over characters in newick string

        if (!_tree->_is_rooted)
            {
            // Root at leaf whose _number = 0
            rerootAt(0);
            }
        
        }
    catch(XStrom x)
        {
        clear();
        throw x;
        }

}