Create a Newick tree description

Newick's Lobster House, Dover, New Hampshire
Newick’s Restaurant

Newick’s Lobster House was the site of an historic 1986 meeting (see Joe Felsenstein’s account) at which a standard was devised for storing descriptions of phylogenetic trees as strings. These newick tree descriptions (also called the "New Hampshire" format) became part of the NEXUS file format (described in Maddison, Swofford, and Maddison, 1997), which is a standard file format for storing systematic data that is still widely used despite the advent of NexML.

Our next goal is to create a newick string for the tree currently stored in a TreeManip object. Add the function declaration in the public area of the TreeManip class declaration:

std::string  makeNewick(unsigned precision) const;

Now add the function body just above the terminating curly bracket of the namespace (i.e. just before the end of the file):

inline std::string TreeManip::makeNewick(unsigned precision) const
	{
    std::string newick;
    const boost::format tip_node_format( boost::str(boost::format("%%d:%%.%df") % precision) );
    const boost::format internal_node_format( boost::str(boost::format("):%%.%df") % precision) );
    std::stack<Node *> node_stack;
    
    Node * root_tip = (_tree->_is_rooted ? 0 : _tree->_root);
    for (auto nd : _tree->_preorder)
        {
        if (nd->_left_child)
            {
            newick += "(";
            node_stack.push(nd);
            if (root_tip)
                {
                newick += boost::str(boost::format(tip_node_format) % (root_tip->_number + 1) % nd->_edge_length);
                newick += ",";
                root_tip = 0;
                }
            }
        else
            {
            newick += boost::str(boost::format(tip_node_format) % (nd->_number + 1) % nd->_edge_length);
            if (nd->_right_sib)
                newick += ",";
            else
                {
                Node * popped = (node_stack.empty() ? 0 : node_stack.top());
                while (popped && !popped->_right_sib)
                    {
                    node_stack.pop();
                    if (node_stack.empty())
                        {
                        newick += ")";
                        popped = 0;
                        }
                    else
                        {
                        newick += boost::str(boost::format(internal_node_format) % popped->_edge_length);
                        popped = node_stack.top();
                        }
                    }
                if (popped && popped->_right_sib)
                    {
                    node_stack.pop();
                    newick += boost::str(boost::format(internal_node_format) % popped->_edge_length);
                    newick += ",";
                    }
                }
            }
        }
    
    return newick;
    }

Add necessary header file includes

Before the new function will compile, we need to add a couple of lines to the beginning of the file to include headers containing the code allowing us to use the standard stack and the Boost format function. Add these lines after the #pragma once and before the other include lines.

#include <stack>
#include <boost/format.hpp>
4-taxon rooted tree showing newick string
4-taxon rooted tree showing newick string. Node numbers represent index into the tree’s _preorder array, not the node’s _number.

Explanation of the makeNewick function

.
The goal of this function is to turn a tree in memory into a "newick" string representation using nested parentheses to indicate clades and subclades. Each left parenthesis encountered in the newick description signifies the start of a new clade (we are moving to the left child of the current node), each comma means we’re moving laterally to a sibling node, and a right parenthesis means we have reached the upper right node in a clade and are dropping back down to the parent of that node. The figure at right shows a newick description for 4-taxon rooted tree. The numbers at the nodes are the index of the node in the _preorder vector. Note that the root node is not included in _preorder, so there is no number shown in the figure for the root node.

The main loop in this function visits all nodes in the tree in preorder sequence:

for (auto nd : _tree->_preorder)
    {
    ...
    }

The auto nd : _tree->_preorder code sets nd to each element of _tree->_preorder, in turn. The C++11 keyword auto makes our life easy by determining the type of nd automatically.

Internal nodes, when visited, result in the addition of a left parenthesis to the growing newick string. We will not add information about the internal node until we are finished with the clade defined by the node and have added the matching right parenthesis. Internal nodes are stored in the node_stack. A stack is a container in which items can only be removed in the reverse order in which they were added to the stack, like a stack of documents (the last one added to the top is also the first one removed). It is necessary to store each internal node in the stack so that we can determine how many right parentheses to add to the newick string.

newick += "(";
node_stack.push(nd);

When a tip node is visited, the tip_node_format string is used to add a number (one greater than the node’s _number) followed by a colon and then the edge length for that tip node.

const boost::format tip_node_format( boost::str(boost::format("%%d:%%.%df") % precision) );

The Boost library’s format class is used to format the string. Here we are creating an object of type boost::format (the boost part is the Boost namespace, while format is the class). The variable’s name is tip_node_format. This works much like Python or sprintf in C: placeholders (single percent symbols followed by a letter such as d, f, or s) are substituted for the supplied integer, floating point value, or string, respectively. Each doubled percent (%%) ends up being a single literal % in the string after replacements, so the above statement will be equal to the following after replacement of the %d by the supplied precision (assume that precision = 5 for this example):

const boost::format tip_node_format("%d:%.5f");

You might wonder why the boost::str(...) is necessary. The boost::format constructor takes a string as its sole argument, not a boost::format object, and the Boost format library does not provide an implicit conversion of format objects to string objects, so the boost::str function provides an explicit way to do this conversion. This is done intentionally in order to make it easier to report errors accurately when compiling formats. You could also call the str function this way:

const boost::format tip_node_format( (boost::format("%%d:%%.%df") % precision).str() );

The Boost format documentation provides some examples of using boost::format.

If the tip node has a sibling to its right, a comma is output to the growing newick string. If the tip node does not have a right sibling, then things get interesting! Take a look at the figure above of a rooted tree and it associated newick description. Note that after the tip node C is output, two right parentheses (each followed by edge lengths) are output before the next node in the preorder sequence (tip node D) is handled. Why is this? It is because we must close up the definitions of two clades before moving on to tip node D. The first clade is the "cherry" comprising tip nodes B and C along with the internal node labeled 3. The second clade that must be finished comprises nodes labeled 1, 2, 3, 4 and 5. How do we know which edge lengths to spit out? The answer lies in our node_stack. The last two nodes added to node_stack are the internal nodes labeled 1 and 3 in the figure. When we reach the far corner of a clade (we know we are at the far corner when a node has no children and no siblings), we must pop nodes off the node_stack (outputting a right parenthesis and edge length for each) until we reach an internal node with a right sibling: this sibling is necessarily the next node in the preorder sequence (in this case, tip node D, labeled 6 in the figure). That is what is happening in the large else clause paired with if (nd->_right_sib) in the code for the makeNewick function. This else clause is further complicated by the need to check whether we are done, which happens if we happen to pop the last node off the node_stack.

The stack function top() returns the next object on the stack without actually removing it. The function pop() deletes the object that is at the top of the stack. Note that we always have called top() to save the node at the top of the stack before we call pop() to delete it: it would not do to lose track of a node by pop()ing it with out first saving a pointer to the node pop()ed!

Finally, note that we add a comma to the newick string if the last internal node popped off the stack has a right sibling. The comma always indicates a sibling relationship for a node whether the node is a tip node or an internal node.

The special case of the root node in unrooted trees

5-taxon unrooted tree showing newick string
5-taxon unrooted tree showing newick string. Node numbers represent index into the tree’s _preorder array, not the node’s _number.

The figure of a 5-taxon unrooted tree on the right is subtly different from the figure above of a 4-taxon rooted tree. The root node in this case represents a tip (A), and this particular tip along with its edge length (which is really the _edge_length belonging to its only child (the node labeled 0 in the figure) is saved to the newick string as if it were the leftmost child of the first node in the _preorder vector. The part of the makeNewick function that accomplished this is contained in the if (root_tip) {...} block of code. The variable root_tip is 0 if the tree is rooted, but points to the root node if the tree is unrooted, so the code inside the if (root_node) {...} conditional is only executed if root_tip actually points to the root node. This code adds the number of the root node and a comma immediately after the opening left parenthesis. It also sets root_tip to 0 so that this block of code is not executed again.

While it is not essential to store unrooted trees by "rooting" them at a tip node, that is the convention that is followed throughout this tutorial. This points out a basic ambiguity inherent in newick tree descriptions. If you saw the newick string

(A:0.13,(B:0.2,(C:0.1,D:0.1):0.11):0.15,E:0.3)

how would you know this represents an unrooted binary tree and not a rooted tree with a polytomy (multifurcation) at the base? The answer is you could not know this unless the user who supplied you with the newick string told you.