The Split class

The Split class provides an object that can store the set of taxa that represent descendants of a given Node object. We will provide a Split object for every Node.

Create a new class named Split in a header file named split.hpp.

#pragma once

#include <vector>
#include <set>
#include <map>
#include <limits>
#include <boost/shared_ptr.hpp>
#include <boost/foreach.hpp>

namespace strom 
    {

    class Split
        {
        public:
                                    Split();
                                    ~Split();

            bool                    operator==(const Split & other) const;
            bool                    operator<(const Split & other) const;

            void                    clear();
            void                    resize(unsigned nleaves);

            typedef unsigned long                               split_unit_t;
            typedef std::vector<split_unit_t>                   split_t;
            typedef std::set<Split>                             treeid_t;
            typedef std::map< treeid_t, std::vector<unsigned> > treemap_t;

            void           setBitAt(unsigned leaf_index);
            void           addSplit(const Split & other);

            std::string    createPatternRepresentation() const;

        private:

            split_t        _bits;
            unsigned       _bits_per_unit;
            unsigned       _nleaves;

        public:
            typedef boost::shared_ptr< Split > SharedPtr;
    };

    // member function bodies go here

    }

Constructor and destructor

The constructor and destructor simply report that a Split object is being created or destroyed, respectively. In addition, the constructor sets the value of the _bits_per_unit data member to the number of bits in the data type specified by split_unit_t. The calculation involves multiplying the constant CHAR_BIT (number of bits in a byte) by the number of bytes in the data type specified by split_unit_t. This value will determine how many units of that data type will be needed to store a 0 or 1 for each taxon. These units are stored in the vector data member _bits, which has type split_t. (Besides split_unit_t and split_t, two other types are defined here, namely treeid_t and treemap_t, but I will postpone explaining these until later when they are used in the TreeSummary class)

For example, suppose split_unit_t is defined to be an integer of type char, which is 8 bits in length. If there are 10 taxa, then 2 char units will be required to store each split. A split that specifies that taxa 0, 2, 6, and 9 are descendants of a node would be encoded as follows (note that bits are counted starting from the right, and because there are 10 taxa and 16 bits total, the leftmost 6 bits of unit 1 will always be set to 0):

|01000101|00000010| <-- _bits vector
| unit 0 | unit 1 |

In our case, split_unit_t is defined to be an unsigned long integer rather than a char. On my computer, a char comprises only 8 bits and is the smallest integer type, whereas an unsigned long has 64 bits. Thus, we can get by with fewer units if we use unsigned long rather than char.

Here are the bodies of the constructor and destructor. Copy these into the split.hpp header file just above the closing curly bracket defining the end of the strom namespace.

inline Split::Split() 
    {
    _nleaves = 0;
    _bits_per_unit = (CHAR_BIT)*sizeof(Split::split_unit_t);
    clear();
    std::cout << "Constructing a Split" << std::endl;
    }

inline Split::~Split()
    {
    std::cout << "Destroying a Split" << std::endl;
    }

The clear member function

The clear function simply sets all bits in all units of the _bits vector to 0. We make use of the Boost BOOST_FOREACH macro here (this is why we included the header file foreach.hpp at the top of split.hpp). The 2 arguments provided to the BOOST_FOREACH macro are u (which is a variable that will store a reference to the current unit being considered) and _bits (the vector of units to loop over). BOOST_FOREACH walks through each unit in _bits, and we set all bits to zero for each unit by assigning it the value 0L. The "L" in "0L" specifies that we are assigning a long integer (long integers typically comprise twice as many bits as an ordinary integer) with value 0 to u: there is little consequence if we forget the "L" because the compiler would automatically convert 0 to 0L for us, but it is always best to be as explicit as possible.

inline void Split::clear()
    {
    BOOST_FOREACH(split_unit_t & u, _bits)
        {
        u = 0L;
        }
    }

The equals and less than operators

We will eventually need to sort splits, and sorting algorithms need to be able to ask whether one Split object is less than another Split object, and also need to know if two Split objects are identical. Special functions called operators are used to provide this functionality. Once these operators are defined, you can treat Split objects like ordinary numbers with respect to the operators involved: for example, this code would work:

(note: do not add this to split.hpp)
Split a;
Split b;
if (a < b)
  std::cout << "Split a is less than Split b";
else if (a == b)
  std::cout << "Split a equals Split b";
else
  std::cout << "Split a is greater than Split b";

The equals and less than operators both compare the current object to another Split object for which the reference other is provided. Note that we are assuming that two Split objects are equal if their _bits vectors are equal, and one Split object is less than another Split object if its _bits vector is less than other._bits. For the less than operator, we assert that the two _bits vectors being compared have the same number of elements (the operator< defined for standard containers such as std::vector do not provide any warning if the containers being compared are of different lengths).

inline bool Split::operator==(const Split & other) const
    {
    return (_bits == other._bits);
    }

inline bool Split::operator<(const Split & other) const
    {
    assert(_bits.size() == other._bits.size());
    return (_bits < other._bits);
    }

The resize member function

Given the number of taxa (nleaves below), we can use _bits_per_unit to determine how many units the _bits vector should have. For example, if _bits_per_unit was 8, then up to 8 taxa could be represented by a _bits vector of length 1; however 9 taxa would require the _bits vector to have length 2. Dividing nleaves - 1 by _bits_per_unit and adding 1 provides the correct number of units. If nleaves = 8, (8-1)/8 = 0.875, which is truncated to 0 when converted to an integer, so adding 1 to (8-1)/8 yields the correct number of units (1). If nleaves = 9, (9-1)/8 = 1.0, which is truncated to 1 when converted to an integer, and adding 1 to (9-1)/8 yields the correct number of units (2). Once nunits is computed, the function simply resizes _bits to have length nunits and then calls the clear function to ensure that all bits in all units are set to 0.

inline void Split::resize(unsigned nleaves)
    {
    _nleaves = nleaves;
    unsigned nunits = 1 + ((nleaves - 1)/_bits_per_unit);
    _bits.resize(nunits);
    clear();
    }

The setBitAt member function

The way to set the 6th bit involves shifting the value 1 to the left 6 places using the bit shift operator. The bit shift operator is made up of two less-than symbols, and is the same operator that you have used to output text to an output stream such as std::cout. Here are some examples assuming that the data type char (8 bits) is used ():

(note: do not add this to split.hpp)
char zero = (char)0;
char one  = (char)1;
zero looks like this: |00000000|
one looks like this:  |00000001|
one << 0 produces |00000001|
one << 1 produces |00000010|
one << 2 produces |00000100|
one << 3 produces |00001000|
one << 4 produces |00010000|
one << 5 produces |00100000|
one << 6 produces |01000000|
one << 7 produces |10000000|

To set a specific bit in a value that may already have some bits set, use the bitwise OR operator:

(note: do not add this to split.hpp)
char x = 13;
x looks like this:     |00001101|
char y = 1 << 4 
y looks like this:     |00010000|
char z = x | y;
z looks like this:     |00011101|
x |= y;
x now looks like z:    |00011101|

Note that x | y simply sets all bits that are set in either x or y, and x |= y does the same but assigns the result back to x. Given that background, the function setBitAt should now be understandable. Given a leaf_index (where leaf_index ranges from 0 to the number of taxa minus 1), the function first determines which unit to modify (unit_index), then the specific bit within that unit to set (bit_index), and finally uses the bit shift and bitwise OR operators to set the bit.

inline void Split::setBitAt(unsigned leaf_index)
    {
    unsigned unit_index = leaf_index/_bits_per_unit;
    unsigned bit_index = leaf_index - unit_index*_bits_per_unit;
    split_unit_t bit_to_set = 1 << bit_index;
    _bits[unit_index] |= bit_to_set;
    }

The addSplit member function

This function creates a union of the taxon set defined by this Split object and the taxon set defined by other, then assigns that union to this Split object. The for loop iterates through all elements of the _bits vector and bitwise ORs that unit with the corresponding unit from other. This function will be useful when constructing splits for interior nodes of a tree: the bits set in an interior node are the union of all sets corresponding to its descendant nodes.

inline void Split::addSplit(const Split & other)
    {
    unsigned nunits = (unsigned)_bits.size();
    assert(nunits == other._bits.size());
    for (unsigned i = 0; i < nunits; ++i)
        {
        _bits[i] |= other._bits[i];
        }
    }

The createPatternRepresentation member function

This function creates a coded representation of a split in which ‘*’ represents bits that are "on" (i.e. set to 1) and ‘-‘ represents "off" bits (i.e. set to 0).

Here is an example involving 10 taxa where bits_per_unit = 8 and taxa 1, 7 and 9 are "on" and taxa 0, 2-6, and 8 and "off":

+--------+--------+
|10000010|00000010|
+--------+--------+
  unit[0]  unit[1]

Note the nasty tangle we have to navigate to write out a sensible split representation! First we read unit 0 from right to left: the 1 on the far left of unit 0 corresponds to the "on" bit for taxon 7. After unit 0 has been interpreted, move to unit 1: the rightmost bit in unit 1 (0) corresponds to the "off" bit for taxon 8. This split would be output as follows:

        -*-----*-*
taxon 0 ^        ^ taxon 9

Note that we must be careful to avoid looking at the extra bits: the two units together contain 16 bits, yet there are only 10 taxa, so whatever is stored in the last 6 bits of unit[1] should not be considered. The algorithm works by letting i iterate over units (in the example above, i=0,1 corresponding to units 0 and 1, respectively), letting j iterate over the bits within each unit, and testing whether the number 1 shifted over j bits, when bitwise-ANDed with unit[i], yields 1, which would mean that bit j is set in unit[i].

For example, suppose i=0 and j=1. Shifting 1 to the left j places yields, as j takes on values from 0 to 7:

 1 << j     j
-------------
00000001    0
00000010    1
00000100    2
00001000    3
00010000    4
00100000    5
01000000    6
10000000    7

Now consider the result of a bitwise-AND between 1 << j (when j=7) and the value of unit[0]:

10000000    1 << j, where j=7
10000010    unit[0]
-----------------------------------
10000000    (1 << 7) && unit[0]

The bitwise-AND operator (&&) yields 1 for a bit position whenever the two numbers being compared both have a 1 in that position and produces 0 otherwise. The bitwise-AND operator can thus be used to determine whether any particular bit position contains a 1 or 0.

inline std::string Split::createPatternRepresentation() const
    {
    std::string s;
    unsigned ntax_added = 0;
    for (unsigned i = 0; i < _bits.size(); ++i)
        {
        for (unsigned j = 0; j < _bits_per_unit; ++j)
            {
            split_unit_t bitmask = ((split_unit_t)1 << j);
            bool bit_is_set = ((_bits[i] & bitmask) > (split_unit_t)0);
            if (bit_is_set)
                s += '*';
            else
                s += '-';
            if (++ntax_added == _nleaves)
                break;
            }
        }
    return s;
    }