**Not yet finished:** please avoid until this notice goes away

The `Split`

class provides an object that can store the set of taxa that represent descendants of a given `Node`

object. We will provide a Split object for every `Node`

.

Create a new class named `Split`

in a header file named *split.hpp*.

#pragma once #include <vector> #include <set> #include <map> #include <limits> #include <cassert> namespace strom { class Split { public: Split(); ~Split(); bool operator==(const Split & other) const; bool operator<(const Split & other) const; void clear(); void resize(unsigned nleaves); typedef unsigned long split_unit_t; typedef std::vector<split_unit_t> split_t; typedef std::set<Split> treeid_t; typedef std::map< treeid_t, std::vector<unsigned> > treemap_t; typedef std::tuple<unsigned,unsigned,unsigned> split_metrics_t; void setBitAt(unsigned leaf_index); void addSplit(const Split & other); std::string createPatternRepresentation() const; split_metrics_t getSplitMetrics() const; private: split_t _bits; unsigned _bits_per_unit; unsigned _nleaves; public: typedef std::shared_ptr< Split > SharedPtr; }; // member function bodies go here }

## Constructor and destructor

The constructor and destructor simply report that a `Split`

object is being created or destroyed, respectively. In addition, the constructor sets the value of the `_bits_per_unit`

data member to the number of bits in the data type specified by `split_unit_t`

. The calculation involves multiplying the constant `CHAR_BIT`

(number of bits in a byte) by the number of bytes in the data type specified by `split_unit_t`

. This value will determine how many units of that data type will be needed to store a 0 or 1 for each taxon. These units are stored in the vector data member `_bits`

, which has type `split_t`

. (Besides `split_unit_t`

and `split_t`

, two other types are defined here, namely `treeid_t`

and `treemap_t`

, but I will postpone explaining these until later when they are used in the `TreeSummary`

class)

For example, suppose `split_unit_t`

is defined to be an integer of type `char`

, which is 8 bits in length. If there are 10 taxa, then 2 `char`

units will be required to store each split. A split that specifies that taxa having indices 0, 2, 6, and 9 are descendants of a node would be encoded as follows (note that bits within each unit are counted starting from the right, and because there are 10 taxa and 16 bits total, the leftmost 6 bits of unit 1 will always be set to 0):

6 2 0 9 <-- taxon index |01000101|00000010| <-- _bits vector | unit 0 | unit 1 |

In our case, `split_unit_t`

is defined to be an `unsigned long`

integer rather than a `char`

. On my computer, a `char`

comprises only 8 bits and is the smallest integer type, whereas an `unsigned long`

has 64 bits. Thus, we can get by with fewer units if we use `unsigned long`

rather than `char`

.

Here are the bodies of the constructor and destructor. Copy these into the *split.hpp* header file just above the closing curly bracket defining the end of the `strom`

namespace.

inline Split::Split() { _nleaves = 0; _bits_per_unit = (CHAR_BIT)*sizeof(Split::split_unit_t); clear(); std::cout << "Constructing a Split" << std::endl; } inline Split::~Split() { std::cout << "Destroying a Split" << std::endl; }

## The clear member function

The clear function simply sets all bits in all units of the `_bits`

vector to 0. The for loop assigns the reference `u`

, in turn, to each unit in `_bits`

. It is important that the variable `u`

is a reference: if we used `for (auto u : _bits)`

instead, then each unit would be *copied* to `u`

and any changes made to `u`

in the body of the loop would affect the copy but not the original.

The "L" in "0L" specifies that we are assigning a long integer (long integers typically comprise twice as many bits as an ordinary integer) with value 0 to u: there is little consequence if we forget the "L" because the compiler would automatically convert 0 to 0L for us, but it is always best to be as explicit as possible.

inline void Split::clear() { for (auto & u : _bits) { u = 0L; } }

## The equals and less than operators

We will eventually need to sort splits, and sorting algorithms need to be able to ask whether one `Split`

object is less than another `Split`

object, and also need to know if two `Split`

objects are identical. Special functions called operators are used to provide this functionality. Once these operators are defined, you can treat `Split`

objects like ordinary numbers with respect to the operators involved: for example, this code would work:

(note: do not add this to split.hpp) Split a; Split b; if (a < b) std::cout << "Split a is less than Split b"; else if (a == b) std::cout << "Split a equals Split b"; else std::cout << "Split a is greater than Split b";

The equals and less than operators both compare the current object to another Split object for which the reference `other`

is provided. Note that we are assuming that two `Split`

objects are equal if their `_bits`

vectors are equal, and one `Split`

object is less than another `Split`

object if its `_bits`

vector is less than `other._bits`

. For the less than operator, we assert that the two `_bits`

vectors being compared have the same number of elements (the `operator<`

defined for standard containers such as `std::vector`

do not provide any warning if the containers being compared are of different lengths).

inline bool Split::operator==(const Split & other) const { return (_bits == other._bits); } inline bool Split::operator<(const Split & other) const { assert(_bits.size() == other._bits.size()); return (_bits < other._bits); }

## The resize member function

Given the number of taxa (`nleaves`

below), we can use `_bits_per_unit`

to determine how many units the `_bits`

vector should have. For example, if `_bits_per_unit`

was 8, then up to 8 taxa could be represented by a `_bits`

vector of length 1; however 9 taxa would require the `_bits`

vector to have length 2. Dividing `nleaves - 1`

by `_bits_per_unit`

and adding 1 provides the correct number of units. If nleaves = 8, (8-1)/8 = 0.875, which is truncated to 0 when converted to an integer, so adding 1 to (8-1)/8 yields the correct number of units (1). If nleaves = 9, (9-1)/8 = 1.0, which is truncated to 1 when converted to an integer, and adding 1 to (9-1)/8 yields the correct number of units (2). Once `nunits`

is computed, the function simply resizes `_bits`

to have length `nunits`

and then calls the `clear`

function to ensure that all bits in all units are set to 0.

inline void Split::resize(unsigned nleaves) { _nleaves = nleaves; unsigned nunits = 1 + ((nleaves - 1)/_bits_per_unit); _bits.resize(nunits); clear(); }

## The `setBitAt`

member function

The way to set the 6th bit involves shifting the value 1 to the left 6 places using the bit shift operator. The bit shift operator is made up of two less-than symbols, and is the same operator that you have used to output text to an output stream such as `std::cout`

. Here are some examples assuming that the data type `char`

(8 bits) is used:

(note: do not add this to split.hpp) char zero = (char)0; char one = (char)1; zero looks like this: |00000000| one looks like this: |00000001| one << 0 produces |00000001| one << 1 produces |00000010| one << 2 produces |00000100| one << 3 produces |00001000| one << 4 produces |00010000| one << 5 produces |00100000| one << 6 produces |01000000| one << 7 produces |10000000|

To set a specific bit in a value that may already have some bits set, use the bitwise OR operator:

(note: do not add this to split.hpp) char x = 13; x looks like this: |00001101| char y = 1 << 4 y looks like this: |00010000| char z = x | y; z looks like this: |00011101| x |= y; x now looks like z: |00011101|

Note that `x | y`

simply sets all bits that are set in either `x`

or `y`

, and `x |= y`

does the same but assigns the result back to `x`

. Given that background, the function `setBitAt`

should now be understandable. Given a `leaf_index`

(where `leaf_index`

ranges from 0 to the number of taxa minus 1), the function first determines which unit to modify (`unit_index`

), then the specific bit within that unit to set (`bit_index`

), and finally uses the bit shift and bitwise OR operators to set the bit.

inline void Split::setBitAt(unsigned leaf_index) { unsigned unit_index = leaf_index/_bits_per_unit; unsigned bit_index = leaf_index - unit_index*_bits_per_unit; split_unit_t bit_to_set = 1 << bit_index; _bits[unit_index] |= bit_to_set; }

## The addSplit member function

This function creates a union of the taxon set defined by this `Split`

object and the taxon set defined by `other`

, then assigns that union to this `Split`

object. The `for`

loop iterates through all elements of the `_bits`

vector and bitwise ORs that unit with the corresponding unit from `other`

. This function will be useful when constructing splits for interior nodes of a tree: the bits set in an interior node are the union of all sets corresponding to its descendant nodes.

inline void Split::addSplit(const Split & other) { unsigned nunits = (unsigned)_bits.size(); assert(nunits == other._bits.size()); for (unsigned i = 0; i < nunits; ++i) { _bits[i] |= other._bits[i]; } }

## The createPatternRepresentation member function

This function creates a coded representation of a split in which ‘*’ represents bits that are "on" (i.e. set to 1) and ‘-‘ represents "off" bits (i.e. set to 0).

Here is an example involving 10 taxa where `bits_per_unit`

= 8 and taxa 1, 7 and 9 are "on" and taxa 0, 2-6, and 8 and "off":

+--------+--------+ |10000010|00000010| +--------+--------+ unit[0] unit[1]

Note the nasty tangle we have to navigate to write out a sensible split representation! First we read unit 0 from right to left: the 1 on the far left of unit 0 corresponds to the "on" bit for taxon 7. After unit 0 has been interpreted, move to unit 1: the rightmost bit in unit 1 (0) corresponds to the "off" bit for taxon 8. This split would be output as follows:

-*-----*-* taxon 0 ^ ^ taxon 9

Note that we must be careful to avoid looking at the extra bits: the two units together contain 16 bits, yet there are only 10 taxa, so whatever is stored in the last 6 bits of `unit[1]`

should not be considered. The algorithm works by letting `i`

iterate over units (in the example above, `i`

=0,1 corresponding to units 0 and 1, respectively), letting `j`

iterate over the bits within each unit, and testing whether the number 1 shifted over `j`

bits, when bitwise-ANDed with `unit[i]`

, yields 1, which would mean that bit `j`

is set in `unit[i]`

.

For example, suppose `i=0`

and `j=1`

. Shifting 1 to the left `j`

places yields, as `j`

takes on values from 0 to 7:

1 << j j ------------- 00000001 0 00000010 1 00000100 2 00001000 3 00010000 4 00100000 5 01000000 6 10000000 7

Now consider the result of a bitwise-AND between `1 << j`

(when `j=7`

) and the value of `unit[0]`

:

10000000 1 << j, where j=7 10000010 unit[0] ----------------------------------- 10000000 (1 << 7) && unit[0]

The bitwise-AND operator (`&&`

) yields 1 for a bit position whenever the two numbers being compared both have a 1 in that position and produces 0 otherwise. The bitwise-AND operator can thus be used to determine whether any particular bit position contains a 1 or 0.

inline std::string Split::createPatternRepresentation() const { std::string s; unsigned ntax_added = 0; for (unsigned i = 0; i < _bits.size(); ++i) { for (unsigned j = 0; j < _bits_per_unit; ++j) { split_unit_t bitmask = ((split_unit_t)1 << j); bool bit_is_set = ((_bits[i] & bitmask) > (split_unit_t)0); if (bit_is_set) s += '*'; else s += '-'; if (++ntax_added == _nleaves) break; } } return s; }