Create the Data class

Create a new C++ class named Data and add it to your project as the header file data.hpp. Below is the class declaration. The body of each member function will be described separately (you should add each of these member function bodies just above the right curly bracket that terminates the namespace block).

#pragma once

#include <fstream>
#include <string>
#include <vector>
#include <numeric>
#include <boost/format.hpp>
#include <boost/regex.hpp>
#include <boost/shared_ptr.hpp>
#include "xstrom.hpp"

namespace strom {

class Data
        typedef std::vector<double>             pattern_counts_t;
        typedef std::vector<std::string>        taxon_names_t;
        typedef std::vector< std::vector<int> > data_matrix_t;
	typedef boost::shared_ptr< Data >       SharedPtr;

        void                                    getDataFromFile(const std::string filename);

        const pattern_counts_t &                getPatternCounts() const;
        const taxon_names_t &                   getTaxonNames() const;
        const data_matrix_t &                   getDataMatrix() const;

        void                                    clear();
        unsigned                                getNumPatterns() const;
        unsigned                                getSeqLen() const;
        unsigned                                getNumTaxa() const;
        void                                    compressPatterns();
        std::string                             extractMatrix(const std::string s);
        std::string                             stripNexusComments(const std::string s);
        void                                    storeDataMatrix(const std::string & s);

        pattern_counts_t                        _pattern_counts;
        taxon_names_t                           _taxon_names;
        data_matrix_t                           _data_matrix;

// Member function bodies go below here but above the right curly bracket that ends the namespace block


The class declaration above defines four types:

  • pattern_counts_t is a vector of doubles (real numbers) that stores the number of sites having each data pattern
  • taxon_names_t is a vector of strings that stores the taxon names
  • data_matrix_t is a vector the ith element of which is a vector of integers representing the ith data pattern
  • DataShPtr is a Boost shared pointer to an object of type Data

These type definitions simply make it simpler to define variables of these types and to pass such variables into functions. Three of the four are used at the bottom of the class declaration to declare three variables that will store the information read from a data file (_pattern_counts, _taxon_namesm and _data_matrix).

Constructor and destructor

Here are the bodies of the constructor and destructor. As before, the only thing that we have these functions do is to report when an object of the Data class is created or destroyed.

inline Data::Data()
    std::cout << "Creating Data object" << std::endl;

inline Data::~Data()
    std::cout << "Destroying Data object" << std::endl;


Here are the bodies of three member functions that can be described as accessors. Accessor member functions provide (read only) access to private or protected data members. They allow code outside of the class (and not belonging to a friend class) to gain access to the information stored in private data members as long as they do not change the information stored there. The const keyword before the return value of these functions ensures that calling code will not be able to change the returned values, and the second const keyword after the name of each function says that this function does not modify any data itself. It is always a good idea to use the const keyword where ever you can use it because this allows the compiler to catch cases in which you (or someone using your code) unintentionally modifies data stored by an object when it was not supposed to be modified.

inline const Data::pattern_counts_t & Data::getPatternCounts() const
	return _pattern_counts;

inline const Data::taxon_names_t & Data::getTaxonNames() const
	return _taxon_names;

inline const Data::data_matrix_t & Data::getDataMatrix() const
	return _data_matrix;

The clear member function

Here is the body of the clear function, which simply empties the three vectors that store the data patterns (_data_matrix), pattern counts (_pattern_counts), and taxon names (_taxon_names).

inline void Data::clear()

The getNumPatterns and getNumTaxa member functions

The two functions below, getNumPatterns and getNumTaxa, return the number of elements in the _pattern_counts and _taxon_names vectors, respectively. These public functions are needed because _pattern_counts and _taxon_names are declared private in the class declaration.

inline unsigned Data::getNumPatterns() const
    return (unsigned)_pattern_counts.size();

inline unsigned Data::getNumTaxa() const
    return (unsigned)_taxon_names.size();

The getSeqLen member function

The function getSeqLen cannot simply return the number of elements in the _data_matrix vector because that would be the number of unique patterns, not the number of sites. The accumulate function in the std namespace sums the values in a vector between two elements delimited by the iterators provided in the first two arguments. The third argument supplies the initial value (zero in this case).

inline unsigned Data::getSeqLen() const
    return (unsigned)std::accumulate(_pattern_counts.begin(), _pattern_counts.end(), 0);

The extractMatrix member function

The member function extractMatrix uses the Boost regular expression engine (specifically, the boost::regex_match function) to find and return the data matrix from a Nexus-formatted data file. This amounts to everything between the keyword "matrix" and the next semicolon, which serves to terminate the matrix command. One thing that might be puzzling to those who have used regular expressions in Python: backslash characters must be "escaped" to avoid being misinterpreted. Thus, where you might use simply \b in Python, you must use \\b here. This is necessary because the C++ compiler gets to interpret these strings before the Boost regex engine does, and a single backslash in C++ indicates that the following character has special meaning. A double backslash tells the C++ compiler that you really intended that character to just be a backslash and not an escape sequence.

inline std::string Data::extractMatrix(const std::string s)
    boost::smatch what;
    std::string matrix;

    boost::regex matrix_expr(
       // file must begin with "#nexus":
       // any number of characters:
       // the word "matrix" denotes the beginning of the data matrix:
       // capture everything up to the terminating semicolon:
       // regex_match requires us to match the entire string
    bool regex_ok = boost::regex_match(s, what, matrix_expr);
    if (regex_ok)
        // what[0] contains the whole string 
        // what[1] contains the contents of the matrix 
        // Construct a string using characters in contents
        // from what[1].first to what[1].second
        matrix = std::string(what[1].first, what[1].second);

    return matrix;

The stripNexusComments member function

The member function stripNexusComments removes Nexus comments within the supplied string using the Boost regular expression engine (specifically, the boost::regex_replace function). Nexus comments are delimited by square brackets, and this function finds all such sequences of characters delimited by square brackets and replaces them with empty strings.

inline std::string Data::stripNexusComments(const std::string s)
    boost::regex comment_expr("\\[.*?\\]");
    std::string ss = boost::regex_replace(s, comment_expr, std::string(""));
    return ss;

The storeDataMatrix member function

The member function storeDataMatrix finds (using the Boost regular expression engine) all lines in the supplied string containing a taxon name followed by whitespace followed by a DNA sequence. It appends the taxon name onto the end of the _taxon_names data member and stores the DNA sequence in the corresponding element of _data_matrix as a vector of integers. The function translates A, C, G and T (or their lower-case counterparts) into the integers 0, 1, 2 and 3, and uses the integer value 4 for any other character. Thus, any ambiguity code in the data matrix is treated by this program as completely missing data. The _pattern_counts vector is filled with the value 1 because at this point no data compression has occurred.

inline void Data::storeDataMatrix(const std::string & s)
    boost::regex taxonseq_expr(
       // possibly leading whitespace:
       // taxon name:
       // whitespace:   
       // DNA sequence:
       // possibly trailing whitespace:
    boost::sregex_iterator m1(s.begin(), s.end(), taxonseq_expr);
    boost::sregex_iterator m2;
    unsigned ntaxa = (unsigned)std::distance(m1, m2);
    unsigned i = 0;
    for (boost::sregex_iterator m = m1; m != m2; ++m, ++i)
        const boost::smatch & what = *m;
        if (what.size() != 3)
            throw XStrom(boost::str(boost::format("Could not extract %dth taxon name/sequence") % (i+1) ) );

        // what[0] contains the entire match
        // what[1] contains the taxon name
        // what[2] contains the sequence
        // Construct a string using characters in contents from what[0].first to what[0].second
        std::string taxon_name = std::string(what[1].first, what[1].second);
        std::vector<int> & v = _data_matrix[i];
        unsigned seqlen = (unsigned)what[2].length();
        v.assign(seqlen, 4);
        unsigned j = 0;
        for (std::string::const_iterator sit = what[2].first; sit != what[2].second; ++sit, ++j)
            const char c = *sit;
            if (c == 'a' || c == 'A')
                v[j] = 0;
            else if (c == 'c' || c == 'C')
                v[j] = 1;
            else if (c == 'g' || c == 'G')
                v[j] = 2;
            else if (c == 't' || c == 'T')
                v[j] = 3;
    _pattern_counts.assign(_data_matrix[0].size(), 1);

The compressPatterns member function

The member function compressPatterns assumes that all data has been stored. It uses a map (similar to a dictionary in Python) to store all unique data patterns (the keys are integer vectors) along with the number of sites that exhibit those data patterns (the values are unsigned integers). After it stores everything in _data_matrix in this map, it replaces the contents of _data_matrix with the unique patterns and stores the number of sites exhibiting each pattern in the _pattern_counts data member.

inline void Data::compressPatterns()
    // sanity checks
    if (_data_matrix.empty())
        throw XStrom("Attempted to compress an empty data matrix");
    if (_data_matrix[0].size() < getSeqLen())
        throw XStrom("Attempted to compress an already compressed data matrix");

    // create map with keys equal to patterns and values equal to site counts
    typedef std::map< std::vector<int>, unsigned > pattern_map_t;
    pattern_map_t pattern_counts;

    std::vector<int> pattern;
    unsigned ntaxa = (unsigned)_data_matrix.size();
    unsigned seqlen = (unsigned)_data_matrix[0].size();
    for (unsigned i = 0; i < seqlen; ++i)
        for (unsigned j = 0; j < ntaxa; ++j)

   		// If pattern is not already in pattern_counts map, insert it and set value to 1.
        // If it does exist, increment its current value.
        // (see item 24, p. 110, in Meyers' Efficient STL for more info on the technique used here)
		pattern_map_t::iterator lowb = pattern_counts.lower_bound(pattern);
		if (lowb != pattern_counts.end() && !(pattern_counts.key_comp()(pattern, lowb->first)))
			// this pattern has already been seen
            lowb->second += 1;
			// this pattern has not yet been seen
			pattern_counts.insert(lowb, pattern_map_t::value_type(pattern, 1));
    // resize _pattern_counts
    unsigned npatterns = (unsigned)pattern_counts.size();
    // resize _data_matrix so that we can use operator[] to assign values
    typedef std::vector< std::vector<int> > data_matrix_t;
    for (data_matrix_t::iterator dit = _data_matrix.begin(); dit != _data_matrix.end(); ++dit)
    unsigned j = 0;
    for (pattern_map_t::iterator pit = pattern_counts.begin(); pit != pattern_counts.end(); ++pit, ++j)
        _pattern_counts[j] = pit->second;
        unsigned i = 0;
        for (std::vector<int>::const_iterator vit = pit->first.begin(); vit != pit->first.end(); ++vit, ++i)
            _data_matrix[i][j] = *vit;
    unsigned total_num_sites = std::accumulate(_pattern_counts.begin(), _pattern_counts.end(), 0);
    if (seqlen != total_num_sites)
        throw XStrom(boost::str(boost::format("Total number of sites before compaction (%d) not equal to toal number of sites after (%d)") % seqlen % total_num_sites));

The getDataFromFile member function

The member function getDataFromFile is now easy to write because all the hard work is now done by other functions (extractMatrix, stripNexusComments, storeDataMatrix, and compressPatterns).

inline void Data::getDataFromFile(const std::string filename)
    // Open the data file
    std::ifstream inf(filename.c_str());
    if (!inf)
        throw XStrom(boost::str(boost::format("Could not open file named \"%s\"") % filename));
    // Read contents of file
    std::string contents((std::istreambuf_iterator<char>(inf)), std::istreambuf_iterator<char>());
    // Extract everything between "matrix" and terminating semicolon using regular expression
    std::string matrix = extractMatrix(contents);
    if (matrix.length() == 0)
        throw XStrom(boost::str(boost::format("Could not extract matrix from file named \"%s\" (expecting file to begin \"#nexus\" and contain \"matrix...;\", where \"...\" is the data matrix)") % filename));

    // Remove any [nexus comments inside square brackets]
    std::string stripped = stripNexusComments(matrix);

    // Commit to storing new data
    // Extract taxon names and sequence data using regular expressions

    // Compress _data_matrix so that it holds only unique patterns (counts stored in _pattern_counts)