Create the Data class

Create a new C++ class named Data and add it to your project as the header file data.hpp. Below is the class declaration. The body of each member function will be described separately (you should add each of these member function bodies just above the right curly bracket that terminates the namespace block).

#pragma once

#include <fstream>
#include <string>
#include <vector>
#include <numeric>
#include <map>
#include <boost/format.hpp>
#include "xstrom.hpp"

#include "ncl/nxsmultiformat.h"

namespace strom {

class Data
    {
	public:
        typedef std::vector<double>             pattern_counts_t;
        typedef std::vector<std::string>        taxon_names_t;
        typedef std::vector<int>                pattern_t;
        typedef std::map< pattern_t, unsigned > pattern_map_t;
        typedef std::vector< pattern_t >        data_matrix_t;
        typedef std::shared_ptr< Data >         SharedPtr;

                                                Data();
                                                ~Data();

        void                                    getDataFromFile(const std::string filename);

        const pattern_counts_t &                getPatternCounts() const;
        const taxon_names_t &                   getTaxonNames() const;
        const data_matrix_t &                   getDataMatrix() const;

        void                                    clear();
        unsigned                                getNumPatterns() const;
        unsigned                                getSeqLen() const;
        unsigned                                getNumTaxa() const;

    private:

        void                                    updatePatternMap(Data::pattern_t & pattern);
        void                                    compressPatterns();

        pattern_map_t                           _pattern_map;       // used as workspace
        pattern_counts_t                        _pattern_counts;
        taxon_names_t                           _taxon_names;
        data_matrix_t                           _data_matrix;
    };

// Member function bodies go below here but above the right curly bracket that ends the namespace block

}

The class declaration above defines several types:

  • pattern_counts_t is used for variables that store the number of sites having each data pattern
  • taxon_names_t is used for variables that store a vector of taxon names
  • pattern_t is used for variables that store site patterns
  • pattern_map_t is used for variables that map pattern counts to site patterns
  • data_matrix_t is used for variables that store vectors of site patterns
  • SharedPtr is a shared pointer to an object of type Data

These type definitions simply make it simpler to define variables of these types and to pass such variables into functions. Four of these are used at the bottom of the class declaration to declare variables that will store information read from a data file (_pattern_map, _pattern_counts, _taxon_names, and _data_matrix).

Constructor and destructor

Here are the bodies of the constructor and destructor. As usual, the only thing that we have these functions do is to report when an object of the Data class is created or destroyed. These output statements may be commented out at any time they get to be annoying (but don’t delete them; you will find them useful for debugging in the future).

inline Data::Data()
    {
    std::cout << "Creating Data object" << std::endl;
    }

inline Data::~Data()
    {
    std::cout << "Destroying Data object" << std::endl;
    }

Accessors

Here are the bodies of three member functions that can be described as accessors. Accessor member functions provide (read only) access to private or protected data members. They allow code outside of the class (and not belonging to a friend class) to gain access to the information stored in private data members as long as they do not change the information stored there. The const keyword before the return value of these functions ensures that calling code will not be able to change the returned values, and the second const keyword after the name of each function says that this function does not modify any data itself. It is always a good idea to use the const keyword where ever you can because this allows the compiler to catch cases in which you (or someone using your code) unintentionally modifies data stored by an object when it was not supposed to be modified.

inline const Data::pattern_counts_t & Data::getPatternCounts() const
	{
	return _pattern_counts;
	}

inline const Data::taxon_names_t & Data::getTaxonNames() const
	{
	return _taxon_names;
	}

inline const Data::data_matrix_t & Data::getDataMatrix() const
	{
	return _data_matrix;
	}

The clear member function

Here is the body of the clear function, which simply empties the three vectors (and one map) that store the data patterns (_data_matrix), pattern counts (_pattern_counts), and taxon names (_taxon_names).

inline void Data::clear()
    {
    _pattern_map.clear();
    _pattern_counts.clear();
    _taxon_names.clear();
    _data_matrix.clear();
    }

The getNumPatterns and getNumTaxa member functions

The two functions below, getNumPatterns and getNumTaxa, return the number of elements in the _pattern_counts and _taxon_names vectors, respectively. These public functions are needed because _pattern_counts and _taxon_names are declared private in the class declaration.

inline unsigned Data::getNumPatterns() const
    {
    return (unsigned)_pattern_counts.size();
    }

inline unsigned Data::getNumTaxa() const
    {
    return (unsigned)_taxon_names.size();
    }

The getSeqLen member function

The function getSeqLen cannot simply return the number of elements in the _data_matrix vector because that would be the number of unique patterns, not the number of sites. The accumulate function in the std namespace sums the values in a vector between two elements delimited by the iterators provided in the first two arguments. The third argument supplies the initial value (zero in this case).

inline unsigned Data::getSeqLen() const
    {
    return (unsigned)std::accumulate(_pattern_counts.begin(), _pattern_counts.end(), 0);
    }

The compressPatterns member function

The member function compressPatterns assumes that all data has been stored. It uses a map (similar to a dictionary in Python) to store all unique data patterns (the keys are integer vectors) along with the number of sites that exhibit those data patterns (the values are unsigned integers). After it stores everything in _data_matrix in this map, it replaces the contents of _data_matrix with the unique patterns and stores the number of sites exhibiting each pattern in the _pattern_counts data member. Note that the updatePatternMap function used in compressPatterns will be explained next.

inline void Data::compressPatterns()
    {
    // sanity checks
    if (_data_matrix.empty())
        throw XStrom("Attempted to compress an empty data matrix");
    if (_data_matrix[0].size() < getSeqLen())
        throw XStrom("Attempted to compress an already compressed data matrix");

    // create map with keys equal to patterns and values equal to site counts
    _pattern_map.clear();

    std::vector<int> pattern;
    unsigned ntaxa = (unsigned)_data_matrix.size();
    unsigned seqlen = (unsigned)_data_matrix[0].size();
    for (unsigned i = 0; i < seqlen; ++i)
        {
        // Create vector representing pattern at site i
        pattern.clear();
        for (unsigned j = 0; j < ntaxa; ++j)
            {
            pattern.push_back(_data_matrix[j][i]);
            }

        // Add this pattern to pattern_counts
        updatePatternMap(pattern);
        }
        
    // resize _pattern_counts
    unsigned npatterns = (unsigned)_pattern_map.size();
    _pattern_counts.resize(npatterns);
    
    // resize _data_matrix so that we can use operator[] to assign values
    _data_matrix.resize(ntaxa);
    for (auto & row : _data_matrix)
        {
        row.resize(npatterns);
        }

    unsigned j = 0;
    for (auto & pc : _pattern_map)
        {
        _pattern_counts[j] = pc.second;

        unsigned i = 0;
        for (auto sc : pc.first)
            {
            _data_matrix[i][j] = sc;
            ++i;
            }

        ++j;
        }

    // Everything has been transferred to _data_matrix and _pattern_counts, so can now free this memory
    _pattern_map.clear();
        
    unsigned total_num_sites = std::accumulate(_pattern_counts.begin(), _pattern_counts.end(), 0);
    if (seqlen != total_num_sites)
        throw XStrom(boost::str(boost::format("Total number of sites before compaction (%d) not equal to toal number of sites after (%d)") % seqlen % total_num_sites));

The updatePatternMap member function

This private member function adds 1 to the count stored for the supplied pattern in the _pattern_map data member. The function first asks (using the lower_bound function of the standard map class) what is the first key that equals pattern. If there is no entry for the supplied pattern, the lower_bound function will return _pattern_map.end(), in which case pattern will be added as a key to _pattern_map with the value initially set to 1 (because this is the first time this pattern has been seen). If the lower_bound function returns an iterator not equal to _pattern_map.end(), then the count associated with this key is incremented by 1.

inline void Data::updatePatternMap(Data::pattern_t & pattern)
    {
    // If pattern is not already in pattern_map, insert it and set value to 1.
    // If it does exist, increment its current value.
    // (see item 24, p. 110, in Meyers' Efficient STL for more info on the technique used here)
    pattern_map_t::iterator lowb = _pattern_map.lower_bound(pattern);
    if (lowb != _pattern_map.end() && !(_pattern_map.key_comp()(pattern, lowb->first)))
        {
        // this pattern has already been seen
        lowb->second += 1;
        }
    else
        {
        // this pattern has not yet been seen
        _pattern_map.insert(lowb, pattern_map_t::value_type(pattern, 1));
        }
    }

The getDataFromFile member function

This member function makes use of the NCL (Nexus Class Library) to read and parse the data file. The advantage of using the NCL is that we do not have to deal with all the vagaries in format that people (and programs) use when creating NEXUS-formatted data file. The comment at the beginning of the function provides the web address of the primary documentation for the NCL. Please refer to this web site if you want a more complete explanation of the NCL functions used here. I will explain the salient features below the code block.

inline void Data::getDataFromFile(const std::string filename)
    {
    // See http://phylo.bio.ku.edu/ncldocs/v2.1/funcdocs/index.html for documentation
    //
    // -1 means "process all blocks found" (this is a bit field and -1 fills the bit field with 1s)
    // Here are the bits (and nexus blocks) that are defined:
    //     enum NexusBlocksToRead
    //     {
    //         NEXUS_TAXA_BLOCK_BIT = 0x01,
    //         NEXUS_TREES_BLOCK_BIT = 0x02,
    //         NEXUS_CHARACTERS_BLOCK_BIT = 0x04,
    //         NEXUS_ASSUMPTIONS_BLOCK_BIT = 0x08,
    //         NEXUS_SETS_BLOCK_BIT = 0x10,
    //         NEXUS_UNALIGNED_BLOCK_BIT = 0x20,
    //         NEXUS_DISTANCES_BLOCK_BIT = 0x40,
    //         NEXUS_UNKNOWN_BLOCK_BIT = 0x80
    //     };
    MultiFormatReader nexusReader(-1, NxsReader::WARNINGS_TO_STDERR);
    try {
        nexusReader.ReadFilepath(filename.c_str(), MultiFormatReader::NEXUS_FORMAT);
        }
    catch(...)
        {
        nexusReader.DeleteBlocksFromFactories();
        throw;
        }

    // Commit to storing new data
    clear();

    int numTaxaBlocks = nexusReader.GetNumTaxaBlocks();
    for (int i = 0; i < numTaxaBlocks; ++i)
        {
        NxsTaxaBlock * taxaBlock = nexusReader.GetTaxaBlock(i);
        _taxon_names.clear();
        for (auto s : taxaBlock->GetAllLabels()) {
            _taxon_names.push_back(s);
            }
        unsigned ntax = (unsigned)_taxon_names.size();

        const unsigned numCharBlocks = nexusReader.GetNumCharactersBlocks(taxaBlock);
        for (unsigned j = 0; j < numCharBlocks; ++j)
            {
            const NxsCharactersBlock * charBlock = nexusReader.GetCharactersBlock(taxaBlock, j);
            std::string charBlockTitle = taxaBlock->GetTitle();

            _data_matrix.resize(ntax);
            for (unsigned t = 0; t < ntax; ++t)
                {
                const NxsDiscreteStateRow & row = charBlock->GetDiscreteMatrixRow(t);
                unsigned seqlen = (unsigned)row.size();
                _data_matrix[t].resize(seqlen);
                unsigned k = 0;
                for (auto state_code : row) {
                    if (state_code < 0)
                        _data_matrix[t][k++] = 4;
                    else
                        _data_matrix[t][k++] = state_code;
                    }
                }

            }
        }

    // No longer any need to store raw data from nexus file
    nexusReader.DeleteBlocksFromFactories();

    // Compress _data_matrix so that it holds only unique patterns (counts stored in _pattern_counts)
    compressPatterns();
    }
}

We use the NCL’s MultiFormatReader class to create an object that can parse a NEXUS-formatted data file. MultiFormatReader is a class defined by the Nexus Class Library (NCL). Your program knows about it because of the #include "ncl/nxsmultiformat.h" at the top of the file. The actual code for the functions and classes provided by the NCL will not be stored in your program’s executable file; it will be loaded on demand from a separate Dynamic Link Library (provided by a file with a name that ends in .dll on Windows, .dylib on Mac, or .so in Unix) after your program begins to execute.

The ReadFilepath function is used to read in the data file whose name is specified in the argument filename. Note that we’ve supplied the argument MultiFormatReader::NEXUS_FORMAT to the ReadFilepath function: the NCL can read other common data file formats besides NEXUS, so you could modify your program to read (for example) FASTA formatted data files by supplying a different argument (i.e. MultiFormatReader::FASTA_DNA_FORMAT). (A listing of supported formats may be found in the nxsmultiformat.h header file.)

The catch block…

    catch(...)
        {
        nexusReader.DeleteBlocksFromFactories();
        throw;
        }

…is executed only if the NCL encountered a problem reading the file. The ellipsis (…) indicates that any exception thrown by ReadFilepath will be caught here. The catch code deletes all data stored thus far by calling the DeleteBlocksFromFactories() function and then re-throws the exception so that your program can catch and report the problem.

When the NCL reads a NEXUS data file, it looks for a DATA or CHARACTERS block and stores the data therein. If the creator of the data file has added a TAXA block, the NCL will associate the sequence data from the DATA or CHARACTERS block with the taxon names from that TAXA block. If no TAXA block was found in the file, the NCL will create a virtual TAXA block using the taxon names in the DATA or CHARACTERS block. In any case, our function first enumerates all TAXA blocks (real or vitual) found and then, for each TAXA block, we can enumerate all DATA or CHARACTER blocks stored under that TAXA block.

The relevant loops are shown below with some code left out in order to show the big picture:

    int numTaxaBlocks = nexusReader.GetNumTaxaBlocks();
    for (int i = 0; i < numTaxaBlocks; ++i)
        {
        NxsTaxaBlock * taxaBlock = nexusReader.GetTaxaBlock(i);
        ...
        const unsigned numCharBlocks = nexusReader.GetNumCharactersBlocks(taxaBlock);
        for (unsigned j = 0; j < numCharBlocks; ++j)
            {
            const NxsCharactersBlock * charBlock = nexusReader.GetCharactersBlock(taxaBlock, j);
            ...
            }
        }

For each TAXA block read, we store the taxon names in the data member _taxon_names. For each CHARACTER block read (DATA blocks will also be returned by the GetNumCharactersBlocks function), the data in the matrix is stored in the data member _data_matrix. The NCL stores DNA data not as A, C, G, and T, but instead as state codes: 0 for A, 1 for C, 2 for G, and 3 for T. An entry in the NEXUS data file that specifies complete ambiguity (e.g. N or {ACGT}) would be stored as 4. Missing data and gaps are stored as -1 and -2, respectively. In our program, we will simply treat missing data and gaps as equivalent to {ACGT}, which explains this conditional:

if (state_code < 0)
    _data_matrix[t][k++] = 4;
else
    _data_matrix[t][k++] = state_code;

After the main loop through TAXA blocks and their associated CHARACTER blocks, we have taxon names stored in _taxon_names and the sequence (vector of state codes) for taxon t stored in _data_matrix[t]. Because all data from the file has been captured by these two data members, we can safely call

nexusReader.DeleteBlocksFromFactories();

to delete all data stored inside NCL’s data structures.

The final statement in this function calls the compressPatterns() member function, which eliminates duplicate patterns from _data_matrix and stores the counts of each pattern in _pattern_counts. If each sequence has 1000 nucleotides in the data file, then, before calling compressPatterns(), each row of _data_matrix would be a vector containing 1000 state codes and _pattern_counts would be a vector containing 1000 elements each equalling 1. Suppose the there are only 180 distinct site patterns. After calling compressPatterns(), each row of _data_matrix would now have only 180 state codes and _pattern_counts would be 180 elements long (and now many of these numbers are greater than 1).

The _pattern_map data member is used by compressPatterns() but is cleared before that function returns. It serves as a temporary work space, and does not hold onto its contents once _data_matrix and _pattern_counts are rebuilt.