Introduction

Read mapping is a common task in bioinformatics and is often the first step of an in-depth analysis of Next Generation Sequencing data. Its aim is to identify positions where a query sequence (read) matches with up to e errors to a reference sequence.

In this example we will implement a read mapper step by step and make use of what we have learned in the previous tutorials. As it is common practice with read mappers, we will first create an indexer that creates an index from the reference and stores it to disk. After this, we will implement the actual read mapper that will use the stored index and map the reads.

Agenda

Implementing an indexer
- Parse arguments
- Read input files
- Create and store index
Implementing a read mapper
- Parse arguments
- Read and load input, search for approximate matches
- Align the search results
- Write final results into a SAM file

The data

We provide an example reference and an example query file.

The indexer

Step 1 - Parsing arguments

As a first step, we want to parse command line arguments for our indexer. If you get into trouble, you can take a peek at the Argument Parser Tutorial or the API documentation of the seqan3::argument_parser for help.

Assignment 1: Parsing arguments

Let's start our application by setting up the argument parser with the following options:

The path to the reference file
An output path for the index

Follow the best practice and create:

A function run_program that prints the parsed arguments
A struct cmd_arguments that stores the arguments
A function initialise_argument_parser to add meta data and options to the parser
A main function that parses the arguments and calls run_program

Use validators where applicable!

Your main may look like this:

Hint

int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Indexer", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.index_path);
 
    return 0;
}

Solution

#include <seqan3/argument_parser/all.hpp>
#include <seqan3/core/debug_stream.hpp>
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & index_path)
{
    seqan3::debug_stream << "reference_file_path: " << reference_path << '\n';
    seqan3::debug_stream << "index_path           " << index_path << '\n';
}
 
struct cmd_arguments
{
    std::filesystem::path reference_path{};
    std::filesystem::path index_path{"out.index"};
};
 
void initialise_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
{
    parser.info.author = "E. coli";
    parser.info.short_description = "Creates an index over a reference.";
    parser.info.version = "1.0.0";
    parser.add_option(args.reference_path, 'r', "reference", "The path to the reference.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fa","fasta"}});
    parser.add_option(args.index_path, 'o', "output", "The output index file path.",
                      seqan3::option_spec::standard,
                      seqan3::output_file_validator{seqan3::output_file_open_options::create_new, {"index"}});
}
 
int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Indexer", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.index_path);
 
    return 0;
}

Step 2 - Reading the input

As a next step, we want to use the parsed file name to read in our reference data. This will be done using seqan3::sequence_file_input class. As a guide, you can take a look at the Sequence I/O Tutorial.

Assignment 2: Reading the input

Extend your program to store the sequence information contained in the reference file into a struct.

To do this, you should create:

A struct reference_storage_t that stores the sequence information for both reference and query information within member variables
A function read_reference that fills a reference_storage_t object with information from the files and prints the reference IDs

You should also perform the following changes in run_program:

Construct of an object storage of type reference_storage_t
Add a call to read_reference

This is the signature of read_reference:

Hint

void read_reference(std::filesystem::path const & reference_path,

reference_storage_t & storage)

This is the reference_storage_t:

Hint

struct reference_storage_t
{
    std::vector<std::string> ids;
    std::vector<std::vector<seqan3::dna5>> seqs;
};

Solution

struct reference_storage_t
{
    std::vector<std::string> ids;
    std::vector<std::vector<seqan3::dna5>> seqs;
};
 
void read_reference(std::filesystem::path const & reference_path,
                    reference_storage_t & storage)
{
    seqan3::sequence_file_input reference_in{reference_path};
    for (auto & [seq, id, qual] : reference_in)
    {
        storage.ids.push_back(std::move(id));
        storage.seqs.push_back(std::move(seq));
    }
    seqan3::debug_stream << "Reference IDs: " << storage.ids << '\n';
}
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & index_path)
{
    seqan3::debug_stream << "reference_file_path: " << reference_path << '\n';
    seqan3::debug_stream << "index_path           " << index_path << '\n';
    reference_storage_t storage{};
    read_reference(reference_path, storage);
}

Here is the complete program:

Hint

#include <seqan3/argument_parser/all.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/io/sequence_file/input.hpp>
 
struct reference_storage_t
{
    std::vector<std::string> ids;
    std::vector<std::vector<seqan3::dna5>> seqs;
};
 
void read_reference(std::filesystem::path const & reference_path,
                    reference_storage_t & storage)
{
    seqan3::sequence_file_input reference_in{reference_path};
    for (auto & [seq, id, qual] : reference_in)
    {
        storage.ids.push_back(std::move(id));
        storage.seqs.push_back(std::move(seq));
    }
    seqan3::debug_stream << "Reference IDs: " << storage.ids << '\n';
}
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & index_path)
{
    seqan3::debug_stream << "reference_file_path: " << reference_path << '\n';
    seqan3::debug_stream << "index_path           " << index_path << '\n';
    reference_storage_t storage{};
    read_reference(reference_path, storage);
}
 
struct cmd_arguments
{
    std::filesystem::path reference_path{};
    std::filesystem::path index_path{"out.index"};
};
 
void initialise_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
{
    parser.info.author = "E. coli";
    parser.info.short_description = "Creates an index over a reference.";
    parser.info.version = "1.0.0";
    parser.add_option(args.reference_path, 'r', "reference", "The path to the reference.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fa","fasta"}});
    parser.add_option(args.index_path, 'o', "output", "The output index file path.",
                      seqan3::option_spec::standard,
                      seqan3::output_file_validator{seqan3::output_file_open_options::create_new, {"index"}});
}
 
int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Indexer", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.index_path);
 
    return 0;
}

Step 3 - Index

Now that we have the necessary sequence information, we can create an index and store it. Read up on the Index Tutorial if you have any questions.

Assignment 3: Index

We want to create a new function create_index:

It takes index_path and storage as parameters
Creates a bi_fm_index
Stores the bi_fm_index

We also need to change:

run_program to now also call create_index
run_program and read_reference to not print the debug output anymore

This is the signature of create_index:

Hint

void create_index(std::filesystem::path const & index_path,

reference_storage_t & storage)

Solution

void read_reference(std::filesystem::path const & reference_path,
                    reference_storage_t & storage)
{
    seqan3::sequence_file_input reference_in{reference_path};
    for (auto & [seq, id, qual] : reference_in)
    {
        storage.ids.push_back(std::move(id));
        storage.seqs.push_back(std::move(seq));
    }
}
 
 
void create_index(std::filesystem::path const & index_path,
                  reference_storage_t & storage)
{
    seqan3::bi_fm_index index{storage.seqs};
    {
        std::ofstream os{index_path, std::ios::binary};
        cereal::BinaryOutputArchive oarchive{os};
        oarchive(index);
    }
}
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & index_path)
{
    reference_storage_t storage{};
    read_reference(reference_path, storage);
    create_index(index_path, storage);
}

Here is the complete program:

Hint

#include <fstream>
 
#include <cereal/archives/binary.hpp>
 
#include <seqan3/argument_parser/all.hpp>
#include <seqan3/io/sequence_file/input.hpp>
#include <seqan3/search/fm_index/bi_fm_index.hpp>
 
struct reference_storage_t
{
    std::vector<std::string> ids;
    std::vector<std::vector<seqan3::dna5>> seqs;
};
 
void read_reference(std::filesystem::path const & reference_path,
                    reference_storage_t & storage)
{
    seqan3::sequence_file_input reference_in{reference_path};
    for (auto & [seq, id, qual] : reference_in)
    {
        storage.ids.push_back(std::move(id));
        storage.seqs.push_back(std::move(seq));
    }
}
 
 
void create_index(std::filesystem::path const & index_path,
                  reference_storage_t & storage)
{
    seqan3::bi_fm_index index{storage.seqs};
    {
        std::ofstream os{index_path, std::ios::binary};
        cereal::BinaryOutputArchive oarchive{os};
        oarchive(index);
    }
}
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & index_path)
{
    reference_storage_t storage{};
    read_reference(reference_path, storage);
    create_index(index_path, storage);
}
 
struct cmd_arguments
{
    std::filesystem::path reference_path{};
    std::filesystem::path index_path{"out.index"};
};
 
void initialise_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
{
    parser.info.author = "E. coli";
    parser.info.short_description = "Creates an index over a reference.";
    parser.info.version = "1.0.0";
    parser.add_option(args.reference_path, 'r', "reference", "The path to the reference.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fa","fasta"}});
    parser.add_option(args.index_path, 'o', "output", "The output index file path.",
                      seqan3::option_spec::standard,
                      seqan3::output_file_validator{seqan3::output_file_open_options::create_new, {"index"}});
}
 
int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Indexer", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.index_path);
 
    return 0;
}

The read mapper

Step 1 - Parsing arguments

Again, we want to parse command line arguments for our read mapper as a first step. If you get into trouble, you can take a peek at the Argumet Parser Tutorial or the API documentation of the seqan3::argument_parser for help.

Assignment 4: Parsing arguments

Let's start our application by setting up the argument parser with the following options:

The path to the reference file
The path to the query file
The path to the index file
An output path
The maximum number of errors we want to allow (between 0 and 4)

Follow the best practice and create:

A function run_program that prints the parsed arguments
A struct cmd_arguments that stores the arguments
A function initialise_argument_parser to add meta data and options to the parser
A main function that parses the arguments and calls run_program

Use validators where applicable!

Your main may look like this:

Hint

int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Mapper", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
 
    return 0;
}

Solution

#include <seqan3/argument_parser/all.hpp>
#include <seqan3/core/debug_stream.hpp>
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & query_path,
                 std::filesystem::path const & index_path,
                 std::filesystem::path const & sam_path,
                 uint8_t const errors)
{
    seqan3::debug_stream << "reference_path: " << reference_path << '\n';
    seqan3::debug_stream << "query_path:     " << query_path << '\n';
    seqan3::debug_stream << "index_path      " << index_path << '\n';
    seqan3::debug_stream << "sam_path        " << sam_path << '\n';
    seqan3::debug_stream << "errors:         " << errors << '\n';
}
 
struct cmd_arguments
{
    std::filesystem::path reference_path{};
    std::filesystem::path query_path{};
    std::filesystem::path index_path{};
    std::filesystem::path sam_path{"out.sam"};
    uint8_t errors{0};
};
 
void initialise_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
{
    parser.info.author = "E. coli";
    parser.info.short_description = "Map reads against a reference.";
    parser.info.version = "1.0.0";
    parser.add_option(args.reference_path, 'r', "reference", "The path to the reference.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fa","fasta"}});
    parser.add_option(args.query_path, 'q', "query", "The path to the query.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fq","fastq"}});
    parser.add_option(args.index_path, 'i', "index", "The path to the index.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"index"}});
    parser.add_option(args.sam_path, 'o', "output", "The output SAM file path.",
                      seqan3::option_spec::standard,
                      seqan3::output_file_validator{seqan3::output_file_open_options::create_new, {"sam"}});
    parser.add_option(args.errors, 'e', "error", "Maximum allowed errors.",
                      seqan3::option_spec::standard,
                      seqan3::arithmetic_range_validator{0, 4});
}
 
int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Mapper", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
 
    return 0;
}

Step 2 - Reading the input and searching

We also want to read the reference in the read mapper. This is done the same way as for the indexer. We can now load the index and conduct a search. Read up on the Search Tutorial if you have any questions.

Assignment 5: Reading the input

Extend your program to read the reference file the same way the indexer does. After this you can load the index and print results of a search.

To do this, you should:

Carry over the read_reference function and the reference_storage_t struct from the indexer
Create a function map_reads that loads the index and prints the results of the search (allowing all error types) for the first 20 queries

You should also perform the following changes in run_program:

Remove the debug output
Construct an object storage of type reference_storage_t
Add a call to read_reference and map_reads

This is the signature of map_reads:

Hint

void map_reads(std::filesystem::path const & query_path,
               std::filesystem::path const & index_path,
               std::filesystem::path const & sam_path,
               reference_storage_t & storage,
               uint8_t const errors)

Solution

struct reference_storage_t
{
    std::vector<std::string> ids;
    std::vector<std::vector<seqan3::dna5>> seqs;
};
 
void read_reference(std::filesystem::path const & reference_path,
                    reference_storage_t & storage)
{
    seqan3::sequence_file_input reference_in{reference_path};
    for (auto & [seq, id, qual] : reference_in)
    {
        storage.ids.push_back(std::move(id));
        storage.seqs.push_back(std::move(seq));
    }
}
 
void map_reads(std::filesystem::path const & query_path,
               std::filesystem::path const & index_path,
               std::filesystem::path const & sam_path,
               reference_storage_t & storage,
               uint8_t const errors)
{
    // we need the alphabet and text layout before loading
    seqan3::bi_fm_index<seqan3::dna5, seqan3::text_layout::collection> index;
    {
        std::ifstream is{index_path, std::ios::binary};
        cereal::BinaryInputArchive iarchive{is};
        iarchive(index);
    }
 
    seqan3::sequence_file_input query_file_in{query_path};
 
    seqan3::configuration const search_config = seqan3::search_cfg::max_error_total{
                                                    seqan3::search_cfg::error_count{errors}} |
                                                seqan3::search_cfg::hit_all_best{};
 
#if SEQAN3_WORKAROUND_GCC_93983
    for (auto && record : query_file_in /*| seqan3::views::take(20)*/)
#else // ^^^ workaround / no workaround vvv
    for (auto && record : query_file_in | seqan3::views::take(20))
#endif // SEQAN3_WORKAROUND_GCC_93983
    {
        seqan3::debug_stream << "Hits:" << '\n';
        for (auto && result : search(record.sequence(), index, search_config))
            seqan3::debug_stream << result << '\n';
        seqan3::debug_stream << "======================" << '\n';
    }
    (void) sam_path; // prevent unused parameter warning
    (void) storage; // prevent unused parameter warning
}
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & query_path,
                 std::filesystem::path const & index_path,
                 std::filesystem::path const & sam_path,
                 uint8_t const errors)
{
    reference_storage_t storage{};
    read_reference(reference_path, storage);
    map_reads(query_path, index_path, sam_path, storage, errors);
}

Here is the complete program:

Hint

#include <fstream>
 
#include <cereal/archives/binary.hpp>
 
#include <seqan3/argument_parser/all.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/io/sequence_file/input.hpp>
#include <seqan3/search/all.hpp>
#include <seqan3/search/fm_index/bi_fm_index.hpp>
 
struct reference_storage_t
{
    std::vector<std::string> ids;
    std::vector<std::vector<seqan3::dna5>> seqs;
};
 
void read_reference(std::filesystem::path const & reference_path,
                    reference_storage_t & storage)
{
    seqan3::sequence_file_input reference_in{reference_path};
    for (auto & [seq, id, qual] : reference_in)
    {
        storage.ids.push_back(std::move(id));
        storage.seqs.push_back(std::move(seq));
    }
}
 
void map_reads(std::filesystem::path const & query_path,
               std::filesystem::path const & index_path,
               std::filesystem::path const & sam_path,
               reference_storage_t & storage,
               uint8_t const errors)
{
    // we need the alphabet and text layout before loading
    seqan3::bi_fm_index<seqan3::dna5, seqan3::text_layout::collection> index;
    {
        std::ifstream is{index_path, std::ios::binary};
        cereal::BinaryInputArchive iarchive{is};
        iarchive(index);
    }
 
    seqan3::sequence_file_input query_file_in{query_path};
 
    seqan3::configuration const search_config = seqan3::search_cfg::max_error_total{
                                                    seqan3::search_cfg::error_count{errors}} |
                                                seqan3::search_cfg::hit_all_best{};
 
#if SEQAN3_WORKAROUND_GCC_93983
    for (auto && record : query_file_in /*| seqan3::views::take(20)*/)
#else // ^^^ workaround / no workaround vvv
    for (auto && record : query_file_in | seqan3::views::take(20))
#endif // SEQAN3_WORKAROUND_GCC_93983
    {
        seqan3::debug_stream << "Hits:" << '\n';
        for (auto && result : search(record.sequence(), index, search_config))
            seqan3::debug_stream << result << '\n';
        seqan3::debug_stream << "======================" << '\n';
    }
    (void) sam_path; // prevent unused parameter warning
    (void) storage; // prevent unused parameter warning
}
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & query_path,
                 std::filesystem::path const & index_path,
                 std::filesystem::path const & sam_path,
                 uint8_t const errors)
{
    reference_storage_t storage{};
    read_reference(reference_path, storage);
    map_reads(query_path, index_path, sam_path, storage, errors);
}
 
struct cmd_arguments
{
    std::filesystem::path reference_path{};
    std::filesystem::path query_path{};
    std::filesystem::path index_path{};
    std::filesystem::path sam_path{"out.sam"};
    uint8_t errors{0};
};
 
void initialise_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
{
    parser.info.author = "E. coli";
    parser.info.short_description = "Map reads against a reference.";
    parser.info.version = "1.0.0";
    parser.add_option(args.reference_path, 'r', "reference", "The path to the reference.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fa","fasta"}});
    parser.add_option(args.query_path, 'q', "query", "The path to the query.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fq","fastq"}});
    parser.add_option(args.index_path, 'i', "index", "The path to the index.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"index"}});
    parser.add_option(args.sam_path, 'o', "output", "The output SAM file path.",
                      seqan3::option_spec::standard,
                      seqan3::output_file_validator{seqan3::output_file_open_options::create_new, {"sam"}});
    parser.add_option(args.errors, 'e', "error", "Maximum allowed errors.",
                      seqan3::option_spec::standard,
                      seqan3::arithmetic_range_validator{0, 4});
}
 
int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Mapper", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
 
    return 0;
}

Step 3 - Alignment

We can now use the obtained positions to align each query against the reference. Refer to the Alignment Tutorial if you have any questions.

Assignment 6: Alignment

We want to extend map_reads to:

Use the output of the search to align the query against the reference
Print the query ID, alignment score, subrange of the reference sequence and the query (for the first 20 queries)

This is the alignment config:

Hint

    seqan3::configuration const align_config = seqan3::align_cfg::method_global{
                                                   seqan3::align_cfg::free_end_gaps_sequence1_leading{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_leading{false},
                                                   seqan3::align_cfg::free_end_gaps_sequence1_trailing{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_trailing{false}} |
                                               seqan3::align_cfg::edit_scheme |
                                               seqan3::align_cfg::output_alignment{} |
                                               seqan3::align_cfg::output_score{};

Solution

void map_reads(std::filesystem::path const & query_path,
               std::filesystem::path const & index_path,
               std::filesystem::path const & sam_path,
               reference_storage_t & storage,
               uint8_t const errors)
{
    // we need the alphabet and text layout before loading
    seqan3::bi_fm_index<seqan3::dna5, seqan3::text_layout::collection> index;
    {
        std::ifstream is{index_path, std::ios::binary};
        cereal::BinaryInputArchive iarchive{is};
        iarchive(index);
    }
 
    seqan3::sequence_file_input query_file_in{query_path};
 
    seqan3::configuration const search_config = seqan3::search_cfg::max_error_total{
                                                    seqan3::search_cfg::error_count{errors}} |
                                                seqan3::search_cfg::hit_all_best{};
 
    seqan3::configuration const align_config = seqan3::align_cfg::method_global{
                                                   seqan3::align_cfg::free_end_gaps_sequence1_leading{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_leading{false},
                                                   seqan3::align_cfg::free_end_gaps_sequence1_trailing{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_trailing{false}} |
                                               seqan3::align_cfg::edit_scheme |
                                               seqan3::align_cfg::output_alignment{} |
                                               seqan3::align_cfg::output_score{};
 
#if SEQAN3_WORKAROUND_GCC_93983
    for (auto && record : query_file_in /*| seqan3::views::take(20)*/)
#else // ^^^ workaround / no workaround vvv
    for (auto && record : query_file_in | seqan3::views::take(20))
#endif // SEQAN3_WORKAROUND_GCC_93983
    {
        auto & query = record.sequence();
        for (auto && result : search(query, index, search_config))
        {
            size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
            std::span text_view{std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
 
            for (auto && alignment : seqan3::align_pairwise(std::tie(text_view, query), align_config))
            {
                auto && [aligned_database, aligned_query] = alignment.alignment();
                seqan3::debug_stream << "id:       " << record.id() << '\n';
                seqan3::debug_stream << "score:    " << alignment.score() << '\n';
                seqan3::debug_stream << "database: " << aligned_database << '\n';
                seqan3::debug_stream << "query:    "  << aligned_query << '\n';
                seqan3::debug_stream << "=============\n";
            }
        }
    }
    (void) sam_path; // prevent unused parameter warning
}

Here is the complete program:

Hint

#include <fstream>
 
#include <cereal/archives/binary.hpp>
 
#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/argument_parser/all.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/io/sequence_file/input.hpp>
#include <seqan3/search/all.hpp>
#include <seqan3/search/fm_index/bi_fm_index.hpp>
#include <seqan3/std/span>
 
struct reference_storage_t
{
    std::vector<std::string> ids;
    std::vector<std::vector<seqan3::dna5>> seqs;
};
 
void read_reference(std::filesystem::path const & reference_path,
                    reference_storage_t & storage)
{
    seqan3::sequence_file_input reference_in{reference_path};
    for (auto & [seq, id, qual] : reference_in)
    {
        storage.ids.push_back(std::move(id));
        storage.seqs.push_back(std::move(seq));
    }
}
 
void map_reads(std::filesystem::path const & query_path,
               std::filesystem::path const & index_path,
               std::filesystem::path const & sam_path,
               reference_storage_t & storage,
               uint8_t const errors)
{
    // we need the alphabet and text layout before loading
    seqan3::bi_fm_index<seqan3::dna5, seqan3::text_layout::collection> index;
    {
        std::ifstream is{index_path, std::ios::binary};
        cereal::BinaryInputArchive iarchive{is};
        iarchive(index);
    }
 
    seqan3::sequence_file_input query_file_in{query_path};
 
    seqan3::configuration const search_config = seqan3::search_cfg::max_error_total{
                                                    seqan3::search_cfg::error_count{errors}} |
                                                seqan3::search_cfg::hit_all_best{};
 
    seqan3::configuration const align_config = seqan3::align_cfg::method_global{
                                                   seqan3::align_cfg::free_end_gaps_sequence1_leading{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_leading{false},
                                                   seqan3::align_cfg::free_end_gaps_sequence1_trailing{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_trailing{false}} |
                                               seqan3::align_cfg::edit_scheme |
                                               seqan3::align_cfg::output_alignment{} |
                                               seqan3::align_cfg::output_score{};
 
#if SEQAN3_WORKAROUND_GCC_93983
    for (auto && record : query_file_in /*| seqan3::views::take(20)*/)
#else // ^^^ workaround / no workaround vvv
    for (auto && record : query_file_in | seqan3::views::take(20))
#endif // SEQAN3_WORKAROUND_GCC_93983
    {
        auto & query = record.sequence();
        for (auto && result : search(query, index, search_config))
        {
            size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
            std::span text_view{std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
 
            for (auto && alignment : seqan3::align_pairwise(std::tie(text_view, query), align_config))
            {
                auto && [aligned_database, aligned_query] = alignment.alignment();
                seqan3::debug_stream << "id:       " << record.id() << '\n';
                seqan3::debug_stream << "score:    " << alignment.score() << '\n';
                seqan3::debug_stream << "database: " << aligned_database << '\n';
                seqan3::debug_stream << "query:    "  << aligned_query << '\n';
                seqan3::debug_stream << "=============\n";
            }
        }
    }
    (void) sam_path; // prevent unused parameter warning
}
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & query_path,
                 std::filesystem::path const & index_path,
                 std::filesystem::path const & sam_path,
                 uint8_t const errors)
{
    reference_storage_t storage{};
    read_reference(reference_path, storage);
    map_reads(query_path, index_path, sam_path, storage, errors);
}
 
struct cmd_arguments
{
    std::filesystem::path reference_path{};
    std::filesystem::path query_path{};
    std::filesystem::path index_path{};
    std::filesystem::path sam_path{"out.sam"};
    uint8_t errors{0};
};
 
void initialise_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
{
    parser.info.author = "E. coli";
    parser.info.short_description = "Map reads against a reference.";
    parser.info.version = "1.0.0";
    parser.add_option(args.reference_path, 'r', "reference", "The path to the reference.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fa","fasta"}});
    parser.add_option(args.query_path, 'q', "query", "The path to the query.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fq","fastq"}});
    parser.add_option(args.index_path, 'i', "index", "The path to the index.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"index"}});
    parser.add_option(args.sam_path, 'o', "output", "The output SAM file path.",
                      seqan3::option_spec::standard,
                      seqan3::output_file_validator{seqan3::output_file_open_options::create_new, {"sam"}});
    parser.add_option(args.errors, 'e', "error", "Maximum allowed errors.",
                      seqan3::option_spec::standard,
                      seqan3::arithmetic_range_validator{0, 4});
}
 
int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Mapper", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
 
    return 0;
}

Step 4 - Alignment output

Finally, we can write our results into a SAM file.

Assignment 7: SAM out

We further need to extend map_reads to write the alignment results into a SAM file. Additionally, there should be no more debug output.

Try to write all available information into the SAM file. We can introduce a naive mapping quality by using mapping quality = 60 + alignment score.

This is the sam_file_output construction:

Hint

    seqan3::sam_file_output sam_out{sam_path, seqan3::fields<seqan3::field::seq,
                                                             seqan3::field::id,
                                                             seqan3::field::ref_id,
                                                             seqan3::field::ref_offset,
                                                             seqan3::field::alignment,
                                                             seqan3::field::qual,
                                                             seqan3::field::mapq>{}};

Solution

void map_reads(std::filesystem::path const & query_path,
               std::filesystem::path const & index_path,
               std::filesystem::path const & sam_path,
               reference_storage_t & storage,
               uint8_t const errors)
{
    // we need the alphabet and text layout before loading
    seqan3::bi_fm_index<seqan3::dna5, seqan3::text_layout::collection> index;
    {
        std::ifstream is{index_path, std::ios::binary};
        cereal::BinaryInputArchive iarchive{is};
        iarchive(index);
    }
 
    seqan3::sequence_file_input query_file_in{query_path};
 
    seqan3::sam_file_output sam_out{sam_path, seqan3::fields<seqan3::field::seq,
                                                             seqan3::field::id,
                                                             seqan3::field::ref_id,
                                                             seqan3::field::ref_offset,
                                                             seqan3::field::alignment,
                                                             seqan3::field::qual,
                                                             seqan3::field::mapq>{}};
 
    seqan3::configuration const search_config = seqan3::search_cfg::max_error_total{
                                                    seqan3::search_cfg::error_count{errors}} |
                                                seqan3::search_cfg::hit_all_best{};
 
    seqan3::configuration const align_config = seqan3::align_cfg::method_global{
                                                   seqan3::align_cfg::free_end_gaps_sequence1_leading{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_leading{false},
                                                   seqan3::align_cfg::free_end_gaps_sequence1_trailing{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_trailing{false}} |
                                               seqan3::align_cfg::edit_scheme |
                                               seqan3::align_cfg::output_alignment{} |
                                               seqan3::align_cfg::output_begin_position{} |
                                               seqan3::align_cfg::output_score{};
 
    for (auto && record : query_file_in)
    {
        auto & query = record.sequence();
        for (auto && result : search(query, index, search_config))
        {
            size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
            std::span text_view{std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
 
            for (auto && alignment : seqan3::align_pairwise(std::tie(text_view, query), align_config))
            {
                auto aligned_seq = alignment.alignment();
                size_t ref_offset = alignment.sequence1_begin_position() + 2 + start;
                size_t map_qual = 60u + alignment.score();
 
                sam_out.emplace_back(query,
                                     record.id(),
                                     storage.ids[result.reference_id()],
                                     ref_offset,
                                     aligned_seq,
                                     record.base_qualities(),
                                     map_qual);
            }
        }
    }
}

Here is the complete program:

Hint

#include <fstream>
 
#include <cereal/archives/binary.hpp>
 
#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/argument_parser/all.hpp>
#include <seqan3/io/sam_file/output.hpp>
#include <seqan3/io/sequence_file/input.hpp>
#include <seqan3/search/all.hpp>
#include <seqan3/search/fm_index/bi_fm_index.hpp>
#include <seqan3/std/span>
 
struct reference_storage_t
{
    std::vector<std::string> ids;
    std::vector<std::vector<seqan3::dna5>> seqs;
};
 
void read_reference(std::filesystem::path const & reference_path,
                    reference_storage_t & storage)
{
    seqan3::sequence_file_input reference_in{reference_path};
    for (auto & [seq, id, qual] : reference_in)
    {
        storage.ids.push_back(std::move(id));
        storage.seqs.push_back(std::move(seq));
    }
}
 
void map_reads(std::filesystem::path const & query_path,
               std::filesystem::path const & index_path,
               std::filesystem::path const & sam_path,
               reference_storage_t & storage,
               uint8_t const errors)
{
    // we need the alphabet and text layout before loading
    seqan3::bi_fm_index<seqan3::dna5, seqan3::text_layout::collection> index;
    {
        std::ifstream is{index_path, std::ios::binary};
        cereal::BinaryInputArchive iarchive{is};
        iarchive(index);
    }
 
    seqan3::sequence_file_input query_file_in{query_path};
 
    seqan3::sam_file_output sam_out{sam_path, seqan3::fields<seqan3::field::seq,
                                                             seqan3::field::id,
                                                             seqan3::field::ref_id,
                                                             seqan3::field::ref_offset,
                                                             seqan3::field::alignment,
                                                             seqan3::field::qual,
                                                             seqan3::field::mapq>{}};
 
    seqan3::configuration const search_config = seqan3::search_cfg::max_error_total{
                                                    seqan3::search_cfg::error_count{errors}} |
                                                seqan3::search_cfg::hit_all_best{};
 
    seqan3::configuration const align_config = seqan3::align_cfg::method_global{
                                                   seqan3::align_cfg::free_end_gaps_sequence1_leading{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_leading{false},
                                                   seqan3::align_cfg::free_end_gaps_sequence1_trailing{true},
                                                   seqan3::align_cfg::free_end_gaps_sequence2_trailing{false}} |
                                               seqan3::align_cfg::edit_scheme |
                                               seqan3::align_cfg::output_alignment{} |
                                               seqan3::align_cfg::output_begin_position{} |
                                               seqan3::align_cfg::output_score{};
 
    for (auto && record : query_file_in)
    {
        auto & query = record.sequence();
        for (auto && result : search(query, index, search_config))
        {
            size_t start = result.reference_begin_position() ? result.reference_begin_position() - 1 : 0;
            std::span text_view{std::data(storage.seqs[result.reference_id()]) + start, query.size() + 1};
 
            for (auto && alignment : seqan3::align_pairwise(std::tie(text_view, query), align_config))
            {
                auto aligned_seq = alignment.alignment();
                size_t ref_offset = alignment.sequence1_begin_position() + 2 + start;
                size_t map_qual = 60u + alignment.score();
 
                sam_out.emplace_back(query,
                                     record.id(),
                                     storage.ids[result.reference_id()],
                                     ref_offset,
                                     aligned_seq,
                                     record.base_qualities(),
                                     map_qual);
            }
        }
    }
}
 
void run_program(std::filesystem::path const & reference_path,
                 std::filesystem::path const & query_path,
                 std::filesystem::path const & index_path,
                 std::filesystem::path const & sam_path,
                 uint8_t const errors)
{
    reference_storage_t storage{};
    read_reference(reference_path, storage);
    map_reads(query_path, index_path, sam_path, storage, errors);
}
 
struct cmd_arguments
{
    std::filesystem::path reference_path{};
    std::filesystem::path query_path{};
    std::filesystem::path index_path{};
    std::filesystem::path sam_path{"out.sam"};
    uint8_t errors{0};
};
 
void initialise_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
{
    parser.info.author = "E. coli";
    parser.info.short_description = "Map reads against a reference.";
    parser.info.version = "1.0.0";
    parser.add_option(args.reference_path, 'r', "reference", "The path to the reference.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fa","fasta"}});
    parser.add_option(args.query_path, 'q', "query", "The path to the query.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"fq","fastq"}});
    parser.add_option(args.index_path, 'i', "index", "The path to the index.",
                      seqan3::option_spec::required,
                      seqan3::input_file_validator{{"index"}});
    parser.add_option(args.sam_path, 'o', "output", "The output SAM file path.",
                      seqan3::option_spec::standard,
                      seqan3::output_file_validator{seqan3::output_file_open_options::create_new, {"sam"}});
    parser.add_option(args.errors, 'e', "error", "Maximum allowed errors.",
                      seqan3::option_spec::standard,
                      seqan3::arithmetic_range_validator{0, 4});
}
 
int main(int argc, char const ** argv)
{
    seqan3::argument_parser parser("Mapper", argc, argv);
    cmd_arguments args{};
 
    initialise_argument_parser(parser, args);
 
    try
    {
        parser.parse();
    }
    catch (seqan3::argument_parser_error const & ext)
    {
        std::cerr << "[PARSER ERROR] " << ext.what() << '\n';
        return -1;
    }
 
    run_program(args.reference_path, args.query_path, args.index_path, args.sam_path, args.errors);
 
    return 0;
}

Difficulty	High
Duration	90 Minutes
Prerequisite tutorials	All
Recommended reading

Table of Contents

Introduction

Agenda

The data

The indexer

Step 1 - Parsing arguments

Assignment 1: Parsing arguments

Step 2 - Reading the input

Assignment 2: Reading the input

Step 3 - Index

Assignment 3: Index

The read mapper

Step 1 - Parsing arguments

Assignment 4: Parsing arguments

Step 2 - Reading the input and searching

Assignment 5: Reading the input

Step 3 - Alignment

Assignment 6: Alignment

Step 4 - Alignment output

Assignment 7: SAM out