SeqAn3  3.0.3
The Modern C++ library for sequence analysis.
SAM File

Provides files and formats for handling alignment data. More...

+ Collaboration diagram for SAM File:

Classes

class  seqan3::format_bam
 The BAM format. More...
 
class  seqan3::format_sam
 The SAM format (tag). More...
 
class  seqan3::detail::format_sam_base
 The alignment base format. More...
 
class  seqan3::sam_file_header< ref_ids_type >
 Stores the header information of alignment files. More...
 
class  seqan3::sam_file_input< traits_type_, selected_field_ids_, valid_formats_ >
 A class for reading alignment files, e.g. SAM, BAM, BLAST ... More...
 
struct  seqan3::sam_file_input_default_traits< ref_sequences_t, ref_ids_t >
 The default traits for seqan3::sam_file_input. More...
 
interface  sam_file_input_format
 The generic concept for alignment file input formats. More...
 
struct  seqan3::detail::sam_file_input_format_exposer< format_type >
 Internal class used to expose the actual format interface to read alignment records from the file. More...
 
struct  seqan3::sam_file_input_options< sequence_legal_alphabet >
 The options type defines various option members that influence the behaviour of all or some formats. More...
 
interface  sam_file_input_traits
 The requirements a traits_type for seqan3::sam_file_input must meet. More...
 
class  seqan3::sam_file_output< selected_field_ids_, valid_formats_, ref_ids_type >
 A class for writing alignment files, e.g. SAM, BAL, BLAST, ... More...
 
interface  sam_file_output_format
 The generic concept for alignment file out formats. More...
 
struct  seqan3::detail::sam_file_output_format_exposer< format_type >
 Internal class used to expose the actual format interface to write alignment records into the file. More...
 
struct  seqan3::sam_file_output_options
 The options type defines various option members that influence the behavior of all or some formats. More...
 
class  seqan3::sam_tag_dictionary
 The SAM tag dictionary class that stores all optional SAM fields. More...
 
struct  seqan3::sam_tag_type< tag_value >
 The generic base class. More...
 

Typedefs

template<uint16_t tag_value>
using seqan3::sam_tag_type_t = typename sam_tag_type< tag_value >::type
 Short cut helper for seqan3::sam_tag_type::type.
 
using seqan3::detail::sam_tag_variant = std::variant< char, int32_t, float, std::string, std::vector< std::byte >, std::vector< int8_t >, std::vector< uint8_t >, std::vector< int16_t >, std::vector< uint16_t >, std::vector< int32_t >, std::vector< uint32_t >, std::vector< float > >
 std::variant of allowed types for optional tag fields of the SAM format.
 

Enumerations

enum class  seqan3::sam_flag : uint16_t {
  seqan3::none = 0 , seqan3::paired = 0x1 , seqan3::proper_pair = 0x2 , seqan3::unmapped = 0x4 ,
  seqan3::mate_unmapped = 0x8 , seqan3::on_reverse_strand = 0x10 , seqan3::mate_on_reverse_strand = 0x20 , seqan3::first_in_pair = 0x40 ,
  seqan3::second_in_pair = 0x80 , seqan3::secondary_alignment = 0x100 , seqan3::failed_filter = 0x200 , seqan3::duplicate = 0x400 ,
  seqan3::supplementary_alignment = 0x800
}
 An enum flag that describes the properties of an aligned read (given as a SAM record). More...
 

Functions

template<seqan3::detail::writable_pairwise_alignment alignment_type>
void seqan3::detail::alignment_from_cigar (alignment_type &alignment, std::vector< cigar > const &cigar_vector)
 Transforms a std::vector of operation-count pairs (representing the cigar string). More...
 
template<seqan3::detail::pairwise_alignment alignment_type>
std::string seqan3::detail::get_cigar_string (alignment_type &&alignment, uint32_t const query_start_pos=0, uint32_t const query_end_pos=0, bool const extended_cigar=false)
 Creates a cigar string (SAM format) given a seqan3::detail::pairwise_alignment. More...
 
template<seqan3::aligned_sequence ref_seq_type, seqan3::aligned_sequence query_seq_type>
std::string seqan3::detail::get_cigar_string (ref_seq_type &&ref_seq, query_seq_type &&query_seq, uint32_t const query_start_pos=0, uint32_t const query_end_pos=0, bool const extended_cigar=false)
 Transforms an alignment represented by two seqan3::aligned_sequence's into the corresponding cigar string. More...
 
std::string seqan3::detail::get_cigar_string (std::vector< cigar > const &cigar_vector)
 Transforms a vector of cigar elements into a string representation. More...
 
template<seqan3::detail::pairwise_alignment alignment_type>
std::vector< cigarseqan3::detail::get_cigar_vector (alignment_type &&alignment, uint32_t const query_start_pos=0, uint32_t const query_end_pos=0, bool const extended_cigar=false)
 Creates a cigar string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seqan3::aligned_sequence's. More...
 
template<typename reference_char_type , typename query_char_type >
constexpr cigar::operation seqan3::detail::map_aligned_values_to_cigar_op (reference_char_type const reference_char, query_char_type const query_char, bool const extended_cigar)
 Compares two seqan3::aligned_sequence values and returns their cigar operation. More...
 
template<typename char_t , char_t ... s>
constexpr uint16_t seqan3::operator""_tag ()
 The SAM tag literal, such that tags can be used in constant expressions. More...
 

Detailed Description

Provides files and formats for handling alignment data.

Introduction

Alignment files are primarily used to store pairwise alignments of two biological sequences and often come with many additional information. Well-known formats include the SAM/BAM format used to store read mapping data or the BLAST format that stores the results of a query search against a data base.

Note
For a step-by-step guide take a look at our tutorial: SAM Input and Output in SeqAn.

The Alignment file abstraction supports reading 15 different fields:

  1. seqan3::field::seq
  2. seqan3::field::id
  3. seqan3::field::offset
  4. seqan3::field::ref_seq
  5. seqan3::field::ref_id
  6. seqan3::field::ref_offset
  7. seqan3::field::alignment
  8. seqan3::field::cigar
  9. seqan3::field::mapq
  10. seqan3::field::qual
  11. seqan3::field::flag
  12. seqan3::field::mate
  13. seqan3::field::tags
  14. seqan3::field::evalue
  15. seqan3::field::bit_score

There exists one more field for SAM files, the seqan3::field::header_ptr, but this field is mostly used internally. Please see the seqan3::sam_file_output::header member function for details on how to access the seqan3::sam_file_header of the file.)

All of these fields are retrieved by default (and in that order). Note that some of the fields are specific to the SAM format (e.g. seqan3::field::flag) while others are specific to BLAST format (e.g. seqan3::field::bit_score). Please see the corresponding formats for more details.

Enumeration Type Documentation

◆ sam_flag

enum seqan3::sam_flag : uint16_t
strong

An enum flag that describes the properties of an aligned read (given as a SAM record).

See also
seqan3::enum_bitwise_operators enables combining enum values.

The SAM flag are bitwise flags, which means that each value corresponds to a specific bit that is set and that they can be combined and tested using binary operations. See this tutorial for an introduction on bitwise operations on enum flags.

Example:

#include <iostream>
#include <sstream>
auto sam_file_raw = R"(@HD VN:1.6 SO:coordinate GO:none
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG !!!!!!!!!!!!!!!!!
r003 0 ref 29 30 5S6M * 0 0 GCCTAAGCTAA !!!!!!!!!!! SA:Z:ref,29,-,6H5M,17,0;
r003 4 * 29 17 * * 0 0 TAGGC @@@@@ SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 237 30 9M = 7 -39 CAGCGGCAT !!!!!!!!! NM:i:1
)";
int main()
{
for (auto & rec : fin)
{
// Check if a certain flag value (bit) is set:
if (static_cast<bool>(rec.flag() & seqan3::sam_flag::unmapped))
std::cout << "Read " << rec.id() << " is unmapped\n";
if (rec.base_qualities()[0] < seqan3::assign_char_to('@', seqan3::phred42{})) // low quality
{
// Set a flag value (bit):
// Note that this does not affect other flag values (bits),
// e.g. `rec.flag() & seqan3::sam_flag::unmapped` may still be true
}
// Unset a flag value (bit):
rec.flag() &= ~seqan3::sam_flag::duplicate; // not marked as a duplicate anymore
}
}
The SAM format (tag).
Definition: format_sam.hpp:128
Quality type for traditional Sanger and modern Illumina Phred scores.
Definition: phred42.hpp:45
A class for reading alignment files, e.g. SAM, BAM, BLAST ...
Definition: input.hpp:353
constexpr auto assign_char_to
Assign a character to an alphabet object.
Definition: concept.hpp:435
@ failed_filter
The read alignment failed a filter, e.g. quality controls.
@ unmapped
The read is not mapped to a reference (unaligned).
Meta-include for the SAM IO submodule.
The main SeqAn3 namespace.
Definition: aligned_sequence_concept.hpp:29

Adapted from the SAM specifications are the following additional information to some flag values:

See also
https://broadinstitute.github.io/picard/explain-flags.html
Enumerator
none 

None of the flags below are set.

paired 

The aligned read is paired (paired-end sequencing).

proper_pair 

The two aligned reads in a pair have a proper distance between each other.

unmapped 

The read is not mapped to a reference (unaligned).

mate_unmapped 

The mate of this read is not mapped to a reference (unaligned).

on_reverse_strand 

The read sequence has been reverse complemented before being mapped (aligned).

mate_on_reverse_strand 

The mate sequence has been reverse complemented before being mapped (aligned).

first_in_pair 

Indicates the ordering (see details in the seqan3::sam_flag description).

second_in_pair 

Indicates the ordering (see details in the seqan3::sam_flag description).

secondary_alignment 

This read alignment is an alternative (possibly suboptimal) to the primary.

failed_filter 

The read alignment failed a filter, e.g. quality controls.

duplicate 

The read is marked as a PCR duplicate or optical duplicate.

supplementary_alignment 

This sequence is part of a split alignment and is not the primary alignment.

Function Documentation

◆ alignment_from_cigar()

template<seqan3::detail::writable_pairwise_alignment alignment_type>
void seqan3::detail::alignment_from_cigar ( alignment_type &  alignment,
std::vector< cigar > const &  cigar_vector 
)
inline

Transforms a std::vector of operation-count pairs (representing the cigar string).

Template Parameters
alignment_typeThe type of alignment; must model seqan3::detail::writable_pairwise_alignment.
Parameters
[in,out]alignmentThe alignment to fill with gaps according to the cigar information.
[in]cigar_vectorThe cigar information given as a std::vector over seqan3::cigar.

Example:

Given the following cigar string "4M2I5M2D1M", the cigar information extracted by seqan3::detail::parse_cigar would be "[(M,4), (I,2), (M,5), (D,2), (M,1)]". Given those cigar information, and an alignment variable containing the two unaligned sequences "(ATGGCGTAGAGC, ATGCCCCGTTGC)", the alignment will be filled with the following gaps:

ATGG--CGTAGAGC
||| ||| | |
ATGCCCCGTTG--C

◆ get_cigar_string() [1/3]

template<seqan3::detail::pairwise_alignment alignment_type>
std::string seqan3::detail::get_cigar_string ( alignment_type &&  alignment,
uint32_t const  query_start_pos = 0,
uint32_t const  query_end_pos = 0,
bool const  extended_cigar = false 
)
inline

Creates a cigar string (SAM format) given a seqan3::detail::pairwise_alignment.

Template Parameters
alignment_typeMust model seqan3::detail::pairwise_alignment.
Parameters
alignmentThe alignment, represented by a seqan3::pair_like of seqan3::aligned_sequence's, to be transformed into cigar vector based on the second (query) sequence.
query_start_posThe start position of the alignment in the query sequence indicating soft-clipping.
query_end_posThe end position of the alignment in the query sequence indicating soft-clipping.
extended_cigarWhether to print the extended cigar alphabet or not. See cigar operation.
Returns
An std::string representing the alignment as a cigar string.
Note
The resulting cigar_vector is based on the query sequence, which is the second sequence in the alignment pair.

Example:

The following alignment reference sequence on top and the query sequence at the bottom.

ATGG--CGTAGAGC
|||X |||X| |
ATGCCCCGTTG--C

In this case, the function seqan3::detail::get_cigar_string will return the following cigar string when printed: "4M2I5M2D1M". The extended cigar string would look like this: "3=1X2I3=1X1=2D1=".

See also
seqan3::aligned_sequence

◆ get_cigar_string() [2/3]

template<seqan3::aligned_sequence ref_seq_type, seqan3::aligned_sequence query_seq_type>
std::string seqan3::detail::get_cigar_string ( ref_seq_type &&  ref_seq,
query_seq_type &&  query_seq,
uint32_t const  query_start_pos = 0,
uint32_t const  query_end_pos = 0,
bool const  extended_cigar = false 
)
inline

Transforms an alignment represented by two seqan3::aligned_sequence's into the corresponding cigar string.

Template Parameters
ref_seq_typeMust model seqan3::aligned_sequence.
query_seq_typeMust model seqan3::aligned_sequence.
Parameters
ref_seqThe reference sequence to compare against the query sequence.
query_seqThe query sequence to build the cigar string for.
query_start_posThe start position of the alignment in the query sequence indicating soft-clipping.
query_end_posThe end position of the alignment in the query sequence indicating soft-clipping.
extended_cigarWhether to print the extended cigar alphabet or not. See cigar operation.
Returns
An std::string representing the alignment as a cigar string.
Note
The resulting cigar string is based on the query sequence (query_seq).

Example:

The following alignment reference sequence on top and the query sequence at the bottom.

ATGG--CGTAGAGC
|||X |||X| |
ATGCCCCGTTG--C

In this case, the function seqan3::detail::get_cigar_string will return the following cigar string when printed: "4M2I5M2D1M". The extended cigar string would look like this: "3=1X2I3=1X1=2D1=".

See also
seqan3::aligned_sequence

◆ get_cigar_string() [3/3]

std::string seqan3::detail::get_cigar_string ( std::vector< cigar > const &  cigar_vector)
inline

Transforms a vector of cigar elements into a string representation.

Parameters
cigar_vectorThe std::vector of seqan3::cigar elements to be transformed into a std::string.
Returns
The cigar string (std::string).

◆ get_cigar_vector()

template<seqan3::detail::pairwise_alignment alignment_type>
std::vector<cigar> seqan3::detail::get_cigar_vector ( alignment_type &&  alignment,
uint32_t const  query_start_pos = 0,
uint32_t const  query_end_pos = 0,
bool const  extended_cigar = false 
)
inline

Creates a cigar string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seqan3::aligned_sequence's.

Template Parameters
alignment_typeMust model seqan3::detail::pairwise_alignment.
Parameters
alignmentThe alignment, represented by a pair of aligned sequences, to be transformed into cigar vector based on the second (query) sequence.
query_start_posThe start position of the alignment in the query sequence indicating soft-clipping.
query_end_posThe end position of the alignment in the query sequence indicating soft-clipping.
extended_cigarWhether to print the extended cigar alphabet or not. See cigar operation.
Returns
An std::vector<seqan3::cigar> representing the alignment.
Note
The resulting cigar_vector is based on the query sequence, which is the second sequence in the alignment pair.

Example:

Given the following alignment reference sequence on top and the query sequence at the bottom:

ATGG--CGTAGAGC
|||X |||X| |
ATGCCCCGTTG--C

In this case, the function seqan3::detail::get_cigar_vector will return the following cigar vector: "[('M',4),('I',2),('M',5),('D',2),('M',1)]". The extended cigar string would look like this: "[('=',3)('X',1)('I',2)('=',3)('X',1)('=',1)('D',2)('=',1)]".

int main()
{
using seqan3::operator""_dna4;
aligned_t ref{'A'_dna4, 'T'_dna4, 'G'_dna4, 'G'_dna4, seqan3::gap{}, seqan3::gap{},
'C'_dna4, 'G'_dna4, 'T'_dna4, 'A'_dna4, 'G'_dna4, 'A'_dna4, 'G'_dna4, 'C'_dna4};
aligned_t query{'A'_dna4, 'T'_dna4, 'G'_dna4, 'C'_dna4, 'C'_dna4, 'C'_dna4, 'C'_dna4,
'G'_dna4, 'T'_dna4, 'T'_dna4, 'G'_dna4, seqan3::gap{}, seqan3::gap{}, 'C'_dna4};
}
The alphabet of a gap character '-'.
Definition: gap.hpp:37
Provides seqan3::debug_stream and related types.
Provides seqan3::dna4, container aliases and string literals.
Provides seqan3::gapped.
std::vector< cigar > get_cigar_vector(alignment_type &&alignment, uint32_t const query_start_pos=0, uint32_t const query_end_pos=0, bool const extended_cigar=false)
Creates a cigar string (SAM format) given a seqan3::detail::pairwise_alignment represented by two seq...
Definition: cigar.hpp:138
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:42
Auxiliary functions for the alignment IO.
T ref(T... args)
T tie(T... args)
See also
seqan3::aligned_sequence

◆ map_aligned_values_to_cigar_op()

template<typename reference_char_type , typename query_char_type >
constexpr cigar::operation seqan3::detail::map_aligned_values_to_cigar_op ( reference_char_type const  reference_char,
query_char_type const  query_char,
bool const  extended_cigar 
)
constexpr

Compares two seqan3::aligned_sequence values and returns their cigar operation.

Template Parameters
reference_char_typeMust be equality comparable to seqan3::gap.
query_char_typeMust be equality comparable to seqan3::gap.
Parameters
reference_charThe aligned character of the reference to compare.
query_charThe aligned character of the query to compare.
extended_cigarWhether to print the extended cigar alphabet or not. See cigar operation.
Returns
A seqan3::cigar::operation representing the alignment operation between the two values.
Note
The resulting cigar operation is based on the query character (query_char).

Example:

The following alignment column shows the reference char ('C') on top and a gap for the query char at the bottom.

... C ...
|
... - ...

In this case, the function seqan3::detail::map_aligned_values_to_cigar_op will return 'D' since the query char is "deleted".

The next alignment column shows the reference char ('C') on top and a query char ('G') at the bottom.

... C ...
|
... G ...

In this case, the function seqan3::detail::map_aligned_values_to_cigar_op will return 'M', for the basic cigar the two bases are aligned, while in the extended cigar alphabet (extended_cigar = true) the function will return an 'X' since the bases are aligned but are not equal.

See also
seqan3::aligned_sequence

◆ operator""_tag()

template<typename char_t , char_t ... s>
constexpr uint16_t seqan3::operator""_tag ( )
constexpr

The SAM tag literal, such that tags can be used in constant expressions.

Template Parameters
char_tThe char type. Usually char. Parameter pack ...s must be of length 2, since SAM tags consist of two letters (char0 and char1).
Returns
The unique identifier of the SAM tag computed by char0 * 128 + char1.

A SAM tag consists of two letters, initialized via the string literal ""_tag, which delegate to its unique id. e.g.

using seqan3::operator""_tag;
// ...
uint16_t tag_id = "NM"_tag; // tag_id = 10061
Provides the seqan3::sam_tag_dictionary class and auxiliaries.

The purpose of those tags is to fill or query the seqan3::sam_tag_dictionary for a specific key (tag_id) and retrieve the corresponding value.

See also
seqan3::sam_tag_dictionary