Hi! How are you? I hope so. I have decided to delight you with a new article that will go straight into the column "Files in bioinformatics", in which I talk about the most common and important file types used by bioinformaticians. In fact, today we talk about SAM file, or Sequence Alignment Map file, and its cousin BAM file, Binary Alignment Map file. These are extremely useful file as they are produced by the process of aligning (or mapping) reads on a reference genome, this is also called re-sequencing, although in my opinion this term is a bit misleading. In any case, the SAM and BAM file are equally useful for identifying the polymorphisms that exist between a sequenced and a reference genome as well as with a third genome also aligned on the same reference, or even a fourth, a fifth and so on. In short, these file are essential to obtain the so-called "call" of the variants existing between the genomes being compared.
But let's go in order. The SAM file are the first products of the alignment process, while the BAM is derived, in fact this contain the same alignment information present inside a SAM file but simply in a compressed way it is more accessible to the programs used for the call of variants or for the simple graphical display of the mapping of the reads on the reference.
So to be as clear as possible I thought of presenting these two types of files schematically.
The SAM file
- File type: Text file in which the alignment information is reported in ASCII characters.
- File extension: name_of_file.sam
- File structure: SAM file consists of two main parts:
- The header, which begins with the symbol "@". This contains a series of information such as generic information relating to the file and its version, those relating to the ordering of the file as well as those relating to the genome reference.
- The body or section of the alignment, in which all the data produced by the alignment process obtained by specific software, such as BWA, are stored. This consists of a number of lines equal to the number of reads produced by sequencing (one line for each reads) e eleven columns, where in each of these we find a field that contains specific information relating to the mapping of the reads on the reference.
Let's see what the individual columns say in detail.
Col 1, QNAME: Indicates the name of the read. But be careful, in some cases a read can be chimerical and therefore capable of aligning itself in different points of the reference genome therefore we could observe repetitions of the name along column 1.
Col 2, FLAG: It indicates a numerical code that tells us how the read observed in the present line was aligned by the alignment software on the reference genome. This column is essential to obtain later, using appropriate software such as samtools flagstat, statistics regarding the quality of the alignment. Let's take an example to understand how these numbers placed in the second column give us important information. Let's assume that the read under consideration has a FLAG value equal to 4. What does it mean? Well, through a special table (present in the image below) we know that this value indicates that the aforementioned read has not been mapped as no point has been found on the genome to align with. Very interesting right? Think that from the point of view of the study of structural variants these unmapped reads are very useful, in fact it is precisely these that usually contain the greatest number of polymorphisms compared to the reference, which precisely prevent alignment.
Col 3, RNAME: Indicates the name of the reference genome on which the reads from the aligner have been aligned. In this regard it is necessary to specify that, usually, in this field there is the name of the chromosome of the reference on which a read is aligned.
Col 4, POS: It indicates the initial alignment position of the read considered on the reference. This position is expressed with a number that indicates the position of the first nucleotide from which the alignment starts. If in this column we find the value zero it means that the read considered has not been mapped to confirm the value 4 placed in the FLAG column (col 2).
Col 5, MAPQ: Indicates the quality value of the alignment (read here to learn more about this parameter).
Col 6, CIGAR: In this column we find a string consisting of a whole number and a letter, which refers to an operation (OP), which together summarize the information relating to the alignment. This is very useful as it allows programs such as TABLET, to graphically display the alignment of the reads on the reference. Below you can find a table describing the meaning of each letter that we can find in the string. Let's take an example, let's say we have a string like this in column 6: 76H130M. This means that 130 bases of the read under consideration have been aligned to the reference while 76 bases remained have not been aligned.
Col 7, RNEXT: Indicates the name of the read that is in paired-end with the read under consideration. Attention, the symbol "*" indicates that there is no information available while the symbol "=" indicates that the read in paired has the same ID (name) as the read of that line.
Col 8, PNEXT: It indicates the starting position of the read which is paired-end with the read under consideration.
Col 9, TLEN: Represents the length of the reference segment mapped by the two paired-end reads.
Col 10, SEQ: It shows the sequence of the read taken into consideration.
Col 11, QUAL or PHRED: It expresses the quality value related to the sequencing of the read, that is expresses the probability to have an error when "calling" the bases during sequencing.
What I have described so far can be summarized with these two images:
The BAM file
Once the SAM file is presented, understanding the role of BAM file is much simpler so I will not dwell on schematizing it as done above. For simplicity we could say that a SAM file is understandable for us humans while the BAM file is understandable only to the computer but both contain information on the alignment of reads on the reference genome. BAM file is easily readable by the computer as it's indexable (using samtools index), i.e. it is possible to create an index of this in order to make it easier for different programs to use the file, including programs that have the task of finding the polymorphisms present among the genomes aligned to the reference and the latter and the programs dedicated to the construction of graphic representations of the alignment of reads on the reference genome.
Okay, maybe even today I have dwelt a little too much, but I hope that with this article you now also have in mind the role of these other two extremely useful file formats in the field of bioinformatics. As always, I urge you to subscribe to the blog, to leave a like and/or to leave a comment also to make a constructive criticism. Also I remind you that if you care about this disclosure project you can support us through a (even very small) donation on PayPal in the appropriate section "Help us grow ".
Bye-bye and see you soon.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup, The Sequence alignment / map (SAM) format and SAMtools, Bioinformatics (2009 ) 25 (16) 2078-9. 10.1093 / bioinformatics / btp352
- Hosseini, M .; Pratas, D .; Pinho, AJ A Survey on Data Compression Methods for Biological Sequences. Information 2016, 7, 56. https://doi.org/10.3390/info7040056
- Milne I, Stephen G, Bayer M, Cock PJA, Pritchard L, Cardle L, Shaw PD and Marshall D. 2013. Using Tablet for visual exploration of second-generation sequencing data. Briefings in Bioinformatics 14 (2), 193-202