Hi. I haven't blogged in a long time. Unfortunately, I have about six months of fire. In fact, between the thesis and the last exams of the master's course, finding time to study and bring material here on the blog has become difficult. But don't be afraid. After July we at Bioinformaticamente will try to bring in new material more frequently, so have faith and keep following us, the best is yet to come.

Having made this brief introduction, today I had just a moment to tell you something. I was undecided whether to talk to you about how primers for the amplification of a given sequence can be built or whether to tell you how it is possible to study the promoters of expressed genes. But I finally remembered that I recently talked more deeply about FASTA file and it didn't seem fair to let so much time go by before talking to you about another very important file format in bioinformatics. Obviously I'm talking about FASTQ files, that is, text files that contain both the sequence, usually nucleotide, and the quality information relating to each element of this. In a certain sense, a FASTQ file looks like the union of a FASTA file and its corresponding QUAL file. Not surprisingly, the FASTQ file format was born at the Wellcome Trust Sanger Institute precisely with the aim of grouping a FASTA sequence and its quality data, but recently it has become the standard format for storing sequences obtained as output of next-generation sequencing tools such as those used in the case of Illumina sequencing .

Let us now try to dissect the structure of the FASTQ files. First of all these can be recognized thanks to the extension .fastq or .fq . and consist of a total of four lines, where:

  • The first line begins with a character "@" followed by a sequence identifier (ID) and, optionally, a series of information regarding this, therefore it looks like the first line of a FASTA file.
  • The second row contains the nitrogenous base sequence obtained as the output of the sequencing.
  • The third line begins with a character "+" and is optionally followed by the same sequence identifier placed in the first line.
  • The fourth row contains the quality values ​​relating to the nitrogenous bases of the sequence present in the second row, in fact the number of quality values ​​is equal to the number of nitrogenous bases of the sequence.
Figure 1. Structure of a FASTQ file. Source: "Theoretical Biology and Bioinformatics" course, Utrecht University, Bas E. Dutilh & Can Keșmir

However, it is important to know that, unlike those placed in the QUAL files, the quality values ​​that we find in the fourth line of the FASTQ files are not exclusively numerical but can be of different types. These are represented by characters that express ASCII values . The purpose is always the same, the software associated with the machine that operates the sequencing estimates the probability of making an error in the identification (or as they say in the jargon "in the call") of a nucleotide. As I got to say in the article "Fasta and Furious 2!" , this probability value is calculated for each nitrogen base in the sequence and is expressed by Phred score. Usually this score is between 0, which means an error rate of 100%, and 41, which is a probability of 10 ^-4.1 that the nucleotide is wrong, and therefore an error rate of 0.01%. So in the fourth line of the FASTQ files we find the Phred values ​​of each nucleotide represented with an ASCII code which, in a sense, translates the Phred value into an ASCII value.

Ok, ok, I think I lost you. Let's try to clarify with an example.

Let us consider a nucleotide sequence of about fifty nucleotides. Observing the two images below and referring above all to the table, it can be seen how a nucleotide with a Phred value of 25 is equivalent to an ASCII value of 58, indicated in the fourth line of the FASTQ file by the ASCII character ":" .

Image
Figure 2. Detailed structure of a FASTQ file.
Source: https://www.drive5.com/usearch/manual/fastq_files.html
Figure 3. ASCII characters. Source: “Theoretical Biology and Bioinformatics” course, Utrecht University of
Bas E. Dutilh & Can Keșmir

Clear? Well. I think at this point your question is: “What the hell is all this for? Can't we just put in the fourth line of the FASTQ file the Phred values ​​for each nucleotide of the sequence just like in the QUAL files? "

Figure 4. Structure of FASTA file and QUAL file. Source: https://awbrooks19.github.io/vmi_microbiome_bootcamp/rst/3_sequences_to_composition.html

This is certainly a correct observation but, take a good look at the table mentioned above. ASCII characters are much simpler and “leaner” than Phred numeric values. I remind you that the FASTQ file is a text file and the more characters we write, the more this will have "larger dimensions". Let's take another example:

Imagine you have a sequence like this: AATCG. Well now imagine writing in the line below this, within the same file, the Phred values: 12 (A), 30 (A), 36 (T), 23 (C) and 26 (G). These numbers have two digits each therefore two characters each. Reporting everything with ASCII values ​​instead we will get the same information but with fewer characters, that is: - (A),? (A), E (T), 8 (C) and; (G). Obviously, using a sequence composed of five nucleotides the reasoning may seem superfluous but imagine you have a sequence composed of more than a thousand nucleotides, here in that case saving characters is certainly advantageous in order to have "lighter" files and be able to download them quickly from the databases where they are stored and open faster without the need to have a computer monster.

Before concluding, I must make a clarification that could further complicate the situation. Earlier I said that, usually, the reference Phred values ​​range from 0 to 41. In reality this depends on the quality criterion taken into consideration, in fact there are different types depending on the sequencing techniques used.

Figure 5. Phred values ​​and their respective values ​​and ASCII characters. Various quality criteria chosen take into account different ranges of Phred values. Source: https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/file-formats-tutorial/#

So if we use a Sanger technique it is possible to use the quality criterion as a reference Phred + 33 with a Phred range of 0-40, in the case of sequences obtained with the first versions of the Illumina technique we use the criterion Phred + 64 which provides a Phred range of 0-40 or 3-40. To make better the idea, I have shown an excellent summary image below.

Figure 6. Several quality criteria that we can use to indicate nucleotide quality values ​​using ASCII characters.
Source: https://bioinformatics.uconn.edu/resources-and-events/tutorials-2/file-formats-tutorial/#

I know I know. This further complication wasn't really necessary but you can rest easy, in fact today the quality criterion dictated by the version Illumina 1.8+, with Phred + 33 with value range of 0-41 (indicated by the red arrow in the image above) is the one universally used. We find it in fact in the FASTQ files obtained from NGS sequencers such as Illumina, Ion Torrent, PacBio and also Sanger.

Image
Figure 7. Universally used ASCII characters, Phred + 33.
Source: https://www.drive5.com/usearch/manual/quality_score.html

Here we are at the end of this article. I hope I was able to present this other file format, extremely important in bioinformatics, in a clear way. As always, I ask you to leave a "like" or a comment, also for clarification, why not? Let us hear your feedback.

Bye-bye and see you soon.

Sources: