Here we are, this is the first article of 2021. I hope it will be a better year for everyone. As for me, I hope that in this new year the number of articles on this blog can increase. We have several projects in mind, opening a YouTube channel, creating a podcast and much more, and who knows maybe one day we will be able to make this dissemination activity a real job. Yes, you noticed well, I'm speaking in the plural, why? Simply because bioinformatici.com in a few months will be managed by two people, me, Omar Almolla, the "voice" of this blog and Luciana Gaccione, the one who manages the structure and the social soul of the blog and who, also having the my own kind of education, keep me from writing too much nonsense. I can say that we have an internal peer review system.
In any case, it is always me who speaks to you but it is right that you know that behind this dissemination project a team is growing that I hope can help Bioinformatics to grow, to reach more and more people and to land on new platforms. There is a long way to go, the tenacity and passion of those who follow it is there, if you want to help us grow this project you can do it by sharing the blog on your social networks, or leave a donation in the section "Help me grow!" .
Ok, I didn't want to talk to you about this, but as usual I got caught up in the desire to write.
The reason I decided to write today is to better talk to you about a file type widely used in bioinformatics. I am referring to the FASTA files. I talked about this quickly in the article from the name "Fasta and Furious! ", but I have to be honest I'm not super satisfied with what is written. After all it is understandable, the blog was born recently and I still did not have a clear idea of who to contact and therefore how to write to you. That's why I'm determined to tell you more about FASTA files, so sit back and read what I have to tell you.
The FASTA file is a text file that contains either a nucleotide sequence of DNA or RNA or an amino acid sequence of a protein and information relating thereto. The structure of a FASTA file is very simple. In it we find two lines:
- The first line begins with the major symbol ">" and is called heading, which provides a series of information relating to the nucleotide or amino acid sequence placed in the second and last line. In fact, it is possible to find an identification code of the sequence, its length and much more. In addition, NCBI (National Center for Biotechnology Information) has outlined a list of codes that allow you to uniquely label the databases from which the sequence was taken. It should also be noted that it is not mandatory to put information in the header, all that is mandatory to have, in order to define a FASTA file, is the major symbol ">" at the beginning of the first line.
- The second line provides the sequence of nitrogenous bases, in the case of DNA or RNA, or of amino acids in the case of proteins. Regarding the second line of the FASTA file, an important consideration must be made
The extension of a FASTA file can be of different types depending on the type of sequences it contains, as can be seen from the table below.
When talking about FASTA files it is necessary to talk about some file types related to them. In particular these are:
- MULTI-FASTA files, that is, FASTA files that contain multiple sequences in FASTA format, that is, with a header plus nucleotide or amino acid sequence. These can be obtained by concatenating individual FASTA files which can be done for example with the command cat. If you are interested, watch the video below to see how the command works cat:
- QUAL. These files provide information about the quality of the individual nitrogenous bases or amino acids that make up the sequence present in the second line of the FASTA file in the form of whole and positive numbers that define a defined quality score. Phred score o Q value. This qualitative information is extremely important because, as mentioned in previous articles, when working with a sequence, in particular DNA or RNA, it is necessary to constantly question the goodness of the latter. In particular, there are 3 levels of quality control of a sequence under examination:
- Quality control in the sequencing phase, where the choice of the library, the coverage and the sequencing technique used is crucial for obtaining well-representative reads of the sequenced DNA whose sequence we want to reconstruct.
- Quality control in the assembly phase of the reads obtained from sequencing, in this case the choice of the correct assembly algorithm and the evaluation of statistical parameters is useful to obtain information about the sequence obtained from the assembly of the reads (read here to find out more).
- Sequence quality control thanks to the quality information placed in the QUAL files in the specific case of FASTA files.
But how does the QUAL file in practice provide this information? As mentioned, there are these values, defined Phred score o Q value, consisting of integers and positive numbers associated with each letter of the sequence. QUAL and FASTA files are generally extracted from SCF files, i.e. files produced following the digital processing of Sanger sequencing. In fact, this SCF file contains a chromatogram in which it is possible to observe the fluorescence peaks relating to the different four nitrogenous bases that are found in the code of the sequenced DNA or cDNA sequence (read here if you feel a little lost).
In particular, the Phred algorithm evaluates the trend, the shape and other criteria of the individual peaks (and therefore nucleotides) with respect to the reference values and according to this evaluation it calculates a value for each nucleotide, precisely the Q value, which describes the probability that there is an error in the "call" of the nucleotide under consideration. Believe me it is more complicated to write than to understand. Let's take an example:
- I have a DNA sequence like this => AATA
- The Phred algorithm performs the following equation base by base:
Where Q indicates the probability expressed in logarithm of having an error in that precise point of the sequence and therefore in the "call" of that nitrogenous base under consideration.
- Let's say we have a Q value of 30 at the level of the first A of the AATA sequence. What does this mean? That we have 0,1% probability of error identifying (or "calling") this A at that precise point in the sequence when computing the sequence in the sequencing process:
Usually the Q value varies from 0 to 60 but a value greater than 20 is considered acceptable. Often it is very useful to know the Q value of each nucleotide of a sequence, in fact we imagine we want to build primers for a given DNA sequence, if in a certain region we find nucleotides with a Q value less than 20, we can decide to discard it or to select another point in the sequence to design the aforementioned primers. In fact, if we used primers that are drawn on a sequence with a low quality score we could also not have a correct pairing when we use for example the primers for the amplification of the same sequence by means of a PCR.
So, as mentioned, in the article "Fasta and Furious! " I have omitted a lot of useful information. So always remember that FASTA files are nothing without the QUAL files associated with them. I like to think that every time you find yourself working with one of these you will remember us authors of Bioinformatica.
Now I have to go. I think I've written too much. But I remind you to leave a “like”, to share the article with your friends and / or colleagues and to subscribe to the blog or on Instagram. Remember, we would love to grow more and more and with your little help this can become possible.
Hello and see you soon.