Here we are, this is the first article of 2021. I hope it will be a better year for everyone. As for me, I hope that in this new year the number of articles on this blog can increase. We have several projects in mind, opening a YouTube channel, creating a podcast and much more, and who knows maybe one day we will be able to make this disclosure activity a real job. Yes, you noticed well, I'm speaking in the plural, why? Simply because Bioinformaticamente is managed by two people, me, Omar Almolla, the "voice" of this blog and Luciana Gaccione, the one who manages the structure and the social soul of the blog and who, also having the my own kind of education, keep me from writing too much nonsense. I can say that we have an internal peer review system.
In any case, it is always me who speaks to you but it is right that you know that behind this project a team is growing that I hope can help bioinformatics to grow, to reach more and more people and to land on new platforms. There is a long way to go, the tenacity and passion of those who follow it is there, if you want to help us grow this project you can do it by sharing the blog on your social networks, or leave a donation in the "Help me grow!" .
Ok, I didn't want to talk to you about this, but I got carried away.
The reason I decided to write today is to better talk to you about a file type widely used in bioinformatics. I am referring to the FASTA files. I talked about this quickly in the article called " Fasta and Furious!" , but I have to be honest I'm not super satisfied with what is written. After all, it is understandable, the blog was born recently and I still didn't have a clear idea on the audience to address and therefore how to write. That's why I'm determined to tell you more about FASTA files, so sit back and read what I have to tell you.
The FASTA file is a text file that contains a nucleotide sequence of DNA or RNA or an amino acid sequence of a protein and related information. The structure of a FASTA file is very simple. In it we find two lines:
- The first line begins with the major symbol ">" and is called header, this provides a series of information relating to the nucleotide or amino acid sequence placed in the second and last line. In fact, it is possible to find an identification code of the sequence, its length and much more. In addition, NCBI (National Center for Biotechnology Information) has outlined a list of codes that allow you to uniquely label the databases from which the sequence was taken. It should also be noted that it is not mandatory to put information in the header, all that is mandatory to have, in order to define a FASTA file, is the major symbol ">" at the beginning of the first line.
- The second line provides the sequence of nitrogenous bases, in the case of DNA or RNA, or of amino acids in the case of proteins. Regarding the second line of the FASTA file, an important consideration must be made
The extension of a FASTA file can be of different types depending on the type of sequences it contains, as can be seen from the table below.
When we are talking about FASTA files we need to talk about some file types related to them. In particular these are:
- MULTI-FASTA file, i.e. FASTA file that contain multiple sequences in FASTA format, that is, with header plus nucleotide or amino acid sequence. These can be obtained by concatenating individual FASTA files which can be done for example with the command cat . If you are interested, watch the video below to see how the command works cat :
- QUAL. These files provide information about the quality of the individual nitrogenous bases or amino acids that make up the sequence present in the second line of the FASTA file in the form of whole and positive numbers that define a quality score called Phred score or Q value . This qualitative information is extremely important because, as mentioned in previous articles, when working with a sequence, in particular DNA or RNA, it is necessary to constantly question the goodness of the latter. In particular, there are 3 levels of quality control of a sequence under examination:
- Quality control in the sequencing phase, where the choice of the library, the coverage and the sequencing technique used is crucial for obtaining well-representative reads of the sequenced DNA whose sequence we want to reconstruct.
- Quality control in the assembly phase of the reads obtained from sequencing, in this case the choice of the correct assembly algorithm and the evaluation of statistical parameters is useful to obtain information about the sequence obtained from the assembly of the reads (read here to find out more).
- Sequence quality control thanks to the quality information placed in the QUAL file in the specific case of FASTA file.
But how does the QUAL file in practice provide this information? As mentioned, there are these values, defined Phred score or Q value, i.e. integers and positive numbers associated with each letter of the sequence . THE QUAL and FASTA files are generally extracted from SCF files, i.e. files produced following the digital processing of Sanger sequencing. In fact, this SCF file contains a chromatogram in which it is possible to observe the fluorescence peaks relating to the different four nitrogenous bases that are found in the code of the sequenced DNA or cDNA sequence (read here if you feel a little lost) .
In particular, the Phred algorithm evaluates the trend, the shape and other criteria of the individual peaks (and therefore nucleotides) with respect to the reference values and according to this evaluation it calculates a value for each nucleotide, precisely the Q value, which describes the probability that there is an error in the "call" of the nucleotide under consideration. Believe me it is more complicated to write than to understand. Let's take an example:
- I have a DNA sequence like this => AATA
- The Phred algorithm performs the following equation base by base:
Where Q is the probability expressed in logarithm of having an error in that precise point of the sequence and therefore in the "call" of that nitrogenous base under consideration.
- Let's say we have a Q value of 30 at the level of the first A of the AATA sequence. What does this mean? That we have 0,1% probability of error identifying (or "calling") this A at that precise point in the sequence when we compute the sequence in the sequencing process:
Usually the Q value varies from 0 to 60 but a value greater than 20 is considered acceptable. Often it is very useful to know the Q value of each nucleotide of a sequence, in fact we imagine we want to build primers for a given DNA sequence, if in a certain region we find nucleotides with a Q value less than 20, we can decide to discard it or to select another point in the sequence to design the aforementioned primers. In fact, if we used primers that are drawn on a sequence with a low quality score we could also not have a correct pairing.
So, as mentioned, in the article " Fasta and Furious!" I omitted a lot of useful information. So always remember that FASTA files are nothing without the QUAL files associated with them. I like to think that every time you find yourself working with one of these you will remember us as authors of Bioinformaticamente.
Now I have to go. But I remind you to leave a “like”, to share the article with your friends and colleagues and to subscribe to the blog or on Instagram and Twitter. Remember, we would love to grow more and more and with your little help this can become possible.
Bye-bye and see you soon.