As I told you in a previous article, the data is contained in file and these can have different forms, distinguishable by their extensions. The extension of a file is that wording that we commonly find at the end of the name of this immediately after a point. Maybe with an example I make the idea better. Go to word, write a new document and save it as “ciccio”, after which go to search for it in your computer and you will see that your file it is called “ciccio.docx”. Or if you right-click on the file ciccio and go to property, you can see in the info related to the type of file the .docx extension. The extension of the file is the word that describes the type of file; for example the files word will have the extension .docx, the files text extension .txt, the files power point extension .pptx and so on.
Obviously also in bioinformatics the extension of file it is very useful because we can understand what kind of file we have in our hands with a single glance. The types of file that a bioinformatician deals with are really different but some are very frequent and I think it is right say somethings of them:
- FASTA file, distinguishable with the extension .fasta, as well as .fa and .fna. A fasta is a file text that contains the sequence of a DNA, RNA , or of a protein that has been sequenced.
The structure of the file fasta is the following:
- They consist of only two lines
- The first line always begins with a greater than symbol (>) and forms the text header (header), where we find information relating to the sequence.
- In the second line we find the nucleotide sequence of DNA or RNA or the amino acid sequence in the case of proteins.
Two clarifications must be made regarding the file fasta:
- The FASTA file may result from the conversion of the file.scf, which contains the chromatogram obtained as output from Sanger sequencing, that is the first generation sequencing technique that we will discuss in detail in one of the next articles.
- The FASTA file presents only the nucleotide sequence obtained from sequencing and not information relating to the quality of the sequence. The qualitative information of the sequence, expressed with numerical values, is contained in another type of filewhich takes its name file.qual.
- The SRA file , described by the extension .sra, is a file raw obtained as the first output from the next generation sequencing processes (NGS) and stored in the database. The file sra are compressed files so in order to analyze the sequences contained within them they must first be decompressed with special tools such as SRA toolkit ( https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view = software ) which output a file fastq.
- The FASTQ file, with the extension .fastq or .fq, is a file in some respects similar to FASTA file, in fact this also contain the nucleotide sequence of a nucleic acid that is sequenced by means of next generation sequencing techniques (NGS) but unlike FASTA file also presents qualitative information, relating to the sequence, expressed with ASCII characters, or by means of numbers, letters and symbols.
The structure of the file fastq is as follows:
- They have four lines
- In the first line we find the header (header) with various information about the nucleotide sequence. This does not start with the> symbol, as in the FASTA file, but with the @ symbol.
- In the second line we find the sequence.
- The third line starts with + and can present the header again.
- In the fourth and last line we find the qualitative information in ASCII characters on the sequence. The quality of the sequence varies according to the type of next-generation sequencing technique used.
Well, I think we can say goodbye for today, of course the files described above are just some of the types of files that exist in the world of bioinformatics, we will meet many others in the next articles.
As always, I invite you to leave a comment and follow this blog.
Bye-bye and see you soon.