Hello, how are you? I hope so. Today I am writing because I wanted to share with you some exercises that I usually try to do to practice as well as for pleasure. For example, sometimes I download a sequence on NCBI or recover this from the bioinformatics course that I followed during my master's degree and almost for fun, I ask this what it has to tell me. No I'm not crazy, on the contrary I believe that one of the basic tasks of bioinformatics is precisely that of extracting information from the biological data at its disposal, so just as an investigator squeezes information from a questioned subject, I ask this sequence what it has to say. I think this is a useful exercise for many especially to train the tendency to ask the right question in front of a biological data. In fact, depending on the data, the question changes so to obtain correct information it is necessary to ask the right question, after all as Immanuel Kant said: "Before evaluating if an answer is correct, one must evaluate whether the question is correct."

Well, to give you a demonstration of what I said, I recovered a sequence from the bioinformatics course I took about a year and a half ago. Let's see together what it has to tell us.

>JQ745270.1_HCT)_mRNA_complete cds
GAAACAGCCCCCTCCAACCATGAAGTCCCCTCCAGGCCACCACACCAAATCCCCAACCAATCTCTCTCTC
TCTCTCTCTTCCCCCCATCGTTCTCACCTTCAGTGGGACCCACGGGTAACGATGATCATTAACGTGAAGG
CGTCCACCATGGTGCGGCCGGCGGAGGAGACGCCTCGCCGGGCGCTGTGGAACTCCAACGTCGATCTGGT
CGTTCCTAATTTTCACACGCCTAGCGTCTACTTTTACCGTCCCACCGGTGCCGCTAACTTCTTTGACGCT
GAGGTTATGAAGCAAGCTCTCGCCAAGGCTCTGGTTCCGTTCTATCCTATGGCCGGCCGGCTCCGTCGCG
ATGAGGATGGTCGTGTTGAGATTGATTGCAACGGCGAGGGTGTGCTTTTAGTCGAGGCTGAGACTATCGG
CGTGATTGACGATTTTGGTGACTTCGCTCCCACACTCGAGCTGCGGCAGCTTATTCCGGCCGTCGATTAT
TCTGGCGGAATCGAAACGTATCCATTGTTAGTGTTGCAGGTAACGTACTTTAAATGTGGGGGCGTGTCCC
TTGGTGTGGGTATGCAGCACCACGCCGCAGATGGGTTCTCGGGTCTCCACTTTATCAACACATGGTCCGA
CATGGCCCGCGGCTTTGACCTCACGCTCCCGCCCTTCATTGATCGCACTTTGCTCCGAGCGCGTGACCCG
CCTCAGCCTGTTTTTGAGCACATTGAATACAAGCCCCCTCCAACAATGAAGTCCCCTCAAAACCCGGTCC
AGTCCCCTACAAAACCCGGTTCAGACCCCAACACAGCCACCGTCTCCATCTTCAAGATGACCCGTGCCCA
ACTCAACGCCCTCAAAGCCAAGTCCAAAGAAGCTGGTAACACCGTCAACTACAGCTCCTACGAGATGCTT
GCTGGTCATGTCTGGAGAAGCACGTGCAAGGCACGTGCACTCCCTGATGATCAAGAAACCAAATTGTACA
TTGCAACTGATGGACGGTCCAGATTGCAGCCGCCCCTTCCCCCAGGTTACTTTGGGAATGTGATCTTCAC
AGCCACGCCTATGGCTGTGGCTGGTGATCTCATGTCAAAACCAACTTGGTTTGCTGCAAGCAGGATTCAT
AATGCTCTCTCAAGAATGGATAATGAGTATTTGAGATCAGCTTTGGACTTCCTAGAACTTCAACCTGATC
TCAAAGCTCTGGTCCGTGGGGCCCATACTTTTAAGTGTCCAAATCTTGGAATCACAAGTTGGGTTAGGCT
TCCAATACATGATGCTGATTTTGGATGGGGTCGGCCCATATTTATGGGTCCTGGTGGGATAGCTTATGAG
GGGCTTTCTTTTATACTTCCAAGCTCAGGTAATGATGGAAGCTTATCAGTGGCCATAGCTCTACAGCCTG
AGCATATGAAGGTGTTCAAGGAAGTTTTGTACGAGATTTGATTTGGTTGAGGAATTGAATAGAAGCATCG
GGAACGCCAAAAATGTTCTCAGGTGGTGTTTTTCTTTCTACATATGTCATTATTGAGACTCGTTTTTTTT
AACCAGAGAGACTATTATTATATGCCTCTGCAAAGTATAGTAATTCTGTAAACTTTTTAAAACGAACTTC
GGGAACAAAAGTATGACTAATTTTGGAGGACATTTGAGAAAGATTTGTTGAACAAAAAAAAAAAAAAAAA
AAAAA

The aforementioned sequence is a cDNA molecule, which is a complementary DNA sequence to a certain mRNA sequence. This cDNA is produced through the mRNA reverse transcription process catalyzed by the enzyme reverse transcriptase.

Now let's ask a few questions to this sequence:

  1. Does the gene that produced this transcript encode proteins?

To answer this question it is necessary to carry out an in silico translation of the cDNA in order to identify the protein sequence most likely produced by the transcript and the 6 possible ORFs. Among these, the longest is the one most likely belonging to the gene. Furthermore, it will be necessary to identify the CDS of the gene. To do this, you can use two different tools:

  • Getorf, that is a command that is launched from the terminal in order to obtain the information mentioned above. This has several usage options that can be called up with the -find option.

In particular, the options have a numerical code as indicated in the table below:

Option numberOption meaning
0Translation of regions between STOP codons
1Translation of regions between START and STOP codons, which is most probably protein coded by the gene.
2Nucleic sequences between STOP codons
3Nucleic sequences between START and STOP codons, which is most probably ORF conteined in the gene
4Nucleotides flanking START codons
5Nucleotides flanking initial STOP codons
6Nucleotides flanking ending STOP codons

Obviously, depending on the find option used, the output of the command will be different. Another important option that should be applied to getorf is -minsize which allows us to indicate the minimum size of the orfs that the algorithm will look for within the queried sequence. But as is my usual to make you better understand how this command works I thought to give a short tutorial.

  • ORFfinder,it is a program that works remotely and is more intuitive and graphically more fascinating than getorf but the information obtained from it is similar to that given by getorf. Again I did a short tutorial.
  1. What is the translation start codon of the transcript?

To answer this second question it is necessary to use Kozak's criteria which show an optimal situation in which a certain codon must be in order to be defined as "translation start codon". In practice it is said that a codon, for example ATG, is the initiator of the translation if it meets these two criteria:

  • Strong situation; the ATG codon is placed in a sequence and in the order thus done RNNATGG.
  • Adequate situation; the ATG codon is placed in a sequence and in the order thus done RNNATGR.

Where, N is any base while R is a purine (Adenine or Guanine).

In practice, to find the codon that meets at least one of these two criteria it is necessary to write a regular expression as I showed in the video below.

  1. To which organism does this transcript belong? And what is the function of the protein it encodes?

To answer this third and final question it is necessary to use the BLAST alignment algorithm. I had the opportunity to mention this algorithm in a previous article (click here to read it) but I intend to dedicate an article richer in content to it. Now I'll just show you, in the video below, how I can use this algorithm remotely to answer the question asked.

Well. We have said enough for today. I hope that the article has been to your liking (let me know by leaving a comment or a "like") but above all I hope I have made you understand that to become a good bioinformatician you have to practice and as investigators you have to know how to ask the right questions to yours data in order to get the right answers.

Goodbye and see you soon.