One of the most frequent operations I do during the day is the search and recovery of biological sequences of interest. So I think it is necessary to talk to you about how this can be done, especially for novices, and therefore be able to "play" with some recovered genes or proteins. I remember that at the beginning of my bioinformatics studies I enjoyed recovering gene sequences and I was delighted to observe that multitude of letters (relating to the nitrogenous bases) that followed one another within the sequence. In short, obtaining a biological sequence of interest is a basic skill for a bioinformatician and with this article I will explain how to do it.
Let's start with a simple assumption. The biological sequences of interest can be retrieved by searching in generic databases or databases dedicated to the species in question or by using specific commands launched from the terminal to obtain the desired sequence from previously downloaded files. But let's do some practical examples right away to better understand how to do it.
RETRIEVE SEQUENCES FROM DATABASE
Regarding the recovery of one or more sequences from a database, two important premises must be made:
- There are different types of databases, more or less specialized in the recovery of certain sequences, and it is possible to find on the web specific databases for some species or families such as the sequence database relating to the Solanaceae family, i.e. Sol Genomics Network.
- To speed up and optimize the database search it is necessary to use some Boolean operators such as:
- AND, allows you to connect two or more search strings. Its logical meaning can be understood as the intersection between two query that we are interested in searching in a combined way.
- OR, allows you to search for an element by discriminating others when more elements are sought at a time.
- NOT, allows you to search for all the elements that do not correspond to a certain query. It therefore has an opposite function to the AND operator.
But let's do some examples:
When you want to search for something in a database, you need to know in which database is it easier to find a particular sequence. Let me explain better. If you want to search for a specific protein it is useful to query and then search for this in the UniProt database, which is managed by a private consortium, or on the Protein database managed by NCBI, certainly it would not make any sense to search for the aforementioned protein in the Gene database. Quite right? Or again, if your intention is to download the tomato genome it is easier to find it in the Genome database rather than in the Assembly database where we mainly find information and statistics relating to the assembly of that genome. In short, the take-home message regarding the search for sequences or data in the databases is:
Search for what you need in the right database to be more successful in your research.
In fact, no one would ever go out and buy a steak from the greengrocer and an apple from the butcher … Do you understand what I mean?
RETRIEVE SEQUENCES FROM FILE USING COMMANDS EXECUTED ON THE TERMINAL
In some cases we find in our hands a file containing different information and sequences but we want to derive only some information or sequences from this. To do this we are helped by the commands that if executed on the terminal allow us to easily reach our goal.
Let's pose some ideal situations to better understand how we can do the above.
- We download Arabidopsis thaliana genome from the Phytozome database and recover from this the sequence "TGTAGGGATGAAGTCTTTCTTCGTTGT". How can we do it? grep command can be used but you have to take in mind that grep allows you to extrapolate only the sequence given as a query and the row in which it is found.
In the video below I have shown how to use these three tools:
Regarding grep there would be much more to say but I prefer not to do it now because I would move away from the true goal of this article so I will just promise to talk about it in depth in a future article.
- Now let's consider another possible situation. Let's say we always have the Arabidopsis thaliana genome file and want to extrapolate the entire chromosome 3 from it. For do this it is useful to use the samtools faidx tool. Watch the video below to understand how:
- The situation presented in point 2 is very frequent in the work of a bioinformatician but how can we derive more than one sequence from a given genome? Let's say we have the tomato gene set and want to extrapolate from it all the genes that are involved in some way in the synthesis and activity of the rubisco. In the video below I show you how:
- Let's take another example of a situation that can be quite common. Given the Solanum tuberosum genome, find the scaffold headers 1 to 3 and then extrapolate the sequences. To do this it is convenient to use regular expressions, see how in the video below:
Also in this case the use of regular expressions should be deepened so, as in the case of grep, I will try to talk about it better in a dedicated article in the future.
- Another very common situation for a bioinformatician. We have for our "hands" a vcf file (variant calling format) that is a file that shows in a matrix, consisting of rows and columns, the various information (especially positional) relating to the polymorphisms traced within a given genome. It can sometimes be useful to extrapolate from this file only specific information and this can be done with awk command (I will also talk about this in a separate article for not bore you too much) or through the use of our dear regular expressions. In the video below I have shown an example for awk.
- Finally it is useful to say that with samtools faidx it is also possible to extrapolate a sequence from a fasta file thanks to the coordinates of the latter and therefore to its position in the considered genome given the fact that this tool is able to annotate the genome before searching for the sequence .
Here we are at the end of this article. As always, I hope to have managed to explain complex concepts in the simplest way possible. Maybe let me know with a comment what you think of the article or if there is something you feel should be added or corrected.
I wish you a good day. See you soon.