In bioinformatics, but more generally in data science, the protagonists are the data but could you give me a definition of data? It almost sounds like an abstract concept but it’s actually very simple. A datum is nothing more than a packet of information. In our specific case we will say that a biological datum is a package of biological information. If you think about it, the same term bio-informatics gives us a picture of the usefulness of the data, in fact with bio we define the field of interest, that is biology, and with informatics means the transfer of information from one user to another. A bioinformatician works on information and results obtained from analyzes carried out by a researcher in the laboratory and stored as data. The work done by the bioinformatician on the data received will in turn allow the production of results, following the bioinformatic analysis of the data, which in turn will constitute new data capable of explaining and solving a specific biological phenomenon. In short, it is a sort of loop, but what matters is that the data are analyzed and manipulated by the bioinformatician through specific software, in fact in the sector it is said that they give themselves input data “in meal” to a program and he “spits it outoutput data.

One of the reasons why bioinformatics is going really strong, and the trend shows no signs of running out, is that the data produced in the laboratory are very numerous, thanks to the greater economic convenience and the superior technical efficiency of the instruments used in the laboratory today. Therefore it is easy to obtain a disproportionate amount of data, well described by the term big data, but these alone do not explain how a process or an organism works, it is in fact necessary to study and apply statistics to these data in order to draw objective conclusions.

I guess you are wondering where this data is stored. In a giant hard drive with a stratospheric capacity? Obviously not. This data is physically contained and downloadable by anyone in databases within which these are organized in.

A database is in fact an organized set of usable and understandable data in order to give information to any users. Databases can be of two types:

  1. Primary (also called Archival), or databases that contain raw data just obtained from laboratory analyzes.
  2. Secondary (also called Curated), which are databases containing the processed data deriving from the interpretation of the raw ones.

Examples of databases are:

  • GenBank, which is a database of genomic sequences managed by the NCBI (https://www.ncbi.nlm.nih.gov/genbank/).
  • UniProt, which is a protein database obtained from the collaboration between the European Bioinformatics Institute (EBI), the Protein Information Resource (PIR) and the Swiss Institute of Bioinformatics (SIB) (https://www.uniprot.org/).
  • KEGG (Kyoto Encyclopedia of Genes and Genomes) is a database that integrates functional information genomic, chemical and systemic (https://www.genome.jp/kegg/kegg1.html).

Furthermore, to insert a biological data into a database, such as a DNA sequence, you can use tools such as BankIt (https://www.ncbi.nlm.nih.gov/WebSub/) or  Sequin (https://www.ncbi.nlm.nih.gov/Sequin/). Obviously before being accepted these will undergo a review by the institute that manages the database.

Finally, I remind you that bioinformatics data, and not only, are equipped with extensions in the name that allow them to be distinguished, in fact we can have different data formats, some very frequent and used are .fasta, .fastq, .sra , .gff and many others. For some of these very frequent I will do a deepening.

That’s all for today, as always I urge you to leave a comment below in the comments area of ​​the blog, to subscribe to the blog if you have not done so and to leave a nice like if you think it appropriate.

Bye and see you soon.