Data science is a discipline that allows us to solve problems of various kinds through data analysis.
This is an interdisciplinary subject that includes different expertise such as: the use of statistics and mathematics, the use of programming, data cleaning and management and much more. A data scientist is therefore a professional figure capable of answering questions of interest through data analysis. I like to compare the data scientist to an investigator who seeks the culprit of a crime by studying the evidence collected.
As usual, data science is represented with a Venn diagram that describes the main fields of interest to it.
- Substantive Expertise: The field of study. In the case of bioinformatics this is constituted by biology. The nature of the data employed by the data scientist depends precisely on the field of study in question. For example, if our field of action is genomics, it is easy to have data concerning information relating to DNA sequences.
- Hacking skills: It refers to the programming skills that are necessary to give instructions to the computer in order that it can, in our place, analyze the data and therefore solve the problems posed.
- Math & Statistical knowledge: In order to answer the questions posed, the data scientist must apply statistical models which, based on mathematics, make possible to analyze and draw conclusions starting from the data available.
When it comes to data science, it is immediately clear that the main players in this discipline are they … the data. A datum is a package of information. It can be numbers, words, facts … any kind of information. These are "contained" in files with different extensions depending on the type of data. For example, my dog's name is "Balù". This is information, a data. To make it known to the computer I have to insert it in a file, such as a text file with the extension .txt.
Today we are inundated with data. The development of increasingly powerful processors, the birth of social networks and in general the enormous technological advances that are characterizing our age have led to the abundance of data that is explained by the term "Big Data". The concept of big data is characterized by the so-called "three Vs":
Volume: The data are very numerous and generally stored in databases (data collection) which continuously increase their volume.
Velocity: Data is generated and collected extremely quickly.
Variety: There are several types of data.
Having a lot of data available can be very useful but the main problem of big data is that often the data is redundant, raw or not very useful to solve the specific problem. For this reason, one of the tasks of the data scientist is to derive from the disorder of big data, for example after cleaning them, the so-called "smart data" which are useful for the study carried out.
Another interesting way to define data is to consider it as the values of a variable. After all, taking the example above, my dog's name can be understood as a variable and the name "Balù" (the data) is its value. In this regard, it is necessary to distinguish two types of variables:
- Qualitative variables: They describe the qualitative characteristics of an element or object (e.g. color of the flower, sex of the population under examination, and so on). Qualitative variables usually have non-numeric values and these can be sortable or not sortable. The values of the qualitative variables are collected mainly following observations.
- Quantitative variables: They describe the values relating to measurable characteristics of an element (eg height of people in the population under examination). The values of the quantitative variables are numerical and continuously distributed as well as sortable and derived from measurements.
It is extremely useful to understand the data with which one works during a project but to do this it is necessary to visualize and observe such data, a bit like a sculptor observes the block of marble to understand what this can become …what the cold marble in some way wants to communicate. A very simple way to represent data is by means of tables or matrices like the one shown in the image below. In a table the columns are the variables and the rows are the samples, while the values in the cells are the values of the samples relative to the variables, our data after all.
Unfortunately, the data is hardly so structured and ordered at the time of its generation and collection, therefore the data scientist must often clean it up, format it in the correct form, order it and store it in order to make it easily usable. Once again we need to make order out of disorder. Write down this concept because it is very true. One thing that immediately seemed clear to me is that to become a data scientist and in particular a bioinformatician you need to be sorted. It is necessary to be able to trace a path and follow it in an orderly way because it is easy to get lost in the whirlwind workflow of studying data.
So what is the reference workflow that a data scientist follows during a project?
Well, there are two reference workflows that can be followed and these are essentially very similar so I will just present the scheme of these described by Chanin Nantasenamat.
In general, there are five fundamental steps that must be followed to complete a study on the data.
- Understand the problem to be solved and ask yourself the right question. This is the first step but I would also say the most important. If the question we want to answer is not clear or wrong the whole study will be wrong or meaningless. Where I come from, a saying is used: "Those who start well are half done" to say how important it is to start any process with the best conditions in order to perform at its best. Somehow I think this saying can also be applied to our specific case. Remember … the question is essential.
- Collect the data that you need to use for solving the problem we want to solve. If I want to make a cake I have to take: eggs, flour, milk and other useful ingredients. I certainly can't think of making a cake by gathering around ants, glass marbles and other elements that are not useful for the purpose.
- The collected data are often rough and imperfect therefore it is necessary to clean them up, impute them (replace the missing values with some specifically generated). Often in this phase descriptive statistical methods are applied which allow to evaluate the goodness and the necessary pre-processing of the available data.
- The algorithms and models necessary for the study and analysis of the data are applied in order to solve the problem or answer the question initially asked.
- Once the problem solving model has been applied, we can take two paths. If the results obtained from the data analysis are able to answer our question, we proceed with the presentation of these through graphs and their description as well as their collection in the database. If, on the other hand, the results obtained are not useful for solving the problem, it is necessary to go back. Maybe collect new data, repeat the pre-processing of the data, apply new algorithms and resolution models, or even ask new questions that previously escaped us.
Good. For today I have already said a lot. What is written in this article is the result of the study I carried out in this first week in which I started a journey from scratch. Almost as if the intent is to build a building starting from the foundations. My goal? become a real good bioinformatician. The journey has just begun but I am already learning a lot and above all I am having fun. If you also like this initiative, show your appreciation with a like or comment below.
Bye and see you soon.