I think that for many of you this article could be a little heavy but on Instagram you asked me to talk about contents a little more technical trying to simplify as much as possible they. Many studies and analyzes that I have also performed start from the sequencing process so it is essential to know more, but let's go in order.

A bioinformatician often finds himself working with several molecules that are important for the life of the organism but certainly the one that is often in the spotlight is DNA. The acronym DNA is going to Deoxyribonucleic acid, a nucleic acid consisting of several bricks called nucleotides which bind together forming a chain which in turn binds laterally to another nucleotide chain arranged in an antiparallel manner. The DNA molecule appears as a double helix structure held together by hydrogen bonds between the chains, and by covalent bonds between the nucleotides of the same chain. The nucleotides, or the bricks mentioned above, are nothing more than molecules in turn made up of a sugar, Deoxyribose, linked to carbon 3 by a phosphate group which in turn binds the carbon 5 of the sugar of the adjacent nucleotide in the chain, it is also linked to carbon 1 by a nitrogenous base which in turn binds, forming bridges called hydrogen, to the nitrogenous bases of the nucleotides of the chain placed in an antiparallel manner.

However, it should be noted that there are 4 different types of nucleotides in the DNA molecule, in particular:

  • The Adenosine, whose nitrogen base is adenine (indicated with the capital letter A)
  • The Cytidine, whose nitrogen base is cytosine (C)
  • The Guanosine, whose nitrogen base is guanine (G)
  • The Thymidine, whose nitrogen base is thymine (T)

The nitrogenous bases present in the two DNA strands also bind to each other with a very precise and preserved pattern, in particular adenine binds with two hydrogen bonds to thymine and cytosine binds with three hydrogen bonds to guanine.

DNA is the seat of an organism's genetic information, written in a code called precisely genetic code. This is made up of 3 letter words spoken codons, and the alphabet used is made of 4 letters: where A stands for adenine, C for cytosine, G for guanine and T for thymine. Therefore each letter recalls the relative nitrogenous bases of the nucleotides that make up the double helix of DNA. The main features of the genetic code are:

  • Each codon recalls a specific amino acid, which are the building blocks of proteins, but beware the genetic code is redundant, as we can see from the diagram below, there are some codons that recall, or are said to encode, the same amino acid. This redundancy is due to the fact that the amino acids present in nature are only 20, and the codons are instead many more, that is 64.
  • Of 64 codons 61 are said sense codons because they code for specific amino acids and are defined nonsense or stop codons as they encode for stop signals, ie signals that establish at what point the assembly of proteins must stop.
  • Each codon is read in the same way by all living things, therefore the genetic code is defined universal. But be careful! There are exceptions to this rule. In fact, it has been discovered that there are organisms that are capable of reading some codons in different ways.
  • Finally, the genetic code presents a only one reading key, this means that it is read by all living beings in the same direction and without interruptions.

With this verbose introduction I intend to make you understand how important is to know the message carried by a DNA molecule, in fact by knowing its nucleotide sequence it is possible to easily understand its function and its biological importance.

One of the most important preliminary techniques for studying DNA is the sequencing, that is obtaining, in a file, the succession of nitrogenous bases that make up a certain DNA. It is also important to know that in the same way the RNA of an organism can be studied, in fact the RNA can be converted into DNA (which for the occasion is called cDNA) by an enzyme called reverse transcriptase and subsequently sequenced.

There are two basic steps to obtain DNA or cDNA sequencing:

  1. Building a library. To sequence a DNA or cDNA molecule, or even the entire genome or transcriptome of an organism, it is necessary to fragment it to make it easier to manipulate and sequence. The set of DNA fragments obtained is called library.
  2. After obtaining the library of DNA or cDNA fragments we proceed with the sequencing.

DNA or cDNA sequencing can be of two types:

  • Partial sequencing, when one or a few regions of the genome (DNA) or transcriptome (set of RNAs of an organism converted into cDNA) are sequenced. If the sequenced region is randomly chosen it is referred to as "Partial random sequencing", while if the sequenced portion is selected by a user since it is of specific interest to a particular study it is referred to “Partial target sequencing”.
  • Whole genome sequencingIn case the whole genome or transcriptome of an organism is sequenced.

The sequencing of a DNA or cDNA molecule (RNA converted into DNA by reverse transcription) can follow two different approaches:

  • Hierarchical (also called clone by clone o Top down), i.e. a sequencing that involves the construction of a low-resolution physical map starting from the sequencing of large DNA sequences (called contigs) and which is used to orient and assemble the sequences of small sequenced DNA fragments (called reads) in order to have a high resolution physical map. It should be noted that this sequencing approach is now obsolete and therefore no longer used.
  • Shotgun, that is a sequencing method that directly involves the construction of a high-resolution physical map. In fact, in this case only small fragments of DNA or cDNA are sequenced and subsequently the reads obtained are assembled by means of special bioinformatics software to build high-resolution physical maps of the sequence, genome or transcriptome studied. This sequencing approach is the one used today since it allows to have a physical map in a shorter time, since it does not require the construction of a low resolution physical map to be used as a reference.

At this point, I think it makes sense to talk about the factors that influence the sequencing of a DNA or cDNA molecule. These are different but certainly the factors that most influence the sequencing result are:

  • The sequencing technique, which is the sequencing method used that directly affects the length of the reads obtained. Moreover, depending on the technique used, it is possible to obtain the sequencing of only one end of the fragments of the library or the sequencing of both ends. In the first case we speak of single-end sequencing while in the second of paired-end sequencing.
Diagram showing the main sequencing techniques.
  • Representativeness of the library. A library must be representative of the genome (in the case of genomic libraries) or transcriptome (in the case of cDNA libraries) of an individual studied, i.e. it must contain all fragments of the genome or transcriptome so that each fragment has an equal chance of be sequenced and studied. In other words, the library must be redundant that is, it must have multiple copies of the same DNA or cDNA fragment. To estimate the representativeness of a library, different methods can be used depending on the sequencing technique used. In the case of sequencing with the Sanger technique, the representativeness of the library is respected if N, the number of clones of the fragments actually contained in the library, is greater than n, the theoretical number of clones; to calculate N and n the following mathematical relations are used:

Where:
P is the probability that we want to have to find a given one sequence within the library.

For the second and third generation sequencing techniques, also called NGS (Next Generation Sequencing) techniques, the representativeness of a library (both DNA and cDNA) is given by the coverage level, i.e. the average number of times the same DNA or cDNA sequence is sequenced, obviously the higher the coverage the greater the safety that you have sequenced all the fragments of the library.
The coverage level (cl) is calculated as follows:

As output we mean the number of nitrogenous bases that have been sequenced in total on the flow cell, that is the support on which the DNA sequencing process takes place. Since the output is specific for each flow cell type, by choosing the type of flow cell we can choose the output and therefore the coverage level.
It is good to point out that when we do the sequencing for the first time complete of a genome it is necessary to have very high coverage, usually from 40 X up, that is a coverage that allows the sequencing of the same sequence 40 or more times. If, on the other hand, a genome has already been sequenced and we just want to compare genomes, to evaluate the presence of any variations and therefore carry out what is called re-sequencing, it is sufficient to have a coverage level of 3-5 X.

Also for today we have come to an end, I remind you that if the article was to your liking or if you have some clarification or constructive criticism to make, I would be very pleased to know, perhaps with a comment.

Bye and see you soon.