In previous articles I had the opportunity to talk about how a molecule of DNA or cDNA (or retro-transcribed RNA, thanks to the enzyme reverse transcriptase into DNA which for the occasion takes the name of cDNA), it can be sequenced and after assembling the sequencing products how it is possible to reconstruct the sequence of the entire starting DNA or cDNA, i.e. the consensus sequence.

Thanks to the feedback received from one of you, I realized that I may have created a little confusion in a topic that I think is crucial for a bioinformatician. I will therefore try to take you by the hand and, as if I were a tour guide, follow you in the demonstration on how DNA is sequenced and assembled.

To do this I ask you to make a little effort of the imagination. We all enter the professional life of Ivar, a novice bioinformatician who works at a research center that deals with studying various microorganisms. His boss asks him to sequence the genome of the alga Chlamydomonas reinhardtii and to assemble it in order to obtain the entire consensus sequence.

Let's see how Ivar proceeds to fulfill the boss's will:

  1. First of all, he dedicates himself to a series of laboratory techniques (also called WET techniques) that allow the extraction and sequencing of the genome of Chlamydomonas reinhardtii:
    • DNA extraction from the algae cells.
    • Quantification of DNA extracted from Chlamydomonas reinhardtii and evaluation of DNA quality.
    • Fragmentation of the extracted DNA and addition of Illumina adapters (if Illumina second generation sequencing is chosen for example) to build the library. I remind you that the library is nothing more than the set of all the DNA fragments to be sequenced with the addition of adapters.
    • bPCR, i.e. bridged amplification of the library fragments.
    • Execution of the Illumina sequencing paired-end which returns the reads obtained by sequencing both ends of a fragment in two separate files in fastq format, which can be stored in archives called Sequence Read Archive (SRA) from which these can be recovered.
  2. Now Ivar has the fastq files of the reads relating to the two ends of each fragment and therefore can proceed with the bioinformatic analyzes that allow to assemble the reads obtaining the consensus sequence of the entire sequenced genome. This type of analysis is called DRY. To do that it is necessary to perform the following steps:
    • Retrieve raw fastq files of paired-end reads from SRA archives if they have been stored there.
    • Clean the reads from the adapters and filter them based on the quality values ​​with one of the possible cleaning algorithms, such as the algorithm sliding windows.
    • Check with FastQC if the cleanup of the reads was successful.
    • Assemble the cleaned reads using Velvet in order to obtain contigs in FASTA format, but we can also obtain scaffolds and superscaffolds. However, contigs are the most used output. Velvet uses in particular a De Brujin Graph algorithm for assembly.
  3. Now Ivar has to figure out if the assembly has been done correctly. To do this, it uses algorithms capable of performing statistical tests that return parameters that are indicative of the quality of the assemblies. There are several useful parameters as can be seen here on wikipedia wikipedia, but certainly among the most considered we have the parameters N50, L50 and N90 (see the wikipedia page for the definitions). To get these parameters you can use quast-5.0.2.
Figure1. Genome assembly

In the video below I tried to show you the individual steps of the DRY analysis:

I hope at this point you understand how the assembly of reads obtained from sequencing is achieved.

One thing that I omitted in previous articles in order not to create too much confusion is that sequencing and subsequent assembly can be of two types depending on whether or not a high quality reference genome is present:

  • De-novo sequencing, when there is no reference genome, in this case the assembly is more prone to errors. The number of scaffolds and contigs required to represent the genome, the proportion of reads that can be assembled, the absolute length of contigs and scaffolds and the length of the contigs and scaffolds are evaluated to evaluate the quality of the assembly obtained. All these parameters are defined through statistical tests which have the task of calculating metrics that describe the quality of the assembly. These, as mentioned above, are different and among the most important we find for example N50, which describes the length of the shortest contigs which cover 50% of the total length of the contigs. I know I know. I have confused you but perhaps with the image below you can better understand this parameter.
Figure 2. N50 explanation

Anyway, the N50 metric may not be as accurate, in fact, a first assembly of Ciona intestinalis it had an N50 of 234 kilobases while a later assembly extended the N50 more than ten times. Further analysis showed that the latter assembly was missing several conserved genes, perhaps because the algorithms discarded repetitive sequences, and this is not an isolated case.

  • Re-sequencing, when the assembly of the studied genome takes place thanks to the “mapping” of the reads obtained on a high quality genome of that same species that we use as a reference. It is as if someone asks you to make a puzzle using as a reference, to understand where to place the different pieces, an image of the final result you should get, usually located on the main side of the box.
Figure 3. Differences between de novo assembly and re-sequencing.
Source: https://journals.plos.org/ploscompbiol/article/figures?id=10.1371/journal.pcbi.1002821

I have to be honest, I have not only omitted the above, let's say that I painted the process of assembling the reads as something flawless, but actually assembling the reads to reconstruct a genome or a transcriptome (the set of mRNAs of a cell or a fabric from which these are extracted, don't worry we will talk about it sooner or later) it is not at all easy, on the contrary the errors and instability of the assemblers mean that often the results of an assembly are not repeatable, that's why you can find different versions of the genome of an organism.

Figure 4. Image retrieved from the Pythozome v12 plant genome database. The red arrows indicate the versions, v3.4 and v4.03, of the Solanum tuberosum (Potato) obtained by two sequencing and assembly processes.

In fact, reading an article (Again genome assembly: what every biologist should know) I understood that genome assembly is a technique that has enormous potential but still needs to improve. Something can go wrong both during the phase in the laboratory, where the choice of the library and the type of sequencing technique used greatly influence the final result of the assembly, and during the computational phase in front of the computer. In particular, during the assembly you can have:

  • Errors in rejecting reads.
  • Errors due to the presence of repeated sequences that cause the formation of regions with high coverage by the reads and regions with little coverage by reads (gaps).
  • Errors in the orientation or positioning of the reads.
Figure 5. Errors related to genome assembly

In general, all these problems cause the loss of information and therefore the obtaining of shorter contigs than necessary.

I wrote this article to summarize how the sequencing and assembly of a DNA or cDNA molecule takes place and to expose the main critical points of the case. When performing an assembly, bear in mind the limitations of this process despite its extreme importance.

As usual, I ask you to leave a "like" or a comment, also to find out if the article has errors or was to your liking.

Hello and see you soon.

Sources: