Site icon BIOINFORMATICAMENTE

The assembling

Previously I have had the opportunity to talk about some sequencing techniques and how the reads obtained from this process can be "clean", but what will their fate be?

The cleaned and filtered reads can be used for the reconstruction of the entire DNA or cDNA sequence that we have decided to sequence through a process that takes the name of assembly. Personally, when I think about assembly, I enjoy making a comparison, perhaps a little forced but I must say that it works. Imagine that the letters "F", "W", "L" and "O" are reads (fragments of a sequence). Well, by assembling the letters together you can get a word "WOLF", which makes much more sense than the single letters. The more attentive reader will have noticed that the word "FLOW" can also be constructed with the above letters, and will be wondering how do I know which of the two "letter assemblies" is correct. These problems also arise for the reads and in this case the analysis of the quality of the assembly comes to our aid, but rest assured we will talk more about this in another article. What I would like you to understand with this example is that assemble it means arranging the reads obtained from sequencing in order to obtain the entire sequence of DNA or cDNA sequenced.

In particular, the assembly of the reads is allowed by overlapping regions that exist between them. The first product of the assembly are the contigs which are connected and oriented in order to form contiguous sequences called scaffolds. However these sequences have"empty spaces" that will be filled using reads obtained independently that somehow overlap precisely at these spaces, this allows to obtain an entire continuous sequence which as a whole constitutes the entire succession of nitrogenous bases that make up the DNA or cDNA sequenced in origin and defined consensus sequence. Despite this, it is possible that some spaces remain "uncovered" in the final product of the assembly. Sometimes scaffolds can be joined together based on genetic information defined by maps pre-existing or on the basis of conformation information obtained by technique Hi-C. The scaffold set is called superscaffold.

So if we want to be schematic we can say that, in temporal order, the products of the assembly of a DNA or cDNA molecule are:

a) Contigs, set of partially overlapping reads. Usually these are the output used by other algorithms that perform post-assembly studies.

b) Scaffold, set of contigs ordered and oriented according to position information. Furthermore, the scaffolds obtained present empty spaces that can be filled by mapping the “empty” points of reads usually obtained independently. The computational process that leads to obtaining scaffolds with no or few empty spaces is called scaffolding.

c) Superscaffold, set of scaffolds joined on the basis of information given by maps or by Hi-C technique. These are approximate to pseudo molecules when they are long enough to represent an entire chromosome.

But what tools do we have to make an assembly?

There are several types of algorithms and related software available (see image).

In any case, the most used and interesting algorithms are two:

  1. Algorithm of Overlap Layout Consensus (OLC)
  2. Algorithm of De Brujin graph

Actually the OLC algorithms have been mainly used for the assembly of long reads and while the De Brujin graph algorithms are more effective in the assembly of short reads.

Figure 2. The illustration of the de novo assembly pipeline (Current challenges and solutions of de novo assembly, https://link.springer.com/article/10.1007/s40484-019-0166-9).

OLC-based assemblers

To better explain how assemblers based on OLC algorithms work I thought I'd show the key steps with which they assemble the reads:

The assemblers based on De Brujin Graph's algorithms

To assemble the reads obtained with NGS techniques, De Brujin Graph algorithms are used which allow the construction of graphs. These are structures consisting of nodes and arches.

Figure 3. Algorithm structure De Brujin Graph

Let's try to see in detail how an algorithm of De Brujin Graph works in order to assemble reads:

Figure 5. K-mers and De Brujn graph structure

Algorithms based on De Brujin Graphs are certainly very effective but during the resolution of extended graphs in order to obtain the exact consensus sequence of the sequenced DNA or cDNA, problems can be encountered that lead to errors.

The problems encountered when solving extended graphs are:

  1. Graph resolution can sometimes lead to errors such as bifurcations where there are nodes with low coverage. This occurs due to the fact that reads can accumulate sequence errors at the 3 ′ ends. To solve these problems, you can filter and eliminate nodes with low coverage (and therefore rare) using specific options that are assigned to the command that launches the assembly algorithm.
  2. Sometimes “bubbles” can be created due to the presence of polyorphisms such as SNPs or Indels. To solve this problem it is advisable to use sequencing techniques that generate longer reads or to use particular libraries such as mate pair library which give positional information of the reads.
  3. Junctions between nodes of different graphs can be created due to the presence of repeated sequences in the sequenced DNA or cDNA. These regions are defined "Collapsing" and according to their length they are distinguished in "Short collapsing" "Long collapsing". The former are identified and remedied directly by the assembly algorithm by grouping the reads at the end of the assembly while the latter are excluded thanks to the use of the mate pair library during the sequencing.

Ok, I think we should stop the article here. I have to be honest, I struggled a lot to write it, partly because I had to review some arguments and partly because the algorithms in question are difficult to understand in detail. However, I hope I have been able to explain to you how the assembly of a DNA or cDNA molecule is achieved. Obviously when I say "molecule" I mean a portion of the genome or transcriptome, a chromosome or a whole genome or transcriptome.

In any case it does not seem correct to close without leaving you a practical tutorial related to the above. In fact, below you will find a video where I assemble reads using an alignment software called Velvet which is based precisely on an alignment algorithm by De Brujin graph.

Bye-bye and see you soon.

Sources:

Exit mobile version