The assembling

bioinformaticamente

5 years ago

Previously I have had the opportunity to talk about some sequencing techniques and how the reads obtained from this process can be "clean", but what will their fate be?

The cleaned and filtered reads can be used for the reconstruction of the entire DNA or cDNA sequence that we have decided to sequence through a process that takes the name of assembly. Personally, when I think about assembly, I enjoy making a comparison, perhaps a little forced but I must say that it works. Imagine that the letters "F", "W", "L" and "O" are reads (fragments of a sequence). Well, by assembling the letters together you can get a word "WOLF", which makes much more sense than the single letters. The more attentive reader will have noticed that the word "FLOW" can also be constructed with the above letters, and will be wondering how do I know which of the two "letter assemblies" is correct. These problems also arise for the reads and in this case the analysis of the quality of the assembly comes to our aid, but rest assured we will talk more about this in another article. What I would like you to understand with this example is that assemble it means arranging the reads obtained from sequencing in order to obtain the entire sequence of DNA or cDNA sequenced.

In particular, the assembly of the reads is allowed by overlapping regions that exist between them. The first product of the assembly are the contigs which are connected and oriented in order to form contiguous sequences called scaffolds. However these sequences have"empty spaces" that will be filled using reads obtained independently that somehow overlap precisely at these spaces, this allows to obtain an entire continuous sequence which as a whole constitutes the entire succession of nitrogenous bases that make up the DNA or cDNA sequenced in origin and defined consensus sequence. Despite this, it is possible that some spaces remain "uncovered" in the final product of the assembly. Sometimes scaffolds can be joined together based on genetic information defined by maps pre-existing or on the basis of conformation information obtained by technique Hi-C. The scaffold set is called superscaffold.

So if we want to be schematic we can say that, in temporal order, the products of the assembly of a DNA or cDNA molecule are:

a) Contigs, set of partially overlapping reads. Usually these are the output used by other algorithms that perform post-assembly studies.

b) Scaffold, set of contigs ordered and oriented according to position information. Furthermore, the scaffolds obtained present empty spaces that can be filled by mapping the “empty” points of reads usually obtained independently. The computational process that leads to obtaining scaffolds with no or few empty spaces is called scaffolding.

c) Superscaffold, set of scaffolds joined on the basis of information given by maps or by Hi-C technique. These are approximate to pseudo molecules when they are long enough to represent an entire chromosome.

But what tools do we have to make an assembly?

There are several types of algorithms and related software available (see image).

In any case, the most used and interesting algorithms are two:

Algorithm of Overlap Layout Consensus (OLC)
Algorithm of De Brujin graph

Actually the OLC algorithms have been mainly used for the assembly of long reads and while the De Brujin graph algorithms are more effective in the assembly of short reads.

Figure 2. The illustration of the de novo assembly pipeline (Current challenges and solutions of *de novo* assembly, https://link.springer.com/article/10.1007/s40484-019-0166-9).

OLC-based assemblers

To better explain how assemblers based on OLC algorithms work I thought I'd show the key steps with which they assemble the reads:

First, the algorithm compares the sequences in pairs with each other by means of alignment algorithms of the Blast type or based on shared k-mers, and on the basis of a homology value it is decided which sequences are most similar to each other. Furthermore, similar reads are grouped into a single cluster.
Subsequently the similar reads are globally aligned using global dynamic alignment algorithm. When the alignment score exceeds a certain threshold value, the overlap of similar aligned reads is defined as significant. A structure is therefore created on the basis of which a contig is generated.
The contigs obtained will then in turn be aligned or better extended always on the basis of overlapping points as in the case of reads in order to obtain much larger sequences defined scaffolds.
Finally, the scaffolds are placed side by side and oriented to each other until a single large consensus sequence is obtained which correspond the sequenced DNA or cDNA.

The assemblers based on De Brujin Graph's algorithms

To assemble the reads obtained with NGS techniques, De Brujin Graph algorithms are used which allow the construction of graphs. These are structures consisting of nodes and arches.

Figure 3. Algorithm structure De Brujin Graph

Let's try to see in detail how an algorithm of De Brujin Graph works in order to assemble reads:

First of all, the reads to be assembled are divided in k-mers, i.e., subsequences of known length (decided by the operator).
Later the k-mers are sorted according to a lexicographic criterion, in hash tables.

Figure 4. https://www.nature.com/articles/nmeth.1376

At this point the graphs for each of the reads are constructed starting from the k-mers obtained from each of these. In the graphs constructed, each node consists of a k-mer and the arcs connect two k-mers to each other.

Figure 5. K-mers and De Brujn graph structure

Once the graphs for each reads obtained from sequencing are constructed, they are compared with each other in order to extend the individual reads and obtaining longer sequences, the contigs, which are more representative of a portion of the sequenced DNA or cDNA. In particular, when two reads have the same node and, therefore k-mer, a collapse of the two graphs is observed at that point. The collapsed node will consequently have greater coverage and therefore that portion of DNA or cDNA is more represented and realistic than other nodes with less coverage.
Finally, all that remains is to solve the graphs of the contigs formed by the collapse of shared nodes. This resolution will allow the entire consensus sequence of the originally sequenced DNA or cDNA to be constructed with little or no gaps. The resolution of graphs can occur through two different types of resolution approaches: the Hamiltonian approach and the Eulerian approach. In both cases the k-mers are connected to their neighbors by superimposing the sequence prefix and suffix (see image below). I don't think I am able to explain to you how these two approaches work in detail as you enter a much more mathematical than biological field so I just tell you that if you want to know more you can read this. However, it should be considered that assemblers based on Eulerian de Bruijn Graph generally perform better in the assembly of a large genome than the Hamiltonian De Bruijn Graph approach.

Algorithms based on De Brujin Graphs are certainly very effective but during the resolution of extended graphs in order to obtain the exact consensus sequence of the sequenced DNA or cDNA, problems can be encountered that lead to errors.

The problems encountered when solving extended graphs are:

Graph resolution can sometimes lead to errors such as bifurcations where there are nodes with low coverage. This occurs due to the fact that reads can accumulate sequence errors at the 3 ′ ends. To solve these problems, you can filter and eliminate nodes with low coverage (and therefore rare) using specific options that are assigned to the command that launches the assembly algorithm.
Sometimes “bubbles” can be created due to the presence of polyorphisms such as SNPs or Indels. To solve this problem it is advisable to use sequencing techniques that generate longer reads or to use particular libraries such as mate pair library which give positional information of the reads.
Junctions between nodes of different graphs can be created due to the presence of repeated sequences in the sequenced DNA or cDNA. These regions are defined "Collapsing" and according to their length they are distinguished in "Short collapsing" e "Long collapsing". The former are identified and remedied directly by the assembly algorithm by grouping the reads at the end of the assembly while the latter are excluded thanks to the use of the mate pair library during the sequencing.

Figure 7. https://rna-seqblog.com/kissplice-de-novo-assembly-of-alternative-splicing-events-from-rna-seq-data/)

Ok, I think we should stop the article here. I have to be honest, I struggled a lot to write it, partly because I had to review some arguments and partly because the algorithms in question are difficult to understand in detail. However, I hope I have been able to explain to you how the assembly of a DNA or cDNA molecule is achieved. Obviously when I say "molecule" I mean a portion of the genome or transcriptome, a chromosome or a whole genome or transcriptome.

In any case it does not seem correct to close without leaving you a practical tutorial related to the above. In fact, below you will find a video where I assemble reads using an alignment software called Velvet which is based precisely on an alignment algorithm by De Brujin graph.

Bye-bye and see you soon.

Sources:

Share this: