In previous articles I have had the opportunity to talk about some sequencing techniques and how the reads obtained from this process can be "clean", but what will their fate be?

The cleaned and filtered reads can be used for the reconstruction of the entire DNA or cDNA sequence that we have decided to sequence through a process that takes the name of assembly. Personally, when I think about assembly, I enjoy making a comparison, perhaps a little forced but I must say that it works. Imagine that the letters "F", "W", "L" and "O" are reads (fragments of a sequence). Well, by assembling the letters together you can get a word "WOLF", which makes much more sense than the single letters. The more attentive reader will have noticed that the word "FLOW" can also be constructed with the above letters, and will be wondering how do I know which of the two "letter assemblies" is correct. These problems also arise for the reads and in this case the analysis of the quality of the assembly comes to our aid, but rest assured we will talk more about this in another article. What I would like you to understand with this example is that assemble it means arranging the reads obtained from sequencing in order to obtain the entire sequence of DNA or cDNA sequenced.

In particular, the assembly of the reads is allowed by overlapping regions that exist between them. The first product of the assembly are i contigs which are connected and oriented in order to form contiguous sequences called scaffold. These sequences however, come with "empty spaces" that will be filled using reads obtained independently that somehow overlap precisely at these spaces this allows to obtain an entire continuous sequence which as a whole constitutes the entire succession of nitrogenous bases that make up the DNA or cDNA sequenced in origin and defined consensus sequence. Despite this, it is possible that some spaces remain "uncovered" in the final product of the assembly. Sometimes scaffolds can be joined together based on genetic information defined by maps pre-existing or on the basis of conformation information obtained by technique Hi-C. The scaffold set is called superscaffold.

So if we want to be schematic we can say that, in temporal order, the products of the assembly of a DNA or cDNA molecule are:

a) Contigs, set of partially overlapping reads. Usually these are the outputs used by other algorithms that perform post-assembly studies.

b) Scaffold, set of contigs ordered and oriented according to position information. Furthermore, the scaffolds obtained present empty spaces that can be filled by mapping the “empty” points of reads usually obtained independently. The computational process that leads to obtaining scaffolds with no or few empty spaces is called scaffolding.

c) Superscaffold, set of scaffolds joined on the basis of information given by maps or by Hi-C technique. These are approximate to pseudo molecules when they are long enough to represent an entire chromosome.

But what tools do we have to make an assembly?

There are several types of algorithms and related software available (see image).

In any case, the most used and interesting algorithms are two:

  1. Overlap Layout Consensus (OLC) Algorithm
  2. De Brujin graph Algorithm

The OLC algorithms have been mainly used for the assembly of long reads and while the De Brujin graph algorithms are more effective in the assembly of short reads.

Fig. 3 The illustration of the de novo assembly pipeline (Current challenges and solutions of de novo assembly,

OLC-based assemblers

To better explain how assemblers based on OLC algorithms work I thought I'd present in detail the key steps with which they assemble the reads:

  • First, the algorithm compares the sequences in pairs with each other by means of alignment algorithms of the Blast type or based on shared K-mers, and on the basis of a homology value it is decided which sequences are most similar to each other. Furthermore, similar reads are grouped into a single cluster.
  • Subsequently these similar reads are globally aligned using global dynamic alignment algorithm. When the alignment score exceeds a certain threshold value, the overlap of similar aligned reads is defined as significant. A structure is therefore created on the basis of which a contig is generated.
  • The different contigs obtained will then in turn be aligned or better extended always on the basis of overlapping points as in the case of reads in order to obtain much larger sequences defined scaffolds.
  • Finally, the scaffolds are placed side by side and oriented to each other until a single large consensus sequence is obtained which constitutes the sequenced DNA or cDNA.

The assemblers based on De Brujin's algorithms

To assemble the reads obtained with NGS techniques, De Brujin algorithms are used which allow the construction of graphs. These are structures consisting of nodes and arches.

Fig. 5 Algorithm structure De Brujin Graph

Let's try to see in detail how an algorithm based on De Brujin graphs works in order to assemble reads:

  • First of all, the reads to be assembled are divided into k-mers, i.e., the subsequences of known length (decided by the operator).
  • Then the k-mers are sorted according to a lexicographic criterion, in said tables hash.
  • At this point we construct graphs for each reads starting from the k-mers obtained from each of these. In the graphs constructed, each node consists of a k-mer and the arcs connect two k-mers to each other.
Fig. 7 K-mers and De Brujn graph structure
  • Once the graphs for each reads obtained from sequencing are constructed, they are compared with each other with the aim of extending the individual reads and obtaining longer sequences, the contigs, which are more representative of a portion of the sequenced DNA or cDNA. In particular, when two reads have the same node and, therefore k-mer, a collapse of the two graphs is observed at that point. The collapsed node will consequently have greater coverage and therefore that portion of DNA or cDNA is more represented and realistic than other nodes with less coverage.
  • Finally, all that remains is to "solve" the graphs of the contigs formed by the collapse of shared nodes. This resolution will allow the entire consensus sequence of the originally sequenced DNA or cDNA to be constructed with little or no gaps. The resolution of graphs can occur through two types resolution approaches: the Hamiltonian approach and the Eulerian approach and in both cases the k-mers are connected to their neighbors by superimposing the sequence prefix and suffix (see image below). I don't think I am able to explain to you how these two approaches work in detail as you enter a much more mathematical than biological field so I just tell you that if you want to know more you can read this review. However, it should be considered that assemblers based on Eulerian de Bruijn graph generally perform better in the assembly of a large genome than the Hamiltonian de Bruijn graph approach in terms of assembly results.

Algorithms based on De Brujin graphs are certainly very effective but during the resolution of extended graphs in order to obtain the exact consensus sequence of the sequenced DNA or cDNA, problems can be encountered that lead to non-trivial errors.

The main problems encountered when solving extended graphs are:

  1. During the resolution of graphs, it can form bifurcations, this tips, where there are nodes with low coverage. This occurs due to the fact that reads can accumulate sequence errors at the 3 ′ ends. To solve these problems, you can filter and eliminate nodes with low coverage (and therefore rare) using specific options that are assigned to the command that launches the assembly algorithm.
  2. Sometimes “bubbles” can be created during the resolution of graphs due to the presence of polyorphisms such as SNPs or Indels. To solve this problem it is advisable to use sequencing techniques that generate longer reads or to use particular libraries such as mate pair library which give positional information relating to the reads.
  3. In addition, junctions between nodes of different graphs can be created due to the presence of repeated sequences in the sequenced DNA or cDNA. These regions are defined "Collapsing" and according to their length they are distinguished in "Short collapsing" e "Long collapsing". The former are identified and remedied directly by the assembly algorithm by grouping the reads at the end of the assembly while the latter are excluded thanks to the use of the mate pair library in the sequencing phase.

Ok, I think we should stop the article here. I have to be honest, I struggled a lot to write it, partly because I had to review some arguments and partly because the algorithms in question are difficult to understand in detail. However, I hope I have been able to explain to you how the assembly of a DNA or cDNA molecule is achieved. Obviously when I say "molecule" I mean a portion of the genome or transcriptome, a chromosome or a whole genome or transcriptome.

In any case it does not seem correct to close without leaving you a practical tutorial related to the above. In fact, below you will find one where I assemble reads using an alignment software called Velvet which is based precisely on an alignment algorithm by De Brujin graph.

Hello and see you soon.