Hi and welcome back! For the past two weeks, I've been full with commitments. As I have already had the opportunity to tell you, I am struggling with a thesis project in bioinformatics aimed at carrying out a “Genome Wide Association Study” (GWAS) on a population made up of different varieties of plants. One day I will dedicate an article on this too.

I was able to carve out some time to talk to you about a type of extremely useful algorithm in the field of bioinformatics. I'm obviously talking about the alignment algorithms. These are in fact the most widely used, they allow us to carry out different operations and are used in different types of bioinformatics analysis. Before mentioning the different alignment algorithms it is necessary to understand what is meant by "alignment". The alignment of sequences, nucleic acids or proteins, is that operation that allows to compare two or more sequences with each other in order to evaluate the degree of similarity and therefore their evolutionary relationship through the calculation of an alignment score calculated precisely by the same algorithms.

At this point you may be wondering why it is so useful to align sequences. Well, sequence alignment allows us to:

  • Know if two or more compared sequences are similar or identical and so identify a sequence comparing this with a known sequence. To understand this function of the alignment algorithms it is necessary to make an essential clarification, in fact it is important to distinguish what is meant by "similar sequences" and "identical sequences" . When we say that a sequence X is identical to a sequence Y we mean that these have the same residues (elementary portions) and therefore are the same molecule. In fact, the percentage of identity between two sequences is expressed by the following equation:

Instead when we say that a sequence X is similar to a sequence Y we mean that these have the similar residues or that these have identical residues in common or different residues but which have the same biochemical function. The percentage of similarity between two sequences is expressed by the following equation:

Thanks to the measurement of the percentage of similarity between two sequences it is possible to define the criteria of homology between them and therefore understand the derivation of these. In particular, two sequences compared with an alignment can be defined:

  1. Orthologues sequences; when two sequences are present in different genotypes which have accumulated differences over time but generally have the same function. This situation is the result of the speciation process.
  2. Paralogues sequences; when two sequences present in the same genotype are obtained by duplication and they have differentiated over time because have accumulated mutations which allowed the acquisition of different function.

In general, two sequences are similar and so homologous (orthologues or paralogues), if they share a value equal to or greater than 80% of their residues.

  • Estimate the evolutionary relationships between the aligned sequences and therefore use them for the construction of phylogenetic trees.
  • Understanding which regions of a sequence are most important to its function. Let's take as an example the protein amylase, that is the enzyme that allows the starch hydrolysis reaction. By aligning the amylases from different individuals, perhaps even very distant genetically from each other, we can see how some amino acid residues are very conserved, therefore they do not change, while others vary between one individual and another. From these observations it is easy to understand that the most conserved regions are those that have the greatest influence on the function of the protein which is the same in all the individuals examined. The variable parts, on the other hand, are of lesser importance in this sense.

The alignment of sequences therefore provides us with a lot of useful information but as previously mentioned there are different types of alignment algorithms that need to be treated individually to be better understood, for this reason I will limit myself to listing them and mentioning some brief concepts, but if you want me to deal specifically with one of these let me know by writing a comment.

  1. Alignment algorithms for sliding. They were the first but are no longer used now.
  2. Dot Plot Matrix Alignment Algorithms. They provide a graphical representation of the alignment between two sequences at a time. The points of identity or similarity between two sequences are indicated by a point in space, thus drawing a line as a whole. In this way it is possible to graphically visualize the traits affected by polymorphisms such as inversions, repetitions, insertions and deletions.
Dot plot comparison of the nucleotide sequences of Acinetobacter phages...  | Download Scientific Diagram
  1. Dynamic alignment algorithms. They are algorithms that use identity matrices substitution matrices to align sequences, i.e. matrices in whose outer cells the elements (nitrogenous bases for nucleic acids and amino acids for proteins) of the compared sequences are placed and the values ​​of identity or similarity between the elements compared to are placed in the inner cells couples. In identity matrices the same elements are assigned a score (1) while the different elements are not assigned any score (0). In the substitution matrices, on the other hand, positive values ​​indicate greater similarity and therefore a greater probability that the two elements during an alignment are considered similar while negative values ​​indicate low similarity and therefore low probability that the two elements during an alignment can be considered similar. Substitution matrices can be of two types, PAM matrices e BLOSUM matrices; these types of matrices in turn can be of different types. The most used are the PAM150 matrices and the BLOSUM62 matrices. Dynamic algorithms can also act globally (Eg. Needleman-Wunsch algorithm) or locally (Eg. Smith-Waterman).
  2. Heuristic alignment algorithms. These algorithms are very similar to the Dynamic ones, in fact these also use substitution matrices but unlike the dynamic algorithms, the heuristic ones carry out their alignments in a more approximate but faster way, for this reason these algorithms they give a probabilistic value of the of the similarity of two sequences compared but which generally approaches the optimal one. Heuristic algorithms are also divided into global, Such as Clustal, and local, such as the very used BLAST.
  3. Multiple alignment algorithms. They are very useful algorithms because they allow you to align multiple sequences at the same time. In fact these algorithms carry out an alignment between sequences by forming a cluster of these.

After this quick presentation of the alignment algorithms, I just have to say goodbye and remind you that if you don't want to miss the next articles you can subscribe to the blog. Also I urge you to comment below in the comments section.

Bye-bye and see you soon.