Being a bioinformatician I often find myself working with data sets consisting of DNA, RNA or protein sequences which is a cool but there is one thing I always have to be very careful about when looking at a group of sequences. I’m talking about redundancy. When we think about redundancy we often refer to something repeated several times, a sort of multiplication of an entity but in bioinformatics is not so simple. Therefore, I prefer to introduce the concept of redundancy in bioinformatics through the most exhaustive and precise definitions possible.

What redundacy means in bioinformatics?

In bioinformatics a sequence A is redundant in a data set when it has one or more similar or homologous sequences within the same data set.

Considering the above-reported definition, it is therefore necessary to define two other important concepts: the similarity and homology between sequences.

What means that two sequences are similar?

In bioinformatics we could say that a sequence A is similar to sequence B when they shared certain percentage of elements. Similarity is a quantitative measure of the likeness between two sequences. Is calculated as a percentage of similar residues between sequences over a given length of the alignment.

It's important to know that similarity does not always imply "same function", so sequence A and sequence B could be similar but have different function.

What means that two sequences are homologous?

In bioinformatics we could say that a sequence A is homologous of sequence B when they shared a common ancestral sequence. This is not a quantitative relation between sequences, so it's not measurable, but is a qualitative one therefore is inferred.
We could distinguish two types of homologous sequences:

- Paralogous sequences, i.e. homologous sequences found within the same species due to duplication process. Paralogous sequences can have different function of the ancestral sequence (usually the older the duplication event, the more likely a neo-functionalization of the duplicate sequence has occurred).
Duplication event could be given by:
Common sources of sequence duplications including ectopic recombination, retrotransposition event, aneuploidy, polyploidy, and replication slippage.

- Orthologous sequences, i.e. homologous sequences found in different species due to separation by speciation process. Orthologous sequences generally carry out the same function of the ancestral sequence.
The four types of speciation:

  1. Allopatric (allo = other, patric = place): New species formed from geographically isolated populations.

  2. Peripatric (peri = near, patric = place): New species formed from a small population isolated at the edge of a larger population.

  3. Parapatric (para = beside, patric = place): New species formed from a continuously distributed population.

  4. Sympatric (sym = same, patric = place): New species formed from within the range of the ancestral population.

After the definition of the redundancy concept, we can figure out that in a data set we may have redundant sequences that in practical terms cause two types of redundancies:

1) Functional redundancy; two or more sequences have the same function. This could be due to homologous or similarity relationship between sequences.
In the context of sequences from the gut microbiota we can see that the functional redundancy of the sequences implies that microorganisms of different species are able to complete the same metabolic process producing the same metabolic products. For example, Liang Tian et al, have reported that: "Interleukin secretion can be promoted by Sutterella, Akkermansia, Bifidobacterium, Roseburia, and Faecalibacterium prausnitzii".

2) Sequences redundancy; Two or more sequences are duplicated. This could be due to duplication and horizontal gene transfer process.

At this point you may be wondering why you need to take care about functional and sequences redundancy during the analysis of a given data set of sequences.

Well, for two main reasons:

- Sequences redundancy leads to an unnecessary increase in the size of the data set resulting in hardware requirements (CPU and memory) and time consuming for the analysis.
For examples, UniProt curators report that: "The UniProt Knowledgebase (UniProtKB) has witnessed exponential growth in the last few years with a two-fold increase in the number of entries in 2014. This follows the vastly increased submission of multiple genomes for the same or closely related organisms. This increase was accompanied by a high level of redundancy in unreviewed UniProtKB (TrEMBL), and many sequences were over-represented in the database. This was especially true for bacterial species where different strains of the same species have been sequenced and submitted (e.g. 1,692 strains of Mycobacterium tuberculosis, corresponding to 5.97 million entries). High redundancy led to an increase in the size of UniProtKB, and thus to the amount of data to be processed internally and by our users, but also to repetitive results in BLAST searches for over-represented sequences."

- The inclusion of functional redundancy in the analyses will introduce undesirable biases in results.
For example, Bastian F. et al, say that: "Duplicated samples might provide a false sense of confidence in a result, which is in fact only supported by one experimental data point"

How can we control the functional and sequence redundancy of a given data set?

Functional redundancy and sequence redundancy could be opposed by several tools but I suggest to consider the resolution of the problem through three main strategies:

  1. Align and Cluster strategy; (Especially useful for sequence redundancy)
  2. Orthologous grouping strategy; (Especially useful for functional redundancy)
  3. Gene Ontology strategy; (Especially useful for sequence redundancy)

Below I will show you a tool for each of these strategies but remember, there are many bioinformatic tools that can be used instead of the ones I mentioned. For simplicity, I decided to show those that, at least in my opinion, seem to work better.

1. Alignment strategy:

Align and Cluster strategy refers to clustering and removing sequences that exceed a similarity thresholds (defined by users). The similarity between sequences is evaluated for pair-wise sequence alignment.

There are several tools that could be used for the purpose but what I wanted to mention is cd-hit.

From cd-hit official site: "CD-HIT stands for Cluster Database at High Identity with Tolerance. The program (cd-hit) takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the sequence 'groupies' for each nr sequence representative. The idea is to reduce the overall size of the database without removing any sequence information by only removing 'redundant' (or highly similar) sequences. This is why the resulting database is called non-redundant (nr). Essentially, cd-hit produces a set of closely related protein families from a given fasta sequence database."

Usage example:


path_to_multifasta="/home/omar/Dropbox/trial_redundancy_reduction/"

# Take a look at the "toy" multifasta:

more $path_to_multifasta/protein.faa

# Look at how many proteins there are into multifasta:

grep "^>" $path_to_multifasta/protein.faa | wc -l 

# Run cd-hit:

# Take always a look to the manual in order to choice the right options for the coverage and identity threshold to consider:

cd-hit --help

# ok, now run it:

cd-hit -i $path_to_multifasta/protein.faa -o $path_to_multifasta/cd_hit_out -c 0.90 -G 0 -aS 0.9 -d 0 -T 0 -M 9000 -n 5

# Let's look at the results:

path_to_out="/home/omar/Dropbox/trial_redundancy_reduction/"

# what's in the output folder?
ls $path_to_out

# Two main outputs:

## 1) cd_hit_out.clstr ==> Clusters of redundant proteins. The ones indicated by "*" is the most rappresentative of the clusters and you will find it's sequence into cd_hit_out file.

## 2) cd_hit_out ==> Multifasta with only the most rappresentative proteins of each clusters.

gedit cd_hit_out.clstr 

more cd_hit_out

# example: 

grep ">MGV-GENOME-0364295_139" cd_hit_out
2. Orthologous strategy:

Ortologous strategy refers to grouping sequences of data set in orthologous groups based to orthologous relation between those evaluated considering specific orthologous database.

An excellent example is DeepNOG.
From user guide: "deepnog is a command line tool written in Python 3. It uses deep networks for extremely fast protein orthology assignments. Currently, it is based on a deep convolutional network architecture called DeepNOG trained on the root and bacterial level of the eggNOG 5.0 database (Huerta-Cepas et al. (2019))."

Usage example:


path_to_multifasta="/home/omar/Dropbox/DeepNOG_prova"

# Take a look at the "toy" multifasta:

more $path_to_multifasta/protein.faa

# Look at how many proteins there are into multifasta:

grep "^>" $path_to_multifasta/protein.faa | wc -l 

# Somethink to know about deepnog:

# DeepNOG has two type of use:

# 1) deepnog infer for assigning sequences to orthologous groups, using precomputed models. So this is for directly and ready to use orthology assignments.

# 2) deepnog train for training such models (e.g. other taxonomic levels or future versions of eggNOG, different orthology databases, etc.). In other words this is used if you have in mind to create your own architectures of convolutional network. 

# For the purpose of this report, just let's taka a look to deepnog infer:

# help:

deepnog --help

deepnog infer --help

# Run it:
deepnog infer $path_to_multifasta/protein.faa -db eggNOG5 -d auto -t 29 -V 3 -c 0.99 -o $path_to_multifasta/out.csv

# Look at the results:

### As an output deepnog generates a CSV file which consists of three columns:

# 1) The unique name or ID of the protein extracted from the sequence file,

# 2) the assigned orthologous group, and

# 3) the network’s confidence in the assignment.

gedit $path_to_multifasta/out.csv
3. Gene Ontology strategy:

Gene Ontology strategy refers to clustering sequences of a data set based on similar Gene Ontology terms.

For this purpose, the tool that I would like to mention is GOMCL.

From github repo: "GOMCL is a tool to cluster and extract summarized associations of Gene Ontology based functions in omics data. It clusters GO terms using MCL based on overlapping ratios, OC (Overlap coefficient) or JC (Jaccard coefficient). The resulting clusters can be further analyzed and separated into sub-clusters using a second script, GOMCL-sub. This tool helps researchers to reduce time spent on manual curation of large lists of GO terms and minimize biases introduced by shared GO terms in data interpretation."

Usage example:


# GOMCL require two types of input files:

# ==> obo file should be provided, e.g. go-basic.obo

# ==> Enriched GO input file may be from different GO enrichment analysis tools, currently supported GO enrichment tools are: BiNGO, agriGO, GOrilla, gProfiler

path_to_input="/home/omar/Dropbox/GOMLC_prova/master/GOMCL-master/tests"

# Take a look at the obo file:

gedit $path_to_input/go-basic.obo

# Take a look at the Enriched GO:

gedit $path_to_input/Wendrich_PNAS_SD2_LR_TMO5_H_vs_L.bgo

# Run GOMCL:
GOMCL.py $path_to_input/go-basic.obo $path_to_input/Wendrich_PNAS_SD2_LR_TMO5_H_vs_L.bgo -gosize 3500 -gotype BP CC -I 1.5 -hm -nw -d -hg 0 -hgt -ssd 0

# Run GOMCL-sub.py:

# Each GO cluster generated by GOMCL.py can be evaluated and further divided into non-overlapping sub-clusters using the GOMCL-sub.py
GOMCL-sub.py $path_to_input/go-basic.obo $path_to_input/Wendrich_PNAS_SD2_LR_TMO5_H_vs_L.clstr -C 1 -gosize 2000 -I 1.8 -ssd 0 -hg 0 -hgt -hm -nw

# REMEMBER: The outputs from both GOMCL and GOMCL-sub can be imported to Cytoscape (https://cytoscape.org/) for additional visualization effects.

Bye, and see you soon


References: