Hi and welcome back! In this week, full of commitments as always, I found some time to brush up on some scripts I had written during my internship in bioinformatics and I thought then to tell, and in a certain sense share, the notions and practices carried out during my internship period. In fact, as I have already repeated in other articles, with the advent of the New Generation Sequencing Techniques (NGS) the amount of computer data obtained from the study of biological processes required, and continues to require, bioinformatics knowledge that allows us to understand and analyze the different biological information acquired.

Personally, I believe it is necessary, if not essential, to learn bioinformatics concepts and methods for anyone who wants to work in the biotechnology field, for this reason I decided to carry out the internship by engaging in a job with the aim of dating the formation of LTR (Long Terminal Repeat ) retrotransposon in different plant species. Obviously, before going into the merits of the bioinformatics tools used, I think it is necessary to know what transposons are.

Transposons are DNA sequences capable of moving, in fact transpose, within the genome with two possible mechanisms: "Copy-Paste", mechanism by which the transposon is copied and its copy, that moves in the genome, integrates elsewhere in the genome, in a position other than that of origin; "Cut-paste", mechanism for which the transposons are not copied but move from one point of the genome to another thanks to their own excision and ligation in another point of the genome. Transposons are very abundant in the genome of living organisms, such as plants, and the cause of their formation is not yet clear, but certainly strong stress can favor their birth. Transposons also have several consequences on the genome and on the genetic expression of an organism, including:

  • Increase in the size of the genome, caused in particular by transposons that move with a "copy-paste" mechanism, but also partial reduction in the size of the genome in the event that some transposons are removed following different removal processes that may occur in the time.
  • Positive effects on gene expression when these are inserted close to genes causing an increase in transcription. These can in fact provide regulatory sequences for genes.
  • Construction of new genes thanks to the fact that moving, especially in the case of “cut-paste” transposition, they could incorporate exons, or in general portions of genes, which can then join together forming a new coding sequence.
  • Negative effects on gene expression, in fact, they could insert themselves inside a gene or near it causing disadvantageous effects, for example they could “deactivate” a gene that also plays a vital role for the organism.
Figure 1. Scheme of the different types of transposons present in the genome of living organisms. Source: Mat Razali, N., Cheah, BH, & Nadarajah, K. (2019). Transposable Elements Adaptive Role in Genome Plasticity, Pathogenicity and Evolution in Fungal Phytopathogens. International Journal of Molecular Sciences, 20 (14). https://doi.org/10.3390/ijms20143597

Transposons therefore play an important role in the biology of an organism and it is therefore interesting to study them. Although they are of different types (see diagram in Figure 1), during the internship, I took into consideration in particular the LTR retrotransposons, a particular category of transposons equipped with a "copy-paste" transposition mechanism whose name derives from the fact that they have portions called LTR (Long Terminal Repeat), i.e. repetitive sequences consisting of 100-5000 nucleotides present at the two ends of the transposon, in 3 ' position and in 5' position . The main families of LTR retrotransposons are Gypsy Copy which differ from each other according to the different arrangement of the residuals within the sequence.

The goal of the internship was in particular to date the LTR retrotransposons according to the principle that the greater the differences accumulated over time by the LTR sequences of each transposon, the greater the age of this transposon. In fact, the differences at the level of the repeated sequences flanking the two ends of the transposons are determined by spontaneous mutations that occur over time.

The internship project specifically included the following steps:

  1. Download of dicotyledonous and monocotyledonous plant genomes from the database Phytozome v12.1. (https://phytozome.jgi.doe.gov/pz/portal.html).
  2. Identification and structural annotation of LTR retrotransposons present in genomes, i.e. identification of the coordinates of the transposable element in the genome to understand its position. The bioinformatics tools used for the identification and annotation of LTR retrotransposons in each genome (such as LTR harvest ed LTR finder) are enclosed in a single package designed precisely with the aim of obtaining a de-novo and automated annotation of all the transposons present in a given genome. This package is called The Extensive de novo TE Annotator (EDTA). The package EDTA it is characterized by the ability to build a real library of transposable elements present in the whole genome, not redundant and of high quality. So this not only identifies and collects the LTR retrotransposons but also the other transposable elements. The transposons present in the library can be subsequently annotated and analyzed by the package itself.
  3. Dating of the LTR retrotransposons of each genome based on the differences between the 3 'LTR and 5' LTR ends of each transposable element. To do this dating I used some scripts I created using one pipeline made by the professor who followed me during the internship.
  4. Comparison of density plot of each genome in order to have a clearer view of the age distribution of LTR transposons in plants. This comparison was made using a script written by me in R.
Figure 2. Diagram of the operation of EDTA. Source: Ou, S., Su, W., Liao, Y., Chougule, K., Agda, JRA, Hellinga, AJ,… Hufford, MB (2019). Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biology20(1), 275. https://doi.org/10.1186/s13059-019-1905-y

Obviously, to make it easier and more intuitive to use the programs mentioned above, I used the graphic interface made available by the software R Studio Version 1.3.1093.

I do not think it is possible to write the procedure in detail so I decided to put at the end of this article an explanatory video, for those interested, on how to proceed to identify and date the LTR retrotransposons starting from the downloaded genomes.

Well, for today I would say to conclude here. Below are the sources used for the realization of the article and the script I designed and that you can use to replicate the study I carried out, even if I warn you. To start the EDTA package, it is necessary to have a computer with high computing power, in fact I used the university server to execute the codes. Anyway I wish you the best of luck and if you liked the article I would like to know your opinions, so comment and put a "like".

Bye and see you soon.


- link to download EDTA package https://github.com/oushujun/EDTA

- Plant Retrotransposons, Amar Kumar and Jeffrey L. Bennetzen. (https://www.annualreviews.org/doi/abs/10.1146/annurev.genet.33.1.479?journalCode=genet)

- The population genetic structure approach adds new insights into the evolution of plant LTR retrotransposon lineages. (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0214542)

- Diversity, Origin, and Distribution of Retrotransposons (gypsy and copia) in Conifers. Nikolai Friesen,* Andrea Brandes,* and John Seymour (Pat) Heslop-Harrison† (https://pdfs.semanticscholar.org/391b/5f713c2f08ab5677891cacbb992114bdb955.pdf)