You know recently I have often sat in front of the computer intent on writing something more technical and therefore to present you more in-depth bioinformatics analyzes but I have always found myself in difficulty because I realize that sometimes to better understand a certain type of bioinformatics analysis it is also necessary to know some fundamental concepts of molecular biology. So the question I've always asked myself is: Should I first write something about basic concepts, such as what is a gene, before presenting some useful bioinformatics analysis? Well yes. I have chosen to talk to you about a topic that I think is a prerequisite for understanding what comes next and therefore some very useful bioinformatics analyzes. In particular, with this article I intend to explain what a gene is and what types of analyzes we can therefore perform when we have one of these in our "hands".

Let's start with a simple but at the same time rigorous definition.

A gene is a DNA sequence capable of producing RNA following a process known as gene transcription.

The RNAs are single-stranded nucleic acids that have different functions based on their structure. In fact, we distinguish different types of RNA but those that I think it is right to mention are the:

  • mRNAs, which carry the message necessary for protein synthesis in the process known as translation.
  • rRNA, which are found in ribosomes and are involved in the translation process.
  • tRNA, which are involved in the translation process as they are responsible for the transport of amino acids to the ribosomes during protein synthesis.
  • shortRNA, these are small-sized RNAs involved in various processes such as epigenetic silencing of genes, in RNA modifications after transcription and in defense against viruses. Furthermore, there are two classes of short RNAs, namely miRNAs and siRNAs, but let's not go into too much detail because that of short RNAs is a vast world.
  • snRNA, i.e. small nuclear RNA, they are produced within the nucleoplasm and are involved in the construction of the spliceosome together with proteins from the cytoplasm, so they are essential for the splicing process that leads to the maturation of mRNA.
  • snoRNA, i.e. small nucleolar RNA involved in the maturation of rRNA.
  • lncRNA, or long non coding RNA, RNA longer than 200 nucleotides produced following a transcription process of non-coding DNA sequences. The role of lncRNAs are not yet very clear being discovered recently, however it is known that this type of RNA is very present in eukaryotic cells. In any case, it is thought that in part these RNAs are actually a random product of the activity of RNA polymerases, while another explanation of their existence is that these have a key role in controlling gene expression.

Based on the type of RNA produced with transcription, the genes are divided into:

  • Structural genes, i.e. the genes that code for proteins, or that produce mRNAs that will be translated into proteins.
  • Non-structural genes, i.e. the genes that do not produce mRNA but RNA that do not code for proteins such as rRNA, tRNA, shortRNA and others.

Genes are found located in chromosomes and sometimes multiple genes can be in the same chromosomal region, for example we can find a gene on the upper strand and a gene that is on the lower strand that partially overlap, or we can also find some genes that they behave like introns of other genes during their transcription. For simplicity I have decided to focus on the genes of eukaryotic organisms but if you are interested I could in another article focus on the genes of prokaryotes. However, genes in eukaryotes are generally made up of:

  1. A transcriptional unit, that is, that portion of the gene that is actually transcribed into RNA.
  2. The promoter, i.e the region of the gene in which the different cis elements are located, that are small sequences that are specialized in binding with transcription factors (trans elements) which are the proteins involved in the transcription process with the RNA enzyme polymerase. Therefore the promoter is a portion of the gene that regulates the transcription process of the transcriptional region of the gene. In reality, a gene can have one or more promoters.

At this point it is necessary to clarify how to pass from DNA to RNA. Trying to be very quick but at the same time clear, I thought of describing the two key steps that lead to the formation of a mature RNA starting from a gene.

  1. Gene transcription, a process in which the transcription machinery (RNA polomerase + transcription factors) binds to the promoter and begins to copy the DNA into RNA (read more here).
  2. RNA maturation. This process is fundamental as the RNA produced immediately after transcription has not yet reached a definitive and functional structure, in fact this is called pre-RNA. The mature and functional RNA is obtained with a series of chemical adjustments that are different according to the type of gene and therefore RNA that must be produced, so I will not talk about it in detail now. Anymore you just need to know that the maturation process allows you to pass from a pre-RNA to a mature RNA and also it is necessary to specify that, thanks to the maturation process, from the same gene you can obtain different types of mature RNA precisely based on the different maturation events occurring on the pre-RNA molecules.

Well, although I have tried to present the concepts in a very basic way, I think I have put a lot of stuff in the pot but I ask you to make a last effort. In fact I think it is right to present at least the structure of structural genes and mRNAs as these are used in many bioinformatics analyzes.

THE STRUCTURE OF STRUCTURAL GENES AND mRNAs:

A structural gene is a DNA sequence that is capable of coding for one or more proteins thanks to the transcription of this into pre-mRNA and the subsequent maturation of this into mRNA which is then translated into protein.

I think many are wondering why more proteins can be made from a single gene. Well the answer is simple. A gene can lead to the production of multiple proteins for several reasons:

  1. A gene can have different transcription start and end sites in its transcription unit so based on where the transcription begins and ends we will have pre-mRNA and therefore different proteins.
  2. The pre-mRNA maturation events can occur in an alternative way on the pre-mRNA molecules produced by the same gene by transcription and this leads to the formation of different mature mRNA molecules and therefore different proteins. In particular, the pre-mRNA maturation process includes:
    • Capping, or the addition of the Cap upstream of the 5'-UTR of the pre-mRNA. Phosphorylation occurs thanks to two enzymes which are Guanyltransferase and Guanosine N-methyltransferase.
    • Polyadenylation, which is the process that determines the addition of the poly-A tail downstream of the 3 UTR of the pre-mRNA after it is detached from the "molecular transcription machine" thanks to the presence of a specific signal sequence, namely AAUAAA. The addition of the poly-A tail occurs through a process catalyzed by an enzymatic complex which includes an Endonuclease and a Poly-A-Polymerase.
    • Splicing, that is the process that determines the removal of the intronic sequences (introns) of the gene, which are the untranslated sequences and generally long from 7 to 70 bp in plants, in order to keep only the exonic sequences (exons) that are instead translated as carriers of the genetic code necessary for the synthesis of a specific polypeptide in the cytosol during the translation process. The intronic sequences are arranged between the exons and therefore, in addition to the removal, the splicing also provides for a union of the exonic sequences located on the sides of the introns. This process takes place in the nucleoplasm thanks to a complex called spliceosome. Furthermore, splicing can sometimes occur in an alternative way to normal and this leads to the formation of different mRNAs and therefore proteins. There are different types of alternative splicing:
      • Exon skipping, consists in the removal of one or more exons in addition to introns. This type of alternative splicing is very common in humans but little in plants.
      • Alternative splicing to 5' site, consists in removing a portion of an exon during cutting while keeping only the part before site 5' where the cut took place.
      • Alternative splicing to 3' site, consists in removing a portion of the exon during the cut, keeping only the part after site 3' where the cut took place.
      • Intron retention, consists in keeping an intron in the transcript. This occurs very frequently in plants while it is infrequent in humans.
 Different types of splicing events are shown schematically. (A) In constitutive splicing, all introns are spliced out and all exons are joined together to produce mRNA. (B) By alternative splicing, pre-mRNA can encode more than one mRNA isoform. Different isoforms can be generated by exon skipping/inclusion of alternative exons, the selection of alternative 5' or 3' splice sites, the retention of intron(s) or selection of the mutually exclusive exon(s). Exons and mRNAs are illustrated as boxes, while introns are represented by solid lines. From: https://hrjournal.net/article/view/2693
  • Trans-Splicing, i.e. the phenomenon that consists in the formation of a mature mRNA following the union of two other different mRNAs.

To summarize the process of maturation of the pre-mRNA into mRNA I thought to report this image:

Ultimately a eukaryotic coding gene is composed of:

The promoter

The promoter of a gene is the DNA sequence (about 1000 bp long) generally placed upstream of the transcriptional unit to which the transcription factors bind so that the RNA polymerase can begin the transcription process. As previously mentioned, a gene can have more than one promoter which contains several key elements placed before the 5 'end of the sequence to be transcribed; these key elements of the promoter are called cis-acting elements and are DNA sequences that they act as regulatory sequences of a specific gene and can interact with trans elements or proteins called transcription factors. Some of the cis elements are located proximal to the gene (part of the proximal promoter) while others are located more distal (part of the distal promoter) but for simplicity we can say that the main cis-acting elements are:

  • The TATA-box sequence (where TATA stands for Thymine-Adenine-Thymine-Adenine), which turns out to be extremely important for the transcription process as it binds the general (or basic) transcription factors necessary for the binding between the 'RNA polymerase and the promoter. The TATA-box therefore regulates the correct positioning of the RNA polymerase on the promoter. Furthermore, the TATA-box is usually located approximately 30 nucleotides before the transcription start site.
  • The CCAT-box sequence is placed before the TATA-box sequence and is located at a distance of 75-80 nucleotides from the transcription start site. The function of the CCAT-box is to indicate the binding site for the transcription factors useful for the expression of the transcriptional sequence.
  • The GC-box sequence is a sequence that works together with the CAAT-box in signaling the binding site for transcription factors necessary for transcription.
  • The G-box sequence is a palindromic sequence (i.e. it is a DNA region consisting of a sequence of repeated and inverted bases on the same strand or between the two strands which for this characteristic can form hairpin structures) usually consisting of a few nucleotides and it is found within the 500-1000 nucleotides upstream of the transcription start site. This sequence is observed only in plants, in particular in the promoter of genes whose expression is influenced by environmental stimuli and are therefore involved in the response to these, that is, it regulates the transcription activity of a specific gene that responds to a specific environmental stimulus. In fact, at the level of the G-box, transcription factors are linked which are produced following a specific environmental stimulus and which can interact with the G-box of several genes which are therefore regulated in a coordinated way to respond to that specific environmental stimulus.
  • Enhancers, i.e. regulatory sequences that are located distal to the 5 'or 3' of the gene or can be placed within intronic sequences. They are usually found distant from the gene in a region (or locus) of the genome called the distal promoter. Their function is to positively regulate the transcription activity of the gene, in fact, they help speed up the activity of RNA polymerase. This is due to the fact that the enhancers bind transcription factors, called activators, which favor transcription since interacting with the general transcription factors linked to the proximal promoter through a protein complex that acts as an intermediary between the two groups of transcription factors, defined mediator, helping to form a chromatin structure that further favors the attachment of RNA polymerase.
  • Silencers, i.e. regulatory sequences that are located distal to the 5 'or 3' of the gene or which, like enhancers, can be placed within intronic sequences. They too are mostly arranged in a distant position from the gene in a region of the genome called the distal promoter. Their function is to negatively regulate the transcription activity of the gene in fact they repress transcription. In particular, transcription factors defined as repressors bind to the silencer sequences which act by blocking or reducing the transcription speed of the gene with different mechanisms:
    • INTERACTION WITH THE ACTIVATORS OF THE ENANCHERS => After binding the silencer regions, the repressors interact with the activators, placed on nearby enhancers, in order to reduce or block the transcription rate.
    • SITE COMPETITION WITH ACTIVATORS => Repressors sometimes bind directly into enhancer sequences preventing their binding with activators.
    • INFLUENCING THE CONDENSATION OF CHROMATIN => The repressors can intervene a priori to the transcription process by modifying the histones and therefore determining the compaction of the chromatin in the controlled gene region in order to prevent the start of the transcription process, in fact making more compact chromatin in the DNA region in which the gene is found and thus determining an inability of the transcription factors to reach the promoter of the controlled gene.
Gene Identification | BioNinja

The transcriptional unit

The transcriptional unit of the gene is the portion of this that is transcribed producing an mRNA and which is made up of:

  1. The transcription start site: this is the site from where transcription by the RNA polymerase begins.
  2. The 5'-UTR: is the untranslated region located at the 5 'end of the transcription unit.
  3. The translation start codon: It is a codon that appears in the pre-mRNA as TAC, AUG and is located upstream of the first transcribed exon. This determines, in the cytosol, the beginning of the translation of the gene message.
  4. The exons: They are the really coding portions of the gene, then translated.
  5. Introns: They are portions additional to the exons but not coding therefore not translated.
  6. The translation stop codon: It is a codon that can be ACT, ATT or ATC and that once transcribed appear in the mRNA as: UGA, UAA or UAG. This is located downstream of the last transcribed exon which causes the interruption of the translation of the gene message in the cytoplasm.
  7. The 3'-UTR: Untranslated region located at the 3 'end of the transcriptional unit. In the 3 'UTR there is an important signaling sequence (AAUAAA) which gives the signal so that the pre-mRNA produced is then processed.
  8. The transcription end site, i.e the site where the transcription activity by the RNA polymerase ends.

In bioinformatics there are two terms that constantly recur in the study of a protein-coding gene and that it is important to know. These are:

  • ORF (Open Reading Frame). There are three different definitions of ORF:
  1. Definition 1: an ORF is a sequence that has a length divisible by three and begins with a translation start codon (ATG) and ends at a stop codon.
  2. Definition 2: an ORF is a sequence that has a length divisible by three and is bounded by stop codons.
  3. Definition 3: an ORF is a sequence delimited by an acceptor and a donor splice site. Thus, it refers to a potentially translated eukaryotic internal exon. 5′- and 3′-terminal exons of a putative gene are determined at the end of the gene prediction process and are not considered for the actual ORF detection. An ORF has both introns and exons and furthermore, since there may be different transcription start and end codons in the transcriptional unit of a gene, we can define different ORFs for the same gene.
Figure 1
Applying the Three Definitions Leads to Different Open Reading Frames (ORFs) (Indicated by Orange Lines) Concerning Their Boundaries. The corresponding ORFs vary between prokaryotes and eukaryotes. An ORF is delimited by a start codon and a stop codon (Definition 1; in the case of prokaryotes practically redundant with CDS), two stop codons (Definition 2), or donor and acceptor splice sites (Definition 3; only for eukaryotes). In all cases the ORFs are not interrupted by internal stop codons in the considered reading frame. According to Definition 2, the ORFs of a eukaryotic gene need not lie in the same reading frame. An ORF according to Definitions 1 or 2 may involve more than one exon if there are no stop codons in the intronic region in between and if they lie in the same reading frame.
  • CDS (Coding DNA Sequence), it is that portion of a DNA or mRNA that possesses the exons for the production of proteins. It should not be mixed up with an Open Reading Frame (ORF). However CDS can be extrapolated from ORFs but not all ORFs lead to the formation of CDS this is because an ORF can contain codons that lead to a truncated protein or possess only fragments of exons, in such cases the CDS extrapolated from these are not functional and therefore cannot be defined as such.

The image below schematically describes the structure of a gene encoding eukaryotic proteins:

File:Gene structure eukaryote 2 annotated.svg

Ok, now I really have come to the end of this very long but necessary article. I have tried to give a complete but simplified overview of some molecular biology concepts that are essential for a bioinformatician. Perhaps this article will also make you understand that a bioinformatician must know how to work on a computer but also how the organisms and molecules he studies work.

Bye-bye and see you soon.

Sources: