Hi everyone. It’s been a while since I last wrote on the blog. Unfortunately, sometimes life gets tough—you lose energy, motivation, and above all, your sense of priorities. I’m not out of this dark period yet, and honestly I’ve lost some of my drive; I don’t have the same energy or desire as before to study and share my journey in the world of bioinformatics.
Lately, though, I’ve decided to slow down with work and start taking an hour or two each day for myself—to study, review, and explore concepts that I find useful. So why not? I’ve decided to start writing again.
It’s also wonderful to see that, despite my absence, the blog is still being visited just as frequently. That makes me happy and proud, and above all it pushes me to create more and more content. In fact, I’m also working on a restyling of the blog—you’ll notice the changes soon.
Alright, now let’s get to the good stuff.
When working with RNAseq data, we are dealing with mature mRNA molecules that are therefore devoid of intronic regions — and this holds true regardless of alternative splicing events, which produce different isoforms that are nonetheless all intron-free.
This creates a problem: if we try to use a classical aligner designed for DNA fragment alignment on a reference genome (e.g. BWA), we end up with 150bp reads that would simply be discarded because they span exon-exon junctions separated by intronic sequences — intronic sequences that can be up to 10,000bp long (wow). Due to the massive mismatch, the classical aligner would just throw the read away. Too bad.
STAR solves this major problem by accounting for the fact that, during alignment to the reference genome, a read can be “split” in two by the presence of an intron. This is why it is referred to as a “splice-aware” aligner.
STAR works in two main phases:
1. PHASE ONE: STAR takes each read and searches the reference genome for the longest possible matching sequence — this is called the “Maximum Mappable Prefix” (MMP). When it gets stuck because it hits an intron and a mismatch arises, it marks the stopping point (like placing a bookmark) and looks elsewhere in the genome for the remaining fragment that completes the read. This search is extremely fast because STAR uses an index called a “suffix array”, which allows it to search any position in the genome in nearly constant time.
2. PHASE TWO: Having now tracked the “splits” for every read, STAR uses the known splice junctions annotated in the provided GTF reference file — but it doesn’t stop there. It is ambitious enough to also discover novel splice junctions, meaning splicing events not yet annotated, directly from the data itself. The final output is a BAM file where, for every read, we know exactly where it aligned, with which mismatches, and across which junctions — known or freshly discovered.
But bear with me for one final step. These reads stored in the BAM file are what we use to generate the gene count matrix, which is the core input for differential expression analysis — the hallmark of any self-respecting RNAseq study. We need a tool like featureCounts, capable of reading every aligned read in the BAM file and asking, with great seriousness: “Which gene does this read fall on?” It does this for every single read, comparing alignment coordinates against those annotated in the GTF reference file, mapping each exon to its corresponding gene and tallying up the counts. The logic is straightforward: a gene produces n mRNA molecules, and from these we derive n reads. featureCounts counts how many reads map to a given gene, and at the end we obtain a count matrix representing the raw expression level of each gene. Naturally, if a read falls on an ambiguous region (two overlapping genes, or within an intron) it is discarded by default.

Wait, where are you going? Let me clarify one more thing. There is a key parameter to consider when generating the count matrix: strandedness. Don’t panic, I’ll keep it simple. When performing sequencing, the experimental design must specify whether we are working with a stranded or unstranded library. In the former case, we retain information about which strand of the genome was transcribed — that is, whether the read originates from the sense or antisense strand of the gene. This is far from a minor detail: on the same genomic locus there can be genes on both strands of the genome, and without this information we would have no way of knowing which gene to assign an overlapping read to. We therefore need to tell featureCounts whether we sequenced a stranded or unstranded library, so that it can perform its calculations correctly.
See you 🤗

