I love my job as a bioinformatician, don’t get me wrong, but it can be really frustrating. It’s a job that truly tests your patience, where stress is a daily occurrence. In this field, you quickly realize how much you still need to learn, and that can be discouraging—there’s never a point where you’ve “arrived.” However, this seemingly negative aspect can be the most stimulating. The constant need to improve pushes you to explore new concepts and cultivate curiosity. It’s a continuous attempt to surpass your own limits.

This introduction serves to emphasize that, given how vast bioinformatics is, it’s essential to have skills in statistics, mathematics, and computer science in general. However, one must never forget biology, even though I often tend to overlook it on this blog. Lately, I’ve been focusing on concepts useful for biological data analysis, risking neglecting the origin of these data. Today, I’d like to make up for that by discussing sequencing. I know, I’ve already dedicated articles to this topic in the past, but my knowledge has matured, so it’s time to update you on what I’ve learned. Sit back and join me for an overview of sequencing in bioinformatics.

What is sequencing in bioinformatics?

In bioinformatics, we work with data derived from the sequencing of nucleic acids, such as DNA and RNA. Sequencing is the process that allows us to "read" the sequence of nitrogenous bases in these molecules, converting biological information into digital data that will be analyzed using computational methods. In practice, it allows us to go from the biological molecule to the digital data, which will then be interpreted to answer research or medical questions.

Sequencing methods

There are different sequencing methods, primarily divided into two approaches: first-generation methods and next-generation sequencing (NGS), which includes second- and third-generation methods.

  • First-generation sequencing: The main method is Sanger sequencing, where a single template, meaning one molecule of DNA, is sequenced. Although revolutionary and instrumental in sequencing the first human genome, this method is now used only for small sequences or quality controls, as it is too slow and costly for large-scale projects.

  • Second-generation sequencing: Also known as "high-throughput sequencing," it allows the simultaneous sequencing of millions of DNA fragments. The main methods include technologies like Illumina and IonTorrent, which follow several general steps:
    1. Fragmentation: DNA or RNA is fragmented physically, chemically, or enzymatically into optimally sized pieces.
    2. Fragment end-repair: The ends of the fragments are modified to allow the addition of sequencing adapters, small nucleotide sequences "platform specific" that facilitate the process.
    3. Adapter ligation: The adapters include sequences for the sequencing primer, barcodes to distinguish different samples, and other useful regions. They also enable multiplexing, allowing for the simultaneous sequencing of multiple samples, which are then distinguished via computational demultiplexing. Additionally, having adapters on both ends of a fragment allows for sequencing in one direction (single-end reads) or both directions (paired-end reads) if needed.



  1. Fragment immobilization: The fragments with adapters are immobilized on the surface of the sequencer, using PCR-based methods or hybrid capture.

  1. In situ amplification: The fragments are amplified directly on the surface to increase the read signal.

    Preparing the "library," or the collection of fragments ready for sequencing, is a crucial step that influences the type of sequencing. There are various types of libraries depending on the sequencing needed, for example:

    • Whole Genome Sequencing (WGS): Fragments are randomly generated, and the entire genome is sequenced.
    • Target Sequencing (TS): Focuses on specific regions, such as a gene or exon.
    • Whole Exome Sequencing (WES): Aims to sequence all coding regions of the genome (exons).
    • RNA Sequencing (RNA-seq): Requires converting RNA into cDNA and includes various library types, such as those for the whole transcriptome (WTS), mRNA only (mRNA-seq), or small RNA (small RNA-seq).

  • Third-generation sequencing: Third-generation methods, such as PacBio and Oxford Nanopore (ONT), do not require DNA fragmentation or amplification, reducing biases introduced by previous steps. These technologies produce "long reads" (long sequences), over 10 Kbp in length, which can detect structural variants and other large-scale genomic features. Although initially less accurate, recent technological improvements are reducing error rates.

What to do with the reads?

Once reads are obtained, their analysis varies depending on the project's objectives:

  • Quality control (QC): Verifying the integrity and reliability of the sequences.
  • Alignment: Reads are mapped to a reference genome to identify similarities and differences (resequencing).
  • De novo assembly: Building a new genome or transcriptome without using a reference.
  • Variant calling: Identifying mutations such as SNPs, indels, and translocations.
  • Gene expression quantification: In RNA-seq, quantifying gene expression based on the number of mapped reads.

When to use second- or third-generation methods?

The choice between second- and third-generation depends on the objectives:

  • Differential gene expression analysis: The second generation is preferable, offering more accurate reads for precise gene expression quantification.
  • Structural variant calling: Third generation is more suitable due to the ability to generate long reads, facilitating the identification of large variants.
  • Resequencing vs. de novo assembly: If a reference genome is available, use second-generation for resequencing. For constructing a new genome, third-generation long reads are better suited.

De-novo seq (left) vs Reseq (right)

Key technical terms

When discussing sequencing, several technical terms frequently come up that need to be understood and used appropriately. Here are some important ones:

  • Sequencing accuracy: During sequencing, a process called "base calling" occurs because each nucleotide in the fragment being sequenced is "read," but during this process, the machine may misidentify and therefore call a particular base incorrectly. The ability of a machine to correctly call the nitrogenous bases of the fragment being sequenced is measured by accuracy, a numerical value expressing the probability of correct base calling. For example, an accuracy of 90% indicates that out of 100 nitrogenous bases, 90 are called correctly. Typically, read accuracy ranges from ~90% for traditional long reads to >99% for short reads. Remember, sequencing accuracy depends on the machine used but especially on sequencing depth and coverage.

  • Sequencing depth: Sequencing depth expresses the number of times a particular nitrogenous base is found, on average, in a collection of raw reads. In practice, it indicates how many times, on average, a particular nucleotide is sequenced. Sequencing depth is therefore an average value and is indicated with nX, for example, 10X means that each nucleotide is sequenced 10 times on average. In general, it is considered that greater depth leads to higher accuracy but also higher costs. The main factors influencing sequencing depth are:

  1. Quantity of initial DNA: A greater amount of available DNA increases the chance of higher sequencing depth because there are more DNA molecules to sequence.
  2. Length of the reads: The longer the reads, the greater the probability of covering a specific region of the genome multiple times, contributing to greater sequencing depth.
  3. Sample quality: Low-quality, degraded, or contaminated DNA samples can reduce sequencing efficiency and negatively affect depth.
  4. Budget and project objectives: A larger budget allows for more sequencing cycles or the use of more advanced technologies to achieve greater depth. The project goals also determine the necessary level of depth depending on the required accuracy.
  5. Type of sample: For complex samples, such as metagenomes (containing DNA from many species), higher depth is often necessary to obtain a representative coverage of all species present.

  • Sequencing coverage: Coverage indicates how much of the DNA or RNA being sequenced has been "read" at least once. In practice, coverage indicates how much of the input matrix is sequenced during sequencing. Coverage is expressed as a percentage; for example, 95% coverage means that 95% of the DNA/RNA has been "read" at least once by the machine. It is very important not to confuse sequencing depth and coverage (as many unfortunately do) because while they are interconnected, the former expresses a frequency of sequencing while the latter expresses the proportion between sequenced and unsequenced portions. The factors influencing sequencing coverage are the same as those seen for depth, but it is also important to add the length of the reads. In fact, the longer the reads, the greater the likelihood of covering a specific region of the genome multiple times, contributing to greater sequencing depth.

In conclusion, sequencing in bioinformatics is constantly evolving, with technological improvements pushing the boundaries of what’s possible. Second-generation techniques remain the standard for many applications due to their accuracy and relatively low cost, while third-generation technologies are gaining ground as their precision improves.

Bonus image:

Resorces:

  • Sequencing depth and coverage: key considerations in genomic analyses. Sims et al.
  • Next-generation sequencing technologies: An overview. Hu et al.