Hi, how are you? We're approaching spring and brighter days, and that generally puts me in a good mood. At work, I'm enjoying learning new things and improving as a bioinformatician; the more days of work I put behind me, the more experience I accumulate, and with it, my imposter syndrome decreases.
In this recent period of work, I've had the good fortune to interact with sequencing facility technicians working at my institute, and I'm learning some technical aspects that I initially thought were exclusively handled by wet scientists. In reality, these aspects require direct input and consultation from the bioinformatician, such as selecting the sequencing depth, choosing the flow cell, sequencing platform, etc. These are all areas where the bioinformatician must be knowledgeable. Inspired by these recent lessons, I've decided to write a slightly different article to explain how a bioinformatician should provide advice and interact even during the sequencing phase, which typically precedes the data analysis that a bioinformatician is usually called upon to perform. I have imagined a dialogue between a bioinformatician and a very inexperienced wet scientist who wants to carry out a bulk RNAseq analysis, to share with you what I have learned.
Well, make yourselves comfortable, prepare a good coffee, and let's get started.
Wet Scientist:
Hi, my lab and I would like to perform a transcriptomic study, but we're all quite inexperienced and would appreciate your help not only for the data analysis but also for data production.
Bioinformatician:
Certainly. Tell me more about the experiment you'd like to carry out.
Wet Scientist:
We'd like to perform a bulk RNAseq experiment to compare samples under two different conditions. Specifically, we have cultured cells from patients treated with a drug A and patients who are untreated.
Bioinformatician:
Okay. So, if I understand correctly, your goal is to investigate gene expression differences based on patient condition, specifically comparing treated versus untreated groups. Correct?
Wet Scientist:
Exactly.
Bioinformatician:
Great. How many samples do you have available for each condition?
Wet Scientist:
We have 3 individuals per condition, making a total of 6 samples.
Bioinformatician:
Very good, so you have three biological replicates per condition, totaling 6 samples. This number is sufficient for a robust statistical analysis. Let's discuss transcriptome sequencing. Since your goal is differential expression analysis, short-read sequencing like that provided by Illumina platforms would be suitable. Which short-read sequencing platform do you have access to?
Wet Scientist:
We have access to an Illumina NovaSeq sequencer.
Bioinformatician:
Excellent. Considering your budget constraints, I would recommend:
- Performing Paired-end sequencing to achieve better transcriptome coverage.
- Using sequencing cycles that provide reads of 2x100bp or 2x150bp. For each library fragment (typically 300-600bp in length), you'll obtain two reads (one forward and one reverse) from each end, each 100 or 150bp long. Even though you sequence two ends totaling 200 or 300bp, there's still a central region of the fragment that's not directly sequenced.
- Given your aim of differential expression analysis, it's essential to 'count' a substantial number of reads mapping to each gene, so I'd recommend aiming for a sequencing depth of 30-40 million paired-end reads. If you also want to analyze alternative splicing or detect rare transcripts or gene fusions, you might increase this to 50 million reads, though obviously, it would cost more.
Wet Scientist:
I see, but how can we ensure we produce 40-50 million reads?
Bioinformatician:
Good question. Usually, the sequencing facility manages this, but I can tell you that a lot depends on the flow cell choice where the sample libraries are loaded. Imagine the flow cell as a raft onto which packages are loaded; each sequencing platform uses flow cells of different sizes, providing a range of choices. Naturally, larger flow cells produce more reads. In your case, having access to an Illumina NovaSeq, the available flow cells are:
- SP (small) → Total output of around 650-800 million paired-end reads.
- S1 (medium) → Total output of around 1.6 billion paired-end reads.
- S2 (large) → Total output of around 3.3 billion paired-end reads.
- S4 (extra large) → Total output of around 10 billion paired-end reads.
I don't know the exact cost for each flow cell, but larger flow cells certainly increase costs. In your situation, aiming for 40-50 million paired-end reads and having 6 samples, you could comfortably select an SP flow cell for sufficient sequencing depth and efficient multiplexing of your samples.
Wet Scientist:
Multiplexing? What do you mean?
Bioinformatician:
Multiplexing refers to processing multiple samples simultaneously without having to use multiple flow cells. In your case, we have 6 samples x 50 million paired-end reads each = 300 million total reads.
Wet Scientist:
I understand, but how do we then distinguish the reads originating from each sample?
Bioinformatician:
That's a valid question. During the preparation of multiplexed libraries, each sample is tagged with unique barcodes called indexes. During sequencing, the sequencer reads these indexes, allowing the computational separation of samples. This computational step, called demultiplexing, assigns each sample to its respective file containing the sequencing reads, typically done immediately after sequencing with tools like bcl2fastq.
Wet Scientist:
Great. Are there any other precautions we should consider?
Bioinformatician:
Yes. I recommend randomly loading the samples onto the flow cell, although generally, sequencing facility technicians are already careful about this. Random loading helps to avoid batch effects.
Wet Scientist:
Wait, what's a batch effect?
Bioinformatician:
Let me explain. Imagine you're doing a cooking experiment comparing two pizza recipes. To fairly evaluate differences caused by the recipe itself, you'd randomly distribute pizzas within the same oven. If instead, you place all pizzas from recipe A in the hottest part (the back) and all pizzas from recipe B in the cooler part (the front), you'd introduce differences in cooking caused by oven position rather than the recipe itself (a batch effect). By randomly distributing pizzas within the oven, you'd minimize this technical bias. Being random in your procedures helps, but sometimes batch effects are unavoidable or introduced unintentionally. Don't worry, it's the bioinformatician's job to correct for potential batch effects computationally using specialized tools designed to address these confounding variables.
Wet Scientist:
Okay, thanks for your help.
Bioinformatician:
Certainly, I'll be happy to help you with other experiments like ATACseq, ChIP-seq, or single-cell sequencing in the future. See you soon!
Well, that's a typical conversation showing the interaction between a bioinformatician and a somewhat inexperienced wet scientist regarding RNAseq experiments. As you've probably guessed, I plan to provide similar conversations for other bulk and single-cell experiments in the future, but feel free to comment on which specific experiment you'd like me to cover next.
Bye and see you soon!
Omar Almolla