There are several methods to study chromatin accessibility, aiming to understand gene expression regulation at the pre-transcriptional level. One such method is ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing). This technique investigates genome-wide accessibility by leveraging DNA fragmentation mediated by the Tn5 transposase dimer, which binds to chromatin regions typically free of nucleosomes.

Recently, I had the opportunity to analyze ATAC-seq data. Below is an overview of the steps involved in such analyses:

  1. Sequencing read quality control (QC).
  2. Trimming of sequencing reads.
  3. Alignment of reads to the reference genome.
  4. Post-alignment QC.
  5. ATAC-seq specific quality assessments.
  6. Peak calling.
  7. Analysis of differential accessibility regions.
  8. Motif identification.
  9. Transcription factor (TF) footprinting analysis.

One of the most common mistakes for newcomers to biological data analysis is neglecting QC steps. This article can be used to guide you and better manage quality control during the analysis of ATACseq data.

Specifically, QC can be classified into three main levels:

- Raw read quality control.

- Post-alignment quality control.

- Pre-downstream analysis quality control.

Raw read quality control

ATAC-seq reads are typically sequenced using Illumina platforms. It's crucial to assess quality parameters such as adapter contamination and sequencing quality, using tools like FASTQC.

Key aspects to evaluate include:

- Read Quality
Phred Quality Scores: Ensure base quality scores are consistently high (above 30), especially for early read positions. A natural quality decline toward the end of reads is expected with Illumina technology. If quality sharply declines toward the ends, use trimming tools like fastp.

- Per base sequence content
Display the percentage of base into reads sample sequence. This should show an almost linear and parellel trend in all the bases.

- Adapter Contamination
Given the small DNA fragments used in ATAC-seq, reads often include adapter sequences, especially in shorter fragments. Use tools like Cutadapt or fastp to remove adapter sequences if contamination is detected.

- Overrepresented Sequences
Some sequences may be overrepresented due to:

  1. Genomic repetitive regions.
  2. Residual adapter sequences.
  3. PCR amplification artifacts.

While some overrepresentation is normal, problematic sequences should be filtered out.

- GC Content Distribution
Chromatin-accessible regions often exhibit atypical GC content. However, extreme GC bias may indicate technical issues, such as PCR amplification bias.

- Duplicate Reads
A certain level of duplicates is expected due to:

  1. PCR Bias: A technical artifact during library preparation.
  2. Biological Enrichment: True enrichment in highly accessible chromatin regions.

A duplicate rate exceeding 50% may suggest poor library quality. Mark duplicates using tools like Picard MarkDuplicates, and aim for improvements after trimming.

Post-alignment quality control

ATAC-seq analysis involves quantifying reads mapped to reference genome regions. To do this, a count matrix is generated, where columns represent individual samples, rows represent peaks (or open chromatin regions), and the cells contain the number of reads mapped to a specific peak for a given sample. To construct such a matrix, the first step is aligning the reads to the reference genome.

Alignment is performed using specialized tools like Bowtie2, which produce BAM files. Before assessing alignment quality, duplicate reads (reads with identical start and end positions on the genome) should be marked using tools such as Picard MarkDuplicates.

Marking duplicates involves identifying reads that are identical after alignment.

At this stage, it is essential to distinguish between two types of duplicates:

1. PCR Duplicates: These are technical artifacts introduced during the amplification process necessary for sequencing library preparation. Reducing these duplicates is desirable, as they artificially inflate mapping values for non-biological reasons.
2. Biological Duplicates: These reflect meaningful biological characteristics. For instance, a region may show numerous duplicate reads because it is highly accessible to Tn5 (indicating open chromatin) or because the region has been duplicated across the genome during evolution.

Marking duplicates serves several purposes:

  1. Preventing Bias in Quantitative Analysis:
    PCR duplicates do not reflect the true biological abundance of a fragment. If not removed or marked, they can artificially inflate signals, especially when quantifying genomic regions such as chromatin accessibility in ATAC-seq.
  2. Signal Normalization:
    By marking duplicates, downstream tools (e.g., peak callers like MACS2) can ignore or appropriately weigh them during analysis.
  3. Library Complexity Estimation:
    Marking duplicates allows for the estimation of the proportion of unique reads versus duplicates, which is an indicator of library quality.

Should you Remove the marked Duplicates?

The answer is: It depends!

It is important to consider that the bioinformatic tools used for marking duplicates cannot distinguish between PCR duplicates and biological duplicates. As a result, removing these marked duplicates may sometimes lead to the loss of valuable information.

- When to Remove Duplicates:
This is useful in experiments where PCR duplicates introduce significant bias, such as RNA-seq or amplicon-based studies. However, removing duplicates may also eliminate biologically relevant data, leading to information loss.

- When to Mark and Retain Duplicates:
This is the standard practice for ATAC-seq. Marking duplicates allows downstream tools to distinguish between technical and biological duplicates without discarding potentially useful data.
REMEMBER: When calculating signals or calling peaks with tools like MACS2, you can specify whether to ignore marked duplicates.

Ok but which Post-alignment quality control metrics should we consider?

After alignment, a thorough evaluation of data quality is crucial. Picard CollectMultipleMetrics is a versatile tool that generates a variety of metrics to assess the quality of reads aligned to the reference genome.
Below is an overview of the main metrics generated by Picard and how to interpret them.

- Alignment Summary Metrics
Provides general statistics about read alignment.
Key Metrics:

  • TOTAL_READS: Total number of reads processed.
  • PF_READS_ALIGNED: Number of reads successfully aligned. (PF stands for Pass Filter, referring to reads that meet quality criteria established during sequencing, such as Illumina's software).
  • PERCENT_ALIGNED: Percentage of reads aligned to the genome (note: not all aligned reads pass the quality filter).
  • PF_MISMATCH_RATE: Rate of mismatches in aligned reads. (A mismatch refers to a discrepancy between a read's sequence and the reference genome sequence at a specific position.)
  • PF_INDEL_RATE: Frequency of insertions and deletions (indels) in reads.
  • STRAND_BALANCE: Balance between reads mapped to the two strands. (Values far from 0.5 may indicate sequencing bias.)
    Interpretation:
    High alignment percentages (>80%) and low mismatch/indel rates indicate high-quality data. An imbalanced strand ratio may suggest library bias or technical issues.

- Insert Size Metrics
Measures the distribution of distances between paired-end read ends (fragment lengths).
Key Metrics:

  • MEAN_INSERT_SIZE: Average fragment length.
  • STANDARD_DEVIATION: Variability in fragment length.
  • MEDIAN_INSERT_SIZE: Median fragment length.
  • MODE_INSERT_SIZE: Most common fragment length.
    Interpretation:
    In ATAC-seq, expected fragment lengths are ~150 bp for mononucleosomes and ~300 bp for dinucleosomes. High variability may indicate fragmentation issues or library preparation errors.

- Quality Score Distribution Metrics
Analyzes the distribution of Phred quality scores across reads.
Key Metrics:

  • MEAN_QUALITY_SCORE: Average quality score.
  • STANDARD_DEVIATION_QUALITY_SCORE: Variability in quality scores.
    Interpretation:
    High average quality scores (>30) reflect good data quality. High variability may indicate the need for trimming low-quality bases.

- Base Distribution by Cycle Metrics
Examines the composition of bases (A, T, C, G) across sequencing cycles.
Key Metrics:

  • Percentage of each base (A, T, C, G) per cycle.
    Interpretation:
    Unbalanced base distributions may indicate sequencing bias, especially during early cycles.

- GC Bias Metrics
Assesses bias in GC content distribution across reads compared to the reference genome.
Key Metrics:

  • GC_BIAS_SCORES: Quantifies the level of GC bias.
  • WINDOWS_ALIGNED: Number of aligned windows for specific GC intervals.
    Interpretation:
    Significant GC bias may impact the interpretation of genomic accessibility.

Duplication Metrics
Provides statistics on duplicate reads.
Key Metrics:

  • PERCENT_DUPLICATION: Percentage of duplicated reads.
  • UNPAIRED_READ_DUPLICATES: Number of duplicates in single-end reads.
  • PAIRED_READ_DUPLICATES: Number of duplicates in paired-end reads.
    Interpretation:
    A high duplication rate (>50%) may indicate PCR amplification bias or low-complexity libraries.

Pre-downstream analysis quality control

Now we have reached the final level of quality control, but before diving into the importance of the metrics presented, it is essential to make three important preliminary observations.

1. Basic Principle of Tn5 and the Need for Peak Shifting

Remember that the Tn5 dimer identifies binding sites in chromatin regions that are typically nucleosome-free. Once bound, the dimer performs two staggered cuts, 9 bp apart—one on the + strand and one on the - strand—while adding two adapters in the process. As a result, the mapped reads originating from each open chromatin region bound by Tn5 will produce two peaks that are offset from the actual binding site and the true open chromatin region.

To correctly position a single peak at the site of open chromatin, an offset of +4 bp for the peak on the + strand and -5 bp for the peak on the - strand is applied.

Additionally, reads from each centered peak are divided into four distinct bins, typically using classification models like Random Forest, based on read length, GC content, and optionally the conservation score.
The conservation score measures the evolutionary conservation of a genomic region. In ATAC-seq, conservation score analysis provides insights into the biological validity of your data, as open chromatin regions with essential regulatory functions tend to be evolutionarily conserved. A good overlap between ATAC-seq peaks and genomic regions with high conservation scores indicates high-quality data.

The four classes are:

  1. Nucleosome-Free Bin (<100 bp).
  2. Mononucleosome Bin (180–247 bp).
  3. Dinucleosome Bin.
  4. Trinucleosome Bin.

Occasionally, some reads do not fit into any of these categories and are labeled as "discarded."

2. Nucleosome-Free Regions Enriched in Promoters and TSS

As mentioned earlier, Tn5 binds open chromatin regions, which is why we use this approach. It allows us to sequence fragments from these regions, which are generally associated with transcriptionally active genes. Based on this reasoning, we are particularly satisfied when our fragments predominantly come from nucleosome-free regions, rich in promoters and transcription start sites (TSS).

Nucleosome-free regions are highly enriched in transcriptionally active genes and transcription factor binding sites (TFBS). Promoters are especially important, as their accessibility typically allows transcription factors to bind and regulate gene expression. Similarly, the accessibility of the TSS region predisposes a gene to transcription.

However, it is crucial to exercise caution. The presence of a gene in a nucleosome-free region with accessible promoters and TSS does not guarantee its expression. Many factors come into play, such as:

  • Post-transcriptional regulation at the level of mRNA.
  • Protein-level regulation after translation.
  • Transcriptional repression, such as silencer activity or downregulation of activating transcription factors.

In essence, chromatin accessibility indicates a gene's predisposition to expression, but a comprehensive understanding requires additional data, such as transcriptomic analysis (RNA-seq), translational profiling via ribosome profiling (Ribo-seq), and so on.

3. From Peaks to Transcription Factor Footprints

An accessible chromatin region can be interrogated to identify DNA motifs that are enriched and potentially bound by specific transcription factors. ATAC-seq data can be used to predict transcription factor binding sites (TFBS) by leveraging a Position Weight Matrix (PWM), calculated using methods such as "matchPWM."
The coverage of reads (mapped reads) at each TFBS is calculated to create a footprint of potential binding by transcription factors.

Ok, now we could look at the Pre-Downstream Analysis Quality Control metrics:

- Library Complexity
This metric reflects the number of distinct DNA molecules identified in the total reads produced by sequencing. The more unique molecules identified as the total read count increases, the better the library quality, as it demonstrates high diversity. Ideally, one would expect a linear trend in this metric.

- Fragment Size Distribution
This metric shows the frequency distribution of different fragment lengths generated by Tn5 dimer activity. As noted earlier, a high proportion of reads with fragment lengths <100 bp (nucleosome-free regions) is ideal, followed by periodic declines corresponding to mononucleosome, dinucleosome, and trinucleosome fragments. Deviations from this pattern may indicate Tn5 malfunctions.

- Promoter/Transcript Body Score (PT Score)
This score measures how much of the signal (mapped reads) comes from promoter regions. It is calculated as the ratio of coverage at promoters to coverage at transcript bodies. Ideally, the majority of the ratios (calculated for all annotated promoters and transcript bodies in the reference genome) should lean toward higher promoter-derived read counts (5–10 or higher).

- Nucleosome-Free Region Score (NFR Score)
The NFR score quantifies the signal originating from nucleosome-free regions. It is calculated by comparing coverage at TSS regions to the coverage at flanking regions.

- Transcription Start Site Enrichment Score (TSS Enrichment Score)
This score indicates how much signal maps to TSS regions. It is calculated as the ratio of reads mapping to TSS to those mapping to flanking regions.

For each annotated TSS, a parent bin of +1,000 bp and -1,000 bp around the TSS (total of 2 kbp) is defined. This parent bin is subdivided into 100 bp child bins, and the signal (number of mapped reads) is calculated for each child bin. The ratio of the maximum signal bin to the average signal bin is computed for each TSS, as illustrated below.

After calculating the TSS enrichment score for each annotated TSS, the values are aggregated across all TSS within a 2 kbp window. The resulting plot shows:

  • X-axis: Distance from the TSS (center of the 2 kbp window).
  • Y-axis: Average aggregated TSS enrichment score per position in the bin.

- Sample Correlation
Finally, it is essential to test for sample correlation to identify potential redundancies in information between samples, which could introduce noise into downstream analyses.

We have now explored in detail how to evaluate the quality of ATAC-seq data. I hope you find this article useful. Feel free to leave a comment and start a good discussion if you’re interested.


Bye bye

Omar Almolla


References: