A small disclaimer !!!!!

At the bottom of the page, you’ll find an important message for the community. So I encourage you to enjoy the article and then, once you’re done, take a moment to read the message I’ve written for all of you. Bye, and happy reading!

Overview:

1) Introduction to ATACseq

2) Quality Control in ATAC-seq

3) ATACseq data analysis pipeline

Introduction

Recently, I authored an article on ATAC-seq data quality control, which garnered significant interest. Encouraged by this, I decided to write an in-depth article featuring an example workflow for a complete ATAC-seq data analysis.

Let’s start with the basics.

What is ATAC-seq?

ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a method to study genome accessibility. It enables researchers to identify regions of the genome accessible to transcription machinery—indicating potential transcriptional activity—and regions that are condensed and transcriptionally inactive. Understanding differences in accessibility between samples under various conditions can shed light on epigenetic mechanisms and their roles in biological processes or diseases like cancer.

How Does ATAC-seq Work?

In organisms like humans or mice, DNA is organized into chromosomes and associated with histone proteins, forming nucleosomes. Interactions among nucleosomes can create chromatin regions that are either closed and inaccessible or open and accessible. Open regions, often nucleosome-free, are more likely to contain transcription factor binding sites, promoters, enhancers, insulators, transcription start sites (TSS), and genes.

ATAC-seq leverages the Tn5 transposase to randomly cut accessible chromatin regions. This produces DNA fragments that capture a snapshot of genome-wide accessibility. The general steps for the wet-lab portion of ATAC-seq include:

  1. Nuclei Isolation: Extract nuclei from fresh or frozen tissues using protocols like Omni-ATAC.
  2. Tn5 Exposure: Optimize the Tn5/nucleus count ratio.
  3. PCR Amplification: Adjust the PCR cycle count to amplify fragments.
  4. Sequencing: Perform paired-end sequencing (40–100 bp reads). For human samples, a minimum of 50M short reads is required, and for TF footprinting, 200M short reads are recommended.

After sequencing, the reads (in FASTQ format) proceed to bioinformatic analysis. Before delving into this, it’s essential to understand two key concepts: library fragment classification and read shifting.

Library Fragment Classification by Insert Size

In paired-end sequencing, library fragments consist of two adapters and an insert, producing paired reads.

The Illumina NexteraXT transposon protocol is a cost effective way to generate paired end libraries.

The insert size significantly influences quality control during sequencing. Based on the insert size, ATAC-seq library fragments are categorized as:

  • 50–100 bp: Nucleosome-free regions.
  • 150–200 bp: Mono-nucleosome regions.
  • 300–400 bp: Di-nucleosome regions.
  • 600 bp: Tri-nucleosome regions.

Nucleosome-free fragments are more likely to contain sites of interest, such as promoters, enhancers, insulators, TSS, and transcription factor binding motifs.

Read Shifting

The Tn5 dimer binds to nucleosome-free chromatin, making staggered cuts 9 bp apart on the + and − strands while adding adapters. This results in two peaks offset from the actual open chromatin region. To correctly position the peak, reads are shifted by +4 bp for the + strand and −5 bp for the − strand.

Reads from centered peaks are further classified into bins based on read length, GC content, and conservation score. The conservation score—a measure of evolutionary conservation—validates the biological relevance of open chromatin regions. High-quality ATAC-seq data shows a good overlap between peaks and conserved genomic regions.


Quality Control in ATAC-seq

Quality control (QC) is performed at multiple stages of the ATAC-seq workflow:

Nuclei QC (Microscopy)

The experiment begins with isolating high-quality nuclei. Key QC steps include:

  • Morphological Evaluation: Use Trypan Blue or DAPI staining to ensure intact nuclei with round or oval shapes and no clumping.
  • Accurate Counting: Use an automated cell counter to optimize tagmentation and minimize technical variability.

I'm not an expert on the wet part so I'm going to cite thi paper:

“ATAC-Seq experiment begins with the isolation of high-quality intact nuclei, we first introduced a quality control checkpoint consisting of the morphological evaluation of nuclei with either Trypan Blue or DAPI staining, followed by the accurate quantification of those nuclei using an automated cell counter. Precise counting of nuclei is important to ensure optimal tagmentation (the simultaneous fragmenting of the DNA and insertion of adapter sequences) and to limit the technical variability across samples. From a qualitative perspective, individual intact nuclei with a round or oval shape should be observed with no visible clumping.”

PCR QC

To exclude degraded or over-tagmented samples, QC involves:

  • Gel Electrophoresis: A periodic DNA ladder pattern (~200 bp) indicates intact chromatin and optimal transposase activity.
  • qPCR Analysis: Assess DNA enrichment using known open chromatin sites (positive controls) and Tn5-insensitive sites (negative controls). High-quality samples show at least a 10-fold enrichment of positive controls.

As above (paper🙂

“To exclude samples with severe degradation or over-tagmentation, we assessed the quality of the treated chromatin samples by gel electrophoresis, as described in Buenrostro et al.12; if the chromatin was intact and the transposase reaction was optimal, a DNA laddering pattern with a periodicity of about 200 bp should be observed, corresponding to fragments of DNA that were originally protected by an integer number of nucleosomes (nucleosome phasing). Furthermore, we measured the enrichment of DNA accessible regions by performing real-time qPCR analysis using known open-chromatin sites as positive controls and Tn5-insensitive sites as a negative controls. When assayed by real-time qPCR, high-quality ATAC-Seq samples should show at least a 10-fold enrichment of positive control sites compared to Tn5-insensitive sites.”

Raw Reads and Pre-alignment QC

Evaluate sequencing quality and adapter contamination using tools like FASTQC, Picard CollectMultipleMetrics and ATACseqQC.

Key aspects to evaluate include:

Library Complexity

This metric reflects the number of distinct DNA molecules identified in the total reads produced by sequencing. The more unique molecules identified as the total read count increases, the better the library quality, as it demonstrates high diversity. Ideally, one would expect a linear trend in this metric.

Read Quality

Phred Quality Scores: Ensure base quality scores are consistently high (above 30), especially for early read positions. A natural quality decline toward the end of reads is expected with Illumina technology. If quality sharply declines toward the ends, use trimming tools like fastp.


Note:

Phred is a score that measures the reliability with which a base is identified during the sequencing process. This score is calculated for each nucleotide base in the sequencing reads, and the frequencies of these scores (i.e., the number of bases with a certain Phred value) are typically visualized—e.g., through a histogram—to provide a general idea of sequencing reliability.
The Phred score is calculated as follows:
Q = -10 x log10(P)
Where:

  • Q is the quality score.
  • -10 is a constant coefficient.
  • log10(P) is the probability of identifying the base incorrectly, expressed as a logarithm in base 10.

Quality by Cycle

During sequencing, various cycles of base identification and reading occur. Throughout these cycles, the quality of base identification (Phred) may fluctuate due to technical reasons or systemic errors, such as issues with sequencing chemistry, instrument malfunction, or physiological problems like fragment degradation during cycles. For instance, it is well known that in short-read sequencing platforms like Illumina, the bases at the fragment ends are harder to identify, leading to significant quality fluctuations.
Studying the Phred value trends across sequencing cycles is useful for understanding the success of the sequencing process. Generally, three trends are of interest:

  • Stable Quality: A flat and consistent line at high Q values, indicating excellent sequencing with minimal or no fragment degradation during sequencing cycles.
  • Progressive Decline in Quality: If fragment degradation is gradual, without sudden drops or unusual fluctuations, the sequencing process is deemed successful and not concerning.
  • Significant Quality Fluctuations: A sudden or pronounced decline in quality values during sequencing cycles indicates issues with sequencing chemistry or the library, necessitating further investigation.

Per Base Sequence Content

Displays the percentage of bases in the reads' sample sequence. This should show an almost linear and parallel trend for all bases. However, note that Tn5 acts randomly but favors sequences containing guanosine (G) and cytidine (C). Therefore, in the scatter plot of "Per Base Sequence Content," two distinct trends may be observed: one where G and C frequencies are nearly parallel and another where A and T frequencies are generally parallel.
Observing fluctuations at the read ends is quite normal due to the presence of adapters and the effects of Tn5 cutting. Conversely, sharp and pronounced fluctuations in the central "body" of the reads are more problematic and may require intervention. For example, trimming overly fluctuating read ends (e.g., --trim_front1 13 --trim_tail1 40) can improve the central read quality. See the comparison image below for the "Per Base Sequence Content" of reads from the same sample before and after trimming.


Adapter Contamination

Due to the small DNA fragments used in ATAC-seq, reads often include adapter sequences, especially in shorter fragments. Use tools like Cutadapt or fastp to remove adapter sequences if contamination is detected.

Overrepresented Sequences

Some sequences may be overrepresented due to:

  • Genomic repetitive regions.
  • Residual adapter sequences.
  • PCR amplification artifacts.

    While some overrepresentation is normal, problematic sequences should be filtered out.

GC Content Distribution

Chromatin-accessible regions often exhibit atypical GC content. However, extreme GC bias may indicate technical issues, such as PCR amplification bias.

Duplicate Reads

Duplicate reads arise due to:

  • PCR Bias: These are technical artifacts introduced during the amplification process necessary for library preparation. Reducing these duplicates is desirable, as they artificially inflate mapping values for non-biological reasons.
  • Biological Enrichment: These reflect meaningful biological characteristics, such as highly accessible regions of chromatin or genomic duplications during evolution.

    A duplicate rate exceeding 50% may suggest poor library quality. However, for ATAC-seq data, it is recommended to mark duplicates using tools like Picard MarkDuplicates and not remove them, as MACS2 can handle these duplicates automatically during alignment. Tools like MACS2 allow users to specify whether to ignore marked duplicates when calculating signals or calling peaks.

Insert Size Histogram

The insert size histogram visualizes the distribution of sequencing fragment sizes from a sample. This is generated using tools like Picard CollectMultipleMetrics or the ATACseqQC package, where it is referred to as a "Fragment Size Distribution Plot." Both provide the same information.
In ATAC-seq data, we expect a multimodal distribution corresponding to categories such as NF fragments, mono-nucleosome fragments, tri-nucleosome fragments, and so on, with a main peak for NF fragments.

Significant deviations from this pattern may indicate issues during library construction.

The tool Picard CollectMultipleMetrics, compared to ATACseqQC, has the advantage of highlighting the proportion of reads in Forward-Reverse orientation and Reverse-Forward orientation. As I believe you already know, during paired-end sequencing, paired reads (R1 and R2) can orient themselves in different ways: FR, RR, and even in Tandem.

Sometimes it is necessary to evaluate these proportions, keeping in mind that in the case of a standard paired-end sequencing approach, it is common to have more pairs in FR orientation compared to RF-oriented reads, which may be seen as artifacts. Instead, in the case of a mate-pair library, the situation is entirely opposite. Therefore, make sure you know the sequencing approach used and then evaluate the presence of unexpectedly oriented reads.

Post Alignment QC

After alignment, a thorough evaluation of data quality is crucial. Picard CollectMultipleMetrics is a versatile tool that generates a variety of metrics not only to assess the quality of reads and library but also to assess the quality of alignment to the reference genome.
Alignment Summary Metrics, provides general statistics about read alignment.
Key Metrics:

  • TOTAL_READS: Total number of reads processed.
  • PF_READS_ALIGNED: Number of reads successfully aligned. (PF stands for Pass Filter, referring to reads that meet quality criteria established during sequencing, such as Illumina's software).
  • PERCENT_ALIGNED: Percentage of reads aligned to the genome (note: not all aligned reads pass the quality filter).
  • PF_MISMATCH_RATE: Rate of mismatches in aligned reads. (A mismatch refers to a discrepancy between a read's sequence and the reference genome sequence at a specific position.)
  • PF_INDEL_RATE: Frequency of insertions and deletions (indels) in reads.
  • STRAND_BALANCE: Balance between reads mapped to the two strands. (Values far from 0.5 may indicate sequencing bias.)

Interpretation:

High alignment percentages (>80%) and low mismatch/indel rates indicate high-quality data. An imbalanced strand ratio may suggest library bias or technical issues.

Post Peak Calling QC

Now we have reached the final level of quality control.

  • Promoter/Transcript Body Score (PT Score)
    This score measures how much of the signal (mapped reads) comes from promoter regions. It is calculated as the ratio of coverage at promoters to coverage at transcript bodies. Ideally, the majority of the ratios (calculated for all annotated promoters and transcript bodies in the reference genome) should lean toward higher promoter-derived read counts (5–10 or higher).\

  • Nucleosome-Free Region Score (NFR Score)
    The NFR score quantifies the signal originating from nucleosome-free regions. It is calculated by comparing coverage at TSS regions to the coverage at flanking regions.

In addition to the nucleosome-free region score, we can evaluate the quality of the signal coming from nucleosome-free and mononucleosomal regions, as these are of greatest interest because they are the regions where we can most frequently find structures of interest such as promoters, enhancers/insulators, TSS, etc., through a heatmap like the one shown below. Generally, we expect to find a strong alignment signal near zero (the point where the TSS is located) in the case of NFR fragments and an alignment signal in the regions flanking zero in the case of mononucleosomal fragments, and thus a situation that conforms to the following occupational model:

Trends that deviate from these indicate problems related to the action of Tn5, such as in the case where a sample may have been over-transposed by Tn5.

If we want to be even more meticulous, there is an additional visualization that I find very important, which provides the same information. Specifically, we can visualize the distribution of the signal of reads in regions associated with NFR fragments and those associated with nucleosome-bound fragments around the TSS as follows:

Abnormal situations are those in which the trend of the nucleosome-free signal (solid black line) and the nucleosome-bound signal (dashed red line) differ significantly and do not appear to be inversely correlated. In these cases, a problem in the action of Tn5 is hypothesized.

  • Transcription Start Site Enrichment Score (TSS Enrichment Score)
    This score indicates how much signal maps to TSS regions. It is calculated as the ratio of reads mapping to TSS to those mapping to flanking regions.
    For each annotated TSS, a parent bin of +1,000 bp and -1,000 bp around the TSS (total of 2 kbp) is defined. This parent bin is subdivided into 100 bp child bins, and the signal (number of mapped reads) is calculated for each child bin. The ratio of the maximum signal bin to the average signal bin is computed for each TSS, as illustrated below.

After calculating the TSS enrichment score for each annotated TSS, the values are aggregated across all TSS within a 2 kbp window. The resulting plot shows:

X-axis: Distance from the TSS (center of the 2 kbp window).
Y-axis: Average aggregated TSS enrichment score per position in the bin.

The difference between the visualization shown above and that related to the NFR analysis is that here we are evaluating the signal around the TSS globally, without distinguishing the origin of that signal, whether nucleosome-free or bound. Obviously, this visualization may seem redundant, but it is always better to evaluate it anyway.

ATACseq data analysis pipeline


Now that we've covered the experimental (wet-lab) side of ATAC-seq and its associated quality controls, we’re ready to look at the bioinformatic (dry) analysis. In this link, you’ll find my personal basic pipeline for processing ATAC-seq data. Please note that it does not yet include motif enrichment analysis or integration with other omics data. I hope you find it useful!


References


Message for you

Hello everyone,

It’s amazing to think that four years have already passed since this blog first came to life—a virtual space born from my passion for bioinformatics and steadily growing along with the community of readers who follow me. As a self-taught enthusiast, I’ve always viewed this space as a training ground for my curiosity: here, I share what I learn, engage with you, and keep pushing my limits by studying, experimenting, and challenging myself.

My greatest reward is knowing that, day by day, more and more people find inspiration, resources, and motivation in my content. I’ve never claimed to be a “great bioinformatician”—and I probably never will—because this field is an endless journey toward new knowledge, new challenges, and fresh possibilities.

However, running this blog requires time, commitment, and some expenses. That’s why I’ve decided to open the option to leave a small tip via PayPal. It’s a gesture that would help me cover the costs of maintaining the blog, devote more time to creating new articles, and—why not—share my work code transparently, providing an even more tangible contribution for anyone eager to learn.

If you’d like, you can freely choose how much to donate. There’s no right or wrong amount—every single bit of help is invaluable in sustaining this independent outreach project. I’ll continue writing with the same passion and dedication as always, happy to learn alongside you and help this wonderful community grow day by day.

Thank you from the bottom of my heart for your support and for the attention you’ve always given me. The journey continues, and I hope you’ll stay by my side as we explore new frontiers of knowledge!