Introduction
Alterations in chromatin structure and function are hallmark features of normal cell fate decisions, as well as pathological processes [
1,
2]. As such, understanding the epigenetic features that regulate chromatin states is essential to develop next-generation biomarkers and therapeutics. The chromatin landscape can be defined by the localization of histone post-translational modifications (PTMs) and chromatin associated proteins (CAPs), which together form a complex molecular language to govern genome transactions [
3]. Indeed, gene expression patterns are controlled by the interplay of distinct genomic regions (
e.g., promoters, enhancers, heterochromatin) marked by histone PTMs and engaged by chromatin regulatory complexes (
e.g., nucleosome remodelers and modifiers) that in turn modulate local genome accessibility [
4,
5,
6,
7] (
Figure 1).
Nucleosomes are the basic repeating unit of chromatin, comprising ~147bp of DNA wrapped around a histone octamer [
8]. ‘Accessible’ or ‘open’ chromatin is conceptually defined as genomic regions containing stretches of free DNA longer than the average linker length between adjacent nucleosomes (~40 bp in human cells) [
9,
10,
11]. These open chromatin regions are commonly referred to as nucleosome depleted/free regions (NDR/NFR; hereafter NDRs), reflecting dynamic nucleosome turnover and the spectrum of accessibility in population-based assays [
12,
13,
14]. NDRs are characterized by relatively long free DNA stretches (~120-200 bp), over-represented in enhancers/promoters, bound by transcription factors (TFs), and positively correlated with transcriptional activity [
15,
16].
A variety of experimental strategies have been developed to map accessible chromatin at genome scale. Historically, nuclease DNAse I treatment of chromatin followed by primer extension identified hypersensitive cleavage sites [
17,
18], representing the NDRs (reviewed in [
19,
20,
21]). Since commercial release, massively high-throughput sequencing technologies (
aka. next generation sequencing (NGS): in 2005, Roche 454 pyrosequencing; in 2006,
Illumina (formerly Solexa)) have revolutionized genome-scale studies by delivering ever-increasing amounts of sequence data at ever-decreasing cost. The first assay to take advantage of NGS for open chromatin mapping was DNase-seq in human CD4
+ T cells [
22], and shortly thereafter MNase-seq to map nucleosome positioning (an indirect approach: see below) in budding yeast [
23] (
Figure 2A,D). 2013 saw the first report of ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing), a Tn5-based assay that rapidly became established as the most frequently used chromatin profiling assay (
Figure 2B,D).
Whole genome-scale chromatin accessibility assays have delivered breakthrough insights in diverse fields [
6,
24,
25,
26]. However, their ability to fully advance clinical research has been hampered by incompatibility with formalin-fixed paraffin-embedded (FFPE) tissue. FFPE is the routine method for preservation of clinical samples, with >20 million banked samples in the United States alone [
27]. This material can be stored for years at ambient temperatures with minimal degradation of cytoarchitecture and proteomic content [
28], making these banked tissues a potential goldmine for clinical research, especially for rare diseases and longitudinal studies. The first FFPE specimen was reported nearly 130 years ago [
29], transforming the face of clinical research and enabling retrospective studies decades after initial tissue preservation [
30]. Compared to the analysis of native cells, genomic mapping in FFPE tissue presents a number of unique challenges that require specific protocol modifications and considerations [
31]. As an example, sample processing protocols must be implemented to extract biological material from the paraffin matrix and expose cross-linked chromatin epitopes. However the primary challenge is genome quality, as FFPE sample processing induces significant DNA adducts and fragmentation [
32]. Further, DNA continues to degrade while in storage, increasing the challenge when analyzing older FFPE specimens.
In recent years, there has been increasing interest in chromatin accessibility studies of FFPE tissue, and thus mining this potentially rich data seam (
Figure 2C,D). The goal of this review is to discuss the most common approaches for mapping NDRs (
Figure 3), their suitability for profiling FFPE samples, and computational strategies to analyse the resulting data (
Figure 4).
Genome-wide profiling of open chromatinThe most common approaches to map NDRs leverage nucleases, a transposase, a nickase, or the biochemical fractionation of chromatin (
Figure 3). For the enzyme-based methods, their catalytic properties, molecular size, and potential steric hinderances influence the resulting open chromatin maps.
DNase I hypersensitivity mapping paved the way for genome-wide open chromatin profilingDeoxyribonuclease I (DNase I) endonuclease (31 kilodaltons, kDa) specifically degrades double- and single-stranded DNA to a 5’-phosphate and 3’-hydroxyl [
33], and preferentially cleaves accessible chromatin
in situ at eponymously named DNase I hypersensitive sites (DHSs). In a typical DNase-seq experiment (
Figure 3A and
Table 1), several million cells are digested to yield DHS subnucleosomal fragments, which are then identified by library preparation and NGS data analysis [
19]. DNase I has proven an excellent molecular tool for studing chromatin structure for over three decades, most notably by the ENCODE consortium [
15,
21]. In a pioneering study, ~14,000 DHSs were mapped in primary CD4+ T cells, and ~90% shown to be shared across multiple cell types [
22]. Although originally thought to lack intrinsic sequence bias, a recent systematic study showed DNAse I exhibits a C/G preference at the 5’-end of DHSs [
34,
35], and several DNase-seq data analysis pipelines now correct for this prejudice [
36,
37,
38,
39]. In an effort to map DHSs in FFPE tissue, a more sensitive DNase-seq strategy was developed using a circular carrier DNA-mediated sequencing method (Pico-Seq) [
40,
41]. However, despite a proof-of-concept study in human follicular thyroid carcinoma, the field has not adopted DNase-seq approaches for FFPE samples (a Pubmed search for “DNAse AND FFPE” returning only two related entries).
Micrococcal nuclease (MNase) digestion to decipher nucleosome positioning
Staphylococcus aureus MNase has been used to study chromatin for nearly five decades [
42], and employed in the NGS era to map genome-wide nucleosome positioning for multiple eukaryotes (
e.g., yeast, worm, fly, mouse and human) [
9,
10,
11,
43,
44]. The enzyme is a small (17 kDa), highly processive endo-exonuclease that degrades most types and forms of nucleic acids (
e.g., supercoiled, linear and circular single-stranded and double-standed DNA and RNA) [
33]. These properties enables it to thoroughly digest chromatin until protected by nucleosome structure, cleaving both NDRs and linker DNA. As such, MNase-Seq is distinct from other NDR mapping approaches since it enriches for protected DNA (
i.e., nucleosome occupancy and position), and open chromatin is then inferred from low signal regions (
Figure 3B and
Table 1). MNase shows a strong sequence bias toward A/T rich sequences, and a correction factor is thus built into many data analysis pipelines [
38,
45]. Recent efforts using MNase to map chromatin accessibility have focused on limiting the MNase digestion [
46,
47,
48], although these titration based variants have largely been superceded by competing direct NDR-mapping methods (
Figure 3 &
Table 1). Applying MNase-seq to FFPE tissue sections has met with minimal success, with a Pubmed search for “MNase AND FFPE” returning zero entries.
FAIRE-seq identifies accessible chromatin regions through principles of biochemical separation and solubilityIn contrast to nuclease based methods to map chromatin accessibility, FAIRE (formaldehyde assisted isolation of regulatory elements) identifies NDRs by building on the observation that transcriptionally active chromatin displays differential biochemical solubility after formaldehyde fixation [
49]. In brief, cells are treated with formaldehyde to crosslink chromatin, sheared by sonication, and phenol-chloroform extracted, where the aqueous phase contains DNA fragments associated with NDRs (
Figure 3B) [
50]. While FAIRE-seq does not have the sequence-specific cleavage bias of nucleases [
34,
45], it is highly dependent on crosslinking efficiency, and often undermined by poor signal-to-noise [
50,
51], false positives [
52], and the challenge posed by low cell numbers [
53] (
Table 1). Nevertheless, FAIRE-seq has been widely applied to model systems and cell lines [
50,
51,
53,
54,
55], particularly as part of ENCODE efforts to systemically identify active regulatory elements [
15]. A recent report mapping open chromatin by FAIRE-seq in
Drosophila cleverly circumvented the challenge that pupa cuticles present to
in situ enzyme-based methods, thus providing higher quality data than ATAC-seq for this particular tissue type [
56]. Despite this, over the last decade ATAC-seq has clearly emerged as the preferred assay for mapping chromatin accessibility while FAIRE has declined in use (
Figure 2A,B).
Tn5 transposon tagmentation of accessible chromatin (ATAC-seq)
Tn5 transposase was first discovered in the 1970s based on the kanamycin resistance it conferred to host bacteria [
57,
58]. In addition to providing a mechanistic model for transposases, Tn5 (106 kDa active dimer) has proven an invaluable molecular tool [
59]. Most recently it has been leveraged to identify NDRs via ATAC-seq, and to map histone PTMs via CUT&Tag [
60,
61]. ATAC-seq is currently the most widely used open chromatin mapping assay (
Figure 2B) due to its relative speed, efficiency, and sensitity (
Table 1). The approach employs a genetically engineered hyperactive Tn5 transposase to insert loaded DNA adapters preferentially at accessible DNA
in situ (
i.e., tagmentation) for direct PCR amplification and NGS [
60,
62] (
Figure 3D). Tn5 displays an enzymatic sequence bias which, while more complex than that of the nucleases used for NDR mapping, can also be compensated at data analysis [
38,
63]. With deep enough sequencing, TF binding footprints may also be inferred from protected fragments within the NDRs [
39,
64]. Early versions of the ATAC-seq protocol were hampered by high read duplications and contaminating mitochondrial DNA, which together consumed a majority of the sequencing bandwith. These issues were largely circumvented by the development of Omni-ATAC, wherein nuclei were isolated with a cocktail of detergents to remove contaminating mitochondria, increasing library complexity and signal-to-noise [
65,
66].
Beyond the application of ATAC to interrogate the epigenomes of model organisms and cell lines, recent efforts have sought to enable clicinal studies from FFPE tissue sections [
67,
68,
69,
70,
71]. To prepare amplicons from Tn5-based approaches, two independent tagmentation events are required in opposing orientation and in close proximity (<~700 bp). This inherently reduces library efficiency and effective yields, which is exacerbated by the highly damaged and fragmented DNA in FFPE material. To address this complication, a recent approach combined Tn5 tagmentation with an
in vitro transcription (IVT) step, such that a single insertion event can be amplified by T7 RNA polymerase [
68,
69,
70,
71]. While standard ATAC-seq yielded some success using nuclei isolated from mouse FFPE liver and kidney, the Tn5-IVT-modified approach improved library complexity, signal-to-noise, and other key metrics. However, the approach is limited by its complex, five day procedure that relies on harsh chemical, mechanical, and enzymatic methods (
e.g., xylene, needle shearing, and a collagenase/hyaluronidase cocktail) to extract nuclei from FFPE tissue. Indeed, studies have shown that such exacting preparation methods contribute to genome fragmentation [
72,
73].
Noting these observations, Henikoff and colleagues instead used gentle heating and permeabilization, similar to how FFPE sections are routinely deparaffinized for histological analysis, to prepare samples for CUT&Tag [
67]. NDRs facilitate access to DNA by the transcriptional machinery, so the CUTAC (Cleavage Under Targeted Accessible Chromatin) protocol was developed to target Tn5 to active chromatin via RNA Pol II, and yields short tagmentation fragments (~60bp) to reduce the impact of DNA damaged samples. Of note, FFPE-CUTAC yielded higher quality data than FFPE-ATAC from mouse brain [
67], suggesting the potential of innovative Tn5-based approaches to map the epigenome of clinically relevant FFPE samples.
Nicking enzyme assisted accessible chromatin sequencing (NicE-seq)
NicE-seq is a recent approach (
Figure 2A,B) that excels at NDR profiling from heavily fixed cells, including FFPE [
74,
75,
76,
77]. In contrast to nuclease or Tn5 transposase double-strand cleavage (as above),
Chlorella virus Nt.CviPII (63 kDa) is a nicking endonuclease that cuts only one strand of double-stranded DNA at CCD sites (D=A/G/T), which occur by chance every ~21 bases [
78]. In the latest version of the protocol (One-pot UniNicE-seq), Nt.CviPII nicks at NDRs are filled in using an NTP mix containing biotinylated- and 5-methyl- dCTP triphosphates (to respectively label and prevent further nicking), the genome is enzymatically sheared, biotin-labelled DNA is captured on streptavidin beads, and libraries are prepared on the matrix by PCR (
Figure 3E). The method is a fast, simple, and robust one-tube workflow, although it is incompatible with native cells and yields larger DNA fragment sizes than ATAC-seq, which limits resolution. NicE-seq has now been applied to a wide variety of mouse and human cell lines, primary tissues, and FFPE sections [
75,
76]. Central to the theme of this review, the approach identified NDRs in human lung and liver FFPE samples from as few as five thousand cells (
Table 1) [
75,
76]. Similar to FFPE-CUTAC, NicE-seq can be performed on permeabilized and minimally disrupted FFPE sections
in situ, obviating the need for nuclei purification by harsh methods that damage genomic DNA. As such, the approach shows great potential for broad adoption to map open chromatin in clinical FFPE material.
Data analysisATAC-seq is the most popular open chromatin profiling approach (
Figure 2), and several excellent papers describe methods to analyse the resulting data [
79,
80,
81]. Instead this review aims to provide a brief overview of key considerations and data analysis pipelines broadly applicable to ATAC-seq, FAIRE-seq and NicE-seq (
Figure 4). Because of their distinct data structure, DNase-seq and MNase-seq require application specific pipelines [
79]. For example, open chromatin from MNase-seq data is inferred from nucleosome-centric maps, and so central considerations are to identify the nucleosome dyad, account for MNase sequence cleavage bias, and to quantify nucleosome occupancy and ‘fuzziness’ [
23,
46,
48]. The major features of a typical ATAC-seq pipeline involve: 1) read pre-processing and quality control (QC); 2) primary analysis (read alignment and filtering); and 3) secondary analysis (peak-calling, visualization, reproducibility and differential accessibility). Paired end (PE) sequencing is highly recommended because it provides the DNA fragment length: an important metric for assay success and interpretation. A sequencing depth of 30-50M PE reads is usually sufficient for good genome coverage, but will depend on how much bandwidth is consumed by mitochondrial DNA contamination and read duplicates.
Read pre-processing and quality control Prior to alignment, several tools are used to assess the quality of the library and sequencing run. FastQC reports the base calling quality and overrepresented sequences, such as primer- and adapter-dimers. Low base calling scores (<20 Q-score) may be indicative of a poor quality library and/or sequencing run. Overrepresented primer- and adapter-dimers are not as problematic in Tn5-based libraries as in ligation-mediated PCR libraries. If the accessible DNA fragment length is not greater than twice the paired-end read legth, sequencing will read through to the Illumina adapter regions. This readthrough can negatively impact genome alignment, so read trimming tools (
e.g., Trimmomatic [
82]) are used to detect and prune Illumina adapter sequences.
Primary analysis pipelineGenome alignment is typically the most time consuming and computationally intensive step in the primary pipeline, so fast, memory-efficient aligners optimized for short paired-end reads have been developed (
e.g., Bowtie2 [
83]). The goal is to identy the unique genomic location that corresponds to each read pair. However, multi-aligned reads pairs are common, and must be flagged/removed from subsequent analysis since they would introduce ambiguity to the data. In addition to removing multi-aligned reads using Samtools [
84], other utilities (
e.g., Bedtools [
85,
86] and Picard (
http://broadinstitute.github.io/picard)) are used for read processing and filtering to remove PCR duplicates, contaminating mitochontrial DNA, and artifactual exclusion list regions [
87].
Tools for secondary analysisIdentifying and visualizing statistically enriched NDRs enables data interpretation and provides biological insights. Accessible chromatin occurs in relatively narrow regions that can be identified using peak-calling tools, such as MACS2 [
88]. DeepTools2 [
89] is an excellent suite of utilities to assess data reproducibility, generate signal heatmaps, and convert alignment files for visualization in genome browsers, such as Integrative Genomics Viewer [
90]. When two conditions are compared, EdgeR is widely used to identify peak locations that display a statistically significant differential signal [
91]. While many additional follow-up analyses can provide further insights to the patterns of open chromatin, they are outside the scope of this paper and we recommend one of the more comprehensive analysis reviews [
79,
80,
81].
DiscussionIdentifying NDRs throughout the genome provides a window into transcriptionally active regions in normal and disease states. Chromatin accessibility profiling has had an enormous impact on basic and pre-clinical research, and is of extreme interest to be applied to clinical biopsy specimens in FFPE blocks. The goal of any useful genomics method is to yield the maximum amount of high quality DNA (or RNA) to QC defined metrics. However, major challenges associated with FFPE tissue have thus far slowed the application of epigenomic approaches to archived biopsy specimens.
There are three main areas of focus if users are to generate high quality data from FFPE tissues: 1) best practices during clinical tissue preparation and preservation; 2) improved methods of material preparation; and, 3) the optimization of genomic assays specifically for FFPE material. The first is largely outside the control of the end-user, though minimizing any delay between tissue harvesting and fixation and shorter storage times improve DNA integrity and assay yields [
92]. For the second, two different strategies were used during FFPE tissue preparation for ATAC-seq and CUTAC: nuclei extraction or
in situ permeabilization [
67,
69]. Improved FFPE extraction and preparation methods should balance increased yields without further compromising genomic integrity. Of note, FFPE repair kits are increasingly available [
93], but their impact on data quality for open chromatin profiling remains to be seen. Finally, lessons may be learned from efforts to develop RNA-seq and ChIP-seq for archived FFPE material [
72,
92,
94]. For example, the crosslinking reversal step is a major source of DNA fragmentation [
92,
93], but this can be mitigated by high concentrations of Tris, which improved yields by three-fold and resulted in longer DNA fragments [
73],
It is a given that the direct analysis of primary tissue provides insights to the development of human disease. Indeed, histopathological analysis of FFPE brain samples has been essential to characterize mechanisms of normal and pathological aging [
95,
96]. The ability to perform comprehensive epigenomic analyses in such samples could provide further understanding of these processes and revolutionize clinical research.