1. Introduction
Transcription factors (TFs) are crucial proteins with sequence-specific DNA binding activity allowing them to regulate the transcription of target genes. Regulatory elements that have functional TF binding activity are called TF binding sites (TFBSs). Common patterns in DNA that are thought to have this activity for certain TFs are called motifs. Currently, the most popular straightforward approach to deduce the context specificity of TFBS motifs for particular tissue-/cell line/stage-specific conditions
in vivo is the chromatin immunoprecipitation (ChIP) based high throughput experiment ChIP-seq producing hundreds of binding loci (or peaks) for a given target TF [
1,
2]. Although this breakthrough approach has been widely used over the past 15 years [
3], it still remains quite expensive and can be technically challenging for many TFs. To overcome these issues, and to map TFBS motifs in whole genomes directly,
in vitro technology DAP-seq has been developed [
4,
5].
A stable framework for the systematic massive analysis of TFBS is the expandable hierarchical classification of TFs according to the structure of their DNA-binding domains (DBDs) (TFClass database, [
6,
7,
8,
9]). This framework has been consistently applied for mammals [
7,
8,
9], and then for other eukaryotic taxa [
10], including plants (Plant-TFClass database, [
11]). The first level of the hierarchy consists of the nine distinct superclasses of TFs. They are distinguished according to the general topology of DBDs. At the second level, TF classes imply structural and sequence similarities in the DBDs of TFs. The third level of TF families relies on sequence similarities in the DBDs of TFs. For many lineages of eukaryotes, structurally similar TFs often are very conservative [
12]. To date, no new superclasses have been identified in plant TFs compared to those found previously in mammals, and numerous TF classes are common to plants and mammals [
11]. Here and below, we use the notations from the TFClass database [
6,
7,
8,
9]. Namely, for a superclass, the digit from 0 to 9 in curly brackets denotes its ordinal number, and for a class two digits separated by a dot mean the corresponding number of the parent superclass and the ordinal number of the class in the superclass. For instance, the first superclass Basic domain {1} contains the first class Basic leucine zipper factors (bZIP) {1.1}.
In eukaryotes, TFs generally function as part of multiprotein complexes. The tissue- and developmental stage-specific regulation of gene transcription is largely achieved through the inherently combinatorial binding and functions of multiple TFs [
13,
14,
15]. Therefore, a mixture of binding sites assigned to either a target TF or to possible collaborative TFs and, moreover, possibly having various affinity, is critical to explain TF binding specificity
in vivo [
15,
16]. Consequently, the ChIP-seq technology not only can find the major context specific pattern of a target TF, which is encoded as its most enriched motif; in addition, the analysis of co-occurring motifs can shed light on the cooperation of a target TF and possible collaborative TFs specific for given
in vivo conditions. For example, the propensity to interact with closed chromatin is a particular feature of a special group of TFs, the pioneer TF [
17,
18,
19,
20]. Hence, pioneer and other TFs lacking pioneer function should bind cooperatively to DNA.
To unravel the intricate structure of the motifs grammar in the regulatory regions of genes, the term composite element (CE) is used as a simplest unit of the second level hierarchy above the first level of individual TFBS motifs. A CE is a pair of motifs of two TFs that co-occur more often than expected by chance close to each other in genomic DNA. Conventionally, three attributes of CEs have been considered [
21,
22,
23,
24,
25]. Since ChIP-seq or other similar data have been used to detect and analyze CEs, each СE contains at least one motif of a target TF from the corresponding ChIP-seq experiment (Anchor motif). The second motif of CE represents either the same or distinct motif model. Therefore, the first attribute of CE defines it either as homotypic or heterotypic CEs. The former corresponds to two identical or structurally similar TFs, the latter implies two structurally distinct TFs, representing a pair of Anchor and Partner motifs. The second and third attributes are the reciprocal orientation and positioning of motifs. The orientation types include the head-to-tail case (Direct) with both motifs in the same DNA strand; the opposite strands of two motifs define the tail-to-tail (Invert) and head-to-head (Evert) cases. The positioning types discern CEs with an overlap of motifs and with a spacer. Finally, we earlier proposed the fourth attribute of CE [
26], the presence or absence of systematic difference in the conservation of motifs within pairs, i.e. all CEs are divided into asymmetric and symmetric ones.
Numerous large-scale experimental and theoretical analyses have indicated that besides the evident exception of the class C2H2 zinc finger factors {2.3}, paralogous TFs from the same classes almost always have very similar DNA sequence preferences [
10,
12,
16,
27,
28,
29,
30]. Therefore, in the present study we categorized homotypic CEs into classes of target TFs.
We recently proposed the Motifs Co-Occurrence Tool (MCOT) package for CEs prediction in ChIP-seq data [
26,
31,
32]. Two particular features distinguish this approach from other similar tools [
21,
22,
23,
24,
25]. First, it uses a single ChIP-seq dataset and detects both CEs with a spacer and with an overlap of motifs. Second, it applies several recognition thresholds for each motif within a CE; consequently, it differentiate asymmetric and symmetric CEs. For heterotypic CEs of Anchor and Partner TFs two motifs are certainly distinct, either an Anchor TF or Partner TF may have a more conservative motif. These two types of CEs are predominantly asymmetric. We have shown previously that CEs with more conserved Partner motifs are substantially more abundant compared to those with more conserved Anchor motifs; this phenomenon have been typical for the majority of tested target TFs [
31]. Yet no one has tested whether the phenomenon of asymmetric CEs is also common for homotypic CEs.
The current study focuses on a massive analysis of ChIP-seq data to predict homotypic CEs with spacers and to test whether target TFs of certain classes show specificity to significant symmetry or asymmetry of motif conservation in homotypic CEs, or whether the real data show no significance in either direction. Notably, TFs from some classes, such as Basic leucine zipper factors (bZIP) {1.1} and Basic helix-loop-helix factors (bHLH) {1.2}, do not function as monomers. E.g., the degenerate motif of bHLH TFs, E-box CANNTG, already represents a TF dimer. However, pairs of E-box motifs implying a TF tetramer can form homotypic CEs [
33]. There are no common rules even for the largest TF classes. For example, TFs of the NR (Nuclear receptors with C4 zinc fingers {2.1}) class, can function either as monomers, dimers, or multimers of several units arranged with various overlaps or spacers of different lengths [
34,
35]. Thus, CEs in DNA may correspond either TF dimers, or multimers with a greater number of subunits. To avoid issues with distinction between structurally variated homotypic CEs with an overlap of motifs, in this study we focused only on CEs with a spacer. We compiled for analysis the three benchmark collections: ChIP-seq data for
M. musculus and
A. thaliana, and DAP-seq data for
A. thaliana.
Overall, our analysis showed that TFs from all classes tend to avoid significantly enriched symmetric homotypic CEs. Furthermore, we showed that the enrichment of the homotypic asymmetric CEs depends on the structural type of the DBD of a target TF. Among the large TF classes common for three benchmark collections, the classes of bHLH and bZIP TFs showed the highest significance of asymmetry within homotypic CEs. Thus, we have shown that different structural types of DBDs of target TFs are encoded in homotypic CEs that differ in the significance of enrichment of the asymmetry of motif conservation within their pairs.
3. Discussion
Prediction of TFBS motifs is an important step in deciphering of the nucleotide context responsible for transcription regulation. The next hierarchical level of gene regulatory regions above the level of individual TFBS motifs represents CEs as stable, recurring and overrepresented compared to a random expectation co-occurred pairs of TFBS motifs. The importance of this level is due to the basic cooperative nature of the action of TFs as major regulators of gene transcription.
Since the beginning of the next generation sequencing era, and, in particular, the massive application of ChIP-seq technology, many novel approaches have been proposed and consistently applied for CEs prediction [
21,
22,
23,
24,
25]. To support these researches, we have continuously developed our approach MCOT [
26,
31,
32]. So far, no CE prediction tool besides MCOT has been able to detect significant asymmetry, or symmetry, or lack of both within CEs. However, in the previous studies [
26,
31] we analyzed the asymmetry of motifs conservation only for heterotypic CEs. In this study, we extended the definition of asymmetric CEs to homotypic CEs (
Figure 1C). This allowed massive analysis of the benchmark collections of ChIP-seq and DAP-seq data and, in addition, revealed clear differences in the structure of homotypic CEs for target TFs from various TF classes.
To start our analysis, we chose the TF ARF5, a well-known master regulator of plant growth and development [
45], we took in analysis its well-known target gene IAA5 [
46], detected previously as one of top-ranked auxin induced genes [
38]. TF ARF5 belongs to the gene family of ARFs TFs, ARF5 is one of the most important master regulators in plants. It activates genes regulated by the plant hormone auxin, which is important in many processes of growth and development in plants. The IAA5 gene belongs to the gene family Aux/IAA of transcriptional repressors. At low concentrations of auxin, ARF5 occupies its binding sites in the promoters of auxin-responsive genes, but direct interaction of Aux/IAA with ARF5 represses the activity of ARF5. At high concentration of auxin, this direct interaction is disrupted, and ARF5 becomes functional. Such important process in plants, standing at the core of gene network have to be fine-tuned reliably and precisely. Therefore, we chose the homotypic CEs of the ARF5 TF as an example demonstrating the asymmetry in motifs conservation within homotypic CEs. Two highly asymmetric homotypic CEs in the promoter of IAA5 gene support effective regulation of the IAA5 gene under the action of ARF5. Further, the same trend to the asymmetry within homotypic CEs we confirmed for the entire dataset of DAP-seq peaks for ARF5 TF [
4,
39]. Whereas the specific pattern of spacer length distribution in homotypic CEs for ARF5 (
Figure 3A) was noted previously [
4,
37], the asymmetry of motifs conservation within these homotypic CEs has not been identified yet (
Figure 3B). The ARF5 TF belongs to the class B3 {9.*} (
Table S3), the other seven members of this class have substantially less significant enrichment of asymmetric homotypic CEs (
Table S9, p-value < 1-42 for ARF5, the median for the entire B3 class, p-value < 1E-12). This result for the particular DAP-seq dataset for the TF ARF5 motivated the analysis of the three benchmark collections of ChIP-seq and DAP-seq data (see
Section 4.1). Our main goal was to identify the relationship between the structure and functions of target TFs, on the one hand, and the significance of asymmetry, symmetry or neither in homotypic CEs, on the other hand.
Remarkably, our results of massive analysis of the proportion between asymmetric and symmetric homotypic CEs (
Figure 4 and
Figure 5) clearly indicate that TFs with different types of DBD structures, categorized to various TF classes, have specific patterns of asymmetric homotypic CEs. First, symmetric CEs are substantially depleted compared to asymmetric CEs. We detected symmetric CEs only in several datasets in the ChIP-seq data of
M. musculus (
Figure 4A,
Figure S1A and
Figure S3A). Symmetric CEs are completely absent in both benchmark collections of
A. thaliana (
Figure 4B,C,
Figure S1B,C and
Figure S3B,C). Second, the fractions of datasets with significant asymmetry and the fraction with neither significant asymmetry nor significant symmetry compete with each other. Third, the fractions with significant asymmetry and significant symmetry, as well as that with neither, are clearly differ in magnitude for target TFs from various classes. The first two statements are quite expected, they are most likely reflect the widespread cooperative mechanism of action of eukaryotic TFs. The third statement is not so expected, and even slightly striking. Nevertheless, many earlier studies have indicated that this result is quite reasonable.
Possible stereochemical structure underlying both TF-TF and TF-DNA interactions defines the diversity of CE structures. For example, TFs of the bZIP class {1.1} function only as dimers. Their homotypic CEs of TFs do not show any variation in the orientation of motifs, and show only a small change in a spacer length, from 1 to 4 bp [
35]. On the contrary, TFs from the NR class (Nuclear receptors with C4 zinc fingers {2.1)} can function as monomers or dimers, their homotypic CEs show diverse structure with various orientations and spacers [
34,
35] (Merkulov, Nagy). Therefore, a very important prerequisite to the analysis of the homotypic CEs is the propensity of TFs to the homotypic dimerization. This term was defined as the dimerization among members of the same TF class [
47]. This review indicated only three conservative TF classes among human and Arabidopsis with this ability. Namely, TFs from bZIP {1.1}, bHLH {1.2}, and MADS-box factor {5.1} (MADS-box) classes, emerged before the divergence of eukaryotes into plants, fungi and animals. Whereas TFs from the bZIPs and bHLHs classes have undergone independent lineage-specific expansion in plants and animals, TFs of the MADS-box class have done so only in plants, this class is very scarce in mammals [
48]. Besides these three classes, the review [
47] distinguished following clades of TFs with homotypic dimerization propensity: the classes of NR (Nuclear receptors with C4 zinc fingers {2.1}) and STAT (STAT domain factors {6.2}) TFs, and the families of NF-κB and HD-ZIP TFs. The NR class is specific to metazoa [
49]. The class STAT emerged early in the metazoan evolution [
50]. The NF-κB family (NF-kappaB-related factors {6.1.1}) belongs to the class Rel homology region (RHR) factors {6.1}, this family also specific to metazoa. The plant-specific HD-ZIP family belongs to the class Homeo domain factors {3.1} [
11]; this family emerged during the early chlorophyte evolution [
51]
. Taking into account the specific TF classes and families noticed for their dimerization ability [
47], we may conclude that among them the bHLH class shows the highest asymmetry in the conservation of motifs in homotypic CEs with a spacer (
Figure 4,
Figure 5,
Figure S1 and
Figure S3). The structurally related bZIP class shows slightly less significant results, it is still superior to all other classes. The NF-κB family and MADS-box class reveal the moderate significance, whereas the NR and STAT classes achieve a relatively low significance of the asymmetry within homotypic CEs. Conclusion about the superiority of bZIP and bHLH classes holds for the asymmetry ratio thresholds TAR of 1.1 and 1.5 for the
M. musculus ChIP-seq data (
Figure 4A,
Figure 5A,
Figure S1A and
Figure S2A), and for all three thresholds (TAR values of 1.1, 1.5 and 2) for the
A. thaliana ChIP-seq and DAP-seq data (panels B and C in
Figure 4,
Figure 5 and
Figures S1-S4). Curiously, for the
M. musculus ChIP-seq data with a high value TAR of 2, first two ranks belong to other classes known for their dimerization propensity [
47], NR and MADS-box (
Figures S3 and S4). Although this enrichment requires further studies due to the significant differences between pioneer TFs and other TFs only for low and moderate asymmetry thresholds (TAR 1.1 and 1.5,
Figure 6 and
Figure S5A).
Regarding the mechanism underlying the asymmetry of motif conservation within homotypic CEs, an analogy with the mechanism of ternary complex formation from two subunits of dimeric TF and DNA can be pointed out [
52]. Either one of the two subunits initially interacts with DNA independently and then it recruits the second subunit, or the two subunits initially interact with each other and then a TF dimer interacts with DNA. These two options were referred to as monomer and dimer pathways [
52]. This study experimentally studied efficiency of these two pathways on the example of the dimerization of the cFos and cJun subunits. They both contain DBDs of the bZIP class. Their heterodimer AP1 activates transcription of many genes. It was demonstrated that although the dimerization of cFos and cJun occurred rapidly in the absence of DNA, its rate was enhanced in the presence of DNA. Therefore, it was concluded that the monomer pathway is favored. The monomer pathways implied sequential binding of two subunits to DNA. The sequential binding was shown for other bZIP TFs [
53]. For bHLH TFs it was also shown that the monomer pathway showed a faster rate than the dimer pathway [
54]. The sequential binding was approved for TFs with various DBDs, e.g. NR [
55,
56].
Based on the results of our study, we propose that the asymmetry in conservation of motifs in homotypic CEs reflects a sequential mode of cooperative binding of TFs from certain classes. Since cooperative action of closely related TFs belonging to the same classes is deciphered as homotypic CEs in genomic DNA, and we detected the strong depletion of symmetric CEs for all classes of TFs, the sequential mechanism of formation of multiprotein complexes on DNA is ubiquitous. The nucleotide context of homotypic CEs of TFs from bZIP and bHLH classes indicates that they are more prone to sequential binding than TFs of other classes.
The propensity to recruit cooperating partner TFs is a well-known fundamental property of TFs as principal gene transcription regulators. We suppose that the significant enrichment of asymmetric homotypic CEs for TFs of any class shows ability of all TFs as basic transcription regulators to recruit collaborative TFs. Obviously, among the diversity of TFs, pioneer TFs are exactly the first to demand this recruiting potential. The significant difference in asymmetry within homotypic CEs between pioneer TFs and TFs lacking pioneer function suggests that asymmetry within CEs signifies the ability of TFs to recruit their partners to contiguous regions of genomic DNA (
Figure 6,
Figure S5 and
Figure S6). Although the higher recruitment capacity of pioneer TFs seems obvious, trying to decipher it as a part of the genomic regulatory code seems to be an open challenge.
Another explanation of weaker binding sites co-occurred with the stronger sites called the weaker ones as ‘traps’ [
34] or ‘antenna’ [
57]. These weaker sites operate as attractive intermediate elements providing a traffic of TF molecules towards binding sites of higher affinity. We recently combined the traditional motif model PWM neglecting the dependencies of various positions in motifs, and the alternative motif models BaMM and SiteGA allowing such dependencies in massive analysis of ChIP-seq data [
43]. We found that the PWM model was successive in prediction of sites of high conservation, whereas both alternative models efficiently complements the predictions of PWM at the recognition thresholds of sites with low conservation. Thus, we hope that further application of alternative motif models may clarify the nucleotide context of weaker motifs from asymmetric homotypic CEs.