3.1. Development of ROSE and application to the analysis of σ70-dependent promoters in E. coli
The ROSE method was developed based on ROMA [
4], which used DNA microarrays for genome-wide run-off transcription analysis. Accordingly, run-off transcription assays were performed employing the commercially available σ
70-saturated form of
E. coli RNA polymerase (Eσ
70). In contrast to ROMA and to RIViT-seq [
9], template DNA has been fragmented to an average size of 6 kb by shearing instead of restriction enzyme treatment to avoid bias by unequal distribution of restriction enzyme recognition sites or by cutting within a promoter. In addition, RNA yield was increased by using commercially available Tris HCl buffer (NEB, Ipswich, USA) instead of the potassium glutamate-based buffer system previously used in ROMA (data not shown). Although ROSE and RIViT-seq are both aiming for a genome wide
in vitro-transcriptome, there are distinct technical differences in the two approaches (See Supplemental
Table S1 for a complete list). The focus of ROSE is the construction of high-quality primary transcript libraries. Therefore, the digestion of RNA having a 5′ di- or monophosphate is necessary to maximize the purification of unprocessed, primary bacterial transcripts. Moreover, ROSE uses index adapter ligation to reduce the noise in the sequencing, and the TSS were identified in an automated fashion using the software ReadXplorer [
13,
17].
Before sequencing, in vitro transcribed mRNA was subjected to native 5′-end-specific transcript library preparation [
7]. Sequencing on Illumina MiSeq yielded around 2 million reads per library (See Supplemental
Table S2), which were quality-filtered and mapped to the respective reference genome (U00096.3). Three different approaches were tested for the isolation of chromosomal DNA of
E. coli K-12 MG1655. The different isolation methods did not result in notable differences of the quality or the distribution of reads (See Supplemental
Figure S1).
Mapped reads were visualized using the ReadXplorer software [
17], and transcription start site (TSS) detection was performed with the same tool and automatic parameter estimation (See Supplemental
Table S3). The automatic TSS detection was able to identify 3,226 possible TSS detected in at least four of the six ROSE runs. Depending on their location relative to known genes, the TSS were classified into four categories according to Sharma et al. [
18]: primary TSS (44.6%), intragenic TSS (24.4%), antisense TSS (27.1%) and orphan TSS (3.9%). Primary TSS comprises all TSS located in a suitable distance and direction to a protein-coding region or a known transcript. Intragenic TSS are located within a coding sequence. In sense orientation, antisense TSS are situated on the opposite strand of a protein-coding region up to 100 bases upstream or downstream (± 100 nt), and orphan TSS do not meet any of these criteria.
To validate the suitability of the ROSE method for promoter identification, upstream sequences of 50 nt lengths (positions -1 through -49 relative to the TSS) were extracted for further analysis. All 3,226 putative promoter sequences were subjected to motif enrichment analysis using Improbizer [
14]. Two distinct motifs corresponding to -10 and -35 regions of σ
70-dependent promoters were detected independently (
Figure 1). As expected from previous studies, the -10 region shows a considerably higher level of conservation [
19]. 3,128 putative promoter sequences contained a region similar to the σ
70 -10 consensus.
In contrast, a -35 consensus motif, namely a conserved ttGA about 35 nucleotides upstream of the transcription start site, was derived from 2,922 promoter sequences. A total of 2,838 putative promoter sequences contained both -10 and -35 regions. Only eleven sequences did not resemble either the -10 or the -35 consensus sequence.
It is apparent that Eσ
70 recognizes natural σ
70-dependent promoters in vitro with high specificity and initiates transcription at well-defined nucleotides. Transcription initiation occurred preferentially at purine bases (A/G), which was observed in 81.0% (50.3% A and 30.7% G) of the detected promoters. Interestingly, the base directly upstream of the TSS at position -1 prefers pyrimidine bases, with 77.3% of the promoters harboring T (41.3%) or C nucleotides (36.0%) at the respective position. Both findings align with in vivo transcriptional profiling studies, reporting similar nucleotide preferences of 78.6% purine bases at +1 and 80.2% pyrimidine bases at -1 [
19,
20].
3.2. Detailed Promoter Analysis by Comparison to Experimentally Characterized Promoters Listed in RegulonDB
The genome of
Escherichia coli K-12 MG1655 contains 4,146 genes organized in 2,376 transcription units. 1,523 transcription units are monocistronic, whereas 853 operons have more than one gene [
21]. Thus, at least 2,376 primary TSS are expected to be found, possibly except for TSS of promoters that need to be activated by factors not contained in the in vitro transcription assay. The RegulonDB database [
22] includes the most comprehensive information regarding the transcriptional regulation of
E. coli, including experimentally determined transcriptional start sites of the strain K-12 MG1655. In a subset of the database, TSS are assigned to the different sigma factors and provided with a level of evidence (Confirmed, Strong, or Weak), depending on the informative value of the method for TSS identification. For the following comparison, only those TSS were considered that belong to the classes “Confirmed” or “Strong”. In addition, to cope with different experimental methods of TSS identification and issues of the template, such as the degree of supercoiling, a deviation of three nucleotides in either upstream or downstream direction has been allowed to compare two TSS positions. The mapped TSS show a clear peak, with 64.7% having zero and 7.0% having one nucleotide deviation in either direction (See Supplemental
Figure S2).
In RegulonDB, 881 TSS are classified as σ
70-dependent; thereof, 352 (40.0 %) were also identified in the ROSE-Eσ
70 experiment. A total of 30 TSS found in our ROSE experiment are assigned to other sigma factors in the database with no affiliation to σ
70. 25 of these TSS are classified as σ
38- and the other five as σ
32-dependent promoters. However, it is known that the consensus sequence of σ
38-dependent promoters is similar to the σ
70 consensus sequence, and a clear distinction between both promoter sets cannot be made [
23]. Therefore, promoter sequences identified by ROSE-Eσ
70 but listed as σ
38-dependent in RegulonDB were compared to those of σ
38-dependent promoters that ROSE did not recognize. Again 50 nt upstream of the TSS have been extracted and analyzed for conserved motifs. Comparing the resulting motifs clearly shows differences, mainly in the -10 regions. The presumed σ
38-dependent promoters show conserved bases at all positions from -12 through -7 (TATACT), whereas in the σ
38-dependent promoters not detected by ROSE-Eσ
70 only the bases at -12, -11, and -7 are conserved (TANNNT). Additionally, there is a C at position -13, upstream of the -10 region, described earlier as a distinct sequence characteristic in σ
38-dependent promoters [
24]. Another distinguishable feature of the exclusively σ
38-dependent promoters is a highly conserved GC at positions -33/-32 (ttGC), occurring in most σ
38-dependent promoter sequences, with a higher conservation of the TT at position -35/-34 in the promoters present in ROSE-E Eσ
70 (
Figure 2).
Following the same reasoning, five predicted false-positive σ32-dependent promoters were compared to 66 σ32-dependent promoters from RegulonDB that ROSE did not detect. Due to the low number of five false-positive promoters, no precise consensus sequence could be identified in the -10-region (data not shown). However, the similarity of the -10-region of these false-positive σ38-dependent promoters to those of σ70-dependent promoters suggests that ROSE-Eσ70 falsely identifies these promoters as σ70 promoters, possibly due to in vivo regulatory mechanisms in the in vitro ROSE-Eσ70 system.
3.3. Comparison of the ROSE Data to Existing Comprehensive Genome-wide in vivo RNA-Seq Data Sets of E. coli K-12 MG1655
To date, genome-wide transcription start site determination is mainly done by analyzing in vivo transcribed mRNA via approaches like dRNA-Seq [
18,
25]. To assess the sensitivity and selectivity of ROSE, results were compared to a transcriptome study by Thomason et al. [
20] and another high-throughput transcription initiation mapping study included in RegulonDB [
22]. Both studies were conducted on shaking flask cultivations of
Escherichia coli MG1655 in different media. After enriching 5′ triphosphorylated RNA species and high-throughput sequencing, they detected 14,865 TSS and 5,197 TSS, respectively [
20,
22]. Although both studies relied on transcriptome sequencing for TSS identification, their suitability for validating ROSE is limited because no specific sigma factor-promoter interaction can be examined. However, as we performed ROSE using the primary sigma factor σ
70, it was assumed that there was reasonable overlap in detected TSS.
Comparing the three TSS datasets showed that 2,006 (62.2%) of the TSS detected by ROSE were also determined by Thomason
et al., while 168 further TSS are confirmed by the study included in RegulonDB. A set of 755 TSS was contained in all three datasets (
Figure 3). Again, a deviation of three nucleotides has been allowed to compare two TSS positions. Here, ROSE-Eσ
70 and RegulonDB exactly matched in 76.0% (±1 bp: 13.9%) of the overlapping TSS, while ROSE-Eσ
70 and Thomason et al. had an exact match at 86.2% (±1 bp: 7.6%) of the TSS (data not shown).
3.4. Transcription Start Sites of Promoters That are Repressed Under Standard in vivo Assay Conditions are Comprehensively Identified in ROSE Experiments
By design, ROSE should be able to identify two classes of promoters not represented in RegulonDB. The first class comprises those present in the
E. coli genome but not described in existing TSS mapping studies. The second class includes repressed or not activated under standard in vivo testing conditions. In total, 2,303 transcriptional start sites detected by ROSE-Eσ
70 are yet undescribed, according to RegulonDB. Thomason et al. identified 1,254 of those TSS
in vivo. The remaining 1,049 upstream regions were subjected to motif enrichment analysis using
Improbizer [
14]. To remove possible background signals, the sequences have been sorted by the -10-region score given by
Improbizer, which corresponds to the similarity of a given sequence to the detected consensus motif. A randomized control run yielded a 95% confidence score of 6.20 for a given sequence. After filtering with this value as a cut-off, 598 sequences remained, containing a precise σ
70 consensus sequence. Due to this, it can be speculated that these promoters were repressed under the conditions tested in the in vivo studies. Manual inspection showed regulator binding sites around many of these promoters, suggesting that transcription from those promoters is prevented in vivo by known transcriptional regulators such as H-NS, Fur, or Fis. In the following, we describe two exemplary promoter regions for each transcriptional regulator, H-NS, Fur, or Fis, in more detail. We performed in vivo experiments for each regulator with defined transcription factor knockout mutants from the KEIO collection [
26] to validate the results observed with ROSE. The knockout mutants were JW1225-2 for
Δhns, JW0669 for
Δfur, and JW3229-1 for
Δfis. Sequencing on Illumina MiSeq yielded, on average, 0.91 million reads per library (See Supplemental
Table S4). The mapped reads were visualized using the
ReadXplorer software [
17], and transcription start site (TSS) detection was performed with the same tool and automatic parameter estimation (See Supplemental
Table S5).
The genes
stpA (b2669) and
ftnA (b1905) are both negatively regulated by H-NS, a global transcriptional silencer [
28], which is involved in the regulation of 5% of all
E. coli genes [
29]. In both cases, H-NS binds upstream of the TSS and leads to a repression of transcription [
30,
31] (
Figure 4A). The gene
stpA has a TSS at position 2,798,556 and a perfect σ
70-like -10 region (TATAAT). The gene
ftnA with a TSS at position 1,988,682 has a complete σ
70-like promoter (TTGCAA-16-TATAGT). Both genes showed no transcription in the wildtype strain, but transcriptional activity was measured in the
Δhns knockout strain and in the ROSE approach (See Supplemental
Figures S3 and S4). Moreover, both genes were also described by Thomason et al. and RegulonDB.
The TSS of
yjjZ (b4567) has already been described for genomic position 4,605,777 in the
E. coli MG1655 genome. According to
EcoCyc, there are two ferric uptake regulator (Fur) binding sites in the vicinity of the transcription start site of
yjjZ [
27,
32] (
Figure 4B). Although the respective promoter harbors a σ
70-like consensus sequence (TTGCAA-18-TATGAT), Thomason et al. did not detect a transcription start site for
yjjZ, suggesting efficient transcriptional repression
in vivo. This has been validated in our in vivo experiment, where the wildtype strain has shown no activity of the
yjjZ gene. However, in the
Δfur knockout strain, transcription from the σ
70 promoter is measurable. Moreover
, the TSS has been identified clearly in vitro (1,494 read starts) with the ROSE method. (See Supplemental
Figure S5). Another example of a gene activated by the regulator Fur is the gene
fepA (b0584) [
33,
34](
Figure 4B). The
fepA promoter has a σ
70-like consensus sequence (TTGCAG-14-TATTAT) and was not detectable in vivo in the wildtype strain. However, both the
Δfur knockout strain (508 read starts,
in vivo) and ROSE (348 read starts,
in vitro) show transcriptional activity for the gene
fepA (See Supplemental
Figure S6).
The gene for the DNA-binding transcriptional dual regulator GlcC has a TSS at position 3,128,206. It has an unusual -10 region (CATAAT) and a -35 sequence (TTAACT). As stated in
EcoCyc, the gene’s promoter region has four binding sites for the global regulator Fis [
35] (
Figure 4C), which causes gene repression. This repression has been validated in the in vivo experiment with the wildtype strain and the
Δfis knockout mutant. The wildtype strain showed minimal read starts (7 read starts) for the gene
in vivo. In the knockout strain, the amount of read starts was five times higher than in the wildtype strain, demonstrating the higher transcription of
glcC in the absence of the Fis regulator. However, the most read starts and the strongest transcription of the gene were identified by the in vitro approach ROSE (493 read starts) (See Supplemental
Figure S7). Another exciting gene is
aer (b3072), which shows a clear transcription start site in ROSE and the Δ
fis knockout strain at position 3,219,346 and harboring a σ
70-like consensus sequence (TTGTGC-19-TAACAT). This transcription start site is also described in the publication of Thomason et al. but is not defined in RegulonDB. Nevertheless, RegulonDB contains a Fis binding site with an unknown function upstream of the
aer gene (
Figure 4C). The fact that ROSE and the
Δfis knockout strain showed transcriptional activity, but there was no transcription in the wildtype strain suggests that Fis is a transcriptional repressor of
aer (See Supplemental
Figure S8).
The gene
ndh of
E. coli expresses the NADH dehydrogenase II. The corresponding promoter P
ndh is located at position 1,165,992 of the genome and is harboring a standard σ
70-like consensus sequence (TTGGTA-21-TATTCT). This gene is negatively regulated by multiple transcription factors like FNR [
36], Fur-Fe
2+ [
37], and NsrR [
38]. Due to the high number of different repressors of
ndh, no transcription was detectable in the
E. coli wildtype strain or the single knockout strains
in vivo. However, the ROSE method showed a distinct TSS at the known position of P
ndh with over 200 read starts (
Figure 5).
These findings underline that the bottom-up approach employed within ROSE aids the identification of previously undetected TSS, especially those that are repressed or not activated under a given in vivo testing condition.
3.5. Promoters activated by transcriptional regulators in vivo are not identified in vitro
A different type of σ
70-dependent promoters comprises those specifically activated by transcriptional regulators
in vivo, possibly allowing for lesser conservation of promoter motifs. For example, the well-known promoter of the
araBAD operon (CTGACG-18-TACTGT) of
E. coli can be activated and repressed by the transcriptional regulator AraC
in vivo, depending on the availability of arabinose [
39,
40]. It is furthermore activated by the cAMP receptor protein (CRP) in vivo [
41,
42,
43]. Since none of these regulators are included in the ROSE in vitro transcription assay, neither activation nor repression of pBAD should occur. Interestingly, no TSS has been identified upstream of the
araBAD operon by ROSE-Eσ
70, suggesting that activation by CRP and/or AraC is indeed critical for transcription initiation at pBAD. Another instance is the σ
70-dependent promoter of
csiE (b2535), known to be activated by both CRP and H-NS in vivo [
44,
45]. Dual activation allows for relatively weak -10 and -35 hexamers (TTCCCT-18-AACTTT). Consequently, the respective TSS at position 2,665,401 is included in both
in vivo-based studies but was not detected by ROSE-Eσ
70. The σ
70-dependent promoter of
alkA is activated upon binding Ada, a DNA repair protein, which is a critical component of the adaptive response [
46,
47]. The promoter of its TSS at position 2,147,559 harbors a well-conserved -10 region (TATGCT) but has no -35 region. In contrast to both in vivo studies, it is not detected by ROSE-Eσ
70, obviously requiring activation by Ada. In conclusion, ROSE robustly and comprehensively identifies
bona fide promoters and those potentially repressed under in vivo conditions. It also allows drawing conclusions from negative results, predicting efficient activation
in vivo.