Comparing healthy and dystrophic transcripts
This transcriptional model also influences interpretation of expression under dystrophic conditions. As shown by the phenomenon of transcript imbalance, however, careful choice of target site allows nuanced assessment of transcriptional dynamics. Mutations eliciting a DMD phenotype predominantly introduce premature termination codons (PTCs) into the dystrophin transcript: these will be promptly degraded via nonsense mediated decay (NMD), but as noted above, this step occurs only upon nuclear export (i.e. after transcript completion -
Figure 5F). Consequently, in dystrophic muscle, mature transcripts are greatly reduced as expected, but nascent mRNAs remain (indeed
mdx muscle myonuclei retain strong 5′ probe foci under ISH [
39]). If transcriptional initiation remains otherwise unchanged, RT-qPCR from healthy and dystrophic muscle (
Figure 5G, H i-iii) will thus suggest different extents of NMD depending on the precise region of the transcript measured. 5′ sequence (exons 1-2) is present in essentially all transcripts (both nascent and mature) and will show only modest fold changes, while 3′ sequence (exons 62-63) is present chiefly in mature transcripts subject to NMD, so here strong reductions will be reported (
Figure 5I). Note this also complicates interpretation of isoform expression (for example within dystrophic brains): here first exon sequence is the only means to distinguish isoforms (and thus might not reflect the true effects of NMD), while 3′ sequence is common to all isoforms (and thus will not reflect isoform specific behaviour). Conversely, changes in transcriptional initiation (in the absence of NMD) will result in
en bloc alterations in transcript numbers (
Figure 5J-L), leading to more consistent fold changes regardless of region measured (
Figure 5M). These two scenarios are not merely of academic interest:
in vivo, both might be present simultaneously, and use of target sites that distinguish predominantly nascent mRNAs (exons 1-2) from predominantly mature (exons 62-63) permits contributions of NMD and transcriptional initiation to be assessed effectively independently. We and others [
37,
39] have shown that in the
mdx mouse, reductions in transcriptional initiation do indeed occur alongside NMD (see
Figure 2): levels of mature transcripts are markedly lower than in healthy muscle (consistent with degradation) but nascent transcripts levels are also reduced, and by a greater factor than can be attributed to NMD (accordingly under ISH, nuclear foci are prominent, but reduced in intensity). Notably, this phenomenon does not appear to be a generalised feature of dystrophic mammals: in the DE50-MD dog model of DMD loss of mature transcripts to NMD is profound, while transcriptional initiation remains essentially unchanged [
45].
Quantifying exon skipping
Of particular therapeutic interest, this model influences assessment of exon skipping (targeted exclusion of one or more exons to restore the dystrophin reading frame): if most dystrophin mRNA is nascent, and moreover co-transcriptionally spliced, many skipped transcripts will be immature at the time of measurement (
Figure 7A). Primers spanning the target site (such as exons 22 and 24 for the
mdx mouse) permit absolute quantification of skipped transcripts regardless of maturity, however expressing this as a fraction of overall dystrophin levels is nontrivial. Quantifying sequence 5′ to the region of interest (for example, exons 1:2, for ‘total’ dystrophin) underestimates skipping efficiency, as this necessarily includes nascent transcripts that do not yet even contain the target site. Use of a site more 3′ correspondingly overestimates efficiency, as fewer transcripts in total contain such 3′ sequence: indeed, for a skipping site at exon 23, normalising to 3′ sequence could give efficiency estimates in excess of 100%. Sequence closest to the target site gives greatest accuracy (or alternatively sequence unique to unskipped transcripts: measuring across the exon 23:24 splice, for example), however regardless of site, it should be remembered that all measurements are additionally subject to survivorship bias: unskipped mature transcripts are degraded (and thus underrepresented), while successfully skipped transcripts are not (
Figure 7A, i-iii).
This picture is further complicated by the dystrophin locus itself: while dp427 mRNA takes ~16 hours to complete, individual exons are not spaced equidistantly along the gene (
Figure 7B): some thus exhibit more closely matched transcriptional behaviour than others. The first ten exons, for example, are sparsely distributed across some ~800 kilobases of genomic DNA, while the following 30 exons occupy less than half that. Consequently, specific sequence regions do not emerge in a smooth gradient, but in a more ‘burst-like’ fashion (see
Figure 7C), with corresponding consequences for comparison of skipped/unskipped transcripts (or healthy/dystrophic). Exon 10 is transcribed almost 6 hours after initiation, but the entire following sequence to exon 41 then emerges in only ~2 hours: all exons within this 10-41 region thus represent similar fractions of total dystrophin mRNA (
Figure 7D), and will exhibit comparable fold differences between healthy/dystrophic (
Figure 7E).
The most rigorous approach consequently would be to quantify multiple regions independently, in both treated and untreated samples: comparison of early 5′ sequence would assess influence of treatment on transcriptional initiation, comparison of skipped sequence with adjacent representative sequence would assess skipping efficiency, and comparison of late 3′ sequence would establish the resultant fold enrichment of mature, viable mRNAs.
Dystrophin in the transcriptomic era
Next-generation high-throughput transcriptomic approaches (such as RNAseq) are increasingly popular and accessible even to those with modest budgets. As such, the eccentricities of dystrophin expression must also be extended to these techniques, here applying caveats both to sample preparation and data analysis/interpretation. For sample preparation, standard RNAseq pipelines first employ oligo dT column purification (eliminating ubiquitous non-coding RNAs such as ribosomes that might otherwise dominate sequencing). As noted in
Figure 5, even this simple step excludes all nascent dystrophin transcripts, a bias then potentially compounded by oligo dT-directed reverse transcription, which favours 3′ sequence. Following fragmentation, ligation and sequencing to FastQ format, analysis first requires mapping of sequence data to a reference genome to establish the genomic location of each read: these (large) BAM files of aligned reads are then subsequently compared to an annotated feature file to determine which transcripts each genomic location corresponds to, ultimately producing a simple value of ‘reads per transcript’ (or more accurately, ‘reads per feature’). Post-hoc corrections can be applied to normalise for transcript length (longer mRNAs generate more reads), but conventionally, the specific location of each read within a transcript is not considered relevant to downstream analysis. For a typical RNAseq pipeline, therefore: mRNA isolation eliminates nascent transcripts, reverse transcription biases against 5′ sequence of whatever remains, and downstream sequencing analysis then hides these concerns from the investigator. For dystrophin in particular, this approach also precludes analysis of isoform-specific expression behaviour: even when isoform variants are present within the feature file, all reads that map to the
Dmd gene feature are flagged simply as ‘Dmd’.
Addressing sample preparation concerns is challenging but not impossible: use of ribodepletion instead of oligo dT purification allows retention of nascent transcripts, and similarly (as in
Figure 5), use of random priming (rather than polyA-directed oligo dT priming) allows capture of sequence without 3′ bias (this approach can also be employed to investigate intronic sequence [
42]). These modifications cannot be applied post-hoc, however, and are thus of no benefit to sequencing data already obtained using conventional polyA purification/priming.
Data analysis is conversely more tractable to post-hoc reassessment. Genome feature files (.GFF or .GTF format) store genomic coordinates of each exon, and the gene ID of the corresponding spliced transcript, but there is no
a priori reason feature mapping cannot be conducted down to the level of individual exons. This can be done by opening aligned BAM files in a genome browser (such as IGV) and manually counting reads for each exon [
47], though this approach is somewhat painstaking. Alternatively, commonly used programs such as Htseq-count can conduct exon-level analysis innately, however this approach necessarily applies to all exons across the entire feature file (generating a vastly excessive dataset), and moreover the nomenclature used to designate individual exons is unwieldy, especially for transcripts with multiple listed variants (such as
Dmd). For dystrophin-focussed investigations, a more practical approach would be advantageous. As gene IDs are simple text fields, and genome feature files can be readily edited (via excel or text editor), we therefore modified the GRCmm39 mouse genome feature file to add each dystrophin exon as a unique, distinct gene ID (
Dmd_exon3, etc) alongside the conventional ‘Dmd’ assignment. This allows mapped reads to be evaluated both overall (total ‘Dmd’), and at single-exon resolution (
Figure 8A). We further assigned distinct IDs to unique isoform first exons, extending resolution to individual full length isoforms (dp427c, m, p) and to dp260, dp140, dp116 and dp71 (we note that a similar approach was elegantly employed by Doorenweerd
et al. to identify isoform-specific expression patterns in human brain RNAseq datasets [
20]).
We first assessed public repository FastQ data taken from a comprehensive transcriptional profiling study of muscle types in healthy mice (generated by Terry
et al [
48]), examining three commonly-studied pelvic limb muscles, the tibialis anterior (TA), the extensor digitorum longus (EDL) and the soleus (SOL). These datasets were then subjected to a standard analysis pipeline using a standard mouse feature file or our modified version (see methods). This dataset should reflect mature transcripts only: as discussed above, use of polyA purification excludes nascent transcripts prior to reverse transcription (
Figure 5E). Conventional analysis confirmed the authors’ original findings: as expected, myosin heavy chains represented a high percentage (~5%) of total reads yet also revealed differences associated with the characteristic roles of each muscle type. Reads per million (RPM) for the very fast IIB myosin heavy chain (
MYH4) were high (~50,000) in the faster TA and EDL muscles but lower in the slower SOL, with the TA also exhibiting higher levels of the fast IIX isoform (
MYH1). Conversely, the fast IIA (associated with fast oxidative fibres) and slow Iβ MHCs (
MYH2, 7) were elevated in SOL but not TA or EDL (
Figure 8B). RPM values for dystrophin (
Dmd) showed no prominent muscle-specific behaviour, being similar across all samples. Counts for this transcript were also markedly lower (~200 RPM) as expected for a low abundance transcript.
Reanalysis of these data using our exon-specific dystrophin feature file added substantial context to these findings. The dystrophin 3′ UTR (exon 79) is 2.7kb in length, almost 20% of the mature transcript: one would thus expect reads to exon 79 to be over-represented. Oligo dT primed reverse transcription also biases in favour of 3′ sequence, potentially compounding this over-representation. As shown (
Figure 8B,
Dmd exon 79) this was indeed the case: reads to the (predominantly untranslated) exon 79 represented 30-60% of total
Dmd reads and were consequently highly comparable to
Dmd values obtained via conventional analysis. Reads to all other exons were markedly lower, but 3′ bias was still evident: RPM values at the 3′ end numbered in the hundreds, gradually diminishing to mere tens as exons become more 5′ (
Figure 8C). We then adjusted per-exon RPM for exon length, effectively obtaining an “RPM per base” value for each exon: normalising both for the enormous size of the 3′ UTR, and for single reads assigned to multiple exons (see
Figure 8A). Expressed in this manner (
Figure 8D) the 3′ bias was rendered substantially more obvious, and moreover illustrated the highly consistent read counts between muscles. Finally, we plotted these corrected counts against transcript length, using the midpoint positions of each exon within the long dp427 mRNA as X-axis coordinates: as the efficiency of reverse transcription declines essentially as a first-order function of length, plotting in this manner allowed the 3′ bias of oligo dT-primed reverse transcription to be empirically calculated (
Figure 8E). For this dataset, we observed a mean gradient of ~-0.0005 log2(RPM).base
-2, i.e. a 2-fold drop in per base sequence capture for every 2000 bases of distance from the 3′ terminus: again, this gradient was remarkably consistent across muscles, suggesting that this could be used as a dataset-wide correction factor for 3′ bias. Finally, ~95% of isoform-specific first exon counts (
Figure 8F) mapped to dp427m, indicating (as expected) that essentially all sequence data represents the muscle isoform of full-length dystrophin. Low levels (<4 counts) of dp116 and dp71 were detected in some samples, consistent with minor contributions from peripheral nervous tissue and vasculature, respectively, but reads to all other isoforms were essentially absent.
Next we applied this approach to dystrophic muscle, assessing RNAseq data generated by Chemello
et al [
49] using the ΔEx51 mouse model of DMD (a gene-edited mouse model that lacks dystrophin exon 51). Conventional analysis of 4-week old TA muscle samples showed
Dmd was expressed in healthy samples at levels highly comparable to the mouse muscle data above, but was markedly decreased in ΔEx51 samples, as would be expected (loss of exon 51 causes frameshift, leading to transcript degradation by NMD). Exon-specific analysis again showed that exon 79 reads were over-represented, accounting for ~30% of total (
Figure 9G), but also demonstrated that ΔEx51-associated decreases in reads were comparable across the entire length of the transcript (reassuringly, zero reads mapped to the absent exon 51 in ΔEx51 samples -
Figure 9H). This consistent decrease suggests these ΔEx51sequences do indeed correspond to mature mRNAs, perhaps captured prior to nuclear export or
en route to degradation (note that first exon sequences were near exclusively dp427m,
Figure 9I). Interestingly, 3′-5′ bias in this instance was consistent with substantially higher efficiency reverse transcription (2-fold drop every ~7000 bases) potentially explaining the more modest enrichment of 3′ UTR sequence in this dataset.
Finally, to explore the broader utility of exon-based interpretations, we examined data generated by Schmitt
et al [
50], using murine brain samples collected from embryonic day 15.5 (E15.5) to 29 days post birth (P29). Given developmental expression of dystrophin (especially within the brain), this data should report multiple isoforms, and moreover, sample preparation for this dataset used ribodepletion and random priming, so should be more representative of both nascent and mature transcription. Conventional analysis readily identified expression of
Dmd, with expression increasing ~2-fold from E15.5 to P29 (Figure 10A). Exon-specific analysis again demonstrated the strong representation of the exon 79 3′ UTR, but here this exon contributed a smaller fraction than in the studies assessed above (~20%, Figure 10B). Examination of unique first exon reads moreover added considerable nuance to this data, demonstrating that expression of
Dmd was indeed distributed across multiple isoforms (Figure 10C). Both the cortical full-length isoform dp427c and the short dp71 isoform were robustly detected, and at comparable levels, with expression patterns that broadly mirrored overall
Dmd expression (though age-associated increases in dp427c were more dramatic than in dp71). Dp140 was the other major contributor, but here expression declined with age: a finding at odds with the behaviour of ‘Dmd’ under conventional analysis, but one that reflects the involvement of this isoform in earlier rather than later neural development. Both muscle and Purkinje full-length isoforms (dp427m, p) were initially essentially absent, but both were present at very low levels by P22-29, while expression of retinal dp260 and peripheral nerve dp116 remained consistent with stochastic noise. Expression of individual exons across the entire dystrophin locus was also markedly different from the patterns revealed above (Figure 10D-I): read counts here were biased in the opposite fashion, exhibiting a clear 5′ enrichment rather than 3′, alongside periodic ‘spikes’ in expression corresponding to distinct initiation events from the dp140 and dp71 promoters (upstream of exons 45 and 63, respectively). The magnitude of these ‘spikes’ differed according to relative isoform abundance within a sample (note the exon 45/dp140-associated spike is barely detectable by P29 -Figure 10I), and which moreover similarly decline from 5′-3′. These data are wholly consistent with the predominantly nascent expression model proposed here: 5′ sequence should be more abundant than 3′ regardless of isoform, and the contributions of multiple isoforms expressed within a sample (dp427, dp140, dp71, Figure 10J-L) should thus overlap to create a ‘saw-tooth’ distribution of exon abundance (Figure 10M).
In summary, this brief overview, using established repository-located RNAseq datasets, supports our transcriptional model and moreover demonstrates the limitations of polyA-purification and oligo dT priming in the generation of cDNA samples and sequencing libraries. This work focusses only on dystrophin expression, but these findings likely extend to other long transcripts or genes (such as titin and obscurin). We also illustrate the utility of applying a more nuanced exon-level analysis to dystrophin within RNAseq data: this approach offers insights into transcriptional dynamics and isoform expression and can potentially allow broader evaluation of dataset-wide bias.