An earlier sequence analysis provided hints that the product of the X ORF was functional, by detecting a decrease in synonymous codon variability in the region of overlap with VP1 [
17], but could not determine whether this decrease was significant. Here we show that it is highly significant, using a dedicated software extensively validated on experimentally proven overlaps, Synplot2. In addition, we show that the X ORF is conserved not only in all erythroparvoviruses but also in tetraparvoviruses (in which it was called ARF1 [
12]). Given the high rate of evolution of viruses, the conservation of the X ORF in two genera provides compelling evidence that it must be expressed and play a function essential for the viral life cycle.
4.1.1. The X protein could be translated either by a non-conventional mechanism or from an overlooked mRNA
The X ORF has a potential AUG start codon in all erythro- and tetraparvoviruses, but cannot be encoded in a monocistronic fashion by any known viral mRNA (see
Figure S3). These observations suggest that the X protein is translated either 1) by a non-canonical mechanism, or 2) from a currently unmapped mRNA. We briefly discuss both hypotheses, which we present only as a starting point to guide experimental approaches.
1) Translation of the X ORF through a non-canonical mechanism
In vertebrates, two main factors influence canonical translation: 1) the strength of the “Kozak sequence” surrounding the initiator AUG codon [
38]; and 2) the position of the AUG codon in the mRNA. In general, translation initiates at the first AUG with an optimal Kozak sequence, but many exceptions are known (for a review, see [
39]). For example, a downstream AUG can sometimes initiate translation even if it is separated from the first optimal AUG by intervening AUGs, thanks to a mechanism called “re-initiation” (for a review, [
40]). For example, in B19V, the VP1 AUG codon is preceded by 7 upstream AUG codons that form mini-ORFs (
Figure 9), and is accessed by re-initiation after having first initiated translation at some of these mini-ORFs [
41]. Note that the presence of these 7 upstream AUGs severely decreases the translation level of VP1 [
41].
In principle, the B19V X ORF might also be translated from the VP1 mRNA by re-initiation, since it is separated from the VP1 AUG start codon by 4 AUGs (
Figure 9). However, the efficiency of translation would presumably be very low [
40]. Interestingly, in B19V, the 77 nucleotides upstream of the presumed AUG start codon of the X OR have a significantly reduced variability in synonymous codons (nt 172-250 of the VP1 CDS, see
Figure 3B and
Table 2, corresponding to nucleotides 2795-2873 of the genome, see
Figure 9, bottom right) It is tempting to speculate that this region corresponds to a translation enhancer, i.e., a regulatory element that would enhance the translation efficiency of the X ORF.
The nomenclature of transcripts and splice sites is as in [
6]. Thin boxes represent mini-ORFs. The mini-ORFs in black influence the translation of VP1 [
40], and might also influence that of the X protein. The mini-ORFs in red are expected to influence the translation of the X protein but presumably not that of VP1. The region immediately upstream of the X ORF has a decreased synonymous variability (see
Table 2 and
Figure 3B), suggesting it has a regulatory function and might act as a translation enhancer.
2) Translation of the X ORF from a currently unmapped mRNA
A second mechanism might in principle ensure translation of the X ORF: the existence of an unmapped mRNA, generated by an overlooked splice acceptor site. Two conditions would be required for a splice acceptor site to generate a monocistronic transcript that encodes the X protein of B19V: 1) this site should be conserved in all isolates of B19V; 2) it should be located in the region between the VP1 start codon and the presumed start codon of the X protein (nt 251-253 of the VP1 CDS).
We found 3 such potential sites (having the canonical sequence (C/U)AG preceded by a region rich in pyrimidines (C/U) [
42]), at nucleotides 158-160, 185-187, and 231-233 of the VP1 CDS. (The respective coordinates of the acceptor G in the genomic sequence of B19V are 2783, 2810 and 2856, see
Figure 9). Each acceptor site would yield a monocistronic transcript that encodes the X ORF, by splicing out both the VP1 AUG start codon and the 4 following AUG codons located upstream of the presumed AUG start codon of the X protein (in red in
Figure 9). Interestingly, these potential splice acceptor sites are located in or near in the potential regulatory region immediately upstream of the X ORF (
Figure 9, bottom right), which has a decreased synonymous variability (
Figure 3B and
Table 2).
The X ORF most probably originated by overprinting the VP1 ORF
Most overlapping gene pairs originate by overprinting, a process in which substitutions in an ancestral reading frame enable the expression of a second reading frame (the novel frame), while preserving the expression of the first frame [
43,
44]. The ancestral frame can be identified by its phylogenetic distribution (the ORF with the widest distribution is most probably the ancestral one) [
43,
45], or by their codon usage [
46] if both frames have the same phylogenetic distribution [
14].
The phylogenetic distribution of X and of VP1 indicates that VP1 is necessarily the ancestral reading frame, since a PLA2 domain is found not only in most
Parvoviridae, but also in a wide variety of metazoans and plants [
47], whereas the X protein is found only in erythro- and tetraparvoviruses. Therefore, the X protein must have originated by overprinting the region of the the VP1 frame encoding PLA2, in the putative common ancestor of erythro- and tetraparvoviruses.
An intriguing observation is that the VP1 gene of bPARV3, which is basal to the
erythroparvovirus phylogeny, contains an X-like ORF despite lacking a PLA2 domain (
Figure 7). This raises two hypotheses–either:
1) the X-like ORF of bPARV3 arose independently from the X ORF of erythro- and tetraparvoviruses (i.e., their similarity is coincidental–they are not homologous); or
2) the X-like ORF of bPARV3 is homologous to the X ORF of erythro- and tetraparvoviruses, and the PLA2 domain was lost in bPARV3. In that case, the X-like ORF would constitute a “genetic palimpsest” (a palimpsest is a manuscript that has been erased and written on again), i.e., the X-like ORF would have been overprinted (“written over”) on a now “erased” PLA2 domain.
In the absence of intermediate sequences to reconstruct the evolution of bPARV3, it is not yet possible to settle the issue.
The X protein is not homologous to the protoparvovirus SAT protein
An earlier work [
12] hypothesized that the ARF1/X protein of PARV4 was homologous to the SAT protein, another short, transmembrane protein encoded in the +1 frame of the VP1 gene in the genus
protoparvovirus [
48]. However, SAT and X cannot have a common origin, since SAT is encoded by the N-terminus of VP2, downstream of the region encoding the PLA2 domain (our observations), unlike the X protein, which overlaps PLA2 (see Figs 3 and 4).