2. Results
Overview of our procedure to detect extremely distant homologs by using contextual information
To detect extremely distant homologs, we used a recursive procedure described in
Figure 1. We had already used this procedure on several occasions (e.g. [
4,
16,
17]), but had never formally presented it. It is based on the idea that it is extremely difficult to
find distant homologs using sequence-based searches, but that is easy to
confirm whether a candidate protein is homologous to the query. This idea can be further decomposed into 3 principles, described below.
Principle 1: Traces of sequence similarity can persist beyond the cutoff for statistical significance of programs for sequence homology detection.
Proteins detected by homology search programs such as Psiblast are considered homologous to the query if the statistical significance of their sequence similarity with the query (called “E-value”) is below a certain cutoff (typically E=10
-3) [
18]. However, these programs have the ability to return a long list of non-significant hits above the cutoff (up to an E-value of 1000 on the web-based version of Psiblast in the MPI toolkit [
19]), which are often called “marginal” hits. Some of these hits might be homologs that have considerably diverged in sequence, which is why they are above the significance cutoff (the higher the E-value, the less significant the similarity is). The question is, how can we identify these divergent homologs? This can be done thanks to principles 2 and 3 below.
Principle 2: To identify candidate homologs in the list of non-significant hits, we can use "contextual" information, e.g. taxonomy or type of host infected.
"Contextual" information is the information associated with the primary sequence of a protein, such as gene location, gene order, taxonomy, protein size, domain co-occurrence, domain order, function, type of host infected, etc. Contextual information has been used since the beginning of bioinformatics to detect more distant homologs (e.g. [
20,
21,
22]). Yet we noticed that it is particularly powerful in viruses for two reasons:
- -
First, viruses tend to have very few genes (fewer than 10 for most RNA viruses, for example, compared to over 20,000 in humans). Thus, a weak similarity between two viral proteins is much more meaningful than, say, a weak similarity between two human proteins;
- -
Second, some proteins are found primarily in viruses, or even restricted only to a certain type of viruses. For example, movement proteins of the 30K superfamily are restricted to plant-infecting viruses (as well as to certain plants) [
23]. Therefore, a protein which has only weak similarity to a 30K movement protein, but which comes from a plant-infecting virus, might reasonably be considered a candidate homolog that would have considerably diverged in sequence.
In the present study, initial searches found that the WIV domain was only present in arthropod-infecting viruses (see below); we thus systematically considered weak hits from arthropod-infecting viruses as candidate homologs.
Principle 3: Candidate homologs can be validated using a highly sensitive method, pairwise profile-profile comparison.
Once a candidate homolog has been identified, it can be validated using a powerful method, HHpred pairwise comparison [
10]. Briefly, in pairwise comparison mode, HHpred automatically performs 4 steps: a) it collects, in parallel, the sequences of homologs of the query protein and of the candidate homolog; b) it generates separate alignments of these sequences; c) it converts these alignments into representations called sequence "profiles"; and d) it compares these two profiles, as well as their predicted secondary structure. Comparing sequence profiles is much more powerful than comparing single sequences, because the profiles contain information about how the sequences of homologs can evolve [
4,
24]. The comparison can yield two results:
- -
If HHpred detects a significant similarity, the two proteins are homologous;
- -
If HHpred does not detect a significant similarity, either the two proteins are not homologous, or they are homologous but have diverged beyond recognition even by sequence profile methods. In such cases, only structure-based methods can confirm or infirm the homology (see below).
Based on these 3 principles, we designed a recursive procedure to identify distant homologs, described in
Figure 1 (see also the Methods section). It is composed of 3 parts:
1) The first part (
Figure 1, top) starts with a standard sequence-based homology search (step 1A) using highly sensitive software (Psi-blast [
25] and HHblits [
5]), using stringent significance cutoffs. Homologs detected in this step are aligned (step 1B) and are resubmitted to Psi-blast and HHblits until no new homolog is identified by this standard search. Then we proceed to part 2,
2) In the second part (
Figure 1, middle) is an advanced sequence-based search, which takes advantage of contextual information. It starts with examining weak hits that only have marginal similarity to the query, among which we select candidates that have the appropriate contextual information (i.e. that come from certain taxa, or infect certain hosts) (step 2A). We filter out those candidates that are homologous to known domains (step 2B). Then we compare the sequence of each remaining candidate with the sequence alignment of the WIV domain (step 2C), using HHpred pairwise comparison. If HHpred detects significant similarity (which means that the candidate is homologous to WIV), we incorporate these validated candidates to the alignment of WIV domains (step 2D) and repeat the procedure from the start (part 1). Candidates for which HHpred detects no significant similarity might be divergent homologs of WIV. We examine them in part 3.
3) The third part (
Figure 1, bottom) consists in a structure-based search. This step has only recently become possible thanks to the success of the software Alphafold2 [
8] in reliably predicting 3D structures. First, using Alphafold2, we predict the structure of the divergent candidate homologs that could not be validated in part 2 (step 3A). Then, we compare their structure with that predicted for the WIV domain of Lake Sinai virus ORF4 (step 3B). If both structures are significantly similar, the candidate is homologous to WIV. Otherwise, the candidate is discarded.
This procedure continues until no new homolog is found. We will now describe its application to discovering homologs of the WIV domain.
The WIV domain is found in proteins from over 20 viral taxa, with a large variety of domain organizations
Lake Sinai virus ORF4 protein was initially classified as orphan (i.e. devoid of homologs) upon its discovery [
12]. To identify distant homologs, we applied the procedure described above. We first present the results of parts 1 and 2 (respectively standard and advanced sequence-based homology search), and later the results of part 3 (structure-based homology search). In part 1 (
Figure 1, top panel), we examined significant hits (E<10
-3) returned by Psi-blast and HHblits, i.e. easily identifiable homologs, and noticed that they all came from viruses that infect arthropods. We thereby discovered that the Lake Sinai virus ORF4 protein is constituted by a standalone domain, which we will thereafter call it the WIV domain (see below). Therefore, in step 2, we considered as “candidate homologs” hits that both had weak sequence similarity to the WIV domain (10
-3≤E<1000)
and came from arthropod-infecting viruses. We verified these candidates using HHpred pairwise comparison (
Figure 1, middle panel), and incorporated validated homologs in a new round of homology search (parts 1 and 2), until no new homolog was detected.
his procedure enabled us to detect homologs of Lake Sinai ORF4 in proteins from over 20 viral genera and unassigned taxa, corresponding to 11 viral orders (see
Figure 2 and
Table 1). Thus the WIV domain has an exceptionally wide taxonomic distribution. In addition, WIV occurs in proteins with a strikingly wide variety of domain organizations. We noted 4 main types of architectures:
- -
As a standalone domain: WIV is found as a standalone domain in most positive-strand RNA viruses (
Figure 2, panel A), except
Picornavirales and an
Amarillovirales; in a negative-strand RNA virus (
Mononegavirales, panel B); and in some double-stranded DNA viruses (
Pimascovirales, panel C). In some of these proteins, WIV is preceded by a signal peptide (e.g. in
Lake Sinai virus ORF4).
- -
Appended to a coiled-coil: In some species, WIV is appended to an N- or C-terminal coiled-coil, for example in Dougjudy virga-like virus RNA2 ORF1 (panel A), or in Wiseana iridescent virus gp049 (panel C);
- -
Next to a double-stranded RNA-binding domain (dsRBD): WIV is wedged between a dsRBD domain and a capsid domain in some Picornavirales (panel A), and is located upstream of a dsRBD domain in a Ghabrivirales (panel D).
- -
Downstream other types of domains: WIV is found at the very C-terminus of some proteins, such as Tospovirus NSs (panel B).
The SPD-like domain (panel D) has been discovered in this study, like the WIV domain (see main text). For cypoviruses (panel D), the segment that encodes each protein is indicated between brackets (e.g. S8 is segment 8).
The wide variety of domain contexts in which WIV occurs clearly shows that it is both structurally and functionally independent, justifying its name of "WIV", for “Widespread, Intruiguing, Versatile” domain.
The distribution of WIV suggests extensive horizontal transfer. In some cases, WIV is found only in a single species within an order, e.g.
Hangzhou frankliniella intonsa flavivirus 1 (
Figure 2A);
Fushun monolepta xinmovirus (
Figure 2B
); and
Hainan sediment Toti-like virus 9 [
28] (
Figure 2C). However, there are 5 large taxa in which all species encode a WIV domain. One is the genus
orthotospovirus. The 4 other taxa are currently unclassified; their member species are presented in
Table 2. These taxa are:
- 1)
An unclassified taxon comprising
Brandeis virus [
29], distantly related to the families
Virgaviridae and
Kitaviridae, part of a larger group called “invertebrate virus group A” in a recent article [
30];
- 2)
An unclassified taxon comprising
Dougjudy virga-like virus [
31], also related to
Virgaviridae and
Kitaviridae.;
- 3)
The proposed family
Acypiviridae [
32];
- 4)
An unclassified taxon related to
Solenopsis invicta virus 7 [
33], close to the family
Acypiviridae (see Figure S1h in [
34]).
Table 2.
Unclassified viral taxa that contain at least 5 species, all of which encode a WIV domain.
Table 2.
Unclassified viral taxa that contain at least 5 species, all of which encode a WIV domain.
Taxon |
Virus species in the taxon |
Brandeis virus group |
Brandeis virus, Beult virus, Muthill virus, Bofa virus, Marsac virus, Buckhurst virus, Hubei virga-like virus 18, Broome virga-like virus, Hubei virga-like virus 19, Zeugodacus cucurbitae negev-like virus, Erysiphe necator associated virga-like virus 2
|
Dougjudy virga-like virus group |
Dougjudy virga-like virus, Hangzhou merodon fulcratus virga-like virus 1, Leuven Virga-like virus 1, Virga-like virus 21, Atrato virga-like virus 6, Atrato virga-like virus 7, Hammarskog virga-like virus, Erysiphe necator associated virga-like virus 1
|
Family Acypiviridae (proposed in [32]) |
Acyrthosiphon pisum virus, Darwin bee virus 7, Hangzhou solinvi-like virus 2, Grapevine-associated RNA virus 1, Hubei picorna-like virus 55, Hubei picorna-like virus 56, Aphis citridus picorna-like virus, Rosy apple aphid virus, Changjiang crawfish virus 6, Lasius picorna-like virus 7, Electric ant virus 1
|
Solenopsis invicta virus 7-like group, closely related to the proposed family Acypiviridae (see Figure S1h in [34]). |
Solenopsis invicta virus 7, Apis-Picorna-like virus 5, PNG bee virus 9, Hangzhou sesamia inferens solinvi-like virus 1, YCA-associated virus-like sequence 8 [34], HVAC-associated RNA virus 1, Apis picorna-like virus 3, Bundaberg bee virus 8, Milolii virus, Lasius picorna-like virus 9
|
The genome of 9 viral species contains an unannotated coding sequence that has significant similarity with the WIV domain, detectable by using the software tblastn (see Methods). These species comprise
Muthill virus,
Bofa virus,
Buckhurst virus and
Marsac virus [
35],
Beult virus [
29],
Ceratitis capitata Negev-like virus 2 [
36],
Atractomorpha sinensis Negev-like virus 1 [
37],
Solenopsis invicta virus 8 [
33], and
Bat tymo-like virus (Genbank accession number NC_030844). We included the corresponding WIV domain of these viruses in the alignment presented in
Supplementary File S1. These overlooked coding sequences are short (~300 nucleotides) and almost all are located at the very 3’ end of the genome, suggesting a bias of genome annotators against annotating short, 3’ coding sequences. Interestingly, in another species,
Ek Balam virus [
38], the 3’ end of the genome also contains an unannotated coding sequence with significant similarity to WIV, but it is interrupted by a stop codon.
Finally, a WIV domain is found in over a dozen proteins annotated as coming from arthropod viruses, in particular thrips. Their sequence is included in the alignment from
Supplementary File S1.
The WIV domain is predicted to have a previously unknown fold composed of a 6-stranded β-sheet buttressed by 3 α-helices
We predicted the structure of the WIV domain (aa 29-129) of Lake Sinai virus ORF4 using Alphafold2 [
8]. The structure is presented in
Figure 3A. It is expected to be highly reliable (pLDDT = 0.95). WIV adopts a previously unknown fold, composed of a 6-stranded β-sheet buttressed by 3 α-helices (
Figure 3A). Its topology is presented in
Figure 3B: a long (~18 aas) helix (α1), followed by four β-strands (β1 to β4), by two helices (α2 and α3) and finally by two antiparallel β-strands (β5 and β6).
A) 3D structure of the WIV domain (aa 29-129) of Lake Sinai virus ORF4, predicted by Alphafold2. B) Topology of the WIV domain. C) Residues corresponding to positions conserved across WIV homologs (see text and multiple sequence alignment in
Figure 4), visualized in a different orientation from those in panel A.
The WIV domain diverges considerably in sequence across viral taxa
Structure-based alignments are more reliable than sequence-based ones. Therefore, to generate an optimal alignment of the WIV domain, we first predicted the structures of two other representative WIV domains. We chose Tomato spot wilted virus NSs and Acyrthosiphon pisum virus polyprotein as representatives, because they have numerous close homologs in the database, a prerequisite for a good prediction by Alphafold2. Accordingly, Alphafold2 predicted their 3D structure with high expected reliability (plDDT = 92.8 and 91.3 respectively). The corresponding model coordinates are in
supplementary files S2 and S3. Next, we generated a sequence alignment of the WIV domains based on the superposition of the WIV domain of the 3 representatives
Lake Sinai virus ORF4, Tomato spot wilted virus NSs, and Acyrthosiphon pisum virus polyprotein, using Promals3D [
39]. The resulting structure-based alignment is shown in
Figure 4 (see
Supplementary File S1 for the alignment in text format). Only the N-terminal two thirds of WIV have meaningful sequence conservation (aa 37-99 in Lake Sinai virus); in the last third of WIV, conserved positions in the alignment correspond mainly to conservation of hydrophobicity.
Four aa positions are well conserved across homologs of the WIV domain. They are boxed and indicated above the alignment in
Figure 4:
1) Either a Q or a H (both large, polar aas) in most species, 9 aas after the start of the conserved region of WIV, in the middle of helix α1 (Q45 in Lake Sinai virus ORF4);
2) An A, or generally another tiny aa (G, S or C), 4 aas downstream of this conserved Q/H position (A49 in Lake Sinai virus ORF4);
3) An S, or generally a small aa, between strand β4 and helix α2 (S95 in Lake Sinai virus ORF4);
4) A Q, or generally a polar aa, 4 aas downstream of the conserved S position, in helix α2 (Q99 in Lake Sinai virus ORF4).
These residues are located towards the interior of the protein (
Figure 3C), and therefore, their conservation is probably due to structural reasons. In the WIV domain of Lake Sinai virus, the aa corresponding to conserved position 1, Q45, is facing the aa corresponding to conserved position 3, S95 (
Figure 3C). The aa corresponding to conserved position 4, Q99, is also located in the vicinity of these 2 aas (
Figure 3C). Finally, the residue corresponding to the conserved position 2, A49, is not visible in the orientation depicted in
Figure 3C, but is also located in the interior of the protein.
Structure-based alignment of representative WIV domain (see text). Protein accession numbers are in
Table 1. For brevity, the term “virus” has been omitted in species names (e.g. “PNG bee 9” corresponds to “PNG bee virus 9”). Species names highlighted in bold are those for which experimental data regarding WIV are available. Aas substituted in the WIV domain of Tomato spotted wilt virus or of related tospoviruses are indicated in bold.
Proteins from several cypoviruses also encode WIV domain, located at the C-terminus
During step 2 of homology search (
Figure 1, middle panel), we identified candidate homologs (i.e. marginal hits with an E-value 10
-3<E<1000, from viruses infecting arthropods) in cypoviruses, both in Psiblast searches and in HHpred searches against the database of viral protein profiles Uniprot-SwissProt-viral70 (see Methods). Cypoviruses are double-stranded RNA viruses of the family
Reoviridae, which infect arthropods. Their genome consists of 10 to 12 segments, and they have a non-enveloped, icosahedral capsid [
40]. The cypoviral candidate proteins and the WIV domain had no significant sequence similarity (as seen using HHpred pairwise comparison), indicating that either 1) these candidates are not homologous to WIV, or 2) they are homologous but have diverged beyond identification by sequence-based methods.
Such divergent homologs can be identified by structural comparison instead. Therefore, to determine whether the cypoviral protein candidates were genuine homologs of WIV, we conducted structure-based homology searches, i.e. part 3 of our procedure (
Figure 1, bottom panel). We predicted the structure of the p44 protein of
cypovirus 1, containing a candidate WIV domain. Alphafold2 returned a reliable model of p44 (plDDT = 86.8), predicting that it is composed of an N-terminal domain (aa 1-131), of a variable linker (aa 132-277), and of a C-terminal domain aa 278-389), corresponding to the candidate WIV domain. The predicted 3D structure of this C-terminal domain is shown in
Figure 5 (middle panel). It has significant similarity with the structure of the WIV domain of Lake Sinai virus ORF4 (FATCAT E-value 3x10
-5 with a RMSD of 3.29 Å), confirming that it homologous to WIV. Note in
Figure 5 how the C-terminal domain of cypovirus 1 p44 has exactly the same arrangement of secondary structures as the WIV domain of Lake Sinai virus, in the same order. In conclusion,
cypovirus 1 p44 contains a divergent C-terminal WIV domain.
A) WIV domain of Lake Sinai virus ORF4 (aa 37-127). B) WIV domain of cypovirus p44 (aa 278-389). C) Structural superposition, showing only common core regions (i.e. with no gaps, and aa distance < 4Å), identified with mTM-align [
41].
We applied to the WIV domain of
cypovirus 1 p44 the recursive homology search procedure described in
Figure 1, and thereby also identified a WIV domain in cypoviruses 4, 5, 14, 18 and 23 (
Table 3 and
Figure 2D).
Figure 6 presents a structure-based sequence alignment of cypoviral WIV domains (the alignment in text format is in
Supplementary File S6). Cypoviral WIV domains are very distant from each other, and the alignment contains no well-conserved position; only general physio-chemical characters are conserved. The taxonomic distribution of WIV is mostly consistent with the phylogeny of cypoviruses [
42], in which
cypovirus 1,
cypovirus 18 and
cypovirus 14 are sister species, as are
cypovirus 5 and
Hubei lepidoptera virus 3. However,
cypovirus 23 is not closely related to these species [
42].
Structure-based alignment of the cypoviral WIV domains, based on the Alphafold models of
cypovirus 1 and
cypovirus 5 WIV. Conventions are the same as in
Figure 4.
Some cypoviral proteins contain a domain upstream of WIV related to the SPD domain of cypoviral capsid protein VP1
We attempted to map the domain organization of cypoviral proteins that contain a WIV domain. First, we identified regions with meaningful sequence similarity with known domains, using HHpred (see Methods). HHpred only identified one domain, PARD (Poly ADP-ribose glycohydrolase), in
cypovirus 5 P5, just upstream of WIV (
Figure 2D).
Second, we attempted to predict the structure of the remaining regions, using Alphafold2. It returned a reliable prediction (plDDT=0.89) for the domain located upstream of WIV in cypovirus 1 p44 (aa 1-131).
Figure 7 presents its structure, a helices/sheet/helices sandwich (its coordinates are in
Supplementary File S7). It has significant similarity (DALI Z-score 12.5) to the structure of the SPD domain ("Small Protrusion Domain") of
cypovirus 1 VP1, the major viral capsid protein [
45] (aa 828-859 of VP1, shown in
Figure 7B; PDB accession code 3IZX_C). The SPD domain of
cypovirus VP1 stabilizes the assembly of the viral capsid [
45], and accordingly mutations in it destabilize the capsid [
46]. The SPD domain has only been observed so far in the capsid of cypoviruses, which is composed of a single shell, unlike the capsid of other
Reovirales.
Owing to its structural similarity to the SPD domain, we called aa 1-131 of cypovirus 1 p44 an “SDP-like” domain. We could find no information regarding the function of the SPD-like domain, but in cypovirus 1 p44, it contains two experimentally verified glycosylation sites (aa N48 and N69), while a third one is located immediately downstream (aa N138), in the linker that separates the SPD-like domain from the WIV domains [
47].
Figure 7C shows the good superposition between the predicted structure of the SPD-like domain of cypovirus 1 p44 and the SPD domain of cypovirus 1 VP1 (FATCAT P-value 2.8x10
-6, RMSD 3.21Å over their whole length). The P8 protein of the closely related cypovirus 18 also contains an SPD-like domain, as indicated by a Psi-blast search.
Alphafold could also reliably predict the 3D structure of the N-terminal domain of
cypovirus 5 P5 (pLDDT =0.92; structure not shown, see
Supplementary File S8 for its coordinates), which also has significant structural similarity to the SPD domain of
cypovirus 1 VP1.The P5 protein of the closely related
Hubei lepidoptera virus 3 also contains an SPD-like domain, identifiable by Psi-blast. In contrast, Alphafold could not reliably predict the structure of the domain upstream of WIV in other cypoviral proteins, for lack of related sequences. Therefore, we could not determine whether they contain an SPD-like domain. In conclusion, the following proteins contain an SPD-like domain, located at the N-terminus:
cypovirus 1 p44,
cypovirus 18 P8,
cypovirus 5 P5, and
Hubei lepidoptera virus 3 P5.
Both the SPD and SPD-like domain are highly variable in sequence, even among species belonging to the same genus. For example, there is no detectable sequence identity between the SPD-like domain of cypovirus 1 p44 and that of cypovirus 5 P5. Likewise, we could only identify a domain with detectable similarity to the SPD domain of cypovirus 1 VP1 in VP1 of the closely related cypovirus 14. Thus, the fold of the SPD and SPD-like domains probably places few constraints on its sequence, allowing it to diverge very fast. Consequently an SPD-like domain may be present in many more proteins than those in which we have detected it.
A) SPD domain (“Small Protrusion Domain”) of cypovirus 1 VP1 capsid protein (PDB accession code 3izx-C). B) SPD-like domain (aa 1-131) of cypovirus 1 p44, located upstream of the WIV domain. C) Superposition of both domains.
Altogether, cypovirus WIV is found at the C-terminus of a wide variety of structurally and functionally unrelated proteins (
Figure 2D), downstream of:
- -
an SPD-like domain in the p44 protein of cypovirus 1 and the related protein P8 of cypovirus 18;
- -
an SPD-like domain followed by a Poly ADP-ribose glycohydrolase (PARG) domain in the P5 protein of cypovirus 5 and Hubei lepidoptera virus 3;
- -
an unknown domain (s) in the P8 protein of cypovirus 4, 14 and 23.
WIV is probably a virulence factor that facilitates infections of arthropods, according to bibliographical information
Bibliographical information about function or gene expression is available only for 4 proteins containing a WIV domain: 1) p15, in Bombyx mori macula-like virus; 2) NSs, in tospoviruses; 3) 206R, in invertebrate iridescent virus 6; and 4) p44, in cypovirus 1. We present this information below.
The p15 protein of
Bombyx mori macula-like virus (BMLV, previously called
bombyx mori latent virus) is essentially a standalone WIV domain (
Figure 2A). BMLV was discovered in cell lines derived from the silk moth
Bombyx mori, which it persistently infects, accumulating at extremely high levels [
48]. It belongs to the family
Tymoviridae (proposed genus:
inmaculavirus [
49]). BMLV produces a protein, p15, which has homologs in the other members of the proposed genus
inmaculavirus (
bee macula-like virus 2 and
Nasturtium officinale macula-like virus 1), but not in the related genus
maculavirus. The p15 mRNA is highly expressed in BMLV-infected silkworm cells (much more than the capsid protein mRNA) [
50]. BMLV p15 is mostly located in the cytoplasm of infected
Bombyx mori BmN cell lines [
51] and is required to establish BmMLV infections in silkworm cells [
50]. BMLV p15 had no RNA silencing suppressor activity in an Agrobacterium-mediated transient coexpression assay [
50].
Some functional information is also available for
tospovirus NSs. It contains a C-terminal WIV domain (aa 335-429 in
tomato spotted wilt virus, see
Figure 2B), preceded by a long (~300 aa) N-terminal domain. Among viruses encoding a WIV domain, tospoviruses are the only ones known to actively replicate both in arthropods and in plants, to which they are transmitted via thrips [
52,
53]. NSs is required for persistent infection and transmission by the thrips
Frankliniella occidentalis [
54]. NSs enhances baculovirus expression in various arthropod cell lines [
55] and increased baculovirus virulence in a caterpillar [
56]. In addition, NSs impairs RNA silencing in tick cells [
57] and in thrips [
58]. The NSs protein accumulated slower than the nucleocapsid (N) protein in primary cell cultures of thrips [
59]. NSs is found in the cytoplasm, as is BMLV p15 (see above), where it is uniformly scattered, both in thrips cell cultures and in thrips infected with
tomato spotted wilt virus, the type species of tospoviruses [
59,
60].
Despite this wealth of information on
tospovirus NSs, to our knowledge, there is no data indicating whether it is the WIV domain and/or other regions of NSs that are responsible for impairing RNA silencing in arthropods, or for enabling infection and transmission by arthropods. The reason for this lack of data is that studies of the function of NSs by targeted mutations have been carried out exclusively in plants. We will present these studies only briefly (for a review, see [
61]), since here we are mainly concerned with the role of WIV in arthropods.
On the one hand, NSs can enhance infection of plants (by impairing RNA silencing), and on the other hand, it can trigger resistance to infection in tomatoes [
62]. Most substitutions that abolished the RNA silencing suppressor activity of NSs in plants or its ability to trigger resistance were found in the N-terminal third, i.e. aa 1-133 [
63]. Yet several substitutions in the WIV domain also abolished the silencing suppressor activity of NSs, such as N355A/N356A [
63], or L413A [
64], in bold in
Figure 4. Likewise, several substitutions in the WIV domain abolished its ability to trigger resistance in plants, such as L396A/S397A [
63], in bold in
Figure 4, which includes a substitution to the conserved S position (boxed in
Figure 4). Both activities of NSs can be uncoupled: for example, another double substitution, S411A/Y412A (in bold in
Figure 4), preserved the RNA silencing suppressor activity of NSs but abolished its ability to elicit resistance in plants [
63]. Since the effect of these substitutions was only tested in plants, and not in arthropods, presenting them in further detail is beyond the scope of this study.
Besides the function of NSs, some mutational studies have also investigated its stability and multimerization. Substitutions within helix α1 of the WIV domain of watermelon silver mottle virus NSs abolished self-interaction of NSs but not its RNA silencing suppression activity [
65]. In another study, the stability of NSs was decreased by substitution by an alanine of aa Y398 of watermelon silver mottle virus (corresponding to Y394 in NSs of tomato spotted wilt virus (strain Br20) - in bold in
Figure 4), located in strand β4 [
66]. Thus, WIV may be required for multimerization and stability of NSs.
Gene expression data are available for invertebrate iridescent virus 6 (also called "chilo iridescent virus"; genus
iridovirus), in which the protein 206R is essentially a standalone WIV domain (
Figure 2C). 206R belongs to the “immediate-early” class [
67], i.e. is among the earliest viral genes expressed. Interestingly, the gene 206R is among the ones most targeted by small interfering RNAs [
68]. This would be coherent as a counter-defense mechanism to prevent expression of 206R and of its WIV domain. The protein 206R has not been detected in virions [
69].
Some information is also available for two cypoviral proteins containing a WIV domain. Cypovirus 1 p44 (also called NSP2) is a non-structural protein that plays a central role, since it interacts with most other viral proteins and drives the formation of viroplasms (i.e. cytoplasmic structures thought to be one of the main sites of viral replication) during infection [
43]. P44 is localized close to intracellular membranes and to the endoplasmic reticulum [
43]. Whether the C-terminal WIV domain of p44 contributes to viroplasm formation or binding to other viral proteins is unknown. In silkworms infected by cypovirus 1, the expression level of the mRNA encoding p44, i.e. segment S8, was lower than that of the main capsid protein VP1, encoded by segment S1 [
70]. This is in contrast to the p15 mRNA of bombyx mori macula-like virus, whose level of expression was much higher than that of the capsid (see above).
Finally, Cypovirus 5 p61, which also contains a WIV domain, is a structural protein, i.e. it can be detected in virions [
44].
We note that WIV is located near a known or putative RNA-binding domain in proteins with 3 types of organization, which may suggest a functional association with RNA binding:
1) in the polyprotein of various
Picornavirales, in which WIV is located immediately downstream of a predicted double-stranded RNA-binding domain (dsRBD, see
Figure 2A);
2) in the polyprotein of an
artivirus (family
Totiviridae) in which WIV occurs instead immediately downstream of a dsRBD domain (
Figure 2D);
3) in
cypovirus 1 p44, in which the region encompassing aa 104-201 binds single-stranded, but not double-stranded RNA [
71]. This region corresponds mainly to the variable linker (aa 132-277) upstream of WIV (
Figure 2D), but also encompasses a short part of the SPD-like domain (aa 1-131). The RNA binding of cypovirus 1 p44 is not sequence-specific (it might be mediated by the negative charge of the the linker, highly enriched in glutamic acid).
However, WIV is is by no means systematically appended to an RNA-binding domain; it also found as a standalone domain in numerous species (
Figure 2). In fact, since RNA-binding activity is relatively common in proteins, its association with WIV in some taxa might be coincidental.
3. Discussion
Altogether, our results show that a domain of ~90 aas, WIV, is found in proteins of over 20 taxa of viruses infecting arthropods. In particular, WIV is encoded by tospoviruses, which infect plants through arthropod vectors, and are a global threat to food security [
72]; by
chronic bee paralysis virus, a widespread virus of honey bees, [
73,
74]; and by cypoviruses [
40], which infect the silkworm
bombyx mori, of economic importance for the production of silk.
According to bibliographical evidence, WIV is most probably a virulence factor, which enables infection of arthropods. There are no obvious common points between the proteins that encode a WIV domain for which experimental information is available (see last paragraphs above, before the Discussion). For example, in silkworms infected by cypovirus 1, the expression level of the mRNA encoding p44 was lower than that of the main capsid protein VP1 [
70]. In contrast, the level of expression of the p15 mRNA of bombyx mori macula-like virus expression was much higher than that of the capsid [
50]. However, the proteins p15 of Bombyx mori macula-like virus [
51], tospovirus NSs [
59,
60], and cypovirus 1 p44 [
43] have at least one point in common: they are found in the cytoplasm during infection.
The domain organization of proteins containing WIV provides support to our predictions
WIV is extremely divergent in sequence across taxa, and its 3D structure is only a model. Nevertheless, two arguments provide strong support to our predictions:
- -
First, the reliability estimate (plDDT) provided by Alphafold2 has been proven to be accurate: a predicted structure with a plDDT≥0.90 is expected to be competitive with an experimentally determined structure [
8]. The Alphafold2 structure for the WIV domain of Lake Sinai virus ORF4 has a plDDT of 0.95, and should therefore be close to the actual structure;
- -
Second, the boundaries of the WIV domain frequently correspond exactly to an unassigned protein region between two known domains (or between a known domain and the extremity of a protein). For example, in the ORF1 of the virus
wpk049shi07 [
75] (Genbank accession QKE55054.1), related to
Sinhaliviridae, the WIV domain, located between aa 1-113, is immediately followed by a 2A “StopGo” sequence (aa 127-139) (our observations; see
Figure 2A, top). Such sequences (also called “Stop-Carryon”) mediate ribosome skipping during translation, which separates two proteins, akin to a cleavage, but without requiring a protease [
76]. Their core motif is DxExNPGP, and the proteins are separated between the penultimate G and the final P (respectively G134 and P135 in the sequence of ORF1). Therefore, in this virus, the WIV domain should be found essentially as a standalone domain (with a short C-terminal extension, aa 114-134), providing strong biological support to our prediction.
Contextual information holds enormous untapped power for homology search in viruses
We presented here a procedure (
Figure 1) to identify extremely distant homologs by harnessing contextual information (namely taxonomy and infected host). This procedure enabled us to identify a previously overlooked domain, WIV, present in nearly a hundred species of arthropod viruses (the full list is in
Supplementary File S1). We had already used this procedure on several occasions to detect distant homologs [
4,
16,
17] but had never formally presented it. It is based on the idea that it is extremely difficult to
find distant homologs using sequence-based searches, but that given a candidate homolog, it is easy to
confirm whether it is homologous to the query. This confirmation can be done by comparing the sequence profiles of close homologs of the query and of the candidate, using HHpred pairwise comparison [
10].
An example, beyond WIV, will illustrate the power of our procedure: it has enabled us to identify homologs for all 4 proteins of
chronic bee paralysis virus initially annotated as "orphan" (i.e. devoid of recognizable homologs) upon sequencing of the viral genome [
77]. We had previously identified homologs for 3 of these orphan proteins [
4]. First, we had discovered that the protein encoded by RNA1 ORF1, closely related to the N-terminal domain of the
nodavirus replicase, was homologous to the
alphavirus methyltransferase. This homology has since been experimentally confirmed by structural studies [
78]. Second, we had discovered that the protein SP24, encoded by RNA2 ORF3, had homologs in a large group of viruses infecting plants and/or insects, related to
Virgaviridae and
Kitaviridae. Third, we had discovered that the predicted glycoprotein encoded by RNA2 ORF2 had homologs in the same group of viruses. Here, we discovered that the 4
th orphan protein of
chronic bee paralysis virus, encoded by RNA2 ORF1, is essentially a standalone WIV domain (
Figure 2A). Thus, our procedure enabled us to identify homologs for all 4 orphan proteins of a phylogenetically isolated virus.
Methods that rely on contextual information to identify homologs have been presented elsewhere (e.g. [
20,
21,
22]), but our approach combines two original elements which account for its power:
1) We focused on viruses, in which contextual information is particularly powerful. Indeed, viruses tend to encode much fewer proteins than other organisms, making a weak similarity between two viral proteins much more meaningful than for other organisms. In addition, some proteins are frequently found exclusively in certain types of viruses, making a weak similarity between two proteins belong to this type of viruses particularly meaningful (see Introduction).
2) We looked for candidate homologs among marginal hits up to extremely high E-values (E=1000), way beyond the cutoff of statistical significance (E=10
-3) of Psi-blast and HHblits. By comparison, E=10 is the traditional cutoff below which marginal hits are presented on the web-based version of most homology detection software; and the highest E-value that we could find in the literature up to which marginal hits were examined by a homology detection method is E=100 (for the method MorFeus [
79]). Using instead a relaxed E-value E=1000 enabled the detection of divergent WIV domains that otherwise could not have been detected, such as that of
Feksystermes virus [
80], an isolated
Picornavirales.
Using contextual information to complement homology searches is necessary in viruses, despite the progress of structure prediction
It seems paradoxical that sequence-based homology searches should still be necessary despite the tremendous progress of structure prediction [
7,
8,
81] and of structure-based comparison [
9]. Yet at least in the medium term, many viral homologs will remain inaccessible to structure-based homology search, for 5 main reasons:
1) Viral proteins tend to have a more loosely packed structure, with less secondary structure elements, and more structural disorder, than their homologs in cellular organisms [
82]. Thus, structural predictions are expected to be poorer for viral proteins;
2) There are few viral proteins with a solved 3D structure compared to bacterial or mammalian ones, perhaps because viral proteins are notoriously harder to produce and purify. As a consequence, the training datasets of Alphafold2 and of other methods proportionally comprise less viral protein structures (although more would be required, in view of their different characteristics, as outlined above). This most probably decreases these methods’ performance on viral proteins (although this has not yet been tested);
3) Viral proteins diverge extremely fast in sequence compared to proteins from other organisms, and thus for viral proteins, fewer close homologs are generally available (with the exception of well-sampled taxa such as
betacoronavirus, the genus of SARS2-CoV). Yet close homologs are required both to enable prediction by methods that rely on sequence alignments (such as Alphafold2 [
8]), and to train methods that do not rely on alignments (such as ESMfold [
83]). As a consequence, structural predictions are expected to be less good for proteins from poorly sampled viruses than for other organisms (although again, this has not yet been tested);
4) Alphafold2 intrinsically fails to predict the structure of a number of proteins that have a stable 3D fold [
84]; for these proteins, only sequence-based methods have the potential to identify distant homologs (this is true for all organisms, and not only for viruses);
5) Finally, at the time of writing, the Alphafold database does not include viral proteins (although it includes proteins from almost all other organisms) [
6], making it impossible to identify protein homologs in viruses only by using structural comparisons.
Clearly, it would be necessary to automate the procedure presented here, since it is too time-consuming to be applied to the enormous number of orphan viral proteins being discovered. Automatically incorporating contextual information into sequence-based homology search might require scoring the contribution of each piece of contextual information (taxonomy, type of host infected, etc). Automated approaches will also need to be adapted to viral polyproteins, which raise special challenges, as they are often composed of many domains, which decrease the accuracy of homology searches. Such approaches are being developed (e.g. LAMPA [
85]), and will need to be combined with an exploitation of contextual information for maximum efficiency.