3.1. Construction of Repetitive Elements reference
To obtain a comprehensive view of REs in Decapoda and reduce the number of elements classified as “unknown”, we developed a standardized protocol to annotate TEs and satDNAs at the genomic level (see Methods and
Figure 1). This pipeline integrates the consensus sequences of the Arthropoda section of the RepBase database and the combination of
de novo identification of REs by RepeatModeler2 and the TAREAN pipeline of all species to generate an extensive library of consensus sequences. The TAREAN pipeline was used to specifically identify satDNAs. Due to their structure and high sequence homogeneity, satDNAs are extremely difficult to assemble and are often excluded from the assembly [
12]. This is why we search for satDNAs in Illumina raw reads paired-end sequences using the TAREAN pipeline to construct the “Satellite libraries”. By using the TAREAN pipeline, we retrieved between 0 and 43 satDNAs families annotated as “High fidelity” while RepetModeler2 identified only 0 to 4 satDNAs families (
Table 2).
Using our new developed pipeline, we identified between 3 643 and 11 431 families of REs in the different assemblies, including between 7.25% and 33.37% of “unknown” sequences (
Table 2). Unknown elements are repetitive sequences that couldn’t be further classified. The lowest percentage of unknown elements is observed in Dendrobranchiata species. This can be explained by the presence of the annotated TEs of the Dendrobranchiata
Penaeus vannamei in RepBase, allowing a better identification in closely related species.
All detected REs were renamed according to the RepBase nomenclature. In fact, the REs classification by Wicker et al., (2007) [
35] is widely used but new TEs have been characterized since the establishment of the classification in 2007, resulting in conflicts in TEs databases. Kojima (2019) brings new clarity to the classification of the RepBase database [
40], but TE annotations can differ between RepBase, RepeatModeler2 database, and DFAM due to capital letters or multiple naming of the same element, for example. A manual correction of repeat names was thus applied when needed in order to obtain a clear annotation.
All libraries generated by RepeatModeler2 and the TAREAN pipeline, and RepBase were merged into a single library. This extensive database contains a total of 71 601 sequences including sequences from RepBase. Among these families, known TEs represent 31 579 sequences. With this new merged library, we considerably extended the number of annotated families compared to what exists in RepBase database of Arthropoda REs. Indeed, RepBase provides consensus sequences of 13 906 repetitive elements in Arthropoda, including 109 satDNAs. These elements are distributed in 218 Arthropoda species and in Eukaryota or Metazoa common ancestors. However, only 16 crustacean and 6 Decapoda species are represented, with respectively 1419 and 328 sequences. Moreover, most decapod sequences (320) are from a single species, P. vannamei. This shows the lack of knowledge of REs in Decapoda species in established databases. Our work also extended the number of known satDNAs families in Decapoda species, with 405 consensus sequences compared to the 109 presents in RepBase. The new REs identified in this study are provided in supplementary materials (Supplementary File S1). Well categorized REs have also been submitted to RepBase.
3.2. Annotation of Repetitive Elements in Decapoda Genomes
With our new extensive database, we performed two rounds of annotation using RepeatMasker. In the first round we only used known TEs in order to have a better characterization and reduce the proportion of unknown, and in the second we used all the remaining REs. We identified between 6 805 and 31 798 consensus sequences of REs in the different assemblies (
Table 2). This represents an increase of approximately 16 500 families on average in Decapoda compared to previous annotation and 6 500 for the other crustaceans. Moreover, our standardized protocol successfully classified the type of REs previously unclassified for most species (now between 4.40% and 24.15%). This represents a considerable improvement over the results obtained with species-specific databases often used.
Once considering all the satDNAs families annotated in the genome with the merged library, we annotated between 11 and 109 different families (previously 10 to 40 using the species-specific strategy, Table 3). The Astacidea and Anomura infraorders have the higher number of satDNA families, going from 92 to 109, except for
H. americanus and
B. latro. The last two species have a number of satDNA families more similar to the other Decapoda species, with respectively 61 and 59 satDNA consensus sequences. The high number of satDNA families detected in Astacidea and Anomura is in agreement with the 258 families detected in the crayfish
Pontastacus leptodactylus [
5]. The diversification of satDNA families in Astacidea and Anomura is remarkable compared to what is observed in other species.
Drosophila species have for example in general less than 10 different families within their genomes, and human have 9 [
20,
49]. However, a large number of satDNA repeats has already been found in Arthropoda such as in
Triatoma infestans (42 families, genome size 1.4 Gb) [
50], the
Locusta migratoria (62 families, genome size 6 Gb) [
18], the morabine grasshoppers (129 families, genome size 5 Gb) [
51], and in the fish
Megaleporinus microcephalus (164 families, assembly size 1.2 Gb) [
52]. It is to note that our results may still underestimate the real number of satDNA families due to the fragmentation of available assemblies. In fact, some satDNA families identified by the TAREAN pipeline in Illumina reads weren’t retrieved in the genome assembly. It is likely that the missing satDNAs were contained in reads that were not included in the final assembly.
Interestingly, the number of families of REs is correlated with both estimate genome size and assembly size (
Table 1) with a Spearman rank correlation test of ρ=0.83, p-value=8.925E-8 and ρ=0.92, p-value=1.146E-6 respectively. The same can be observed with satDNA families, with respectively Spearman rank correlation test of ρ=0.84, p-value=6.875E-08 and ρ=0.90, p-value=3.83E-10. It reveals the diversification of RE families in larger genomes.
The strategy used in this study increased the knowledge of REs in Decapoda species and provides an extended library that can be used in future studies (Supplementary File S1). Unfortunately, there are still a high number of unknown REs in some of the annotated genomes. A manual curation of the library would be necessary but was beyond the scope of this study. We also want to mention that, due to high presence of REs, genome assemblies are often fragmented, preventing the exhaustive annotation of TEs that can be absent from the assemblies or split and contained by two contigs. The study of Sproul
et al., (2022) on more than 600 insect species showed the influence of sequencing technology on repeat detection, with long reads assemblies containing 36% more repeats than short-reads assemblies with huge impact on LTRs detection [
53]. This is because assemblies based on long reads are often more contiguous [
54,
55]. In our case, most of the genomes were assembled using long reads or a combination of long and short reads, and short read assemblies do not stand out concerning repeat content or diversification.
3.3. Proportion of Repetitive Elements in Decapoda Genomes
The RE proportions are variable both between and within phylogenetic clades of the analysed species. The proportion of REs in studied Arthropoda genomes is above 40%. Exceptions are two Decapoda species,
Cherax quadricarinatus and
Caridina multidentata, with respectively 38.73% and 39.02% of repeat content, and the non-Decapoda
Hyallela azteca with 26.12% (
Figure 2). Compared to the Decapoda species, which have an average of 59.7% REs in their genomes, the non-Decapoda crustaceans analysed in this study exhibit a lower proportion of REs, averaging at 46.4%. However, it is important to note that
Armadillidium vulgare stands out among non-Decapoda studied, as it has a remarkably high percentage of repeats (76.26%). If
A. vulgare proportion is excluded, the average of REs in non-Decapoda drops to 40.4% and the difference is significant with Wilcoxon p-value=0.0074. Within Decapoda species, Anomura presented an especially high percentage of REs with on average 73.6%. Indeed, the Anomura specie
Paralithodes platypus has the highest proportion of REs among the studied species with 78.89%. On the contrary, the genome with the lowest percentage of repeats was the non-Decapoda
Hyallela azteca with 26.12%. So, the REs proportions were highly variable among the phylogenetic clades, as was the content of REs categories.
Indeed, we could also see a variability in the content of REs within suborders. Among Decapoda, Dendrobranchiata exhibited only half the amount of LINEs than Pleocyemata, for which it could go up to 35.3% in the Astacidea C. destructor. Dendrobranchiata was characterized by a high proportion of DNA transposons as for A. vulgare with between 13% and 18% of DNA transposons. Anomura infraorder has the highest percentage of LTRs with more than 16% and the Achelata P. ornatus has the lowest, with 3.24%. SINEs elements were low in all genomes, going from 0.02% in H. Azteca to 2.54% in P. trituberculatus. DIRS elements in general contributes to less than 1% of the repeat content in almost all genomes, major exception was for M. nipponense where DIRS represented 8.84%. This species also has the highest proportion of Penelope elements with 5.18%. The second infraorder that has the highest number of Penelope elements was Astacidea, with a mean of 2.3%. Lowest numbers of unclassified elements were present in the Dendrobranchiata suborder with around 3.5%, probably because of the better characterization of REs in this suborder in the original RepBase database with the almost exclusive presence of annotations derived from P. vannamei. This is why less closely related species present a higher proportion of unclassified elements, such as E. affinis with 24.15%. The content variability shows that the different suborders of studied crustaceans species has specific major REs present in their genomes.
If we refer to REs study of Decapoda species included in assembly publications, the proportion of REs could vary from 8% to 82% [
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69,
70,
71]. Tan
et al., (2020) annotated repeatome of 8 decapod species and estimates repetitive content between 27% and 50% with the majority of the genomes having more LINEs, except for
Penaeus vannamei that has more DNA transposons [
60]. Compared to these studies, with our pipeline we annotated approximately 10% more repeats in the genomes. For
P. virginalis genome, 8.8% of repetitive elements were retrieved in the assembly Pvir0.4 (GenBank accession: GCA_002838885.1) and 27.52% in the study of Tan
et al., (2020). However, in the assembly DKFZ_Pvir_1.0 (GenBank accession: GCA_020271785.1), the new assembly version used in this study, we annotated 57.87% of repetitive elements [
59,
61]. In the assembly of
P. clarkii, Xu
et al., (2021) annotated 82.42% of repeats, while in our study we annotated only 71.26%. For
P. platypus genome, we annotated similar results to Tang
et al., (2021) [
72]. However, the percentage of LINEs and LTRs increase of almost 10% each, while unknown TEs dropped to 17%. REs percentage in
E. sinensis was determined as 40.5% and 61.42% in two different studies [
61,
73], here we determined that repetitive elements represent 58.93% of the genome. This shows that our method provided greater or equal proportion of REs but with a better characterisation.
The Decapoda species studied here all showed high proportions of REs ranging from 58 % to 79%. They are in the upper range of what is generally observed in Arthropoda. Indeed, comparative studies carried out on arthropods (mainly based on insects) report highly variable proportions of TEs ranging from 1% to 80% [
53,
74,
75]. We can expect even higher proportions of REs with the forthcoming sequencing of giant genomes in Decapoda or other Crustacea. Recently, the assembly of the Antarctic krill (belonging to a sister order of Decapoda) demonstrated that 92% of its genome is constituted of REs, 78 % of them being TEs, indicating that Arthropoda can have incredibly high amount of REs [
76]. In terms of TE landscape, Decapoda presented only a few SINE elements as for all Arthropoda. Previous studies in Dendrobranchiata species reported that the most abundant groups of repeats were, disregarding SRR, DNA transposons or LINEs with different results depending on the used bioinformatic tools [
53,
74,
75]. Here we showed that DNA transposons were the major subclass in all Dendrobranchiata species, followed by LINEs. This is similar to what is observed in most insects species, where DNA transposons are generally the major TE group present in genomes [
53,
74,
75]. Interestingly, our results revealed a different situation in the studied Pleocyemata species where LINE and LTRs elements are more abundant. This can be compared to what is observed in some insect orders exhibiting a different TE composition: LTRs are more abundant in Diptera species, and Odonata and Orthoptera species are richer in LINE elements [
53,
74]. The change in the major type of REs between suborders suggests an altered strategy for genome stability maintenance and regulation of REs between suborders. Sproul
et al., (2022) demonstrated that LINE-rich species lineages present many REs that are associated with protein-coding genes. Such associations suggest consequences regarding phenotype evolution. The presence of a TE near a gene can lead to methylation changes, indeed, it already has been showed that LINEs can serve as amplifier for the silencing away from X-chromosome inactivation center, and LINEs and SINEs for gene imprinting [
34,
77]. The movement of a LINE, or other TE, at a new genomic locus can so have an impact on nearby gene expression and at end reshape networks of gene expression and impact genome evolution.
3.4. Correlation between Genome Size and Repetitive Elements
The 20 Decapoda species analysed in the present study has large differences in genome size estimations (1.6 Gb to 8.5 Gb). This difference was also evident in assembly sizes, although less pronounced (1 Gb to 4.8 Gb). The variability of the genome sizes raised the question of the contribution of REs to their host genome. After masking each genome, we calculated the load of REs, i.e., the number of copies of REs and TEs only, the percentage of REs and TEs only, and we searched for a correlation with both assembly size and estimate genome size. The assembly size was both positively correlated with the load (ρ=0.87, p-value=1.864E-06) and the percentage of TEs (ρ=0.6, p-value=0.0015) (
Figure 3A,B). The estimated genome size (
Table 2) was positively correlated with the load of TEs (ρ=0.62, p-value=0.0007) while there was no significant correlation with the percentage of TEs (ρ=0.47, p-value=0.014) (
Figure 3C,D). Even though the number of satDNA families was correlated with both assembly size and estimated genome size, when including satDNA elements the significance of correlation decreases between load of REs and genome/assembly size (
Figure S1). The correlation between percentage of REs with both assembly and estimated genome size were not significant, with α=0.005 (
Figure S1).
For the first time in Decapoda species, assembly size and load of TEs are demonstrated to be strongly correlated. This reveals the consequent impact of the number of REs in the size of the assembly. The correlation of percentage of TEs or REs is more often analysed than the load. In our study the proportion of TEs was less significantly correlated than the load of TEs, and REs was not correlated with genome size. Like in our study, Petersen et al., (2019) found a positive correlation between percentage of TEs and assembly size, however, they also found a positive correlation between percentage of TEs and estimate size that was not retrieved in our study. Moreover, Sproul et al., (2022) found a positive correlation between REs proportion and assembly size not confirmed in our study. This can be explained by the fragmentation of the genomes analysed due to the difficulties to assemble REs: REs can either be excluded from the assembly although present in the genome and cannot be annotated, or they can be fragmented indicating that a part of the RE is not included in the assembly and so can contribute to the load of REs in the genome but not to the percentage. This is the case for satDNAs that are often concatenated since the assembler cannot define how much repetitions are present if they are not entirely covered by a long read. This explains the decrease or absence of the significance of the tests when including satDNAs. It is therefore expected that the load is rather correlated with the assembly size than with the estimated size.
3.6. Diversity of Repetitive Elements
We determined the number of copies (the load) of each superfamily of REs identified for each genome to see the diversity of REs (
Figure 5). With 67 superfamilies of TEs present in at least one species, the majority of the known superfamilies of REs were found in the investigated genomes, as seen in insects [
74], and appear highly conserved across all the genomes (
Figure 5). Among the Decapoda genomes studied, there was a clear pattern of high and low presence of repeat superfamilies, with only a few distinct variations between species by repeat suborder.
The load of REs of each repeat superfamily was then used as a profile for each genome to construct the dendrogram by clustering of the REs profiles. This dendrogram mainly followed the currently known species phylogeny [
4] except for
A. vulgare, whose RE proportions and composition were more similar to Decapoda (
Figure 2) and two Anomura species that together grouped with the Caridea. The genome of
A. vulgare (1.6 Gb) was larger than the other crustaceans analyzed in this study (238 Mb – 1 Gb) with the highest percentage of repeat among the non-decapoda crustaceans species studied. This may explain why
A. vulgare is clustered with Decapoda species and not with other crustaceans. Nevertheless, we could see a clear differentiation between Decapoda species and the other Crustaceans that have a lower number and a distinct composition of REs, except for
A. vulgare. Similarly, we could clearly distinguish Dendrobranchiata from Pleocyemata infraorders with the presence of LINE
ingi and SINE
MIR. Within Pleocyemata, Caridea were also separated from the other Reptantia species in agreement with established phylogeny [
4]. Many studies, including Petersen
et al., (2019), Sproul
et al. (2022), Wu and Lu (2019), based their RE analysis on already published phylogenetic tree. In our study, we clustered the repetitive profile of each genome and obtained a phylogenetic signal that respects major classification (
Figure 5) [
1]. In fact, REs have been recently used as evidence for phylogenetic trees construction in plants, with RE abundance resolving species relationships in a similar manner to DNA sequences from plastid and nuclear ribosomal regions [
78,
79]. This can be explained by the capacity of some REs to have a high conservation and synteny within species [
80,
81,
82]. This approach could therefore be used in the future to determine the phylogeny of non-model species using low-coverage, low-cost sequencing.
3.7. Sequence Divergence Distribution of Transposable Elements
The genetic distance between each annotated TE copy and the consensus sequence of the respective TE family was calculated using Kimura 2P distance in order to analyse sequence divergence distribution and approximate the age and intensity of duplication events (
Figure 6). The peak distribution shows the genomic coverage of TE copies according to their divergence from their family consensus estimated using the Kimura distance. A peak indicates that a group of TE copies has undergone an evolutionary event in the genome leading to the expansion of these elements. This event is more recent if the peak is located at a low Kimura distance. At a high Kimura distance, a wide peak can indicate that TE copies have undergone genetic drift or other processes leading high sequence divergence suggesting ancient expansion event.
In Dendrobranchiata, sequence divergence landscapes were similar for the five species. We could observe two very similar peaks. The first one presented a larger number of LTRs and smaller augmentation of LINEs elements between 10% to 15% of divergence. The peak of LTRs was particularly high in
P. japonicus and
P. indicus. At the same time point, we could see an increasing amount of DNA transposons with the same distance to the consensus in
P. monodon. A longer time ago, an augmentation of DNA transposons and LTR elements around 25% of divergence was shared by all species. This suggests that all the Dendrobranchiata were sharing the same old evolutionary events. The
P. monodon genome was one of the few analysed Decapoda genomes showing a recent peak of SINEs elements with the two
Procambarus species. We would therefore expect to see a higher proportion of SINEs in
P. monodon compared to other genomes. However, SINE elements were only slightly more abundant in this genome due to a higher presence of SINE MIR elements (
Figure 2 and
Figure 5). Interestingly, the content of repeat showed that DNA transposons are the most widespread among the suborder (
Figure 2). However, DNA transposons expansion was older and more spread out over time. On the contrary, the landscape and diversity of repeats showed higher peak of LTRs elements over time in the suborder compared to the other species, with Gypsy being the most abundant (
Figure 5 and
Figure 6). There were almost no sequences with a low divergence suggesting that TEs are not transcriptionally active in these genomes.
The two Caridea species presented a different sequence divergence landscape. In
C. multidentata, there was a recent peak of unknown elements between 5% to 10% of divergence. This peak could be caused by the expansion of one or several families of unknown TEs. We could also see that from high divergence, the fraction of the genome increased as the Kimura distances decreased. This trend could be seen until the event at 5% to 10% of divergence. After this event, and more recently, the number of TEs with really low divergence decreased, with almost no TEs at 0% of divergence. This suggests that despite the peak of recently active unknown elements, TEs are not active anymore for this species. For
M. nipponense we could observe two recent peaks at 1-4% and 10% of Kimura divergence corresponding to LINE, Penelope and LTR elements for the first one and DIRS for the second one. We could see integrated virus expansion between 5% and 25% of divergence. This was in accordance with the diversification of repeats (
Figure 5), where we could see that
M. niponnense genome was the Decapoda with the highest amount of integrated virus. The presence of sequences with little divergence with the consensus sequences suggests that TE are active in this genome.
Within Astacidea,
H. americanus has a different TE landscape compared to the other four species belonging to the infraorder. Indeed, the genome has a high peak at a divergence of 15% of unknown elements. Interestingly, we could observe an ancient event concerning integrated viruses at 40% to 45% of Kimura distance. The
H. americanus genome was the only Decapoda genome studied here presenting this characteristic. Integrated virus couldn’t be seen in the proportion of repeats because of their low presence in genomes and was included in category other REs (
Figure 2). Integrated virus in
H. americanus sequences correspond to the white spot syndrome virus [
83], suggesting that
H. americanus faced this virus long time ago and these sequences were then propagated. In
H. americanus genome, there was a clear increase of LINEs, LTRs and DNA transposons coverage with a low percentage of divergence, which leads us to conclude that TEs are still active in this genome. TEs are also active in
Procambarus species that has a similar landscape with several elements at a low divergence and especially LINEs. We could also see an augmentation of Penelope and SINE elements at low divergence for both species. In
P. clarkii there was also a small peak at 10% of divergence of unknown elements. Contrary to the TEs in
C. quadricarinatus, TE seems active in
C. destructor, with an increase of LINEs at low divergence. The expansion of LINEs in
C. quadricarinatus was, instead, more ancient, at 6% to 10% of divergence.
In Brachyura all genomes seemed to have active TEs, but TEs landscapes across the genomes of this infraorder differ from each other. In P. trituberculatus, the LINEs with no divergence with consensuses were three times more abundant than LINEs at 1% of divergence. These LINEs were in a really active phase in this genome. Penelope elements were also more abundant at 0% of divergence. C. sapidus genome showed an almost constant increased coverage of TEs with lower divergence for all elements. However, we could see an increasing number of LTRs with no Kimura divergence and a decreasing number of LINEs and DNA transposons. The genome of E. sinensis was the only Brachyura genome presenting two peaks. The oldest one was at 15% of Kimura distances and was caused only by unknown elements. The latest event involved LINE, LTR and unknown elements at a divergence between 0% and 7%. C. opilio was the less active genome of Brachyura concerning TEs. We could observe a large peak between 0% to 20% of divergence where LINEs and LTRs increase. The proportion of DNA transposons also increase during this time, but at a lesser coverage.
Concerning the last two infraorders, in Achelata, the
P. ornatus genome has a middle age peak at 15% of divergence corresponding to LTRs. There was also a recent and high peak, around 4-8% of divergence, caused by the expansion of LINE elements with 2% of the genome being represented by LINEs that are 6% divergent. This suggests that LINEs were, until recently, highly transcriptionally active in the genome but are now inactive. The high presence of LINE elements was also visible when considering the proportion of repeats in the genome (
Figure 2). In Anomura, intragroup with the highest percentage of LTRs within Decapoda (
Figure 2),
B. latro and the
Paralithodes species has very different landscapes.
B. latro genome seemed to have inactive TEs, with two peaks of LTRs and LINEs at 3% and 15% of Kimura distance. On the other hand,
Paralithodes species has impressively active LINEs and LTRs with 6.8% and 3.6% of LINE elements without divergence to consensuses in respectively
P. platypus and
P. camtschaticus. Finally, for other crustaceans the amount of unknown elements in their genomes was predominant, making the analysis of the divergence distribution of TEs in their genomes of difficult interpretation (
Figure S2).
A clear differentiation between Dendrobranchiata and Pleocyemata species could be observed in sequence divergence distribution such as seen with the proportion and diversity of repeats. Indeed, Dendrobranchiata has more non-transcriptionally active TEs compared to the majority of Pleocyemata. Among all Pleocyemata species studied here, almost all have at least one or more type of active TEs. The expansion of a particular subfamily of RE increases genome plasticity and can indicate periods of rapid evolutionary changes [
14,
33]. This suggests that Pleocyemata genomes have a rapid evolution on recent timescale. Genomes with recent accumulation of repeats present highly similar repeats or type of repeat that can be long (mostly LTRs and LINEs). These long repetitive regions are more difficult to assemble and so repeats resolution during assembly is even more problematic [
84]. Indeed, we could argue that a large number of the genomes studied presented recent accumulation of long REs. These long REs, being difficult to assemble, can be a possible explanation of assembly fragmentation. Moreover, species with larger genomes size tend to have more transcriptionally active TEs but also more REs.