1. Introduction
The microsporidian phylum, an evolved branch of the rozellids [
1] includes over 1,700 species divided into more than 220 genera infecting an extremely diverse range of hosts protists to all major animal phyla [
2]. Microsporidia are vastly represented in aquatic environments and food webs. Numerous species can also be found infecting animals of veterinary and economic importance such as bees or silkworms [
3]. In addition, some microsporidian species may be involved in human diseases with 17 species belonging to 10 genus [
4] that have been described as leading to severe syndromes predominantly in immunocompromised patients. This includes acquired immunodeficiency syndrome (AIDS) patients but also patients that have undergone organ transplants and been treated with immunosuppressive drugs [
4,
5,
6,
7].
With the advent of next-generation sequencing (NGS) technologies, the systematic sequencing of microsporidian genomes has been undertaken and over the last few decades, genomic data has rapidly been accumulating with more than 50 genomes now available [
8]. The study of these genomes has allowed researchers to highlight the consequences of the distinct evolutionary patterns in this parasitic group. Indeed, due to their adaptation to obligate intracellular parasitism, « the genomes of Microsporidia are under strong selective pressures which conduct them to present specific characteristics. Thus, microsporidian genome sizes are highly reduced, with some species having reduced their genomic content to potentially the lowest limit required for life [
9]. The human infecting species
Encephalitozoon intestinalis with 2.3 Mbp harbors the smallest microsporidian genome sequenced [
10]. The genomic reduction in microsporidia is not just limited to a massive loss of genes as it also affects the gene length. For example,
Encephalitozoon cuniculi Coding DNA Sequences (CDSs) are on average 15% shorter than their yeast orthologs [
11]. This CDS size reduction also leads to the presence of around 8.5% of the CDSs with a size under 300 nucleotides [
12,
13]. Another consequence of microsporidian gene compaction is the removal of intronic sequences. Introns with reduced size may remain in a small number of genes however, but this is not common to all species thus making their prediction more difficult [
13,
14]. Microsporidian genes also present a strong compaction of the 5' and 3'UTRs (UnTranslated Regions). In some cases, the 5’UTR can even be absent, and the transcribed mRNAs begin with the translation initiation codon [
12,
15,
16,
17,
18,
19]. This high reduction of 5’UTR length seems to be an advantage for the identification of transcriptional regulation signals that are therefore localized near the translation initiation codon. Numerous studies have shown that these signals are highly conserved amongst microsporidian species and consensus sequences have been defined [
12,
13,
18]. Thus, CCC-like or GGG-like signals are located upstream of the Translation Initiation Site (TIS). In genomes with a low G+C content, these signals can often be replaced by a strong A+T-bias (more than 90%) near to the TIS. A final feature affecting the size of microsporidian genomes is the shortening of intergenic regions which are essential for transcription as they contain promoters and enhancers (Jespersen et al 2022).
Microsporidian genomes are characterized by a high rate of sequence evolution which has induced difficulties in positioning these microorganisms in the phylogenetic eukaryotic tree [
1]. The identification of orthologous genes between microsporidian species has also proved to be challenging even for genes crucial to their infectious process [
20] or DNA repair [
21]. While genomes tend to present extreme reduction, one exceptionally large microsporidian genome of 51 Mbp has been reported in the mosquito parasite
Edhazardia aedis. This is mainly due to the expansion of Transposable Elements (TE) families in the genome of this microsporidian species. The high rate of sequence evolution in the microsporidian phylum also concerns TEs, making them hard to identify using comparative approaches alone with sequences from other organisms [
22,
23].
Rapid and cost-effective next-generation sequencing (NGS) technologies have produced, and are still producing, numerous new microsporidian genome sequences. After genome assembly, efficient genome annotation represents a crucial step to point out all biological processes that govern the life of these microorganisms. Computational genome annotation has become one of the principal research areas in computational biology [
24]. However, the microsporidian genome features previously described turn out to be the pitfall of classical methods for producing accurate prediction of complete gene repertoires. Until now, microsporidian genomes have been annotated using ab initio protein predictions that were based primarily on the detection of Open Reading Frames (ORF) using various generalist software such as GeneMarkES, Augustus [
25,
26] or Glimmer, that could be combined with the detection of CCC- and GGG-like motifs found in close proximity of microsporidian TISs [
13,
27,
28]. Such signals significantly improve prediction of translational initiation codons which are otherwise defined as the first AUG codon of the studied ORF. In addition, extrinsic data, such as the one available from orthologous gene sequences has also been intensely used to carry out structural annotation [
29]. Due to the high rate of sequence evolution in the microsporidian phylum, such approaches require an optimization of the comparison tool parameters (e.g. with BLAST). These parameters also need to ensure unambiguous prediction of TEs as much as possible. Finally, small protein-coding genes are often overlooked during structural annotation due to their shortness, lack of sequence conservation, and/or lack of known functions [
30].
To address the challenging question of microsporidian genome annotation, we developed a dedicated annotation pipeline called MicroAnnot. MicroAnnot ensures gene prediction as well as structural and functional annotation. Firstly, using curated databases of validated proteomes from four microsporidian species, an extrinsic approach has been implemented using the BLAST tools [
31] with optimized parameters to identify divergent or small orthologous sequences. The results were also exploited to predict potential translation initiation codons that were further validated using upstream transcriptional signals. Secondly, using these newly predicted genes as training set for the Glimmer tool [
32], an ab initio sequence annotation was carried out and the potential translation initiation codons for the newly identified sequences were once again validated using transcriptional signals. All predicted CDSs were then used as queries against a microsporidian specific TE database. The identification of rRNAs was also done by a comparative approach using a database containing small subunit (SSU) rRNA sequences representative of all the microsporidian phylogenetic diversity and tRNA detection done using tRNAScan-SE [
33]. MicroAnnot has been tested on four microsporidian genomes and it yielded an annotation of higher quality on all the evaluated criteria (specificity, sensitivity, translational initiation codon prediction, small gene characterization and TE identification). Furthermore, functional annotation using the InterProScan tool [
34] can also be included in the result files as either GENBANK, EMBL or GFF annotation files.
3. Discussion
Although the correct annotation of genomes over the last decade has led to the development of increasingly innovative and specific approaches that consider the characteristics of each studied genome, this initial step in the
in silico exploration of genomes is still often a source of error. Indeed, gene prediction is a complex process, especially in eukaryotes, and the prediction of all the genes in an organism is never 100% correct. A benchmark study of ab initio gene prediction methods in diverse eukaryotes using the most widely used gene prediction programs has shown that all these programs harbour numerous strengths but also various weaknesses [
36]. These authors also concluded that ab initio gene structure prediction is a very challenging task, which should be further investigated [
36]. For prokaryotic species, whose genome organization is close to that of Microsporidia, many common CDS predictors failed to identify the complete gene catalogue because some genes features fell outside the defined rules such as non-standard codon usage, overlapping genes and small genes [
37].
Due to their characteristics, the structural annotation of microsporidian genomes to define their genetic potential can quickly prove to be a real challenge. To take these constraints into consideration, we have developed a tool dedicated to the annotation of these particular genomes. The annotation carried out using the MicroAnnot tool on the four benchmark genomes gives particularly conclusive results in terms of specificity and sensitivity, while conventional softwares used for the annotation of different microsporidian genomes (
E. cuniculi ; Glimmer prediction [
11],
E. intestinalis, BLAST procedures [
10],
N. ceranae ; Glimmer [
38] and
E. bieneusi ; FunGene and Glimmer3 [
39]) do not offer predictions of the same quality given several badly predicted or even non predicted genes which can add up to as much as 10% of the total genes of the studied species [
13].
Achieving complete genome annotation based on ab initio predictions can be particularly effective when it comes to model organisms but the results in terms of sensitivity and specificity can drop for non-model species [
36,
40]. Annotations obtained using comparative methods such as the alignment of protein sequences against predicted CDS, give precise and reliable results [
41]. However, due to high rates of sequence evolution in microsporidian sequences [
42], comparative approaches are difficult to implement with these species: the correct alignment of orthologous sequences from different microsporidian species and the parameters of the comparative tools all need to be optimised. To provide high quality alignments with the proteome sequences but also with the small gene sequences, the parameters of the BLAST software were modified, notably with the use of the BLOSUM45 matrix. This « deeper » matrix provides very sensitive similarity searches but also produces alignment overextension into less homologous regions such as N-terminal regions [
43]. The alignment of the N-terminal regions has been used by MicroAnnot to unambiguously determine gene TISs. Indeed, correct recognition of this initiation codon is crucial in gene prediction to highlight gene structure and its product [
44]. Nevertheless, due to high rate of sequence evolution, local alignment with BLAST software may stop before reaching the methionine defining the N-terminal end of the sequence. In this case, MicroAnnot considers the length of the homologous database sequence and the position of the BLAST local alignment to propose a potential translation initiation site in the query sequence. Validation of predicted TISs can be achieved by evaluation of the ATG context [
45]. The search for the Kozak sequence cannot be applied to microsporidian genes because the mRNAs are characterized by highly reduced or even absent 5’ untranslated regions (UTR) and only a bias in +4 position for an adenine or a guanine residue has been described [
18]. However, 5’ UTR size reduction represents an advantage because the transcription regulatory signals are in close proximity to the translation initiation codon. Characterized CCC-like or GGG-like signals, or a strong adenine/thymine-rich sequence (approximately 90%) upstream of TISs are conserved within all microsporidian genomes [
13,
18,
46,
47] and their identification allows to unambiguously support TIS prediction. The search for these signals using the MicroAnnot tool proved particularly relevant by ensuring the correct prediction of more than 70% of the previously mis-predicted TISs. In the case of
E. cuniculi, more than 23% of TISs [
12,
13,
18] had been badly predicted during the first annotation of this genome not using detection of such signals [
11]. The annotation obtained using MicroAnnot, without considering the reference Encephalitozoon species, presents only around 20% of mis-predicted TISs (
Table 1). This value drops to less than 2.5% for the annotation of the
E. intestinalis genome, carried out using the
E. cuniculi’s proteome as a reference. It should also be noted that of the 32 badly predicted TISs in this species, 12 are contiguous to the putative correct ones (
supplementary data 2). This comparative approach coupled with the identification of the correct translation initiation codon also proves relevant for identifying potential short introns found mainly within genes coding for ribosomal proteins [
14,
18,
48].
The CCC- and GGG-like signals have been successfully used for the annotation of the gene TISs in different microsporidian species such as
Ordospora colligata [
47] or
N. ceranae [
28]. These signals also proved relevant to ensure the characterization of small microsporidian genes (CDS size < 300 nt) which are relatively frequent due to the general reduction of CDS sizes in microsporidia when compared to their orthologs in other fungal species [
11]. Despite their number, these small genes are often misreported by generalist gene predictors. Advances in high-throughput technologies have highlighted an emerging world of proteins composed by small open reading frame-encoded micro-peptides [
49]. Based on comparative approaches, the MicroAnnot tool proved particularly effective in ensuring the annotation of genes that had been ignored during the initial annotation of the four genomes studied in this work. These missing genes mostly correspond to small CDSs. For
N. ceranae, 296 new genes have been identified and 184 of them harbour a CDS smaller than 300 nt in length. Most of these new genes were previously highlighted by Pelin et al. who used an in-house script that combines Glimmer's ab initio gene prediction algorithm, and CCC- and GGG-like motifs found in close proximity to microsporidian transcription initiation sites [
28] reinforcing the relevance of our approach. Differences in genome sizes between microsporidian species are essentially due to the presence of TEs [
22,
50]. Unfortunately, gene predictors can predict CDS in these TEs which can thus lead to an incorrect estimation of the number of genes in microsporidia [
13]. The first step in structural annotation involving exhaustive identification of repetitive elements is still challenging [
24]. Despite their prevalence and importance, TE sequences remain poorly annotated and studied in almost all model systems [
51]. Functional annotation of many microsporidian CDSs revealed that they contained specific TE ORF domains and motifs (see for example product description of
Dictyocoela muelleri species [
52] in MicrosporidiaDB [
53]. In addition, the identification of large multigene families among which some members have a low percentage of similarity with TE sequences show difficulties to identify certain TE families within microsporidian genomes [
13]. Thus, MicroAnnot includes a specific module based on a comparative approach with well predicted TEs to scan all predicted CDSs. This method allows a fine TE detection and finally 693 predicted CDS were in fact TEs for the
N. ceranae species.
Although structural annotation methods based on sequence homology are very effective, they are closely dependent on the presence of orthologous genes in the queried databases used for annotation. This is not a problem if we consider the pangenome but it is more problematic for the core genome [
54] especially for organisms such as microsporidia that can be found in multiple ecological niches and therefore present variable gene contents [
23,
55]. So, to ensure the annotation of new genes, the implementation of an ab initio method is needed and the Glimmer tool which was used for the annotation of several microsporidian genomes was selected [
27,
28,
56,
57]. For the best possible prediction, the dataset used for the construction of the Glimmer model must be perfectly reliable. The sequences included in the composition of this dataset are produced during the comparative annotation approach carried out by the MicroAnnot tool. This approach uses as references, genes whose annotation has been validated manually and, in some cases, experimentally [
12,
13]. Hence, these sequences ensure the extraction of unambiguous CDSs with translational start sites well defined and validated by the presence of transcriptional regulatory signals in the upstream region. Once the CDSs are predicted by Glimmer, their position in the genome is evaluated and it makes it possible to eliminate wrongly predicted genes by overlap search. This overlap is notably responsible for poor prediction of 20% and 28% of genes in
E. bieneusi and
N. ceranae respectively [
13]. Incorrectly predicted genes may also result from sequencing errors leading to frameshifts. The comparative approaches utilized by MicroAnnot limit these erroneous gene predictions because the potential frameshifts are also evaluated. Frameshift characterization can only be done during the comparative approach step, and their identification is directly correlated to the reference proteomes available to the MicroAnnot tool and highlights the importance of integrating additional reference genomes more representative of the phylogenetic diversity of microsporidia for better identification (see below). This type of errors is however less frequent with the improvement of sequencing approaches, base calling algorithms [
58], and third generation techniques [
59,
60]. The comparative analysis implemented in MicroAnnot also allows the identification of pseudogenes that may be present in microsporidian genomes [
61]. The MicroAnnot tool also mis-predicted some genes. All these genes however, were predicted during the ab initio annotation step with the Glimmer tool. This is likely linked to Glimmer specificity concerns previously described [
36]. However, for
N. ceranae, 71% of incorrectly predicted genes (27 out of 38) harbour a “warning” giving the possibility to the user to invalidate these predictions. As for the initial annotation of the
E. bieneusi genome, sequences of low complexity are annotated as CDSs during the ab initio approach, but this number drops from 89 to 18 with the MicroAnnot tool while these annotations are carried out with the Glimmer tool in both cases.
Despite the progress made in developing increasingly efficient tools for genome annotation, the process requires manual curation to produce the most reliable results [
24,
62]. Furthermore, genome annotation is unfortunately not 100% accurate and needs to be updated regularly to take advantage of the new knowledge from comparative genomics, transcriptomics, proteomics, and metabolomics, continuously generated on the organisms under study, and more generally on all organisms, for annotation using sequence similarity search approaches for example. However, computer analysis methods have led to high levels of erroneous annotations which, when used, spread throughout international databases [
63]. The MicroAnnot tool, while significantly increasing the sensitivity and specificity of predictions and reducing the number of incorrectly predicted TIS, is not yet 100% effective. For this reason, we plan to update it constantly, particularly in regard to the content of the databases used for comparative approaches. Today, the tool includes four reference proteomes, but this number will need to be increased by implementing others available and validated proteomes, thus enabling the representation of the entire microsporidian diversity. Meanwhile the sequencing of new microsporidian genomes, the annotation of genomes available in international databases and microsporidiaDB [
53] for which transcriptomic data has also been produced to validate their annotation could rapidly be integrated in MicroAnnot for the annotation by the comparative approach. The annotation of these genomes would also provide new sequences for the databases used by the software. Following the integration of a new reference genome, a check of the existing data should be systematically carried out, as this may enable errors to be corrected. The objective is not to propagate annotation errors, but to correct them over time by adding new sequences. The annotation of each microsporidian genome is therefore no longer fixed but is a dynamic process enabling regular re-annotation [
64].
Increasing the number of reference genomes would be crucial also for developing and integrating a specific module for the characterization of non-coding RNAs. High-throughput sequencing technologies such as RNA-seq have largely shed light on the world of ncRNA regulators [
65], some of which have been systematically identified within microsporidian genomes using RNA-seq data [
66,
67,
68,
69]. Many methods predict ncRNA using sequence-derived features alone and they are difficult to apply to all species, especially microsporidia, which have a high rate of sequence evolution. To ensure the annotation of such ncRNAs, a synteny-driven “all-versus-all" BLASTN approach could be implemented following the addition of new genomes. This approach has previously been used to annotate the U1 small nuclear RNA [
70], almost 15 years after the initial sequencing of the
E.cuniculi genome [
11].