1. Introduction
Bees are a monophyletic lineage within a clade Anthophila in the superfamily Apoidea [
1]. There are around 20,000 different species of bees, and thanks to their effective pollination for both crops and natural flora, bees are now essential parts of practically all terrestrial ecosystems [
2,
3]. Bees can be broadly divided into two functional groups based on floral specificity: oligolectic species, which forage on one or a small number of closely related plant species, and polylectic species, which collect nectar and/or pollen from many different plant species [
4]. The most well-known bee species include the polylectic honey bees (
Apis spp.) and bumble bees (
Bombus spp.). Numerous biological and ecological investigations of the polylectic species have been conducted [
2,
5]. In contrast, oligolectic bees have received significantly less research despite making up a sizable fraction of the world's bee fauna [
6]. For oligolectic bee exploitation and conservation, therefore, a greater understanding of their biology and ecology is required.
Andrena Fabricius (Andrenidae) is a large bee genus of around 1,600 species with a wide distribution mainly throughout the Holarctic. They are a crucial pollinator in both natural and agricultural contexts, and they are a particularly important aspect of northern temperate ecosystems [
7].
Andrena species exhibit a spectrum of diet breadth, from polylectic to oligolectic, which makes this genus a superb group to study the evolution of diet specialization [
8,
9]. Oil-tea (
Camellia oleifera) is an important woody oil crop in many countries, including China, the Philippines, India, Brazil and South Korea [
10]. This plant only blooms in November and December [
11], when there are few wild pollinators present. As a result, crop yields are very limited due to significant pollinator constraint [
12,
13]. Local farmers have attempted utilizing domestic honey bees (
Apis mellifera and
Apis cerana) to increase pollination efficiency, however, both species are badly damaged by the nectar's toxicity [
14]. Interestingly, some wild bees such as
Andrena camellia and
Colletes gigas primarily rely on oil-tea nectar to survive, suggesting these species have coevolved to become an expert at oil-tea [
15,
16,
17].
Direct observation on floral visiting and microscope examination on pollen from pollen basket showed that
A. camellia near exclusively collects nectar and pollen from oil-tea blossom [
12].
A. camellia emergences in the middle of October and keep activities (mating, oviposition, and larval development) mainly during November and December, well synchronizing the blossom period of oil-tea tree [
11,
12]. Besides
A. camellia, at least three other
Andrena bees (
A. chekiangensis,
A. hunanensis and
A. striata) are also reported visiting oil tea flowers [
18,
19].
A. chekiangensis is much larger than
A. camellia, this makes it easier to identify between the two species. Our field observations find that,
A. chekiangensis is strictly not an oil-tea specialist, because it also frequently visits tea tree blossoms (
Camellia sinensis). In some sympatric habitats of
C. oleifera and
C. sinensis,
A. chekiangensis individuals were more frequently observed in
C. sinensis blossoms. In contrary,
A. hunanensis and
A. striata are almost indistinguishable from
A. camellia in terms of morphology features and feeding habits (oil-tea specialization). It should be noted that
A. hunanensis and
A. striata are frequently mistakenly classified as
A. camellia in many amateur fieldwork reports. Due to the morphology similarity among
A. camellia and its close relatives, the researchers believe there may even be hidden unknown species that have yet to be discovered.
Understanding the poisoning and detoxifying processes used by various bee species could be crucial to enhancing oil-tea flower pollination. According to chemical tests, the primary toxins affecting western honey bees are the galactose derivatives: raffinose, manninotriose, and stachyose [
20,
21]. Typically, these oligosaccharides must be broken down in two steps: first, the alpha-galactosidine linkages are hydrolyzed by galactosidases to produce sucrose and galactose, and then the galactose molecules are converted to UDP-glucose via the Leloir route [
22,
23,
24]. It should be noted that little is known about the genes involved in the metabolism of galactose derivates in
Andrena species. Here, we present the first next-generation sequencing of
Andrena species consuming oil-tea nectars. By combining these data with genomic information of other
Andrena species from GenBank, we performed bioinformatic analyses of genes involved in the metabolism of galactose derivatives. The objective is to determine whether the oil-tea specialized
Andrena species differ from the other
Andrena species in terms of evolutionary specialization.
2. Materials and Methods
All pollinators were live-trapped from oil-tea blossoms in Jiangxi, Anhui, Sichuan, Zhejiang, Guangdong, and Hunan Provinces, China (
Figure 1). In order to pick out
Andrena samples the specimens were identified based on morphological characteristics [
18]. One female sample for each species was chosen for genome sequencing. Total genomic DNA was extracted from the thorax of each individual using the QIAGEN DNeasy Blood and Tissue kit (Germany), following the manufacturer's protocols. DNA libraries with ~350 bp insertions were constructed and were then sequenced with both directions of 150-bp reads using the Illumina HiSeq 2000 sequencing platform (Illumina Inc. San Diego). Quality control for raw reads data was performed using fastp 0.20.0 with default settings and parameters [
25].
The clean reads were used for de novo assembly with MEGAHIT [
26]. The mitochondrial COI sequences were extracted by local blast using NCBI-BLAST+ program v2.13.0 [
27] and were used as queries to search in BOLD system (
www.boldsystems.org) to determine their taxonomic information. Previously published six
Andrena genomes downloaded from GenBank were used as comparison objects:
A. dorsata, GCA_929108735.1;
A. fulva, GCA_946251845.1; A. haemorrhoa, GCA_910592295.1;
A. hattorfiana, GCA_944738655.1;
A. minutula, GCA_929113495.1;
A. bucephala, GCA_947577245.1. All 13 mitochondrial coding sequences in the genomes sequenced in this study and those from GenBank were extracted and were concatenated to reconstruct phylogenetic trees using IQ-TREE [
28].
The galactose metabolism pathway of Hymenoptera in KEGG (ko00052) was used to identify candidate genes involved in galactose derivatives metabolism. With honey bee (
A. mellifera) candidate genes as query sequences, the exonerate program v2.4.0 [
29] was used to find homologous genes in
Andrena genomes. MEGA v10 [
30] was used to combine and align the coding sequences from all samples for each gene. DNasP v6 [
31] was used to identify the genetic variation information for each gene. The sequence similarity information among the genes as well as their translated protein sequences are calculated using clustal omega [
32].
The molecular evolution analyses were performed in PAML program package [
33]. The branch model was used to estimate the
dN/dS (nonsynonymous / synonymous mutation) ratios of the foreground clade (oil-tea specialized species). The likelihood ratio tests (LRTs) between M0 (null model) and the branch model were performed by comparing twice the difference in log-likelihood values (
2ΔlnL) against a chi-square distribution (
df = 2). We also used the branch-site model called Model A to test for positive selections in the foreground clade oil-tea specialized species. The null model for Model A is Model A1, which is a modification of Model A, but with
ω2 = 1 fixed [
34,
35]. The putative positive selection sites were deduced with Bayes Empirical Bayes (BEB) analysis. It should be noted that, the seven non-specialized
Andrena species lacked
NAGA-like gene. Because of the high sequence similarity between
NAGA and
NAGA-like genes, we arbitrarily used
NAGA genes of these species instead.
In order to analyze the relative expression level of each gene, we also carried out RNAseq sequencing for
A. camellia and
A. chekiangensis, representing specialized and non-specialized oil-tea pollinators, respectively. Total mRNA was isolated from whole specimens of each individual (4 individuals were analyzed for each species). Following the manufacturer's instructions, 150 bp reads were sequenced bidirectionally by the Illumina platform (Illumina, San Diego, CA). The obtained clean reads of a randomly selected individual of each species were used for de novo assembly using the Trinity program [
36]. The transcripts were processed by CD-HIT-EST [
37] to remove the redundant sequences, and the generated unigenes were then used to predict coding sequences (CDSs) with the GeneMarkS-T program [
38].
Orthologous genes were identified using OrthoFinder v2.3.11 [
39]. The salmon program v1.0.0 [
40] was used to calculate the expected read counts and transcripts per million (TPM) value for every orthologous gene. which were then used to identify differentially expressed genes (DEGs) with the DEBrowser program [
41]. The genes with posterior fold changes (
FC) of
A. camellia against
A. chekiangensis over two (i.e.,
FC > 2 or
FC < 0.5) and with highly significant posterior probabilities of differential expression (
Padj < 0.05) were considered to be DEGs. It should be noted that one of our candidate genes for the galactose derivatives metabolism had two closely related copies (
NAGA and
NAGA-like, see below), which made it challenging to pinpoint the true source of their mapped reads. In order to differentiate the relative expression levels between the copies, we initially extracted all reads that map to the two copies using bowtie2 program [
42] and samtools [
43]. We then randomly selected five 50 bp variable segments (in each segment, at least seven variable sites occurred between the two gene copies) as baits, and directly counted the number of reads that match the baits using grep module of seqkit program [
44].
3. Results
Based on morphological characteristics and DNA barcoding using mitochondrial COI sequences, six distinct
Andrena species were discovered. Four species were recognized:
Andrena camellia,
Andrena hunanensis,
Andrena striata, and
Andrena chekiangensis. Since neither GenBank blasting nor BOLDSYSTEMS searching produced any COI hits for the two remaining species, they were temporarily designated as
Andrena sp. 1 and
Andrena sp. 2 (
Table 1). A total of 62.74 giga bases (Gb) of WGS clean reads were obtained for the six
Andrena species (five oil-tea specialized species and
A. chekiangensis). After assembly 345~389 Mb of contigs were generated, with N50 contig size of 8.8~15.6 Kb (
Table 2). Phylogenetic analysis showed that, the known three species (
A. camellia,
A. hunanensis and
A. striata) and the unknow two species formed to a single clade, while
A. chekiangensis and
A. haemorrhoa formed another clade (
Figure 2).
According to the KEGG database, alpha-galactosidase (EC 3.2.1.22, also known as alpha-galactosidase A), which is prevalent in chordates, plants, and bacteria, is not found in arthropods such as bees (Hymenoptera) and other insects. As an alternative, bees have the equivalent alpha-N-acetylgalactosaminidase (
NAGA, EC 3.2.1.49, also called alpha-galactosidase B) which is a homologous gene of alpha-galactosidase. Similar to other bees, there was only one
NAGA gene in
A. chekiangensis genome. Intriguingly, two closely similar copies of
NAGA gene were discovered in the genome of the five oil-tea specialized species. One was the conventional
NAGA, while the other appeared to be a novel copy of the conventional
NAGA. For ease of use, we refer to the novel copy as a
NAGA-like gene. The
NAGA and
NAGA-like genes were highly similar, taking
A. camellia as an example, there were 86% and 78% identity sites between them at nucleotide and amino acid sequence levels, respectively. It was worth noting that
NANA-like had a termination mutation in the last exon (the sixth exon) which resulted in a shortened protein (
Figure 3). All genomes of the five oil-tea specialized species and the other seven species (including
A. chekiangensis) contained four of the Leloir pathway genes [
22], which most organisms use to metabolize galactose: aldose 1-epimerase (
galM, EC 5.1.3.3), galactokinase (
galK, EC 2.7.1.6), galactose-1-phosphate uridylyltransferase (
galT, EC 2.7.7.12), and UDP-galactose 4-epimerase (
galE, EC 5.1.3.2).
Genetic variations were surveyed within the five oil-tea specialized species. The coding sequence of
NAGA was 1,320 bp in length, with 16 (1.21%) variable sites among the five species. The
NAGA-like gene was shorter but interspecifically much more variable (2.66%) than
NAGA. The four Leloir pathway genes were shorter than
NAGA and
NAGA-like and the number of variable sites ranked as
galM >
galK >
galT >
galE (
Table 3). The sequences of the six genes of all the 12
Andrena species analyzed in this study were shown in the
supplementary Table S1. Branch model analyses were executed with the five oil-tea specialized species as foreground clade and the remaining seven species as background clade. The results showed that,
NAGA-like,
galK and
galT had significant greater
dN/
dS ratio in the foreground clade than in the background clade (
χ2 test,
df = 1,
p < 0.001). However, no significant divergence was seen for
NAGA,
galM and
galE (
p > 0.2) (
Table 4). We also assess the putative positive selection sites in the two genes using the branch-site model test. With the Bayes Empirical Bayes (BEB) analysis, twelve sites in
NAGA-like, four sites in
galK, and one site in
galT were found under positive selection with posterior probabilities > 0.95.
A total of 33.7 Gb and 35.1 Gb of RNASeq clean reads were obtained for A. camellia and A. chekiangensis, respectively (
Table 5). The assembly of A. camellia generated 23,087 unigenes, with a N50 value of 1,668 bp. For A. chekiangensis, 18,387 unigenes were produced with a N50 value of 1,692 bp. A total of 7,151 orthologs were shared by the two species, with a total length of 7,847,865 bp and N50 of 1,632 bp. The average value of relative expression level (TPM) of these orthologs in each sample were 140. The TPM values of the six candidate galactose metabolism genes are shown in
Table 6. Taking the A. chekiangensis samples as the control group, 1,987 differentially expressed genes were detected, including 1,155 up-regulated (FC > 2) and 853 down-regulated (FC < 0.5) genes in A. camellia. NAGA-like, galK, and galT were significantly up-regulated in A. camellia (FC > 2, Padj < 0.05), while NAGA, galM and galE were not deviated in expression levels between A. camellia and A. chekiangensis (Padj > 0.05) (
Figure 4).
4. Discussion
Oil-tea is an important woody edible and industrial oil tree species [
45]. Its products, tea oil, was categorized by the FAO (Food and Agriculture Organization of the United Nations) as a premium health-grade edible oil [
46]. This plant presents a low oil yield because of self-incompatibility. Previous studies showed that the oil yield can be improved by an increase in pollinating insects [
19,
47]. However, blooms occur in late autumn and winter (from October to January), when bee pollinators are few due to cold temperatures. Additionally, some compounds in the nectar are toxic to most bees, including managed honey bees [
13]. It's interesting to note that the poisonous elements in oil-tea nectar can be detoxified by both adults and larvae of several
Andrena species. Unfortunately, despite the significant attention these species have received [
47,
48,
49,
50], little is known about the molecular mechanisms of detoxification. In this study, we present the first genome and transcriptome sequencing on oil-tea specialized
Andrena species and carried out bioinformatic analyses on genes involved in galactose derivatives metabolism.
As stated in Introduction, in order to degrade galactose derivatives, the alpha-galactosidine bonds need to be firstly hydrolyzed to release galactose residues. In most organisms such as chordates, plants, and bacteria this process is accomplished by alpha-galactosidase A. However, honey bees and other insects have not yet been found to contain such a gene. Instead,
NAGA, a homologous gene of
alpha-galactosidase A, was commonly present in insect genomes. Although the protein produced by
NAGA has been initially thought to be an isozyme of α-galactosidase and given the name α-galactosidase B, it was actually an exoglycosidase acting on N-acetylgalactosamine [
51]. There is no proof that it can replace alpha-galactosidase A's role, which is why honey bees (such as
Apis mellifera and
Apis cerana) are unable to process oil-tea nectar. According to our molecular evolution analyses, there was no discernible selective differentiation of
NAGA between the five oil-tea specialized species and the other
Andrena species. According to gene expression assessments,
NAGA in
A. camellia was not significantly deviated from that in
A. chekiangensis (
p > 0.05). As a result, we hypothesize that the conventional
NAGA makes no contribution to the detoxification of oil-tea specialized
Andrena species.
The most intriguing discovery in this study might be the novel copy of
NAGA, say
NAGA-like gene, in the five oil-tea specialized
Andrena species. Such gene duplication pattern was not found in the other seven
Andrena species, including
A. chekiangensis which also consume oil-tea nectar, although unspecifically. We also examined the genomes and transcriptomes of
Colletes gigas, another crucial oil-tea pollinator [
17,
19], and no novel copy was found. Considering that the five oil-tea specialized
Andrena species formed to a monophyletic group, we speculated that the
NAGA-like gene was created from a particular gene duplicating event occurred in the common ancestor of these species. Molecular evolution analyses with branch models and branch-site models indicated that
NAGA-like was under strong positive selections, a sign that a new phenotype on this gene would arise for these species [
52]. The seqkit grep counts showed that the majority of reads mapping on
NAGA were actually from
NAGA-like. As a result, it is possible to estimate that the expression level of
NAGA-like in
A. camellia is ~120 times of
NAGA in the same species, or ~94 times of
NAGA in
A. chekiangensis. Additionally, given that each of the 7,151 orthologs had an average expression level (TPM) of 140,
NAGA-like had a TPM that was around 496 times of the average value. Such a high degree of
NAGA-like expression suggested that it is essential for the oil-tea specialization in
A. camellia, and maybe the other four oil-tea specialized species. Since the novel
NAGA-like gene was highly similar (86% DNA identity) with conventional
NAGA, it was logical to assume that
NAGA-like protein could likewise catalyze N-acetylgalactosamine residue. In other words, although more research is required to confirm this idea, we propose that
NAGA-like may have acquired a new role to break the alpha-galactosidine linkages from galactose derivatives.
There are four steps in the classic Leloir pathway. First, β-d-galactose (natural galactose) is epimerized to α-d-galactose by
galM. Second, α-d-galactose is phosphorylated to yield α-d-galactose 1-phosphate by
galK. Third,
galT catalyzes the transfer of a UMP group from UDP-glucose to galactose 1-phosphate, thereby generating glucose 1-phosphate and UDP-galactose. Finally, UDP-galactose is converted to UDP-glucose by
galE [
22]. Our results of the branch model and branch-site model tests showed that,
galK and
galT, but not
galM nor
galE, showed significant larger
dN/dS ratios in the five oil-tea specialized species than in the background
Andrena species. Moreover, the gene expression analyses showed that
galT and
galT, but not
galM nor
galE, was significantly upregulated in
A. camellia than in
A. chekiangensis. These findings suggested that
galK and
galT may have been crucial in helping the oil-tea specialized
Andrena species adapt.
Galactokinase (GALK) and galactose-1-phosphate uridylyltransferase (GALT) are the most important proteins for galactose metabolism in humans. GALT deficiency causes the inborn error of metabolism known as “classic galactosemia” (type I galactosemia); while deficiency causes type II galactosemia, which is the most clinically severe form of the disease [
53]. The galactosemia disease manifests soon after birth if the newborn is exposed to galactose, with acute symptoms such as failure to thrive, hepatocellular damage, bleeding, and possible death from bacterial sepsis [
54]. Catalyzing efficiency of GALK and GALT are sensitive to its amino acid mutations [
53]. It should be mentioned that mutations themselves are inherently directionless. For instance, in human, some mutations in
galK gene may result in significant decreases in catalyzing efficiency and lead to diseases, while others may, in the opposite direction, result in an increase in catalyzing function [
55]. We hypothesize that an improvement in catalytic efficiency may result from positive selection in
galK and
galT of the oil-tea specialized species. In contrast, although
galM and
galE were also involved in galactose metabolism, no significant deviations in molecular evolution and gene expression were detected between
A. camellia than in
A. chekiangensis, suggesting that the universal activity and quantity of these two epimerases are sufficient to deal with the catalytic demand in the oil-tea specialized species.