1. Introduction
Sunflower (
Helianthus annuus L.) is one of the most important crops for the production of high-quality oil and seeds consumed by both humans and livestock. Last year’s FAO scientific review states that climate change represents an unprecedented challenge to food security, and due to its impact, plant diseases that ravage economically important crops are becoming more destructive and pose an increasing threat to food security and the environment [
1]. A plant’s response to stress is complex, varies in time (e.g., early or delayed responses), and differs in space (e.g., whole plant, organs, tissues, or cell types). It is also governed by networks that are likely to involve numerous signaling pathways and genes [
2]. Moreover, abiotic and biotic stresses can interact in complex ways. Over the past decades, many groups have conducted genetic, genomic and transcriptomic experiments at different stages and under different conditions to understand the regulatory and genetic basis for many stress related responses in sunflower [
3,
4,
5,
6,
7]. Yet the identification of causal genes in these cases is very limited, mainly due to the lack of statistical power and poor experimental design. It is known that Helianthus species possess a rich spectrum of non-redundant defense mechanisms, many of which have been introduced into cultivated sunflowers from their wild relatives [
8,
9]. Moreover, fungal resistance in crops is typically governed by many interacting
loci, which each offer small contributions [
10]. Identifying those causal
loci via strategies such as genome wide association studies (GWAS) and quantitative trait
loci mapping (QTL mapping) is very challenging and cumbersome. The genomic regions surrounding GWAS markers and those encompassing QTL mapping intervals can be extensive, including hundreds of genes, only a few of which may be potentially causal. To make matters worse, the sunflower genome is only partially annotated. Only ~60% of the genes are annotated with at least one Gene Ontology (GO) term, and all annotations being in silico (inferred from phylogeny (IBA) or electronic (IEA) annotation). Moreover, only ~22% of gene products have at least one BP GO annotation, and only 864 (1.6%) out of the total gene products are annotated as “defense response”, whereas this figure rises to 8% in
Arabidopsis thaliana [
11,
12].
To date, several novel automated function prediction methods have been developed to address this knowledge gap. Among them, top scoring methods incorporate algorithms utilizing gene co-expression data to infer biological functions through the Guilt-By-Association principle [
13,
14]. Historically, BLAST [
15] and InterProScan [
16] using sequence similarity and presence/absence of domains, respectively, have been the major annotation pipelines. The top scoring methods include machine learning techniques and novel ways to integrate genomic features, sequence intrinsic properties, protein-protein interactions, and other omics data and transfer these new annotations to unknown genes [
17]. As expected, the assignment of molecular function (MF) GO terms significantly outperforms the assignment of BP GO terms in these machine learning methods [
18]. Moreover, given the scarcity of data from different functional omics studies, such as protein interaction assays, in non-model organisms, such as sunflower, gene-expression based automated functional prediction remains of great importance, especially for the prediction of BP GO terms [
13,
19]. Therefore, re-analyzing transcriptome data as a whole enhances functional annotation by leveraging the collective knowledge gained from multiple experiments. To date, several authors have focused on multi-study co-expression networks to gain a deeper understanding of specific biological processes [
20], for example in the model plant
Arabidopsis thaliana [
13,
14]. The work of Depuydt et. al. (2021) [
13] in Arabidopsis showed very promising results in annotating unknown Arabidopsis genes with multi-study co-expression networks, especially using the BP GO terms under the category “response to stimulus”. Taken together, these network analyses can provide new knowledge for the field under study.
In this work, we investigated the potential of gene co-expression networks generated from multi-study datasets from all public RNAseq bio-projects currently available for sunflower to identify and/or validate candidate
loci for defense or fungal resistance. We focused on identifying and categorizing gene candidates for two major fungal diseases, Verticillium wilt and Sclerotinia head rot.
Verticillium dahliae (Vd) is a soil-borne and seed-borne pathogen that is capable of infecting hundreds of plant species including sunflower. Infection by
V. dahliae causes vascular discoloration and systemic wilting [
21]. On the other hand, the necrotrophic fungus
Sclerotinia sclerotiorum can behave as an air-borne and a soil-borne pathogen, and is the causal agent of three distinct diseases in sunflower (sclerotinia root rot, sclerotinia stem rot, and sclerotinia head rot) [
22]. Altogether, these pathogens affect oil quality and under favourable conditions may lead to massive production loss [
23]. Different studies have identified many candidate genes potentially involved in the resistance against these diseases [
21,
24,
25,
26] but as we mentioned, narrowing down the culprits is difficult.
Here we present the first large-scale co-expression network for sunflower and perform functional prediction to further characterize previous candidate
loci for
S. sclerotiorum and
V. dahliae resistance and identify novel
loci involved in defense response, specifically, to fungal pathogens. introduction should briefly place the study in a broad context and highlight why it is important. It should define the purpose of the work and its significance. The current state of the research field should be carefully reviewed and key publications cited. Please highlight controversial and diverging hypotheses when necessary. Finally, briefly mention the main aim of the work and highlight the principal conclusions. As far as possible, please keep the introduction comprehensible to scientists outside your particular field of research. References should be numbered in order of appearance and indicated by a numeral or numerals in square brackets—e.g., [
1] or [
2,
3], or [
4,
5,
6]. See the end of the document for further details on references.
3. Discussion
To our knowledge, this is the first large scale multi study co-expression network reported for sunflower. The objective was to arrange genes into biologically significant clusters while investigating their connections, classify previously reported candidate
loci and deduce their possible function, and to discover new potential candidates for fungal disease resistance. For this, we constructed two weighted gene co-expression networks (GCNs) from different tissues of sunflower plants: green healthy tissue and stressed/healthy root tissue. It has been shown that merging samples from different tissues and studies generates networks with less gene connectivity and fewer modules[
33]. Hence, in our work, we decided to separate samples according to tissue origin and construct different networks to compare them later. Moreover, it is known that in both simulated and real datasets, approximately 100 samples or more are required to capture transcriptional genome-wide regulations in a consistent and robust manner [
32]. Therefore, we did not undertake the analysis of other relevant tissues such as the stem of fully grown plants or capitula since there were not enough samples to construct robust networks.
Co-expression network analyses have become a popular tool to find associations between complex expression data in plants [
34,
35,
36,
37,
38,
39]. The use of public transcriptomic data, while increasing the heterogeneity, also increases sensibly the sample size, enhancing the possibility to establish genuine associations and provide valuable insights into gene co-expression patterns and their potential roles in defense responses in sunflower. In our work, the inclusion of 733 samples in the GreenGCN made it possible to organize 25% of the genes in the sunflower’s genome in 62 modules, while the inclusion of 163 samples in the RootGCN grouped 40% of the genes in the sunflower’s genome in 77 modules. Both networks exhibited a scale-free topology, indicating a robust organization of gene co-expression [
40]. The identification of hub-genes with high intramodular connectivity scores suggests their importance in coordinating gene expression within the modules. To further establish a biological context for the modules, a Gene Ontology (GO) term enrichment analysis was performed. All modules of both networks showed significant enrichment for one or more GO terms, further supporting the functional relevance of the identified gene clusters. Notably, module Green1 stood out with the highest number of hub genes, indicating its potential significance in the biological processes of growth, cell division, and plant-type cell wall biogenesis. Moreover, the GO term analysis of the RootGCN modules revealed significant enrichment for defense responses, consistent with the control versus infected condition of the samples. Additionally, the networks’ performances were evaluated by measuring the Guilt-By-Association-AUROC (GBA-AUC), which quantifies each network’s ability to connect genes with shared GO terms. The obtained GBA-AUC values indicated a high level of functional coherence within each network. The analysis revealed a negative correlation between gene connectivity and dN/dS ratios, consistent with Masalia et al (2017) [
21] where genes with less overall connectivity appeared to be under more intense selective pressure. In summary, the networks serve as a valuable tool for investigating potential connections between genes and advancing towards the field of systems biology in sunflower.
The identification of modules related to fungal disease resistance was a key focus of the study. We identified eight defense-related modules, significantly enriched in GO terms associated with defense responses and/or WRKY genes: Green5, Green16, Green18, Green24, Green39, Root18, Root36, and Root44. Within them, the intramodular connectivity scores pointed out the main genetic drivers of each as hub genes. Among the 40 hub genes with predicted function, nine were directly associated to defense responses through their GO-Term annotation, while other 16 hub genes were linked to them through their A. thaliana homologs. Additionally, the positive correlation between the eigengenes of these modules within each GCN suggested a coordinated regulation in defense processes. Notably, module Green16 exhibited an inverse correlation with other defense-related modules, indicating potential functional distinctions. Supportively, no WRKY genes were found in this module. The WRKY family is considered as one of the largest transcription factor (TF) families in higher plants [
23,
33], with 119
loci in sunflower [
27,
34]. Members of the WRKY-family are involved in several stress related processes, including defense response and biotic or abiotic stress responses [
23]. However, only eight sunflower WRKY genes are annotated with the GO term “defense response” or its child terms in the HanXRQv1 genome. In sunflower, previous findings by Giacomelli et al. (2010) [
24], Liu et al. (2020) [
27], Li et al. (2020) [
35] and Filippi et al (2020) [
18] implicated WRKY genes in fungal defense responses. In this study, we also identified homologs of sunflower WRKY genes in our networks that were previously associated with defense responses in other plant species (see
Table 2). The presence of these sunflower gene homologs within the “defense modules”, along with their high intramodular connectivity, provides evidence for their function in a defense regulatory role. In support of the predictive ability of the networks, the WRKY hub of the Green24 module, HanXRQChr16g0509771 (IMK=0.89), without prior GO defense-related annotation, was found by Peluffo (2010) and Giacomelli et al. (2010) (under the name HaWRKY7) to be differentially expressed upon infection with S. sclerotiorum in two independent studies[
41,
42]. Furthermore, using this candidate gene as a marker in an association mapping study, Filippi et al. (2015) found it to be associated with low severity of Sclerotinia head rot [
43]. Furthermore, another five WRKY genes (four present in Green24 and one in Green39) were described by Giacomelli et al. (2010)[
41] to have differential expression under S. sclerotiorum infection: HanXRQChr16g0499381, HanXRQChr03g0088861, HanXRQChr03g0084521, HanXRQChr08g0216831 and HanXRQChr16g0505941 (IMKs = 0.64, 0.37, 0.37 and 0.18 respectively) (
Supp. Table S1). Lastly, the WRKY-like genes HanXRQChr14g0460611 and HanXRQChr15g0480431 from modules Green18 (IMK=0,41) and Green5 (IMK=0,13), respectively, have been shown to be upregulated in roots under
R. irregulare infection [
44]. It is worth mentioning that among the total of 26 WRKY genes identified in the studied modules, 8 genes are found in both the Green and Root modules of interest. This finding suggests a potential link or association between these gene clusters in different tissues. Our comprehensive analysis, based on public data, presents compelling evidence supporting the involvement of multiple sunflower WRKY genes in defense responses.
In addition to the WRKY family, a large number of other genes have been included in our defense-related modules with potential functional connections. To illustrate the functional potential of defense-related networks, we selected the gene AT3G55800, a homolog of the hub gene HanXRQChr14g0459921 in the Green16 module, which is described as a sedoheptulose biphosphatase. According to the Protein-Protein Interaction Networks of STRING Database (© STRING CONSORTIUM 2023), this protein has been reported to be coexpressed with several fructose-bisphosphate aldolases and glyceraldehyde-3-phosphate dehydrogenases, among others. In addition to its role in the process of carbohydrate biosynthesis, this protein, along with three other metabolic enzymes involved in glycolysis and the pentose phosphate pathway, decreased during the response to Pseudomonas syringae pv tomato DC3000 in A. thaliana leaves [
45]. At the same time, the ATP synthase subunit CF1δ, which is encoded by the gene AT4G09650 and plays a central role in oxidative or photosynthetic phosphorylation, decreased in abundance. Remarkably, the Green16 module is highly enriched in predicted enzymes of carbohydrate metabolism, including several fructose biphosphatases, glyceraldehyde-3-phosphate dehydrogenases, sedoheptulose biphosphatases, and malate dehydrogenases, many of which play a central role in this module. Green16 also contains a homolog of AT4G09650, HanXRQChr01g0001181. Moreover, this gene was down-regulated in a tolerant inbred line at early stages of infection with the fungus S. sclerotiorum in sunflower, suggesting a common functional role in pathogen-associated molecular pattern (PAMP)-triggered immunity [
46]. Several authors have identified modulation of primary metabolic pathways during the defense response, although the specific players in each species remain to be elucidated [
47,
48,
49].
Previous studies identified various
loci involved in the quantitative resistance to V. dahliae and
S. sclerotiorum, mostly through comprehensive analysis in biparental mapping, association mapping, or transcriptome differential expression studies [
25,
26,
30,
31,
32,
46] (citas). In an attempt to provide a biological context to the candidate genes derived either by co-localizing with a molecular marker in GWAS studies or by being located in between markers of QTL studies, we focused on the candidates predicted as related to defense functions in our modules of interest. This strategy allowed to select and annotate by association the candidates immersed in a defense-related biological context. The resulting list included 122 unique genes, of which 1 was from group A, 40 were from the close vicinity of 44 associated markers of group B, and the remaining 81 genes were included in 13 QTL regions from group C (see
File S5). Interestingly, 7 of the 122 genes had been repeatedly identified by different studies. Furthermore, of the 122 total
loci, 43 have no previous BP GO term annotation, 3 are WRKY transcription factors, and 8 are described as “unknown” or “uncharacterized” proteins.
Overall, the aforementioned findings present a novel approach to exploiting the great amount of data generated in sunflower to date. Although only a moderate proportion of the sunflower genome could be included in our networks, they highlight the presence of genetic drivers and group co-expressed elements, expanding our understanding of the associations established during different biological processes taking place. The insights into disease resistance in sunflowers are of particular interest for breeding programs. The identification of WRKY genes with a recurrent and central role in defense-related modules presents a compelling example of promising breeding targets. The dominant role of these transcription factors can potentially achieve more comprehensive and substantial effects on multiple
loci associated with defense responses by modulating the expression of various downstream genes. This could also apply to other hub-genes, allowing for fine-tuning the breeding strategies and leading to the development of sunflower varieties with enhanced defense capabilities against fungal pathogens. The defense-related annotation resource generated in this study is available in
File S5, where it can be explored by the sunflower and plant community to facilitate future research, as these novel annotations might provide the missing link in defense-related pathways.