Machine learning, a subfield of artificial intelligence, deals with the development of algorithms capable of learning from the data. Recently, the application and development of machine learning methods for genomics have undergone a rapid growth. It proved valuable for analysing complex, high-dimensional genomics data and extracting previously unknown information. Examples of machine learning applications in the wider omics field range from the identification of DNA sequences (splice sites [
31], promoters [
32], enhancers [
33]), nucleosome positioning [
34], taxonomic annotation [
35], microbial enterotyping [
36], sequence errors learning [
37], microbial host body site and subject classification [
38], viral escape prediction [
39], protein 3D structure estimation [
40], evolutionary population genetics inference [
41] and genomic selection [
42].
2.1. Machine Learning Methods Frequently Adapted for GWAS
PubMed and Google Scholar were searched for journal articles that included the keywords “machine learning” and “genome wide association study”. We focused on papers written in English and published from 1 January 2004 to 6 November 2023. An initial set of 147 articles was selected and then reviewed based on their title, keywords and abstract for inclusion. Papers that did not match inclusion criteria were eliminated, resulting in 109 articles. We then assessed full text of those papers, which were further categorised based on their context and relevance including research articles that applied machine learning algorithms to GWAS, PRS and review papers. We also included benchmarking research which used real data excluding the ones that used only synthetic data. From this set of articles duplicate papers were also deleted. This resulted in 79 relevant papers, of which 60 were research articles and 19 review articles. The methodology in each research article was analysed to identify the specific machine learning tools and their unique features. The most common methods included Support Vector Machines (SVMs), random forest and neural networks. We provide short background for these methods below.
Random forest [
43] is an ensemble learning method commonly used in GWAS. In a random forest, several weak classifiers (e.g. tress) are constructed, each using a random subset of the training data and a random subset of the features. This randomness in data and feature selection is a key element of the method, which mitigates the risk of overfitting and helps ensure the model's generalisation to new, unseen data. Each tree in the forest independently makes predictions based on its specific subset of the data. When a new data point is presented to the model, it passes through each decision tree and their individual predictions are aggregated. In classification tasks, the final prediction is often determined by a majority vote among the trees, while in regression tasks, it is the average of the predictions. Random forests are particularly strong at handling high-dimensional genomic data commonly encountered in GWAS, providing insights into the importance of individual genetic features and interactions among them [
44]. Random forests can also be used to perform feature importance rankings, helping researchers to identify key genetic variables contributing to complex traits, as discussed below.
SVMs [
45] are a class of machine learning algorithms designed to classify data by identifying the optimal hyperplane that best separates different classes in a high-dimensional feature space. In the context of GWAS, SVMs map genetic data that is often represented as high-dimensional feature vectors in multi-dimensional space. The goal is to identify the hyperplane (decision boundary) that maximises the margin between different genetic variations associated with particular trait or disease. SVMs work by selecting support vectors, which are the data points closest to the decision boundary. These vectors play a key role in determining the orientation and position of the hyperplane. The choice of the optimal hyperplane is critical because it minimises the risk of overfitting and aims to generalise well to unseen data. SVMs can also handle non-linear relationships through kernel functions, transforming the input data into a higher-dimensional space, where a linear separator becomes feasible.
Neural networks [
46] rapidly gained significance in GWAS, mainly due to their ability to uncover complex genetic patterns within high-dimensional genomic datasets. The basic building block of a neural network is the artificial neuron (also referred to as a node). Each neuron transforms input data through a weighted sum, which is followed by the application of an activation function. By connecting neurons in layers, neural networks can model increasingly abstract and complex relationships. In the context of GWAS, these networks are often designed as deep neural networks [
47,
48] with multiple hidden layers, to extract hierarchical features from genetic data. Neural networks are especially suited at capturing non-linear relationships among genetic variants [
48]. During the training process they adjust their internal parameters to minimise prediction errors. This training process involves feeding the network with genetic data and adjusting its parameters until it can make more accurate predictions. Once the model has been trained, neural networks can be used for a variety of tasks, including classification, regression and feature selection.
2.2. Machine Learning Application Areas in GWAS
In this section we present the methods, benchmarking efforts, and specifically designed tools which integrate machine learning approaches working with high dimensional genetic data, the results of which are promising in identifying novel disease-associated susceptibility loci. These studies suggest that machine learning could be used instead of traditional statistical GWAS methods, potentially aiding in the better understanding of complex multifactorial genetic diseases and prediction of individuals at risk. Benchmarking efforts of using machine learning in field of GWAS are mainly focused on four methods: gradient boosting, random forest, SVM and neural networks. Here, we simplify the classification of applications by prioritising top GWAS results, detecting epistasis among selected loci, prioritising variants for GWAS, predicting traits, identifying variant/loci and supporting PRS.
Prioritization of top GWAS resultsMachine learning applications developed for post-GWAS prioritisation (up until 2020) were summarised by Nicholls et al. [
49] who pointed out that 7 out of 19 post -GWAS prioritisation methods were ensemble methods, namely gradient boosting and random forest. One remarkable benchmarking effort in this field was done by Vitsios and Petrovski (2020) [
50] and compared seven different machine learning methods to prioritise genes for amyotrophic lateral sclerosis, chronic kidney disease and epilepsy. They implemented a diverse pool of gene-annotation sources: generic resources (disease and/or tissue agnostic), resources filtered by tissue and disease-specific features. They also developed “mantis-ml” as an automated machine learning framework to enable learning from sets of gene-associated features. Random forest was reported as the top-performing classifier. Another benchmarking effort earlier was by Roshan, et al. (2011) who introduced random forest as a ranking method of causal variants for GWAS [
51], once a GWAS is already performed. Their method helped to loosen the Bonferroni threshold, by 2 times the number of SNPs passing the threshold and showed that both methods improve the ranks of causal variants and associated regions.
An example of how neural networks can be used to prioritise disease-associated genetic variants, can be found in Liu et al. (2018) [
52]. They developed DEOPEN, a model which integrates a deep convolutional neural network and a three-layer feed-forward neural network. This model can predict chromatin accessibility and consider interactions between sequence patterns. The authors also demonstrated how their framework can be used to evaluate genetic variants of interest, including functional variants. Their model outperformed Basset [
53] and gkm-SVM [
54] for classification of genome susceptibility in 50 random cell lines. Most importantly, DEOPEN can be used to identify known and potentially new transcription factor motifs. The authors applied their framework to a GWAS breast cancer dataset which identified 29 SNPs associated with this condition from 1,057 SNPs that co-occurred with them, through their involvement with a cancer-related transcription factor.
A random forest-based classifier, GCDPipe [
55], uses gene-level results derived from GWAS analysis. It expands the list of potential disease gene candidates through the estimation of probability to influence disease risks. GCDPipe identifies gene expression profiles across cell types and tissues with the highest importance for the putative disease genes identification. Additionally, it prioritises drugs based on affinity to the putative disease genes using drug-gene interaction databases. Open Targets recently introduced new techniques for prioritising GWAS results [
56]. Their “locus-to-gene” model derives features to prioritise likely causal genes at each GWAS locus, incorporating genetic and functional genomics features such as distance, molecular QTL colocalization, chromatin interaction and variant pathogenicity. The locus-to-gene method uses a machine learning model to determine the weights of each evidence source, referencing on a gold standard of previously identified causal genes and relying on fine-mapping and colocalisation data.
Another method, that uses epigenetic knowledge is DeepPerVar [
57], was developed in two versions, based on two datasets: the DeepPerVar-H3K9ac (paired whole genome sequence and HEK9ac CHIP-seq data) and DeepPerVar-methy (paired whole genome sequence and DNA methylation data) to predict quantitative signals and methylation ratios, respectively. Overall, DeepPerVar was able to interpret and prioritise causal variants in a GWAS risk locus linked to Schizophrenia, quantify epigenetic signals and interpret the relationship of non-coding variants with a disease trait.
Epistasis detection among selected loci Random forest was initially suggested as an alternative to model genetic interactions in 2004 [
44]. The rationale behind employing random forest is that in situations involving genuine interactions, SNPs exhibit modest individual effects but considerable interaction effects within a population. However, such effects are less likely to be detected at the genome-wide multiple testing thresholds used in GWAS screenings. Moreover, model-based screens that assess the interaction of each SNP with every other SNP in the dataset, aiming to pre-specify interacting SNPs, are impractical for datasets exceeding a thousand SNPs. Given that a typical GWAS dataset usually comprises more than 50,000 SNPs, such an approach becomes unfeasible.
Random forest analysis of interacting genetic models, up to 32 independent SNPs showed that random forest performed better than Fisher’s exact test as a screening tool, when genetic heterogeneity as well as random noise is accounted for. In this study, the authors recommended that thousands of trees must be used in order to get stable estimates of the variable importance [
44]. An advantage of random forest is that the investigator does not need to propose a model, making it well-suited for hypothesis-free screens such as GWAS or candidate gene studies. It also captures interactions and reflects them in variable importance scores. Drawbacks of the method include lack or concordance between variable importance and predictive index value [
58] and high chance of detecting false, spurious associations when the study design is sub-optimal [
59]. A recent report described by Leem et al. [
60] suggested a three step approach allowing authors to define up to 5-locus interactions in real WTCCC datasets and in synthetic datasets without marginal effects. Also, there have been multiple attempts to find interacting genetic loci by other machine learning methods, such as decision trees (DF-SNPs) [
61], Deep Mixed Model [
62] and grammatical evolution optimised neural networks (GENN) [
63].
Variant prioritisationOne important area of machine learning for GWAS has been on prioritising loci to be included in GWAS. To this end, stand-alone but also combinatory tools have been developed for search space reduction. In 2015 Nguyen et al. [
64] developed ts-RF which is a two-stage method. In this method, first a p-value assessment is performed to find a cut-off point that separates the genome-wide SNP data into relevant and irrelevant SNPs. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative. Then these two groups are considered when sampling for building trees. So, the feature subspace is encouraged to contain highly informative SNPs when used to split a node at a tree, resulting in better performance. They applied ts-RF to real genome-wide datasets of Alzheimer’s and Parkinson’s disease and compared its performance of linear kernel SVM from LibSVM [
65]. ts-RF performed better at prediction and was able to point 25 SNPs associated with Parkinson’s disease that are located within gene regions studied by previous GWAS.
Silva et al. [
66] showed that dimensionality reduction techniques based on random forest could effectively reduce dataset dimensions before conducting a cluster analysis of augmented GWAS data using a two-step machine learning approach. In the first step of dimension reduction, SNPs were ranked based on their relevance, and those with higher relevance underwent the second stage of analysis, which involved clustering. They tested the method on seroclearance GWAS in chronic hepatitis B while including the most significant SNPs in the clustering. The results included over 100 SNP sets which were associated with the phenotype of interest. SNPs were further detected and linked to HBsAg seroclearance with statistical significance based on Hamming distance-based association tests [
67] in which a p-value for each predetermined causal SNP set was calculated. Knowing that statistically significant variants tend to cluster, the authors also investigated the functional relevance of SNPs found in the same SNP-set, as well as in individual SNPs followed by random forest and identified possible susceptible loci that could be otherwise ignored when only performing GWAS. The resulting SNP-sets from the cluster analyses were subsequently tested for trait-association and identified three susceptibility loci possibly associated with HBsAg seroclearance one of which was reported in the literature to be significantly associated with HBsAg seroclearance in patients who had received antiviral treatment.
Random forest was further combined with SVMs and k-nearest neighbour (kNN) clustering methods [
68] by Gaudillo et al. and used for asthma genetic risk prediction. In their study they applied random forest to identify the SNPs with high implication to asthma. Following that, they trained kNN and SVM algorithms to classify the identified SNPs for their association to asthma. Recently, Díez Díaz et al. [
69] proposed GASVeM that makes use of genetic algorithms together with SVMs to find out whether a certain biological pathway, assigned from a set of SNPs, can classify cases from controls. New frameworks using SVMs continue to be developed, while their performance is also shown to be heavily influenced by the heritability of the disease [
70].
Recent research in Alzheimer’s disease [
71] used a hybrid feature selection approach based on association test, principal component analysis and the Boruta algorithm, to identify the most promising predictors. The selected features are then forwarded to a wide and deep neural network models to classify the Alzheimer’s disease cases and healthy controls. In the first step, they conducted an association test to select the most signification SNPs influencing the disease, followed by a hybrid feature selection approach to reduce the number of features substantially. They subsequently used a selection process for neighbouring SNPs to generate a final set of SNPs. This set was then used to train wide and deep learning classification models for both cognitively normal individuals and those with Alzheimer’s disease. Another method is DeepGWAS which uses a 14-layer deep neural network to enhance GWAS signals, using GWAS summary statistics, linkage disequilibrium information and brain related functional annotations. DeepGWAS was developed particularly for psychiatric diseases, starting with schizophrenia and outperformed XGBoost and logistic regression methods [
72]. The range of applications using combinatory approaches continues to expand (
Table 1).
2.3. Tools for SNP Discovery From Whole-Genome SNP Data
There is a growing number of efforts that use SVMs and neural networks narrow down the search space for GWAS. Additionally, there are tools designed to perform GWAS with no prior hypothesis or feature selection. Below we discuss algorithms and publicly available tools which have undergone internal benchmarking but warrant further testing in broader genetic epidemiological research (
Table 2).
A method by Mieth et al. (2021), COMBI [
73], employs a linear SVM which is trained and used as an indicator of importance and SNPs from each chromosome separately. This filtering step selects SNPs which contribute to phenotype classification with either high individual effects or effects in combination with the rest of SNPs, while removing results due to the correlation structure. At the application level, a phenotype vector and a genotype matrix which can be directly converted from a Plink [
74] genotype object are generated. From these two objects, SVM weight vector is generated and used as importance measures.
Table 1.
An overview of machine learning tools classified by application categories and machine learning approaches.
Table 1.
An overview of machine learning tools classified by application categories and machine learning approaches.
Application categories |
Applications and tools |
Machine learning approach |
Prioritization of top GWAS results |
Methods developed prior to2021 [ 49]
|
ClusteringSVMRandom ForrestNeural Network |
Epistasis detection among pre-selected SNPs |
|
ClusteringRandom ForrestNeural Network |
Variant prioritization |
clustering, random forest [ 66]
random forest, SVM, kNN [ 68]
Wide and Deep Learning [ 71]
|
SVMRandom ForrestNeural Network |
Hypothesis-free GWAS |
|
SVMNeural Network |
Polygenic Risk Score |
|
Random ForrestNeural Network |
In a second step, SNPs with the higher scores selected undergo a chi2 based hypothesis test performed together with Westfall-Young [
83] type threshold calibration for each SNP, based on the permutation distribution of the re-sampled p-values. By this way, using a pre-selected list of SNPs and relaxed p-value threshold the proportion of true positives in the data is ultimately increased. In the simulated dataset COMBI overperformed other SVM based algorithms, including previously mentioned from Roshan et al. [
51]. Following that they used data from the 2007 WTCCC phase 1, consisting of 14,000 cases of seven common diseases and 3,000 shared controls. When compared to the standard p-value thresholding approach, COMBI detected twelve additional SNP, ten of which have already been replicated in later GWAS or meta-analyses of bipolar disorder, coronary artery disease, Crohn’s disease and for type 2 diabetes.
DeepCOMBI [
76] The authors of COMBI subsequently developed a “deep” extension of COMBI, called DeepCOMBI [
76]. This extension was designed to identify SNPs associated with a trait of interest, leveraging genotypic and phenotypic data from GWAS. The methodology includes the construction of deep neural networks for phenotype prediction of any genotype and SNPs selection according to a threshold, followed by layer-wise relevance propagation application on the SNPs and the selection of the most relevant variants. Lastly, a hypothesis test is performed for each variant. In addition, layer-wise relevance propagation yields the relevant scores for each variant and the permutation test can guarantee the selection of novel SNPs based on their p-values. In their report DeepCOMBI showed a better performance compared to other methods and identified a higher number of significant SNPs with the lowest error rate.
GenNet [
77] Applying fully connected networks to millions of SNPs requires an ample amount of computational time and memory. To overcome these limitations, developers of GenNet provided a novel framework for predicting phenotype from genotype [
77]. GenNet uses neural network, as well as prior biological knowledge, to create groups of nodes that are connected among the layers, reducing the sum of learnable parameters that a fully connected neural network would need. Biological knowledge may include information on gene annotation, local and global pathways, exon annotation, chromosome annotation, as well as cell and tissue type expression. In this model, neurons represent biological entities, and the weights signify the effects between neurons, resulting in a biologically interpretable network. This method allows human biological input, via a straightforward framework with help of two other pieces of software, HASE [
84] and ANNOVAR [
85] embedded in for generating necessary files. The major drawback of the method is that any researcher can perform differently layer annotation, making it difficult for standardisation.
GMStool [
75] The tool was developed and tested on soybean but can be easily applied to human GWAS with no modification. Overall workflow consists of three phases: preparation, marker selection and final modelling. The preparation phase includes preparation of data which are genotype matrix, phenotype file and a GWAS summary statistic file as the training set. The marker selection phase applies the forward selection method of regression analysis and sequentially selects SNP markers that increase the correlation rate between observed and predicted phenotypes on the validation set. The ridge regression best linear unbiased prediction and bootstrap trees methods are provided as learning models. The final modelling phase performs prediction modelling using ridge regression, random forest, deep neural network and convolution neural network models, using either only one of them, or all. Unfortunately, the current construction of the GMStool requires the use individual level data in addition to GWAS summary statistics, limiting the application areas of the method.
Deep Mixed Model [
62] GWAS on moderately or cryptically related individuals have customised methods to correcting for relatedness, usually either by genetic components or mixed models. To account for relatedness in genome-wide deep learning application Wang et al. [
62] proposed Deep Mixed Model which consists of two components. The first component acts as a confounding factor correction by using a convolutional neural network, while the second component uses Long-short Term Memory for genetic variants selection. The results from Deep Mixed Model applied on Alzheimer disease genome-wide datasets of 1,017 individuals were not directly comparable to literature because most findings in GWAS Catalog are conducted through univariate testing methods. Nonetheless, six out of 20 SNPs selected by Deep Mixed Model were associated with Alzheimer’s disease.
GWANN [
78] Ashkenazy et al. (2022) [
78] tried to exploit the ability of convolutional neural network in image recognition by developing and training a method for classification of variants associated to a trait of interest, using genomic data converted to image patterns. The model named GWANN, was trained using true positives and true negative data corresponding to trait association and finally makes prediction in a tested population. GWANN performance deteriorated when the simulated population did not accurately represent the tested data. For example, minor allele frequency less than 5% affected the pattern of SNP images, affecting the model’s sensitivity. Therefore, parameters such as minor allele frequency, population structure, population size and sampling rate in the training populations need to be adjusted.
DeepWAS [
86] Multivariate functional unit-wide association study (DeepWAS) was developed with the aim to only include SNPs that have been prioritised based on their risk potential. Genome-wide SNPs are first analysed for their functional roles and their association to specific cell lines and transcription factors using the deep learning model DeepSEA [
87]. DeepWAS was able to identify and validate novel disease associated loci in multiple sclerosis, major depressive disorder and height that could not be identified in smaller cohort GWAS studies. It was also able to identify associations of SNPs within a functional unit relevant to a trait that typically missed in traditional GWAS. This methodology is ideal for any GWAS dataset if disease associated genetic conditions (cell-types effects, chromatin features) and its functional data are available. DeepWAS reduces the multiple testing burden of classical GWAS and makes regulatory information on a single SNP level readily available without requiring a second analysis step.
iMEGES [
79] Integrated Mental-disorder GEnome Score (iMEGES), this method was developed as a deep learning tool for analysing whole genome/exome sequencing data, primarily for mental disorders [
79]. In the first step, iMEGES prioritises variants based on non-coding and coding variants using tools EIGEN, CADD, DANN, GWAVA, FATHMM, known brain eQTLs from CommonMind and enhancer/promoters from PsychENCODE and Roadmap Epigenomics projects. In the second step genes are prioritised based on annotations for each variant from the first step of iMEGES.
Table 2 shows an overview of practical properties of these tools which are only internally benchmarked, requiring parallel assays for evaluating their analytical power over each other.