1. Introduction
Networks have been providing immensely popular platforms that enable nodes to interact and communicate with each other for free or under defined conditions. Network in this context may refer to an undirected graph in which vertices are connected by edges or links. Link prediction is about finding missing links in static networks or predicting the probability of the future based on the observation of existing links in dynamic networks [
1,
2]. Based on empirical studies, it is possible to predict new links between vertices based on the topology of a network and the properties that characterize the topology of networks and the evolving dependencies between interactions over time in between two nodes in order to infer social interactions by suggesting possible friends to the users or to infer novel drugs from biological networks [
3]. The task of link prediction requires to examine the proximity of different pairs of nodes and the type of interactions taking place to know how frequently any two nodes interact thus finding applications in domains of biological networks and recommender systems [
4]. The growing popularity of these platforms has led to the extensive use of social network data in research across various fields, including sentiment analysis to extract opinions of people out of their writings [
5], product recommendation systems based on user relationships for user by Amazon, Taobao, Jingdong and AliBaba platforms [
6], user interaction studies employing statistical techniques for extracting information features from shared images and textual content [
7], and social relationship analysis for spreading over the internet, metabolic networks, food webs and neural networks for various scientific and academic discipleines [
8]. One crucial task in social network analysis is link prediction, which aims to identify potential or missing connections between users. The task is to predict missing links and new links in the network through existing structural information present in the network. Link prediction is a versatile technique applied across a wide array of domains to forecast potential connections within networks [
9]. In social networks, it supports friend recommendations and community detection, enhancing user engagement on platforms such as Facebook and LinkedIn [
10].
In the field of biomedicine [
11], it predicts protein-protein interactions and genetic correlations of diseases, which help in scientific discoveries and medical research [
12]. E-commerce and streaming services leverage link prediction to suggest products and content, respectively, based on user behavior [
13]. Knowledge graphs use them to infer missing relationships, improve information retrieval and semantic understanding. Additionally, it plays a crucial role in fraud detection by identifying suspicious transaction patterns, and in infrastructure networks, it helps improve transportation systems and power grids [
14]. Academic networks benefit from predicting future research collaborations, while telecommunications use them for network optimization [
15]. Even in law enforcement, link prediction helps in uncovering hidden connections within criminal networks [
16]. This wide application underscores its importance in enhancing the functionality and efficiency of various complex systems.
The entities in networks could be proteins, neurons or persons connected together edges (or links) representing associations. Link predictions are aimed at suggesting healthcare procedures for survival of patients with fatal diseases, and recommending products of interest in shopping while finding key actors in criminal investigation [
17,
18]. Recent studies on link prediction in social networks commonly employ two broad approaches: similarity-based methods and learning-based methods. The similarity-based method works on the assumption that nodes with higher similarity scores are more likely to be connected [
19,
20]. It determines the degree of similarity between nodes using a function that incorporates network data, such as topology or node attributes with relevant weighing scores. This similarity measure is then applied to estimate the likelihood and level of a link between nodes. The accuracy of the prediction heavily depends on the effective selection of network structure features. The learning-based method creates a model capable of extracting features of interest from the given network topology using computational biology, machine and data mining for drug sensitivity algorithms on profiling genomic, proteomic and epigenomic datasets. It trains this model using existing information of patients’ responses to different drugs based on environmental causes, genetic factors and tumor heterogenity, and then utilizes the trained model to predict the probability of links between the pairs of nodes [
21,
22]. The scores of the associated links are employed to gauge the closeness of connectivity between two nodes in link prediction using scores-based heuristic methods to assess similarity by considering only the immediate neighbors shared by two nodes.
In contrast, path similarity methods leverage global structural insights of networks, encompassing paths [
23,
24] and ant colony optimization to predict missing links in communities, to ascertain node similarity. However, structure-based methods are solely reliant on the topology of the networks. Moreover, they may not always be reliable, different networks can exhibit varying clustering and path lengths while sharing similar degree distributions. Consequently, their performance can differ across varying networks, making it challenging to effectively capture the underlying topological relationships between nodes.
In recent years, a significant number of learning-based algorithms such as graph neural networks (GNN), data driven deep learning methods have evolved to aim more efficiently at improving the accuracy of link prediction in various types of networks. The learning-based methods have led to graph convolutional networks (GCN) and graph attention networks (GAN) algorithms which work on assigning different weights or importance, making them one of the most sophisticated models. for developing methods of varying applications [
25,
26]. These algorithms have focused on extracting essential features from networks by constructing sophisticated models. Since the extracted features form the basis for precisely predicting probable linkages, the quality and relevance of these features have a significant impact on the performance of these models. Thus, one of the most important stages in the link prediction process is the feature extraction phase. It involves identifying and selecting the most informative attributes of the network, which can include node characteristics, topological properties, and interaction patterns. By accurately capturing these features, the models can better understand the underlying structure and dynamics of the network, leading to more precise predictions of non-existing links. Thus, the success of learning-based link prediction algorithms is largely determined by the robustness and comprehensiveness of the feature extraction stage. In response to this challenge, we propose a framework that revolves around feature extraction and the application of machine learning (ML) techniques to classify potential links into two categories: "will form" (positive) or "will not form" (negative). To achieve this classification, we conducted experiments employing diverse ensemble learning models such as Random Forest (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), CatBoost, ExtraTrees, and AdaBoost.
The significant contributions of this work are:
- -
To introduce a novel framework for link prediction using machine learning, aiming to predict the likelihood of new connections forming in a network.
- -
To investigate how extracted features and ensemble learning models impact the effectiveness of link prediction process.
- -
To determine the optimal hyperparameter values using the GridSearchCV technique.
- -
To achieve the highest accuracy, we utilized machine learning classifiers with the most effective hyperparameter values determined through hyperparameter tuning.
- -
To evaluate the performance of various ensemble machine learning models, we considered different measures like Accuracy, AUC, Recall, Precision, and F1-score.
The rest of this paper is organized with Section 2 presenting reviews of recent research on link prediction within social networks while Section 3 introduces fundamental theories associated with link prediction methods. Section 4 details the design of the proposed method and presents the experimental findings. Section 5 concludes with a summary and discussion of the results. Finally, Section 6 draws conclusions and presents some future works.
2. Related Work
Networks make one data type, consisting of objects of interest as nodes and with edges or links signifying some form of relations between the nodes. Such data types are used in fields ranging from those in biological sciences to analysis of terrorist networks. The links can be associated with weights providing a measure of strength for each connection, and any two nodes may be connected or not. In recent years, the scientific community has intensively studied the link prediction problem in networks, particularly those evolving with time through add and drop of nodes with time, leading to numerous algorithms based on similarity methods. However, there remains room for improving these approaches.
Currently, machine learning has significantly contributed to the development of several advanced link prediction methods. For example, the authors in [
27] present an innovative approach for enhancing link prediction in social networks by leveraging an ensemble of machine learning techniques. The authors investigate the limitations of existing link prediction methods as the process of identifying potential connections between nodes in order to forecast the growth of patterns in a structure. The authors propose a novel ensemble framework that combines multiple machine learning algorithms to improve links predictive accuracy. Through extensive experiments and analysis, the study demonstrates the effectiveness of the proposed ensemble method in various social network scenarios. The results highlight the potential of ensemble techniques to address challenges in link prediction and contribute to more robust and accurate social network analysis. In a similar previous work [
28], the authors have investigated a supervised approach to link prediction in social networks using embedding-based methods. This paper presents a novel technique that utilizes network embedding to capture the structural properties of social networks and improve the accuracy of link prediction which shows if the link exists else it helps if a link will appear between two nodes in future link prediction which can be either periodic or non-periodic in dynamic networks.
By embedding the nodes of the social network into a continuous vector space, the method allows for more effective prediction of future links based on the learned representations. The study demonstrates the effectiveness of the embedding-based approach through experiments on various social network datasets, showcasing its potential to outperform traditional link prediction methods. Another important approach to be mentioned is given by authors in [
29] who propose a novel link prediction approach in complex networks by integrating recursive feature elimination (RFE) and stacking ensemble learning. This method utilizes RFE to select the most relevant features and employs a two-level stacking ensemble model combining logistic regression, gradient boosting decision tree (GBDT), and XGBoost as foundational classifiers. Their approach leverages both global and local topological information to enhance the prediction accuracy and robustness across various network datasets.
The authors in [
30] propose a novel approach that combines the concept of "mean received resources" with various machine learning techniques to measure the similarity between nodes for improved accuracy of link prediction models through network structure and node characteristics. They argue that traditional link prediction methods often overlook the importance of resource-related factors that can significantly influence the formation of links in social networks. By incorporating these metrics into machine learning algorithms, the study aims to address the limitations of existing deep learning based prediction models to provide a more nuanced and effective prediction model. Another significant contribution to this field is suggested by the authors in [
31] discussing network dynamic link prediction which has emerged as a powerful tool in various fields. The authors investigate link prediction techniques through supervised learning approaches, examining methods for forecasting link formation in both single-layer and multiplex networks. While single-layer networks involve a single type of connection, multiplex networks feature multiple types or layers of connections. This study emphasizes how applying supervised learning models to these diverse network structures can enhance the accuracy of link prediction. Moreover, Ghorbanzadeh Hossien, et al. [
32] introduced an innovative method that combines multiple techniques to improve the accuracy of predicting future links in directed graphs, where edges have a specific direction. Their approach integrates various prediction strategies to capitalize on different aspects of the graph’s structure and dynamics. By merging the strengths of these individual techniques, this method aims to enhance overall prediction performance, as demonstrated through experiments on directed graph datasets.
The authors in [
33] have investigated a supervised link prediction method that uses structure-based feature extraction in social networks. The authors have proposed a method to extract features derived from the network’s structure for improving the prediction of future links. By emphasizing these structural features, their method seeks to enhance the accuracy of link prediction in social networks. The effectiveness of the approach is demonstrated through experiments conducted on various social network datasets.
Further, the authors in [
34] have explored link prediction in multiplex networks by employing inherent features of recursive feature elimination of random forest to select representative and relevant structural features of the networks and using stacking method to enhance prediction results of the model. The authors developed a method specifically designed to predict links in networks with multiple layers or connection types by employing logistic regression (LR), gradient boosting decision tree (GBDT), and XGBoost as the base models and using XGBoost as the top-level model. Their approach leverages supervised learning to effectively navigate the complexities of multiplex networks and boost the accuracy of link predictions. The study validates the effectiveness of this strategy through extensive experimentation on multiplex network datasets. Different performance metrics are used to evaluate the methods discussed in the literature review.
Table 1 provides a summary of the best classifier for each approach based on the selected metrics.
4. Experimental Results and Discussion
The experimental results and comprehensive analysis described in this section is aimed at validating the effectiveness of the proposed approach. The experimental setup is outlined, detailing the procedures, and the tools used. This is followed by examination of feature importance models, which provide insights into features most significantly influencing model predictions. Analyzing these features allows us to better understand their role in model performance. Finally, an extensive discussion of the results is provided.
4.1. Experimental Setup
In this section are explained how the experiments are conducted. This study has involved the development of a machine learning classifier and the incorporation of feature selection methods using Python scripts. All experiments have been performed on a computer with 16 GB of RAM and an Intel Core i5 CPU running Windows 11. The algorithms have been tested in the Google Colab environment. The hyperparameter tuning step is performed to enhance the accuracy of the approach, selecting the best parameters for each algorithm. GridSearchCV method has been used to optimize the hyperparameters of each model. This method automates the process of selecting the best combination of parameters for a given algorithm by exhaustively searching through a specified parameter grid.
Table 3 displays the best hyperparameter combinations for the algorithms used with the Twitch and Facebook datasets. The steps involved in GridSearchCV are outlined below:
Define the Model: Select the machine learning algorithm to optimize.
Create the Parameter Grid: Specify the parameters and their ranges to test, typically using a dictionary where the keys are parametric names and the values are lists of possible values.
Configure GridSearchCV: Initialize the GridSearchCV object with the model, parameter grid, and options like the number of cross-validation folds.
Fit the Model: Train the model on the training data using the specified parameter grid to cross-validate.
Evaluate the Results: Analyze the results to identify the best parameters and to evaluate the model’s performance with those parameters
4.2. Feature Importance Models
"Feature importance" encompasses a range of techniques designed to assign a significance score to each input parameter based on its ability to predict a target variable [
40]. These scores are integral to a predictive modeling project for several reasons. Firstly, they provide valuable insights into the dataset, revealing which features have the greatest impact on the prediction results. This understanding can inform data preprocessing steps, such as cleaning and transforming data, to improve model performance. Secondly, feature importance scores shed light on the inner workings of the model itself. By identifying the features the model relies on most, practitioners can gain a deeper understanding of the model’s decision-making process. This can aid in the interpretation of models, making it easier to explain and justify predictions to stakeholders. The significance of the 14 features across the six models employed in our approach has been described in this section.
Figure 4 illustrates the significance of various features in the Twitch dataset.
Specifically,
Figure 4(a),
Figure 4(b), and
Figure 4(f) show that Followers_Dst and Followees_Src are highly significant in the RF, XGBoost, and AdaBoost models. These features consistently rank at the top across these models, underscoring their substantial impact on model performance. On the contrary, the LightGBM and CatBoost models emphasize the importance of Followees_Src and Page Rank_Dst, as depicted in
Figure 4(c) and
Figure 4(e). These features are critical to the prediction accuracy of LightGBM and CatBoost. In contrast, the Extra Trees model, as shown in
Figure 4(d), identifies katz_centrality and Shortest Path as the most influential features, highlighting their unique contribution to the model’s performance.
Similarly,
Figure 5 demonstrates the importance of various features in the Facebook dataset.
Figure 5(a),
Figure 5(b), and
Figure 5(d) reveal that Shortest Path and katz_centrality are highly significant in the RF, XGBoost, and Extra Trees models. These features consistently rank at the top, indicating their substantial impact on these models. On the other hand, the LightGBM and AdaBoost models prioritize Page Rank_Dst and Page Rank_Src, as shown in
Figure 5(c) and
Figure 5(f) highlighting their critical role in these models’ prediction accuracy. For the CatBoost model,
Figure 5(e) showcases katz_centrality and Followees_Src as the most influential features, suggesting their unique contribution to the model’s performance compared to the others.
4.3. Results
The performance metrics discussed in subsection 3.4 have been used here to present the results of the classification models. Also, the developed models are trained on the training dataset to validate their performance on new data to check if they suffer from outfitting.
Figure 6 illustrates the precision of the classifiers before and after hyperparameters tuning on the Twitch and Facebook datasets. The experimental results clearly show that hyperparameter tuning enhanced the precision of most classifiers. These results are achieved by adjusting the hyperparameters to optimize precision. Using 10-fold cross-validation, the precision of different hyperparameter combinations is evaluated. These results demonstrate that most classifiers, including RF, XGBoost, CatBoost, ExtraTrees, and AdaBoost, have improved the precision with hyperparameter tuning. However, the precision of LightGBM has remained unchanged in the Twitch dataset.
Table 4 provides a comparative analysis of the performance of different algorithms across multiple metrics, including Accuracy, AUC, Precision, Recall, and F1-Score. This evaluation reveals subtle variations in performance between classifiers when applied to the Twitch and Facebook datasets.
For the Twitch dataset, all classifiers have performed exceptionally well, with XGBoost and CatBoost slightly outperforming the others in accuracy (0.968) and AUC (0.993). Random Forest, LightGBM, and AdaBoost also demonstrated strong results with accuracy scores of 0.967 and comparable AUC values, indicating high reliability in class discrimination. It is worth noting, the precision scores for all classifiers were very high, ranging from 0.981 to 0.985, suggesting a low rate of false positives. The F1-scores were consistently around 0.968, reflecting balanced precision and recall. Although ExtraTrees still performed well, it had a marginally lower AUC of 0.981, indicating a slightly lesser but still robust distinction capability.
In contrast, the Facebook dataset presented a greater challenge, resulting in lower overall accuracy scores across all classifiers. XGBoost again has delivered with an accuracy of 0.921 and a highest AUC of 0.976, indicating superior performance in class separation. CatBoost and LightGBM followed closely, maintaining high AUC values of 0.974 and 0.972, respectively, with balanced precision and recall, resulting in strong F1-scores (around 0.930). Random Forest, while having a high AUC of 0.972, showed a slightly lower accuracy of 0.911 and an F1-score of 0.923, suggesting it may have had more difficulty with the Facebook dataset than with the Twitch dataset. ExtraTrees and AdaBoost had the lowest accuracy scores of 0.906 and 0.902, respectively, with AdaBoost showing a notably lower AUC of 0.942. Despite this, the precision scores remained relatively high, but the recall was slightly lower, reflecting a higher rate of missed positive instances. Overall, while all classifiers exhibited efficient performance, XGBoost and CatBoost consistently achieved the best results across both datasets. However, the Facebook dataset proved more challenging to accurately classify compared to the Twitch dataset. The visual representation of classifier performance based on the Area Under Curve (AUC) is as shown in
Figure 7 and
Figure 8.
4.4. Discussion
The results of analysis of the Twitch and Facebook datasets confirm the profound impact of ensemble learning models and feature selection on model performance. Ensemble methods, including XGBoost, CatBoost, and LightGBM, consistently outperformed other classifiers across both datasets, demonstrating their ability to improve accuracy, AUC, precision, and recall. These models leverage the strengths of multiple underlying learners to create a more powerful predictive system, effectively capturing complex data patterns and mitigating issues such as overfitting. For instance, both XGBoost and CatBoost have achieved the highest accuracy and AUC scores, indicating their exceptional capability in distinguishing between classes and handling diverse data complexities.
Feature selection further amplifies the effectiveness of these models by selecting and retaining only the most relevant predictors. This process eliminates noise and redundant features, enhancing model accuracy and interpretability while reducing computational load. In the context of the Facebook dataset, which posed a greater challenge, the high AUC scores and balanced metrics achieved by the ensemble methods suggest that thoughtful feature selection played a critical role in managing increased complexity and noise.
Overall, the synergy between ensemble learning and effective feature selection leads to models that are not only highly accurate, but also efficient and interpretable. This balanced approach is essential to address complex real-world problems, ensuring that models are both robust and practical for deployment.