1. Introduction
In recent years, predicting student performance has become a pivotal aspect of educational data mining and learning analytics. The ability to accurately forecast student outcomes allows educators and administrators to identify at-risk students, tailor interventions, and enhance overall educational effectiveness [
1]. Early identification of students who might struggle academically is crucial, as it enables timely support and resources to be provided, thereby improving retention rates and overall student success [
2,
3]. Traditional methods of predicting student performance often relied on simple statistical techniques and historical data. However, these methods frequently fell short due to their inability to handle complex, high-dimensional data sets commonly found in educational environments [
3].
With the advent of advanced machine learning techniques, ensemble models have emerged as powerful tools for making such predictions, leveraging the strengths of multiple algorithms to improve accuracy and robustness. Ensemble learning, which involves combining multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent models alone, addresses many of the limitations of traditional methods. These models work on the principle that a group of weak learners can come together to form a strong learner, thus improving the model’s overall performance [
4,
5,
6].
Ensemble methods such as Random Forests, Gradient Boosting, and Stacking are particularly well-suited for educational data mining. Random Forests, for instance, can handle large datasets with numerous features, making them ideal for analyzing complex educational data. They work by constructing multiple decision trees during training time and outputting the mode of the classes as the prediction. Gradient Boosting, on the other hand, builds models sequentially, with each new model attempting to correct the errors made by the previous ones. This method is highly effective in improving predictive accuracy and reducing overfitting. Stacking involves training a new model to combine the predictions of several base models, which allows it to leverage the strengths and compensate for the weaknesses of each base model [
7,
8,
9,
10,
11,
12,
13].
There have been many studies which demonstrate the success of ensembling methods by utilizing student performance prediction. For example, an ensemble of models have been shown to substantially outperform single-model approaches in both accuracy and reliability when predicting student success and at-risk students. These models are even more predictive when combined with other sources of data, such as demographic information, academic performance and interaction data from the learning management systems. At a basic level, this integration is an essence of learning analytics that has the goal of optimizing learning and the learning environments by understanding and analyzing educational data [
14,
15,
16,
17].
The use of these models in an educational context builds from the wider field of learning analytics where data about learners and their context are collected, analysed, and reported in order to understand and optimize learning and the environments in which it occurs. [
16] defines learning analytics as “the measurement, collection, analysis and reporting of data about learners and their contexts, for purpose of understanding and optimizing learning and the environments in which it occurs. Big Data is capitalizing on the field to improve educational outcomes by producing information that can drive decision-making processes at institutional levels [
18].
Learning analytics combines multiple sources of data (especially academic records, demographic information, and data produced from interactions with learning management systems) to help educators understand the behaviour and performance of students. Academic records for instance contain historical data of performance that can help to predict outcomes [
19]. Disaggregation of data by demographic characteristics (e.g., age, gender, socioeconomic status and educational background) can provide some of this context and hence facilitate more nuanced analysis, and focused interventions [
20]. Data traces of the interactions of students in a Learning Management System such as how often they logged in, the time that they spent in different modules, how they participated in online discussions or the moment at which they submitted their assignments can be used to produce Student focus models and, in doing so, distinguishing patterns of engagement of students at-risk of failure [
1].
Moreover, the inclusion of data form various sources in the creation of these models provides the ability to tackle a variety of educational problems. For example, predictive models can predict that a student is likely to drop out, which allows for timely intervention to retain the student [
21]. Analytics can also be applied to create personalized learning experiences, adjusting content and methods of instruction to meet the specific needs of each individual learner [
22]. As an example, this is crucial in the case of adaptive learning technologies that improve student engagement by modifying content delivery on the fly based on student performance/engagement via data-driven insights [
23].
Learning analytics is designed to produce answers to these questions in the form of actionable intelligence that may be used to improve learning. With the right information about what works towards student success, educators can design curricula more effectively, provide the appropriate help, and cultivate learning environments built around academic accomplishment. It thus follows that the data coming from learning analytics is of a type that is used to bolster the larger picture of a student’s success or failure, which can be further tailored toward larger institutional goals, such as improving retention rates, increasing student satisfaction and certification of academic excellence [
24].
Many research works have proved the effectiveness of ensemble models in predicting student performance across multiple educational environments, showing their widespread applicability as well as their robustness. In the studies by [
14,
15], Random Forests accurately identified students at risk of dropping out based on academic performance and engagement metrics, achieving significant predictive accuracy compared to other models. Similarly, [
25] utilized Random Forests to analyze engineering student data, successfully reducing dropout rates by identifying at-risk students early on and enabling timely interventions.
Another powerful ensemble technique known as Gradient Boosting has also been successfully used to predict whether students are likely to fail a course or not. Among such, [
26] investigate if Gradient Boosting can help in predetermining any risk of students from online learning environments. Their research showed that Gradient Boosting achieved a higher accuracy than standard classification techniques. In addition to that, [
27] also used Gradient Boosting to predict student performance in MOOCs (Massive Open Online Courses) with high accuracy, for more personalized learning experiences.
As educational data mining tasks are often characterized by complex and voluminous data, model efficacy is favoured by their capability to process large datasets with high-dimensional features. As an example, [
10] pointed out that Random Forests handles large feature sets effectively, preventing overfitting - an important characteristic for educational datasets which typically include various inputs like demographic information, prior academic records, and real-time interaction data from learning management systems. Additionally, Gradient Boosting with its iterative nature during the model building provides the flexibility to capture complex patterns in student behaviour and performance critical for accurate predictions in dynamic educational scenarios.
Additionally, Stacking, which combines multiple models to leverage their individual strengths, has proven effective in educational contexts. [
11] introduced the concept of Stacked Generalization, which has since been applied in various domains, including education. By combining models like Random Forests, Gradient Boosting, and others, Stacking achieves superior predictive performance and provides more reliable insights into student outcomes. This approach has been particularly useful in predicting multifaceted educational phenomena, such as student retention and course completion rates. Moreover, the interpretability and reliability of ensemble models are particularly beneficial in educational contexts, where decisions based on predictive analytics can have significant impacts on students’ academic trajectories. The combination of high predictive accuracy and the capacity to provide actionable insights makes ensemble models a valuable asset in the toolkit of educational institutions.
Therefore, the objective of this paper is to explore the utilization of ensemble models and learning analytics techniques to predict student academic performance. The focus of this study is to harness the predictive power of ensemble models, which combine multiple algorithms to enhance accuracy and robustness. Specifically, the performance of various ensemble models is analyzed and compared using the Open University Learning Analytics Dataset. By evaluating these models across different classification scenarios, this study seeks to identify the most effective method for predicting student performance, with a particular focus on the stacking model.
3. Results and Discussions
The results in
Table 1,
Table 2, and
Table 3 illustrate the performance of various ensemble models in predicting student outcomes across different classification schemes. The evaluation metrics used to measure the performance include Precision, Recall, F1 Score, Accuracy, and AUC (Area Under the Curve). These metrics provide a complete understanding of the models’ effectiveness in different scenarios. For the purpose of comprehensive model building, the performances of the various models were observed across three classification scenarios. Scenario 1 involves the prediction of the four classes of students’ performance (Distinction, Fail, Pass, and Withdrawn). Scenario 2 consists of predicting 3 classes of students’ performance (Distinction & Pass, Fail and Withdrawn). And lastly, scenario 3 consists of predicting 2 classes of students’ performance (Distinction & Pass and Fail & Withdrawn).
Table 1 presents the evaluation metrics for models predicting four classes of student results: Distinction, Fail, Pass, and Withdrawn. The Stacking model exhibits the highest performance across most metrics, with a Precision of 83%, Recall of 81%, F1 Score of 81%, Accuracy of 81%, and an AUC of 96%. This indicates that the Stacking model is the most effective in handling the complexity of four distinct classes. In contrast, the AdaBoost model performs the poorest, with the lowest metrics in all categories, especially an F
1 Score of 46% and an Accuracy of 50%. This suggests that AdaBoost struggles with the multi-class classification problem.
Figure 2 and
Figure 3 show the confusion matrices and ROC curves of the models for the four classes (Distinction, Fail, Pass, Withdrawn) of the students’ final results.
Figure 2 shows that all the models consistently predict the correct performances except Gradient Boosting and AdaBoost which returned poor predictions for students in the failure category. Similar results were observed in the ROC curves for the Fail categories in
Figure 3 for Gradient Boosting and AdaBoost.
Table 2 summarizes the performance of the models for three classes: Fail, Pass & Distinction, and Withdrawn. The Stacking model again leads with 88% across all metrics and an AUC of 97%, showing strong predictive capability. Random Forest and ExtraTrees also perform well, each achieving 87% Precision, Recall, F
1 Score, and Accuracy, with an AUC of 96%. AdaBoost shows improvement compared to the four-class scenario but remains the weakest, with an F
1 Score of 57% and an Accuracy of 63%. This indicates that reducing the number of classes improves the model’s performance, but not uniformly across all models.
Figure 4 and
Figure 5 show the confusion matrices and ROC curves of the models for the three classes (Fail, Pass & Distinction, Withdrawn.
Figure 4 shows that all the models consistently predict the correct performances except Gradient Boosting and AdaBoost which returned poor predictions for students in the failure category. Similar results were observed in the ROC curves for the Fail categories in
Figure 5 for Gradient Boosting and AdaBoost.
Table 3 displays results for the simplest classification, dividing the outcomes into two classes: Pass & Distinction versus Fail & Withdrawn. Here, all models show significant performance improvements. The Stacking and Random Forest models both achieve 96% across all metrics, including an AUC of 99%. This near-perfect performance indicates that the models are highly effective when the classification task is simplified. ExtraTrees and Bagging also perform remarkably well, with metrics around 94-95% and an AUC of 99%. AdaBoost, while still the lowest performer, achieves a relatively high F
1 Score of 85% and an Accuracy of 86%, demonstrating considerable improvement in a binary classification context.
The results indicate that model performance varies significantly depending on the complexity of the classification task. In the four-class scenario, the complexity is higher, and only the Stacking model manages to maintain high performance, while others like AdaBoost fall behind. As the number of classes decreases, all models show improvement, with the two-class scenario yielding the highest metrics across the board. This trend suggests that ensemble models handle binary classification tasks more effectively, likely due to reduced complexity and clearer distinctions between the classes.
Stacking emerges as the most robust model across all scenarios, consistently achieving the highest or near-highest scores. This model’s ability to combine the strengths of various base learners likely contributes to its superior performance. Conversely, AdaBoost’s relatively poor performance, especially in multi-class scenarios, indicates a potential limitation in its ability to handle more complex classification tasks.
Figure 6 and
Figure 7 show the confusion matrices and ROC curves of the models for the two classes (Pass & Distinction, Fail & Withdrawn) of the students’ final results.
Figure 6 shows that all the models consistently predict the correct performances except Gradient Boosting and AdaBoost which returned moderate predictions for students in the failure category. Similar results were observed in the ROC curves for the Fail categories in
Figure 7 for Gradient Boosting and AdaBoost.
4. Conclusions
The evaluation of various ensemble models across different classification scenarios reveals several key findings. The proposed stacking model consistently emerges as the best-performing model, excelling in both multi-class and binary classification tasks. Random Forest and ExtraTrees also demonstrate high effectiveness, particularly in binary and reduced class scenarios, where they achieve near-perfect metrics. Bagging performs commendably across all scenarios, though slightly below the top performers. On the other hand, Gradient Boosting and AdaBoost show weaker performance, particularly in multi-class scenarios, though they improve considerably as the classification task is simplified.
Overall, the results indicate that while complex classification tasks present challenges, ensemble models like Stacking, Random Forest, and ExtraTrees can achieve high accuracy and reliability. Simplifying the classification task generally enhances model performance, suggesting a potential strategy for improving predictive accuracy by reducing the complexity of classification schemes. These findings highlight the importance of selecting appropriate ensemble methods and optimizing classification granularity to achieve the best predictive outcomes for student performance analysis.
Figure 1.
Open University Learning Analytics Dataset Schema.
Figure 1.
Open University Learning Analytics Dataset Schema.
Figure 2.
Confusion matrices of the various machine learning models for the four classes (Distinction, Fail, Pass, Withdrawn) of the students’ final results.
Figure 2.
Confusion matrices of the various machine learning models for the four classes (Distinction, Fail, Pass, Withdrawn) of the students’ final results.
Figure 3.
Receiver operating characteristics curves (ROC) of the various machine learning models for for the four classes (Distinction, Fail, Pass, Withdrawn) of the students’ final results.
Figure 3.
Receiver operating characteristics curves (ROC) of the various machine learning models for for the four classes (Distinction, Fail, Pass, Withdrawn) of the students’ final results.
Figure 4.
Confusion matrices of the various machine learning models for the three classes (Fail, Pass & Distinction, Withdrawn.
Figure 4.
Confusion matrices of the various machine learning models for the three classes (Fail, Pass & Distinction, Withdrawn.
Figure 5.
Receiver operating characteristics curves (ROC) of the various machine learning models for the three classes (Fail, Pass & Distinction, Withdrawn
Figure 5.
Receiver operating characteristics curves (ROC) of the various machine learning models for the three classes (Fail, Pass & Distinction, Withdrawn
Figure 6.
Confusion matrices of the various machine learning models for the two classes (Pass & Distinction, Fail & Withdrawn) of the students’ final results
Figure 6.
Confusion matrices of the various machine learning models for the two classes (Pass & Distinction, Fail & Withdrawn) of the students’ final results
Figure 7.
Receiver operating characteristics curves (ROC) of the various machine learning models for the two classes (Pass & Distinction, Fail & Withdrawn) of the students’ final results
Figure 7.
Receiver operating characteristics curves (ROC) of the various machine learning models for the two classes (Pass & Distinction, Fail & Withdrawn) of the students’ final results
Table 1.
Evaluation metrics of the various ensemble models predictions for the four classes (Distinction, Fail, Pass, Withdrawn) of the students’ final results.
Table 1.
Evaluation metrics of the various ensemble models predictions for the four classes (Distinction, Fail, Pass, Withdrawn) of the students’ final results.
|
Evaluation Metrics |
Models |
Precision |
Recall |
F1
|
Accuracy |
AUC |
Random Forest |
79% |
79% |
79% |
79% |
95% |
Gradient Boosting |
58% |
58% |
56% |
58% |
84% |
ExtraTrees |
78% |
78% |
78% |
78% |
95% |
AdaBoost |
49% |
51% |
46% |
50% |
79% |
Bagging |
77% |
77% |
77% |
77% |
94% |
Stacking |
83% |
81% |
81% |
81% |
96% |
Table 2.
Evaluation metrics of the various ensemble models predictions for the three classes (Fail, Pass & Distinction, Withdrawn) of the students’ final results.
Table 2.
Evaluation metrics of the various ensemble models predictions for the three classes (Fail, Pass & Distinction, Withdrawn) of the students’ final results.
|
Evaluation Metrics |
Models |
Precision |
Recall |
F1
|
Accuracy |
AUC |
Random Forest |
87% |
87% |
87% |
87% |
96% |
Gradient Boosting |
65% |
67% |
63% |
66% |
83% |
ExtraTrees |
85% |
85% |
85% |
85% |
96% |
AdaBoost |
59% |
63% |
57% |
63% |
78% |
Bagging |
84% |
84% |
84% |
84% |
95% |
Stacking |
88% |
88% |
88% |
88% |
97% |
Table 3.
Evaluation metrics of the various ensemble models predictions for the two classes (Pass & Distinction, Fail & Withdrawn) of the students’ final results.
Table 3.
Evaluation metrics of the various ensemble models predictions for the two classes (Pass & Distinction, Fail & Withdrawn) of the students’ final results.
|
Evaluation Metrics |
Models |
Precision |
Recall |
F1
|
Accuracy |
AUC |
Random Forest |
96% |
96% |
96% |
96% |
99% |
Gradient Boosting |
88% |
87% |
87% |
87% |
92% |
ExtraTrees |
95% |
95% |
95% |
95% |
99% |
AdaBoost |
87% |
86% |
85% |
86% |
91% |
Bagging |
94% |
94% |
94% |
94% |
98% |
Stacking |
96% |
96% |
96% |
96% |
99% |