3.2. Dataset Creation
The typical student performance dataset traditionally revolves around two key variables: Continuous Assessment (CA) and Examination (Exam) scores. These metrics have long been the sole basis for evaluating student’s academic achievement. However, this research introduces a paradigm shift by expanding the dimensions of this performance evaluation framework. By integrating additional variables such as Attendance, Demeanor, Practical skills, Class Participation, and Presentation quality, this study pioneers a more comprehensive approach to assessing student success.
Since the dataset needed for this kind of new model is not readily available anywhere, hence the need to create a new dataset that captures the above mentioned variables. To achieve this, a suitable mathematical formulation was utilized which was then translated into a computer algorithm for creating such a higher dimensional dataset as shown in this section.
Let x1, x2, x3, ...,, x7 be Attendance, Practical, Demeanour, Presentation, Participation in class, Continuous Assessment, and Examination, respectively. Where x1, x2, x3, ..., x6 can take on 10 distinct values (i.e, 10 marks each) while x7 can take on 40 distinct values (i.e, 40 marks only). However, the summation of the values of all these variables can not be greater than 100, as the case may be.
To find the total number of combinations, multiply the number of possibilities for each variable:
Total combinations = (Number of possibilities for x1) x (Number of possibilities for x2) x (Number of possibilities for x3) x (Number of possibilities for x4) x (Number of possibilities for x5) x (Number of possibilities for x6) x (Number of possibilities for x7)
So, the total number of combinations is 40, 000, 000. In other words, the total number of data points would be 40, 000, 000 i.e. the new dataset containing all the above features would have 40, 000, 000 records, which is enough to train the proposed model. This formulation was implemented in python and the code snippet is displayed in Algorithm 1.
Algorithm 1: Pseudocode for data pre-processing |
import itertools import csv # Define the range of values for each variable range_values = range(11) # For x1 to x6 range_x7 = range(41) # For x7 # Generate all combinations of the variables combinations = list(itertools.product(range_values, repeat=6)) # For x1 to x6 combinations_with_x7 = [(c + (x7,)) for c in combinations for x7 in range_x7] # Combine with x7 # Calculate total for each combination combinations_with_total = [(c + (sum(c),)) for c in combinations_with_x7] # Specify the file name file_name = "resultPredictionDataset.csv" # Write combinations with sum to CSV file with open(file_name, 'w', newline='') as csvfile: csvwriter = csv.writer(csvfile) # Write header row csvwriter.writerow(["x1", "x2", "x3", "x4", "x5", "x6", "x7", "total"]) # Write data rows csvwriter.writerows(combinations_with_total) print(f"Final Dataset Generated Successfully to {file_name}") # total_rows = 5**6 * 70 # print("Total number of rows:", total_rows)
|
3.2.1. Dataset Description
The dataset generated by Algorithm, which has been published in kaggle repository () contains a collection of 40,000,000 records, each detailing various aspects important for evaluating student academic performance. These records include Continuous Assessment (CA), Practical skills proficiency, Demeanor, Presentation quality, Attendance records, Participation in class, and Examination results. The final two columns represent the overall performance (total score) and the performance class (remarks), which ranges from 1 to 5, indicating different levels of student achievement.
To provide clarity and facilitate analysis, each feature is assigned a corresponding variable: X1 corresponds to CA, X2 corresponds to Practical proficiency, X3 corresponds to Demeanor, X4 corresponds to Presentation quality, X5 corresponds to Attendance, X6 corresponds to Participation in class, and finally, X7 corresponds to Examination results. This structured framework not only simplifies data interpretation but also lays the groundwork for comprehensive analysis and insights into student academic performance.
The CA variable evaluates ongoing performance through assignments, quizzes, and tests, providing insight into consistent engagement and mastery of material. The Practical variable assesses hands-on proficiency in applying theoretical knowledge through lab work and projects. Demeanor focuses on punctuality, attentiveness, and overall conduct, reflecting social and emotional intelligence. Presentation quality is evaluated through the clarity and effectiveness of student presentations, highlighting communication skills. Attendance records quantify commitment and consistency in attending classes. Participation in class measures active engagement and contribution during discussions and activities. Together, these variables offer a comprehensive framework for assessing student strengths, areas for improvement, and overall academic progress.
Table 1 presents the structure of the dataset generated in this research.
3.2.2. Data Analysis
Given the generated dataset, this study aims to explore the differences in student performance across various classes. To achieve this, an Analysis of Variance (ANOVA) was employed to determine if there are statistically significant differences in the performance scores among the different classes. To ensure the validity of the ANOVA results, normality was first checked using Q-Q plots, shown in
Figure 2, and homogeneity of variance was assessed using Levene's test as depicted in
Figure 3.
The Q-Q plots indicated that the normality assumption was reasonably met, while Levene’s test revealed a p-value of 0.0, indicating significant differences in variances across groups.
As shown in
Figure 3, the analysis of variance (ANOVA) performed on the dataset revealed significant differences in student performance across different classes. Levene’s test for homogeneity of variances produced a p-value of 0.0, indicating that the variances among the groups are significantly different. Although this result suggests a violation of the homogeneity of variance assumption, it is expected given the context of the dataset, which comprises different classes of student performance.
The ANOVA results further support the presence of significant differences between groups. The F-statistic was calculated to be 624,662.88 with a corresponding p-value of 0.0. This extremely low p-value allows us to reject the null hypothesis and conclude that there are significant differences in the mean performance scores among the different classes of students as expected in this context.
Following the ANOVA, Tukey’s HSD post-hoc test was conducted to identify which specific groups differed significantly from each other. The results showed that all pairwise comparisons between the groups were significant, with each group displaying a mean difference that was statistically significant (p-adj < 0.05) as expected. This indicates that the performance of students in each class is distinctively different from the others.
To sum up, the analysis demonstrates clear and significant differences in student performance across different classes. Although the homogeneity of variance assumption was not met, the context of varying student performance classes justifies these significant differences. The results align with expectations and provide a robust indication of distinct performance levels among the different classes.
3.2.3. Data Preprocessing
As illustrated in Algorithm 2, the dataset is first loaded from a CSV file, ensuring that all subsequent operations are based on the complete data set. Following this, relevant features and target variables are extracted. The features include various metrics such as `x1`, `x2`, `x3`, `x4`, `x5`, `x6`, and `x7`, while the target variables consist of the `total` score and `remarks`. The `remarks` target variable is adjusted to zero-based indexing in order to align the target values with typical numerical representations used in model training. The feature set is then reshaped to fit the input requirements of a Long Short-Term Memory (LSTM) network. This reshaping transforms the data into the format required for LSTM input, with dimensions corresponding to samples, timesteps, and features. Then, the dataset is divided into training and testing subsets, allowing the model to be trained on one portion of the data and evaluated on a separate, independent portion to assess its performance effectively. Finally, the input shape for the LSTM network is defined based on the reshaped data, which is important for setting up the network architecture correctly.
Algorithm 2: Data Preprocessing |
dataPreprocessing(dataset) |
Load the dataset from the CSV file |
Extract features as X and target variables as y_total and y_remarks |
Adjust y_remarks for zero-based indexing by subtracting 1 |
Reshape X to fit the LSTM input format (samples, timesteps, features) |
Split the dataset into training and testing sets: |
X_train, X_test |
y_total_train, y_total_test |
y_remarks_train, y_remarks_test |
Define the input shape for the LSTM network based on the reshaped data |
end dataPreprocessing |
3.2.4. The Model
The proposed model integrates a Multi-Task LSTM with an Attention Mechanism to enhance the prediction of students' academic performance. As detailed in Table 3, the model addresses two dependent variables: the total and remarks variables. Predicting the total variable is a regression task, while predicting the remarks variable is a classification task. Traditionally, handling these tasks would involve splitting the dataset into two and training them separately, which is time-consuming and resource-intensive. Therefore, a Multi-Task LSTM is employed to manage both tasks concurrently. Additionally, the Attention Mechanism is utilized to identify and extract the most relevant features from the dataset for the Multi-Task LSTM model. The integration of these models is mathematically presented in this section.
For a sequence S =[s
1, s
2, …, s
T] with input features s
t ϵ R
d, the LSTM layer computes forget gate, input gate, cell state update, cell state, output gate and hidden state as illustrated in equations 1 through 6, respectively.
Where SAF is the Sigmoid Activation Function, tanh is the hyperbolic tangent function, and W and b are the weights and biases.
Likewise, for a sequence H = [h
1, h
2,…, h
T], the attention mechanism calculates a context vector C
t as a weighted sum of the input sequences as shown in equation 7.
Attention score (A
t) is computed as depicted in equation 8.
Where et = score (ht, ht-1) = LeakyReLU(Wa * [ht, ht-1] + ba).
For the regression task, the output denoted by y
t is given in equation (9).
where the hidden layer is computed as in equation (10).
where W and b are weights and biases.
Also, for the classification task, the output layer denoted by y
o is given in equation (11).
where the hidden layer is computed as in equation (12).
where
W and
b are weights and biases, and
softmax is the activation function that outputs probabilities. Therefore, the integration is modeled as in equations 13 through 16.
Where H is the output sequence from the LSTM layer, and W, b are the respective weights and biases for each task.
Then, both regression and classification losses are computed as illustrated in equations 17 and 18 respectively.
Where denotes the predicted values.