In order to further the process of classification modeling according to the supervised learning paradigm, the available set of 636 vectors is divided into a training set, a validation set and a testing set in the ratio 80%:10%:10%. Six different classification models were created, mainly based on the decision tree model with certain modifications. All model parameters and hyperparameters have default settings specified in the IBM SPSS Modeler environment.
5.2.1. CHAID model
Chi-squared Automatic Interaction Detection (CHAID) is a classification method in machine learning used for creating decision trees based on Pearson Chi-square statistics. It represents one of the most popular statistically-based models for multivariate dependence, used for detecting relationships between categorical dependent and independent variables [
17,
35]. The CHAID model examines the chi-square test of independence to assess the significance of the relationship between independent variables and the dependent variable. When multiple relationships are statistically significant, CHAID selects the input variable with the lowest
p value. For input variables with more than two categories, categories with insignificant differences in outcome are merged, starting with those with the least significant differences, until all remaining categories are significantly different. The algorithm minimizes variations in the dependent variable within groups and maximizes them between groups, achieving optimal data segmentation [
35,
44]. The default settings of the CHAID model in the IBM SPSS software include, among other things, two important hyperparameters: Significance level for splitting = 0.05 and Significance level for merging = 0.05. These parameters with values between 0 and 1 define thresholds for statistical significance used to decide whether a node in the tree should be split, i.e., whether multiple predictor categories should be merged. Minimum Change in Expected Cell Frequencies is another hyperparameter used in the CHAID model. It determines the smallest change in expected cell frequencies (in the contingency table) required to split or merge nodes. Maximum Iterations for Convergence sets the upper limit for the number of iterations the algorithm can perform, with the default setting for this hyperparameter being 100.
Figure 8 shows the structure of the created decision tree, i.e., the CHAID model with three layers of nodes in addition to the root node. The first layer contains nine nodes, five of which branch into terminal nodes – leaves of the second layer. Each of these nodes contains a question or test that splits the data based on certain attributes, while the leaves themselves generate predictions – classifications. Only one second-layer node is connected by a branch to two third-layer leaves.
Next to each node in
Figure 1, there is a label that defines the branching rule, which is structured in two parts. The label above the node shows a set of rules for assigning individual records to child nodes based on predictor values [
44]. It consists of the name of the variable according to which the branching was performed and its numerically coded categories, connected by the OR operator. The label below the node defines the classification value, which for categorical outputs, as in this case here, is expressed as a statistical parameter Mode – the most frequent category in the observed branch. For example, consider the classification of input vectors into category 5 of the Event Significance in the first node of the first layer from the left side of the tree in
Figure 8, given in the following form: [Mode: 5]=>5.0. For trees with numeric dependent variables, the prediction is the average value for the branch. The created CHAID model accurately classifies 92.19% of the vectors from the test set.
5.2.2. C&R Tree model
The Classification and Regression (C&R) Tree model is a predictive classification model based on a decision tree. The C&R Tree algorithm begins by analyzing input nodes to find the best possible split, determined by the minimum Impurity Index. The Gini Index is most commonly used to measure impurity, which is related to the probability of incorrect classification of a random sample [
38]. In IBM SPSS Modeler software, the default minimum impurity change is 10
-4 to allow a new split in the tree. All splits are binary, meaning that each split creates two subgroups, and each of those subgroups is further split into two until one of the stopping criteria is met [
38]. The stopping criteria set in the C&R Tree model are:
Minimum records in parent branch – 2% of the total dataset,
Minimum records in child branch – 1% of the total dataset.
The same criteria apply to the CHAID model. In the C&R Tree model, “maximum surrogates” is an important hyperparameter that determines the maximum number of surrogate splits that can be used to split nodes if the data for the primary split variable is unavailable. By default, this hyperparameter has a value of 5.
Figure 9 shows the tree of the C&R Tree model where the best split was made according to the Event Type variable. Based on this division, the existing data set is divided into two subgroups. The final decisions or affiliations of the inputs to the categories of the dependent variable are contained in the final nodes, Node 1 and Node 2. This model showed a classification accuracy of 78.13% on the test data.
The following two inference rules define the structure of the previously shown tree:
To clarify, consider the first rule as an example: Event_type in [ 2.000 4.000 5.000 ] [ Mode: 5 ] => 5.0. The interpretation of this rule is as follows:
Event_type in [ 2.000 4.000 5.000 ]: This means that the variable “Activity” has a value within the specified set of values: 2.000, 4.000, 5.000.
[ Mode: 5 ]: “Mode” refers to the most frequent class in the dataset that satisfies the condition “Event_type in [ ... ]”. In this case, the most frequent class is 5.
=>5.0: This part indicates the prediction or classification. If the “Activity” attribute has a value within the specified set of values, then the class prediction is 5.0.
Based on these rules, it can be concluded that the classification is based on the category mode of the dependent variable. For node 1 in
Figure 9, the mode is 5, and for node 2 the mode is 4.
5.2.4. C5.0 model
The C5.0 model divides the sample based on the variable that provides the maximum information gain. Each resulting subsamples is then further split, typically based on another variable, and this process continues until the subsamples can no longer be split. The splits at the lowest level are reexamined, and those that do not significantly contribute to the classification are removed or pruned [
47].
C5.0 can build two types of models: a decision tree and a rule set. The decision tree is a simple and intuitive way to represent the branching found by the algorithm. Each leaf of the decision tree describes a specific subset of the training data, where each case in the training data belongs to exactly one terminal node of the tree. In contrast, the rule set represents a simplified version of the structure that defines the branching.
The C5.0 model was created without the input variable Location due to warnings and errors reported by the software when running the algorithm. The reasons are inherent to the variable’s nature, which has a large number of unique values – categories.
Figure 10 shows the structure of the resulting decision tree, which has a depth of 2. Each node of the tree is marked with a number corresponding to a specific branching rule given below. When tested, the model showed an accuracy of 92.19%, matching that of the CHAID model.
For the C5.0 model, the Training Mode hyperparameter is defined, which can have the values Simple and Expert. In Simple training, most of the C5.0 parameters are set automatically. Expert training allows for more direct control over the training parameters. By default, C5.0 attempts to generate the most accurate tree possible (Accuracy), which often can lead to overfitting and poor performance. On the other hand, Generality uses algorithm settings that are less prone to this problem [
54].
The similarity with the CHAID model is also reflected in the set of rules that define the branches of the tree shown, with the position of each rule in the tree determined based on the numerical label of each node in
Figure 10. In this case, the following 16 rules were generated, which can be interpreted as shown for the C&R Tree model:
1. Activity in [ 1.000 7.000 8.000 13.000 16.000 17.000 18.000 19.000 20.000 23.000 26.000 30.000 46.000 60.000 63.000 65.000 ] [ Mode: 5 ] => 5.0
2. Activity in [ 2.000 3.000 4.000 5.000 27.000 28.000 33.000 34.000 36.000 37.000 38.000 41.000 44.000 51.000 52.000 53.000 57.000 58.000 62.000 64.000 66.000 ] [ Mode: 4 ] => 4.0
3. Activity in [ 6.000 10.000 11.000 24.000 29.000 47.000 54.000 ] [ Mode: 6 ] => 6.0
4. Activity in [ 9.000 ] [ Mode: 4 ]
4.1. Event_type in [ 1.000 2.000 3.000 5.000 6.000 7.000 8.000 11.000 12.000 13.000 14.000 15.000 16.000 ] [ Mode: 2 ] => 2.0
4.2. Event_type in [ 4.000 ] [ Mode: 4 ] => 4.0
4.3. Event_type in [ 9.000 10.000 ] [ Mode: 2 ] => 2.0
5. Activity in [ 12.000 31.000 32.000 61.000 ] [ Mode: 7 ] => 7.0
6. Activity in [ 14.000 ] [ Mode: 5 ]
6.1. Periodicity in [ 1.000 2.000 3.000 6.000 7.000 8.000 9.000 10.000 ] [ Mode: 5 ] => 5.0
6.2. Periodicity in [ 4.000 ] [ Mode: 5 ] => 5.0
6.3. Periodicity in [ 5.000 ] [ Mode: 6 ] => 6.0
7. Activity in [ 15.000 ] [ Mode: 6 ]
7.1. Event_type in [ 1.000 2.000 3.000 6.000 7.000 8.000 11.000 12.000 13.000 14.000 15.000 16.000 ] [ Mode: 6 ] => 6.0
7.2. Event_type in [ 4.000 ] [ Mode: 6 ] => 6.0
7.3. Event_type in [ 5.000 9.000 10.000 ] [ Mode: 5 ] => 5.0
8. Activity in [ 21.000 ] [ Mode: 5 ]
8.1. Object in [ 1.000 2.000 3.000 4.000 5.000 7.000 8.000 9.000 10.000 11.000 12.000 13.000 14.000 15.000 16.000 17.000 18.000 19.000 20.000 21.000 22.000 24.000 25.000 26.000 27.000 28.000 29.000 ] [ Mode: 5 ] => 5.0
8.2. Object in [ 6.000 ] [ Mode: 5 ] => 5.0
8.3. Object in [ 23.000 ] [ Mode: 4 ] => 4.0
9. Activity in [ 22.000 ] [ Mode: 5 ]
9.1. Host in [ 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000 11.000 12.000 13.000 14.000 15.000 16.000 17.000 18.000 19.000 20.000 21.000 23.000 24.000 25.000 26.000 28.000 29.000 30.000 31.000 32.000 33.000 34.000 ] [ Mode: 5 ] => 5.0
9.2. Host in [ 22.000 ] [ Mode: 4 ] => 4.0
9.3. Host in [ 27.000 ] [ Mode: 5 ] => 5.0
10. Activity in [ 25.000 ] [ Mode: 8 ] => 8.0
11. Activity in [ 35.000 40.000 43.000 45.000 48.000 49.000 ] [ Mode: 4 ] => 4.0
12. Activity in [ 39.000 50.000 56.000 68.000 ] [ Mode: 3 ] => 3.0
13. Activity in [ 42.000 ] [ Mode: 4 ]
13.1. Host in [ 1.000 2.000 3.000 4.000 5.000 6.000 8.000 10.000 11.000 13.000 14.000 15.000 16.000 19.000 20.000 21.000 22.000 23.000 24.000 26.000 28.000 29.000 30.000 32.000 33.000 34.000 ] [ Mode: 4 ] => 4.0
13.2. Host in [ 7.000 25.000 27.000 ] [ Mode: 4 ] => 4.0
13.3. Host in [ 9.000 ] [ Mode: 7 ] => 7.0
13.4. Host in [ 12.000 ] [ Mode: 3 ] => 3.0
13.5. Host in [ 17.000 18.000 31.000 ] [ Mode: 8 ] => 8.0
14. Host in [ 55.000 ] [ Mode: 1 ] => 1.0
15. Host in [ 59.000 ] [ Mode: 5 ]
15.1. Object in [ 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000 11.000 12.000 13.000 14.000 15.000 16.000 17.000 18.000 19.000 20.000 21.000 22.000 23.000 24.000 26.000 29.000 ] [ Mode: 5 ] => 5.0
15.2. Object in [ 25.000 ] [ Mode: 7 ] => 7.0
15.3. Object in [ 27.000 ] [ Mode: 4 ] => 4.0
15.4. Object in [ 28.000 ] [ Mode: 5 ] => 5.0
16. Activity in [ 67.000 ] [ Mode: 4 ]
16.1. Event_type in [ 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 11.000 12.000 13.000 14.000 15.000 16.000 ] [ Mode: 4 ] => 4.0
16.2. Event_type in [ 9.000 ] [ Mode: 5 ] => 5.0
16.3. Event_type in [ 10.000 ] [ Mode: 4 ] => 4.0
The advantage of the C5.0 model is its robustness in the presence of issues such as missing data and a large number of input variables. Their training time is relatively short and, in addition, C5.0 models are more interpretable compared to a large number of other machine learning models [
47].