4. Discussion
Graphical presentation of the obtained results is a very important stage of working with data acquired through research. Presenting the obtained results in a clear manner and in accordance with the commonly accepted rules makes it easier to choose the method of further processing the data and interpreting the results of experimental research correctly.
Also in the case of the present study, the analysis of source data began with the analysis of their graphical presentation. The study investigated two groups of patients: girls with scoliosis (group 1 and group 2) and girls without scoliosis (group 3 and group 4). Each of the groups is divided into two subgroups: girls who had not started menstruating yet (group 1 and group 3) and menstruating girls (group 2 and group 4).
The analysis of data began with the preparation of descriptive statistics characterising the studied groups. The plot presented in
Figure 1 shows an overview of the values of the hormones measured for the studied groups of girls; it does not show the values of the measured parameters. The presented relationships (
Figure 1) indicate a high diversification of the ranges of the measured hormones, which usually results in the necessity of using the procedures of data normalisation when conducting, for instance, discriminant analysis. Basic statistical parameters were calculated for each variable characterising the studied groups of patients. The results are described in more detail in the authors’ previous articles [
8,
9]. In this study,
Figure 2 and
Figure 3 present only some of the obtained results, focusing on the LH variable, which was found to be the parameter considered the most diagnostic one in the further statistical analyses conducted. Such histograms may be a combination of several separately calibrated distributions. For each value of categorised variable (experimental group), a separate frequency distribution can be drawn. The categorised histogram for the results of the measurement of the luteinising hormone is presented in
Figure 2.
Figure 3 show scatterplots of the mean with marked standard deviations determined for all four studied groups of patients and the plot of probability distribution for the same variable. The results presented in
Figure 3 show that the variation values of the analysed trait differ considerably from each other in each of the studied groups, which suggest that there are significant differences between the mean values in the analysed groups. The Shapiro-Wilk W-test conducted for the LH variable confirms that the distribution of this categorised variable is normal because in all studied groups
p > 0.05. For the measurements of other biochemical data, confirming the normality of the categorised distributions of the variables was possible in the majority of cases. For several variables, the study observed slight departures from normal distribution (at the ends of the range of variability). The obtained tests of significance, however, remain reliable, because the samples are numerous, and the lack of normality results only from the skewness of the histogram showing the data distribution.
Normality of the distributions of variables is usually tested prior to more advanced statistical analyses. The present study used two methods implemented into the STATISTICA package: discriminant analysis and decision trees. The methods are qualified by the software as Multivariate exploratory techniques.
The preliminary results of the conducted discriminant analysis confirmed that discrimination of the membership in a group is highly significant. This is indicated by the value of Wilks' lambda, which equals 0.0738851, and the approximate value of
F, which equals 26.18207, and the corresponding value of
p < 0.05. Also indicated were the variables, which are the most useful in discriminant analysis. The analysis of the results presented in
Table 4 allows for forming a conclusion that seven variables, the significance of which was confirmed by the calculated values of the
p parameter (
p << 0.05), should be selected for further analysis: LH, E2, PROG, PTH, osteocalcin, calcium and vitamin D. The following variables, for which
p > 0.1, will be excluded are FSH, HGH, and phosphorus. The significance of discriminant functions was also confirmed. As can be seen (
Table 2), the first two functions are characterised by a high value of canonical correlation
R. This indicates a strong relationship between the groups of patients and discriminant functions. The first row in the table contains the significance test for all roots. The second row contains the results of the evaluation of the significance of roots that remained after removing the first root (etc.). Because the
p values for all three discriminant functions are very close to 0, all functions were found to be significant. Therefore, it can be said that the results of the conducted research come from the population, where four groups of the studied patients (the number of significant discriminant functions is equal to the number of groups) emerge naturally. The obtained results were compared with each other in order to determine the size and direction of the shares of the variables in each canonical discriminant function. For example, the variables LH and E2 have the strongest impact on the first function, and this impact is very similar (both coefficients are negative and equal -0.567 and -0.508, respectively). The first function is responsible for 75.4% of the explained variance. This means that 75.4% of the entire discriminatory power is explained by this function, and this is why the first function is the most important one. The second function explains 21.2% and the third function explains only 3.4% of discriminatory power. This is confirmed by the earlier results, namely, the value of the canonical correlation coefficient
R for the third function equals only 0.397 (
Table 2).
The presented results of the calculations of the mean values of discriminant functions indicate that the first discriminant function differentiates primarily the objects from
group 4 and, to a smaller degree, the objects from
group 1. The second discriminant function seems to distinguish the third group, but, as can be seen, the value of this discrimination is considerably lower. The situation is confirmed by the scatterplot of canonical values shown in
Figure 4. It can be observed in the figure that the girls from
group 4 (without scoliosis, menstruating) are located significantly more to the left than the other groups and create a distinctive cluster. This discriminant function is affected the most by the LH and E2 variables. The higher the values of these variables are, the more to the left the object is located (unambiguously belonging to
group 4). Similar interpretation may be applied to the second discriminant function, on which the LH and osteocalcin have a strong positive effect, and which can be used to test the membership of objects in
group 3. The cases belonging to
group 2 are located within all the remaining groups.
The classification of cases is possible after developing classification functions. The equations presenting the classification functions have the following form:
The assessment of the usefulness of the established classifiers was verified for the training set and the testing set. The results of the correctness of classification for the learning set (180 cases) are presented in
Table 3. The classification matrix presented in the table contains information about the number and percentage of objects (patients) classified correctly in each group. The highest percentage of correctly classified cases is observed in the group of healthy girls, that is, in
group 4 (97.8%) and
group 3 (91.1%). The percentage of correctly classified girls with scoliosis equals 77.8%, for
group 1 (non-menstruating girls) and 73.3%, for
group 2 (menstruating girls). The obtained results match the areas of the overlapping of the objects belonging to the different classes observed in
Figure 4. It can be observed that
group 4 is the most isolated one and
group 2 is the least isolated one. In the case of
group 2, the calculated percentage of incorrectly classified patients is the highest and equals 26.7%. The established functions K
1, K
2, K
3, K
4 allow for classifying new cases from the testing set, which were not used for calculating the coefficients of the functions. The mean percentage of correctly classified patients is slightly lower than the result obtained for the learning set and equals 80% of the overall number of the studied girls. The worst result of the correctness of classification (both for the learning group and the testing group) was obtained for the patients belonging to
group 2. Such a result can be explained by the fact that this group is characterised by the highest dispersion of the values of the measured biochemical parameters (
Figure 3). Furthermore, assigned to
group 2 were girls aged between 11 and 19 years, and the age difference between the children from the other groups is only 5 or 3 years. Such high differences in the age of the children in group 2 probably had an effect on the correctness of classification.
In order to compare the classification capabilities of various calculation methods, a diagnostic system was constructed, which was represented by a decision tree. The tree was also built using the STATISTICA software. First, analogously as in the case of
Discriminant analysis, the significance of all 10 variables, which were the results of the laboratory research, was evaluated. The results of the ranking (
Figure 5) confirmed that FSH, HGH and phosphorus (in the figure, they are separated off by the red line) should be considered less significant if a construction algorithm for decision trees is used to create a classifier. The same conclusion was formulated in the case of discriminant analysis, which was discussed earlier (
Table 1). For this reason, only the top seven variables from the ranking were selected for constructing the decision tree.
The structure of the generated tree is presented in
Figure 6. The root of the tree contains the LH attribute, which possessed the ‘largest amount of diagnostic information’, that is, the values of the attribute allowed for distinguishing from the studied set the majority (74) of menstruating girls (29 girls from
group 2 and the entire
group 4), which finds confirmation in the physiological levels of this hormone in women on their period. Then, the tree was constructed recursively, in accordance with the principle of locating the attributes bringing the highest information gain in the root, guaranteeing the most optimal division of the studied sample. In the obtained classification system, the LH variable was found to be the most important decision attribute, followed by E2, PROG, calcium, and finally, osteocalcin. Node no. 5 contains a question about the level of calcium, the value of which allows for distinguishing the group of menstruating girls with scoliosis (5 persons from
group 2) from those without scoliosis (33 persons from
group 3). In node no. 8, the calculated value of the osteocalcin parameter allows for indicating which of the non-menstruating girls form
group 1 and which belong to
group 3. The analysis of the content of the nodes that are the leaves of the tree allows us to conclude that groups 4 and 1 are sets better isolated than the other ones because there is only one decision pathway leading to them. To classify a case belonging to
group 3 two pathways have to be analysed and classifying a case belonging to
group 2 requires analysing three decision pathways.
The results presented in the form of a classification tree may be converted into a set of rules and used in this form in the process for prognosing the classes for new medical cases. The rules allowing for classifying the studied girls into appropriate groups may be formulated as conditional sentences:
Rule 1: IF ((LH ≤ 13.887) AND (E2 ≤ 30.83) AND (PROG ≤ 1.27) AND (osteocalcin > 24.964)) THEN group 1
Rule 2: IF (((LH > 13.887) AND (E2 ≤ 52.019)) OR ((LH ≤ 13.887) AND (E2 ≤ 30.83) AND (PROG > 1.27)) OR ((LH ≤ 13.889) AND (E2 > 30.83) AND (calcium ≤ 2.43))) THEN group 2
Rule 3: IF (((LH ≤ 13.887) AND (E2 ≤ 30.83) AND (PROG ≤ 1.27) AND (osteocalcin ≤ 24.964)) OR ((LH ≤ 13.889) AND (E2 > 30.83) AND (calcium > 2.43))) THEN group 3
Rule 4:IF (LH > 13.887) AND (E2 > 52.019) THEN group 4
Using the rule knowledge base and software operating as an inference engine (e.g. SCANKEE [
21,
22]) allows for a full automation of the process of diagnosing scoliosis cases and makes it easier for a physician managing a patient to make a decision about treatment and rehabilitation.
The results of the assessment of the correctness of classification by the constructed decision tree (
Table 4) and the base of rules developed based on the tree lead to the conclusion that out of 180 cases forming the learning set, five girls were classified in the incorrect group. As far as the set of 20 girls forming the testing set, two persons were classified incorrectly. Thus, the effectiveness of classification of the generated decision tree equals 97.2% for the learning set and 90%, for the testing set. Both results are better than the results obtained as the outcome of discriminant analysis.