2.2.1. Interpreting Dataset
Our early work with PDD focused on interpretable pattern discovery. In this section, we presented how PDD works for interpretability. As an overwhelming number of features (attributes) can complicate interpretability, we removed redundant features and retained 27 features describing patient demographics and their physical assessments. This includes 15 features related to physical measurements and patient information (i.e., GCS, systolic blood pressure, diastolic blood pressure, mean blood pressure, pulse pressure, heart rate, respiration rate, SpO2, age, gender, ethnicity, discharge status, admission weight, discharge weight, and height); another two observation features for each of the six physical measurements, totaling 12 features; and one label feature. Consequently, the dimension of the dataset is 10,743 by 28.
Due to the unlimited degrees of freedom inherent in numerical features, correlating these features with the target variable and interpreting the associations present significant challenges. So, the first step of the PDD process is discretizing the numerical features into event-based or discrete categories according to clinical standards as
Table 1 shows. Other numerical features without clear intervals, such as age and admission weight, were discretized into intervals that ensure a uniform distribution of data points within each interval such that each bin contains the same number of records. This method indirectly enhances the informational value of the data.
Then, on the discrete dataset, using PDD [
11], we construct a statistical residual matrix (SR-Matrix) to account for the statistical strength of the associations among feature values. In pattern discovery, the term "attribute" is used instead of "feature," so "attribute value" (AV) will be used subsequently. Since the meaning of the attribute values (AVs) and their class labels are implicit, the discovery of a statistically significant association of the AVs is unaffected by prior knowledge or confounding factors. To evaluate the association between each pair of AVs in the SR-Matrix, we calculated the statistical measure of adjusted standardized residual to represent the statistical weights of the association between distinct AV pairs. For instance, if
represents the value of attribute
as
H, and
represents the value of attribute
as
L. Then the adjusted standard residual of the association between
and
is calculated in Equation (
1), which is denoted as
.
where
and
represent the number of occurrences of each attribute value;
is the total number of co-occurrences of the two attribute values; and
refers to the expected frequency of co-occurrences of the two attribute values;
N is the total number of entities.
Subsequently, PDD applied a linear transformation, Principal Component Analysis (PCA), to decompose the SR-Matrix into Principal Components (PCs). Each PC is functionally independent, capturing unique associations distinct from those identified by other PCs. PDD then reprojected each PC back onto an SR-Matrix to generate a Reprojected SR-Matrix (RSR-Matrix) for each distinct PC. If the maximum residual between a pair of AVs within an RSR-Matrix exceeds a statistical threshold, such as 1.96, corresponding to a confidence interval, the association captured is considered statistically significant. The associations discovered within each RSR-Matrix (or PC) remain functionally independent from those in other RSR-Matrices (or PCs).
Figure 1 illustrates the concept of disengagement of AV association. After decomposing the SR-Matrix, PCs and their corresponding RSR-Matrices are obtained. Only RSR-Matrices containing SR values exceed the statistical significance threshold, and their corresponding PCs are retained. For each of those retaining PCs, as shown in
Figure 1, three groups of projected AVs can be identified along its axis, each showing different degrees of statistical associations. Those close to the origin are not statistically significant and thus do not associate with distinct groups or classes (marked as (a) in
Figure 1); those at one end of the projections (marked as (b)); and those at the opposite end. The AV groups or subgroups in (b), if their AVs within are statistically connected but disconnected from other groups, may associate with distinct sources or classes (marked as (c)). As a result, two AV groups at opposite extremes were discovered. That is to say, each AV within such a group is statistically linked to at least one other AV within, and none of them is statistically connected to AVs in other groups.
Furthermore, to achieve a more detailed separation of groups, several subgroups are separated in each AV group based on their appearance in entity groups. This is done using a similarity measure defined by the overlapping of entities each AV can cover. We denote such an AV subgroup by a three-digit code and refer to it as a Disentangled Space Unit (DSU). We hypothesize that these DSUs originate from distinct functional sources.
Therefore, in each subgroup denoted by DSU, a set of AVs is included, which are referred to as pattern candidates. We then developed a pattern discovery algorithm to grow high-order patterns, called comprehensive patterns, from the pattern candidates. In the end, a set of high-order comprehensive patterns is generated within each DSU, and they are all associated with the same distinct source.
The interpretable output of the PDD is organized in a PDD Knowledge Base. This framework is divided into three parts: the Knowledge Space, the Pattern Space, and the Data Space. Firstly, the Knowledge Space lists the disentangled AV subgroups referred to as a Disentangled Space Unit (DSU) (denoted by a three-digit code, shown in the three columns of the knowledge space to indicate different levels of grouping) linking to the patterns discovered by PDD on the records. Secondly, the Pattern Space displays the discovered patterns, detailing their associations and their targets (the specified class or groups). Thirdly, Data Space shows the record IDs of each patient, linking to the knowledge source (DSU) and the associated patterns. Thus, this Knowledge Base effectively links knowledge, patterns, and data together. If an entity (i.e., a record) is labelled as a class, we can trace the “what” (i.e., the patterns it possesses), the “why” (the specific functional group it belongs to), and the “how” (by linking the patterns to the entity clusters containing the pattern(s)).
The novelty and uniqueness of PDD lie in its ability to discover the most fundamental, explainable, and displayable associations at the AV level from entities (i.e., records) associated with presumed distinct primary sources. This is based on robust statistics, unbiased by class labels, confounding factors, and imbalanced group sizes, yet its results are trackable and verifiable by other scientific methods.
2.2.2. Clustering Patient Records
Without specifying the number of clusters to direct the unsupervised process, PDD can cluster records based on the disentangled pattern groups and subgroups.
As described in section 2.2.1, the output of PDD is organized into a Knowledge Base, where each pattern subgroup is represented by a . As defined in section 2.2.1, the set of AVs displayed in each DSU is a summarized pattern, representing the union of all the comprehensive patterns on entities discovered from that subgroup. We denote the number of comprehensive patterns discovered from the summarized pattern in DSU as . For example, in , if 10 comprehensive patterns are found, then . Each record may possess none, one, or multiple comprehensive patterns for each DSU. We denote the number of comprehensive patterns possessed by a record in a specific DSU as . For example, and represent the record with possess 5 comprehensive patterns in and 6 comprehensive patterns in .
Each DSU can represent a specific function or characteristic in the data, potentially associated with a particular class. For example, in this study,
is associated with
, while
is associated with
. The fact that
and
appear as two opposite groups in
(
Figure 1) indicates that their AV associations have significant differences as captured by
. Some DSUs might reveal rare patterns not associated with any class while the class label is not in the association.
Based on the definitions described above, we cluster the records by assigning each record to the class that matches the most comprehensive patterns compared to any other class. To provide a more detailed explanation of the clustering process, consider the following example. The DSU outputted by PDD are , , , and , which are associated with , , , and , respectively. The total number of comprehensive patterns in these DSUs are , , , and respectively. Consider a record () with comprehensive patterns possessed by this record are , , , and .
Due to the variation in the number of comprehensive patterns across DSUs, we use a percentage rather than an absolute value to measure the association of the record with pattern groups. Hence, to determine how the record () is associated with the patterns, we calculate the average percentage of the number of comprehensive patterns associated with a specific class possessed by the record, denoted as . Due to indicates that the record is not covered by the DSU[2,1,1], it is excluded from the calculation to avoid the significant impact of a zero value on the final percentage. Hence, the association of the record () with the patterns is calculated as . Similarly, . Since is greater than , the record is assigned as . To evaluate the accuracy of this assignment for all records, we compare the assigned class label with the original implicit class label.
2.2.3. Detecting Abnormal Records
The evaluation of the classification or prediction involves comparing the predicted labels with the original class labels. However, this comparison is unreliable if mislabels exist in the original data. To address this issue, we proposed an error detection method to identify abnormal records using the patterns discovered by PDD. In our early work on PDD, we integrated both the supervised and unsupervised methods for error detection and class association. In this paper, we simplify the process by using only a novel unsupervised method on a dataset with implicit class labels as the ground truth, making the error detection process more succinct.
To determine whether a record is abnormal, the proposed algorithm compares the class assigned by PDD with its original labels, evaluating the consistency of discovered patterns with their respective explicit or implicit class labels. We define three statuses to an abnormal record: Mislabelled, Outlier, and Undecided, which are detailed below.
Mislabelled: If a record is categorized into one class but matches more patterns from a different class according to the PDD output, it suggests the record may be mislabelled. For example, consider the same record with described in section 2.2.2 with the same setting of the pattern groups where and . If the record is originally labelled as in the dataset, but the relative difference is greater than , this suggests that the record () is more associated with than with . The relative difference is used instead of absolute difference because it provides a scale-independent comparison of the number of patterns associated with one class to another. A value greater than 0.1 indicates that the number of patterns associated with one class is statistically significantly greater than the number associated with another class. Hence, the record () may be mislabelled.
Outlier: If a record possesses no patterns or very few patterns, it may indicate the record is an outlier. For example, a record with uses the previously described pattern group settings. The comprehensive patterns possessed by this record are: , , , and . Calculating the percentages, is and is . Both percentages are less than or equal to , suggesting that record m possesses fewer than 1% of the patterns associated with either class, which may indicate it is an outlier.
Undecided: If the number of possessed patterns for a record is similar across different classes, the record should be classified as undecided. For example, a record with uses the previously described pattern group settings. The comprehensive patterns possessed by this record are: , , , and . Calculating the percentages, is the mean of and , which is ; and is the mean of and , which is . Since the difference between the two percentages is zero or less than , record k may be associated with both classes, suggesting it is undecided
To avoid adding new incorrect information, mislabelled, undecided, and outliers are removed from the dataset. Hence, to validate the effectiveness of abnormal records detected by PDD, we compared classification results from the original dataset to those from a dataset without abnormal records when various classifiers were applied.