5.1. Limitations and proposition for enhancing this workflow
When large data is too large As often in data mining, one limit concerns the size of the data. It is clear that for massive data, where the number of rows is several million, specific algorithms such as grid-based methods or canopy pre-clustering algorithm (McCallum et al. 2000) are needed for the algorithms to scale up.
More specifically, in such case, factor analysis may be impossible to calculate, as it requires to make matrix calculations and to invert matrix of size n * p (n individuals, p binary variables). Please note though that in the case of categorical variables, one may prefer to use the anglo-saxon MCA method that applies the CA algorithm on a Burt table (p * p) instead of the complete disjunctive table (n * p), which is more efficient in computing time and thus more appropriate for large data (also implemented in the MCA() function in FactoMineR (Greenacre 2007)). Equally, in the case of very large data, CLARA algorithm may be too time-consuming to be calculated as we still need to maintain enough samples and observations per sample for representativeness. For all these reasons, one suggests simply analyzing a random sample of the original dataset that is likely to be very representative of the latter while allowing the use of the Qluster workflow. Please note that PCA() and FAMD() are known to take more time for computation than MCA(). As a consequence, the notion of “large” data may depend on the nature of data (continuous, categorical, mixed). One also suggests (when possible) in a data preparation step to converge into one type of data (continuous only or categorical only). This is especially true for mixed data where the upstream scaling of the data can be challenging, and where the computation times by FAMD are more important. Alternatives may consist in not using the proposed workflow but algorithms which go fast on (very) large data such as Mini Batch K-means used on continuous variables or on one-hot-encoded categorical variables. However, in addition to the fact that it relies on the Euclidean distance only, these strategies may not allow the beforehand use of factor analysis because of data’s size, as well as to easily and properly assess clusters’ stability, which hinders confidence and support in results. One therefore recommends the first option.
Conversely, when the number of columns is greater than the number of rows (p > n), the dimension reduction step via factor analysis methods makes very sense to easily manage the high dimensionality of the data. However, in the most extreme cases where p >> n, standard factor methods may fail to yield consistent estimators of the loading vectors. Also, the results may be difficult to interpret. In such situations, standardized methods may be a solution to improve the robustness, consistency, and interpretability of results (e.g., penalized PCA, Lee et al. 2012). It is also recommended that, when possible, a subset of the variables be selected strictly prior to analysis.
Generalizability of this workflow to missing data Missing values management is not covered in this workflow, and it is therefore assumed that no missing values are present in the dataset. Indeed, both factor methods (PCA, MCA, FAMD) and proposed clustering methods (PAM, CLARA, ...) require data without missing values. However, this workflow can be easily generalized to missing data, using the same missMDA package as for performing the selection of the optimal number of dimensions in factor analysis, in order to impute in a first step missing values using factor methods. The latter are state-of-the-art methods for handling missing values (e.g. function imputePCA(), imputeMCA() and imputeFAMD() for simple imputations (Audigier et al. 2013)) and can thus easily be integrated and/or used in an automated workflow to handle missing data. In addition, this R package makes it possible to perform multiple imputations (MIPCA(), MIMCA() and MIFAMD()) for assessing incertitudes from imputed values and increasing confidence in results (Josse, Pages` , et al. 2011). In this sense, the Qluster workflow is therefore adapted for handling missing data (see in Appendix in section E an example of the Qluster workflow adapted for handling missing values).
Discussion on using factor analysis as a first step As mentioned on several occasions, factor analysis allows the transformation of structured data of all kinds into continuous data, while dealing with large, collinear, noisy, high-dimensional data. It also facilitates clustering by aggregating groups of homogeneous information within dimensions. Nevertheless, it cannot always be guaranteed that the results will be “better” or “as good” with factor analysis in the clustering process. Similarly, the choice of factor analysis in this workflow comes with drawbacks that include the following items:
The packages used cannot handle the ordinal nature of the variables. The latter must be treated as categorical or continuous.
The observations x components matrix is continuous, although some raw variables could be categorical. This prevents the user from favoring (respectively, not favoring) positive cooccurrence over negative co-occurrence via the Jaccard (respectively, Hamming) distance
Alternatives can be to perform data dimension reduction using features selection methods, or manually, by grouping, transforming and/or deleting variables based on clinical expertise.
Discussion on using a single K-medoid algorithm In order to offer a simple, yet generic and robust workflow to make practical use of the same methodology in many applications, we performed an arbitrary (but well chosen) selection of both algorithms and software packages. In particular, the choice to use the PAM/CLARA algorithm is based on many aspects such as the fact that it is:
one of the best known, studied and used algorithm by the community, for general purpose, - adapted to the continuous setting (i.e., the most mature in the literature).
meant for the most frequent use case of clustering (i.e., hard partitioning),
suitable to the Manhattan distance, a less sensitive to outliers distance, unlike its counterpart on the euclidean distance (K-means),
deterministic, thanks to its internal medoid initialization procedure, unlike the basic K-means algorithm that may lead to inconsistent or non-reproducible clusters
requiring few parameters to set up (e.g. conversely to BIRCH and DBSCAN, see Fahad et al.(2014)),
very well implemented within a recognized reference R package (the FPC package) facilitating its use within a complete and robust clustering approach,
usable within the same R function (pamk()) regardless of the volume of data.
Yet, it is clear that other algorithms than the ones chosen could be routinely used instead, including those present in the FPC R package to facilitate its integration within the workflow (e.g. DBSCAN and HAC). In particular, it is well known that with non-flat geometry and/or uneven cluster size, DBSCAN is more appropriate than K-means and PAM. Equally, if the final goal is to obtain a hierarchy rather than a unique hard partition, the user will be encouraged to prefer an algorithm such as HAC, which can easily be used with the proposed packages in this workflow. However, the presence of additional parameters to tune or the lack of compatibility with massive data would require the workflow to become more complex. Also, this workflow is not meant to replace more in-depth work by the data scientist to find what is optimal on a specific case study. More experienced data scientists can use the generic Qluster workflow for a first look at the data, but are encouraged to adapt the general principles of this workflow to their own case study (e.g., finding the most suitable algorithm, etc.). Such adaptations would be out-of-scope of this workflow in the sense of the initial objectives: genericity of applications while preserving the simplicity of implementation and reliability/robustness of metholodogy.
Equally, the user may want to benchmark several clustering algorithms as suggested in Hennig (2020). The comparison of methods solutions can be based on information measures (e.g. entropy, mutual information, etc.), internal validity measures (e.g. silhouette, see section 2.4.), set-matching (i.e., mapping each cluster from the first clustering to the most similar cluster in the second clustering and computing recall, precision or some other measure) and pair counting (including dedicated visualization tools, see Achtert et al. (2012)). Some of these strategies are directly implemented in the clusterbenchstats() function from the FPC R package or in the clValid() function of the clValid R package. However, as our goal is to propose a simple-to-use workflow, this complexification - which would also greatly impact computing times and memory capacities - is left to the user’s discretion. Moreover, multiplying the algorithms and combination of parameters forces one to rely more heavily on a purely statistical criterion (e.g. ASW) to select the “best” clustering of the data, although this may not reflect the best partitioning in the clinical sense. Indeed, ASW remains a criterion characterizing the average separability over all the clusters, and its optimum may miss (the set of) results that is clinically relevant and/or useful for the desired objective 25. If the data scientist wishes to compare different algorithms, we would rather recommend to fully investigate the results from a well-chosen first algorithm (here PAM/Clara), before challenging it with others, in order to be less dependent on the sole selection criterion of the ASW. This paper thus takes the opposite view of the auto-ML literature by first advocating a full investigation of a parsimonious workflow made of well-chosen algorithms, rather than directly a broad coverage of algorithmic possibilities. Nevertheless, readers may be interested in recent areas of research around meta-clustering (Caruana et al. 2006) and ensemble clustering methods (Greene et al. 2004; Alqurashi et al. 2019). The first one aims to produce several partitioning results in order for the user to select those which are most useful. The second one is intended to combine the clustering of several methods with the aim of proposing a consensual result.
Discussion on the clusters’ stability assessment step Bootstrapping and noising methods were chosen in the workflow for both their availability in the same function clusterboot() from the same package as for pamk(), and for their complementarity as recommended by Hennig (2007). Nevertheless, other methods may also be used as sensitivity analyses, including those proposed in the same FPC package. Furthermore, although this step allows for the clusters to be assessed, data scientists should keep in mind that stability is not the only important validity criterion - clusters obtained by very inflexible clustering methods may be stable but not valid, as discussed in Hennig (2008). Finally, although several choices were made to try to manage outliers as best as possible, such as using a k-medoid algorithm and the Manhattan distance, the Qluster workflow does not fully address the issues related to outliers and extreme values. One solution may be to define threshold values to manually detect extreme values as a pre-processing step (as in the case study in section 4), or to use more sophisticated statistical methods such as Yang et al. (2021).
Discussion on the clusters’ interpretation Clusters’ description is not covered in the Qluster workflow. However, many methods exist to interpret clusters (see section 2.3). Data scientists can easily generalize Qluster to the description of clusters by using the functions already present in the FPC package in order not to make the workflow too complex, such as plotcluster() and cluster.varstats() following methodologies recommended by Hennig (2004).
Discussion on the types of data that are supported by the Qluster workflow Although general, the Qluster workflow does not cover all types of data and it is clear that for medical imaging data, omics data or data in the form of signals, dedicated approaches must be considered. Nevertheless, most tabular data can be processed using the Qluster workflow. In this respect, and although the Qluster workflow was specifically designed in the context of healthcare data analysis, it can easily be applied in other fields.
5.2. Discussion and recommendation on the practical use of the workflow
Use of clusters stability as a criterion to be optimized Cluster stability assessment could be considered as a criterion to be optimized, by iterating on this step in order to make this property an integral part of the clustering process itself. For example, stability measures could be used to select the optimal number of clusters, assuming that the clustering results are more stable with the correct number of clusters (Pasi Franti¨ 2020).
Attention should be paid, however, to the fact that the bootstrap and noise methods are more computationally and thus time-consuming than simple methods such as deleting variables one by one (methods used on biological measurements and proposed in the ClValid R package). Also, it may not be obvious to optimize the clustering on clusters’ stability if the 2 proposed methods do not give similar results. For example, relatively to the noising method, the bootstrap method is more likely to give stable results with the increase in the volume of the dataset in the case of PAM, and the increase in the % of representativeness of the samples in the case of CLARA.
What if results are not satisfying The question of the ultimate relevance of clusters has not been covered in this workflow. It is worth pointing out that an absence of results may be a result in itself, as it can characterize a population that cannot be described in several homogeneous subgroups (either because such subgroups do not exist, or because the variables used do not allow to find them). Nevertheless, it is clear that, as in the Data Mining process, we can consider looping back on this workflow by changing certain parameters if the results are not satisfactory or if an important criterion of the clustering was not taken into account at the beginning (for example the maximum number of clusters, etc.). More generally, the data scientist is encouraged to keep in mind that the final objective of clustering is often the clinical relevance and usefulness of the results generated. In this sense and as mentioned in 5.1, it is not forbidden to relax a purely statistical criterion, such as the ASW (whose optimum may miss some relevant subgroups as it is an indicator of the overall separability) to better represent the diversity of the population studied, or to favor the generation of hypotheses in the case where the statistical optimum only gives broad results not enough specific for the initial objective.
In the same spirit, negative Silhouette values are too pejoratively considered in the cluster’s validity analysis (interpreted as clustering failures). Indeed, negative Silhouettes characterize patients who are, on average, closer to patients from another cluster than to the patients of their own cluster. Therefore, patients with a negative Silhouette may be informative of potential relationships between clusters, and should therefore be considered as potential additional information about disease history and phenotypic complexity, such as one cluster that is the natural evolution of another. Hence, it is recommended that an analysis of patients with negative Silhouettes be included in the workflow to better assess whether they are a reflection of “bad” clustering or the key to better understanding the disease.
What if the optimum number of clusters is the minimum of range K? In the case where the optimal number of clusters is the minimum of the range of K (as in our example in section 4), we recommend (if appropriate) that data scientists be testing for lower values of K to challenge the obtained optimum. Equally, if the optimum is obtained for K = 2, data scientists should test whether the dataset should be split into two clusters, using the Duda-Hart test that tests the null hypothesis of homogeneity in the whole dataset. This can be done using the same pamk() function by setting up the minimum of K to 1, or directly using the dudahart2() function (also in FPC R package). In any case, if the primary objective is to provide fine-grained knowledge of the study population, it will still be possible to provide results with the optimal K that was initially obtained, keeping in mind that the levels of inter-cluster separability and intra-cluster homogeneity are not really higher than those that would be obtained with a smaller number of clusters.
Using this workflow routinely The Qluster workflow is highly subject to automation for data scientists and companies that need a routine way to cluster clinical data. Indeed, data scientists may create a main function for applying this workflow, including by setting the nature of data (categorical/continuous/mixed), the volume (normal/large), and parameters related to each called function. It is worth mentioning, however, that the quality of the input data or the structure of the groups to be found are factors that may not allow the present workflow to identify relevant results every time. In this case, the data scientist can refer to the indications given above or, if necessary, consider an approach more adapted to his data.