Geometric Separability of Mesoscale Patterns in Data Represented in a Two-Dimensional Space
Measuring group separability in a geometrical space is a fundamental mission in data science and pattern recognition [
1,
2], because it allows assessing the extent to which algorithms for dimension reduction, embedding, and representation of multidimensional data perform well [
2]. The quality of the representation can be assessed according to different criteria, and the criterion to measure the geometric group separability of mesoscale patterns evaluates the ability of an algorithm to represent at the best the mesoscale patterns hidden in the original high-dimensional data space. In this study for mesoscale patterns we intend the organization of the data samples that tend to create groups that are separated between them (
inter-group diversity), but they retain also a meaningful internal distinction (
intra-group diversity). And, for representing at the best the mesoscale patterns in a two-dimensional space, we intend that the groups of samples should retain both inter-group diversity and intra-group diversity. This is possible to evaluate because the group of samples are associated with some labels, which can be provided: (1) supervisedly using meta-features or meta-data designed by the users; (2) unsupervisedly by applying algorithms for data clustering, with the scope to discover new groups stratifications or to independently verify the ones expected.
In 1973, with the introduction of the Dunn index [
3], the concept of cluster validity index (CVI) was presented with the aim to evaluate group separability of clusters detected in a geometric space by unsupervised algorithms for data clustering. Then, along the years, following the same philosophy, many other cluster validity indices were proposed including, to name some of the most used: in 1974 the Calinski-Harabasz index (CH) [
4]; in 1979 the Davies-Bouldin index (DB) [
5]; in 1987 the Silhouette index (SIL) [
6]; in 1995 the Generalized Dunn Index (GDI) [
7]; in 2019 the Density-involved Distance (CVDD) [
8]. For more details about the mathematical formula of each of these indices please refer to the original publications because, for the reasons we explain below, they are not subject of this study.
Although not intentionally designed for that, and with the risk of inaccuracy, CVIs gained popularity in a similar applied problem, which is the one discussed in this study: on measuring the geometric group separability of mesoscale patterns in data represented in a two-dimensional space. Each of these CVIs were introduced across the decades to address different evaluation’s issues, but all of them shared the same conceptual problem. In
Figure 1a we show that cluster validity indices belong to a special subclass of separability indices that enforces
compactness, because the preservation of intra-group diversity is neglected. Indeed, as we show in
Figure 1b, CH index (as any CVIs) scores higher the representation where the points of each group tend to collapse (at the limit) in one unique point (right panel of
Figure 1b), favoring compactness in contrast to retain intra-group diversity (left panel of
Figure 1b). CVIs favor compactness because they were designed to evaluate the performance of clustering algorithms, but this criterion is too restrictive for the evaluation of the geometric separability of mesoscale patterns in data represented in a two-dimensional space, because in this circumstance we are interested to value representations in which the intra-group diversity is preserved. Indeed, the goal of two-dimensional data representation is to explore the relative disposition of the samples inside each group and between groups.
To this aim in 1998, Thornton introduced the concept of geometric separability and an algorithm to compute the geometric separability index (GSI) [
1]. The geometric separability is based on the criteria that a point should share the same label of the first-nearest neighbor in the geometric space. The GSI is defined as the proportion of data points whose classification labels are the same as those of their first-nearest neighbor. GSI can detect the presence of group separability in the presence of nonlinearity, but it cannot distinguish whether the separability is linear or nonlinear, and it seems to suffer more than the CVIs in the presence of noise or micro-cluster formations [
2].
The concept of
linear separability in a geometric space was discussed in 1969 by Minsky and Papert [
9,
10], who described tasks which could be handled using the Perceptron method as ‘linearly separable’ [
1], meaning that there exists a
separability line which segregates two groups of samples one from each other. However, a separability line was never used to design indices for evaluation of geometric separability of mesoscale patterns in data represented in a two-dimensional space. In 2022, our group in the study of Acevedo et al. [
2] proposed the general data science notion termed
projection separability (PS) [
2], which contemplates diverse ways to define linear separability in respect to a
projection separability line. The
separability line (
Figure 1b, vertical dashed black line) separates two groups of samples in a geometric space and indicates the presence of linear separability. In a 2D space, the projection separability line (
Figure 1b, horizontal solid line) is orthogonal to the separability line and is used to project the samples and to assess the extent to which their organization is far from the exact linear separability in two groups. For instance: (1) the projection line that connects the centroids (see example of centroid projection line in
Figure 3a,b) of two groups of nodes in the geometric space [
2], which is termed centroid projection separability (CPS); (2) the projection line defined with respect to a criterion of maximum linear data discrimination is the first component projection vector of linear discriminant analysis (LDA) [
11] (see example of LDA projection line in
Figure 3a,b), which is termed linear discriminant projection separability (LDPS) [
2]. The criterion of separability for the LDPS is to maximize the ratio of the variance between groups to the variance within groups [
11]. In this study we will concentrate on CPS and LDPS because they are the most efficient solutions that we have currently at hand [
2], as we will motivate hereafter. In our previous study of Acevedo et al. [
2], other examples and notions to define a projection line were discussed. A separability line can be obtained by any statistical or machine learning technique which maximizes a criterion of separability between two groups of data [
2]. For instance, the linear binary soft margin Support Vector Machine (lbSVM) [
12,
13,
14] maximizes the maximum-margin, and the line orthogonal to the maximum-margin hyperplane (the decision boundary) can be used as a projection line. Hence, the criterion of separability for the support vector projection separability (SVPS) [
2] is to maximize the geometrical margin between the two groups. However, lbSVM scales cubically with the number of samples [
14,
15], and its running time is in general larger than LDA. To address these time issues, Acevedo et al. [
2] introduced the methodology called the centroid projection separability line (CPS), whose time complexity is O(ND) where N is the number of samples, and D is the number of dimensions. Since in our study the representation is two-dimensional (D = 2 is a constant and does not impact the time complexity), CPS scales linearly with the number of samples only. CPS computes the geometrical centroids (median estimator) of each of the two groups, and then considers the line that connects them as a projection line (
Figure 3a,b). CPS offers a naïve solution to measure linear separability that is more approximative than LDPS and SVPS, but the advantage in running and complexity time is remarkable in comparison to the other solutions.
Finally, a projection separability index (PSI) is defined by applying any bi-class separability measure (such as the area under the curve of precision-recall, AUPR [
16], or any other measure for evaluation of unbalanced data classification) directly on the projection line to measure the extent to which the two groups are linearly separable. For instance, PSI was adopted with merit to evaluate the geometric linear separability of spatially organized groups of single cells embedded in a 2D and 3D space by analyzing their transcriptome [
17].
GSI and PSIs values are bounded between 0 (worst result) and 1 (best result indicating data separability), while the majority of CVIs are not. Since they evaluate mere geometric separability, they are not preferentially looking for compactness as the CVIs do. For this reason, in the example of
Figure 1b, PSI rates with the highest values (PSI = 1 indicates presence of linear separability) the two different patterns of separability indicating that they are both valid and of interest, whereas CH index overrates the separability pattern on the right side (CH = 259.67) because, as all cluster validity indices, aims to value compactness.
In the panel provided in
Figure 1c we offer an overview of the CVIs, GSI and PSIs mentioned above together with their characteristics (see figure legend for details) considering a metanalysis based on empirical evidence conducted in the study of Acevedo et al. [
2]. From this comparison emerges that geometric separability-based indices, such as GSI and PSIs, perform better than cluster validity indices on many requirements, hence in this study we will consider only GSI and PSIs. GSI is the second best but it suffers in case of overlapping clusters, cannot distinguish linear from nonlinear separability and is affected by isotropic noise. PSIs are the best because they can encompass all the characteristics, but their results are affected by the presence of nonlinear separability between groups in the data. Therefore, the first aim of this study is to investigate how to extend the concept of projection separability to the nonlinear scenario.
Geometric Separability of Mesoscale Patterns in Complex Networks Represented in a 2D Space
In recent decades, the landscape of physics has extended. It embraces many data-driven approaches, emphasizing analysis, representation, and interpretation of data using computing tools. Physics has become a dynamic blend of traditional principles and new cutting-edge tools including data science and AI, allowing us to delve deeper into the mechanisms governing our universe. These tools play a pivotal role in unraveling complex systems - systems whose properties emerge from the interactions among their constituent parts - and this naturally directs scientists in complexity science to adopt networks as framework to model the complexity behind the physics of the system. Micro-properties of complex networks, such as average clustering coefficient and degree probability distribution, are features of complexity that emerge from the statistical analysis of micro-structures around a network node. However, one of the most intriguing aspects of complexity is the capability to originate mesoscale patterns from microscopic interactions.
The emergence of mesoscale patterns in complex systems is a key feature of complexity, which occurs when micro-parts of a system tend to self-organize grouping together as a result of their closer inter-playing with respect to other micro-parts. These mesoscale structures are important because they can influence dynamic processes on the network, such as information flow. Representing mesoscale patterns helps in identifying the underlying principles of network organization and can have practical applications in various fields, from ecosystem management to the design of resilient infrastructure. Formation of mesoscale patterns arises at different physical scales in complex systems: proteins in molecular networks create stronger interactions inside functional complexes; insect swarms and bird flocks, as well as fish schools, create different internal meso-patterns with respect to external stimuli (e.g., temperature of air or water) or threats; humans in social networks make tighter links inside communities. Some of these mesoscale patterns, such as in bird flocks, are directly visible because they are generated in a patent space; other patterns, such as in protein interactomes or social networks, emerge in a latent space, and the adoption of algorithms for network embedding is fundamental to visualize their presence. Nevertheless, regardless of their origin, once these mesoscale patterns are geometrically represented in a visual space, they are a feature of complexity that needs quantification and analysis. Thus, questions such as how close or far, how similar or distant are the groups that form mesoscale structures represent a challenge to address in complexity analysis. This requires introducing the notion of geometric separability between groups that form mesoscale patterns in networks derived from complex systems.
Examples of mesoscale patterns in complex networks are: communities or modules, in-block nestedness, core-periphery structures. In this study we will concentrate on mesoscale patterns associated to community organization because they are investigated in many domains of applied network science. A community or module refers to a subset of nodes within a network that interact with each other more frequently than with nodes outside that specific community [
18]. Communities are a crucial meso-property to analyze in order to reveal and understand the emergence of mesoscale mechanisms in the associated complex system, however their visualization is not always straightforward. For instance,
Figure 2a displays the adjacency matrix associated with the unweighted network connectivity of an artificial complex network with 2 communities generated with the nPSO model [
19,
20]. Looking at the
Figure 2a binary color visualization (orange color for observed and black color for missing interactions) of the adjacency matrix, it is not straightforward to visually distinguish the presence of mesoscale patterns that can be associated with the 2 communities in the network.
Network embedding in a geometric space of two-dimensions (2D) [
21,
22] plays a crucial role in the visual representation, discovery, investigation and interpretation of mesoscale patterns hidden in the structure of a complex network. When
Figure 2a adjacency matrix is represented by a network embedding algorithm (in this case, HOPE [
23]) in a 2D geometric space (
Figure 2b), we can visually recognize the presence of the 2 communities (compare their pattern with their ground truth node’s colors in
Figure 2c) showing the utility of network embedding to discover patterns in complex data analysis [
24,
25,
26,
27]. Yet, new challenges [
28] emerge after the data embedding. For instance, how close or far, how similar or distant are these mesoscale structures which are associated to communities in the networks? The calculation of the separability of the communities in the two-dimensional geometric space can be used for instance: (1) to evaluate the performance of network embedding algorithms or to guide the best tuning of their hyperparameters; (2) to evaluate the similarity between the mesoscale organization of diverse complex networks according to the geometric separability of their communities. In the first case, the more the algorithms clearly disclose and display the community structure of the networks in the two-dimensional space, the better their performance is rated. In the second case, the closer is the evaluation of community geometric separability between networks that are generated from the same complex system, the higher is their similarity in their mesoscale organization.
Hence, we introduce and test also in network science the notion of linear and nonlinear geometric separability of mesoscale patterns [
1,
2] which, in the specific case of this study, concerns measuring the geometric separability of the groups of network’s nodes that form the communities.
Figure 2.
Geometric separability of community-based mesoscale patterns in complex networks. (a) The adjacency matrix of an artificial network with two communities generated with the nonuniform popularity similarity model (nPSO). From the adjacency matrix, the presence of any mesoscale structure associated to community organization is not visible (b) Embedding by the HOPE algorithm of the nPSO network in a two-dimensional geometric space reveals the presence of a geometric representation composed by two groups of nodes (one up and one down), providing evidence of network embedding efficacy to visualize the latent mesoscale structure of complex networks. (c) Attributing to each node a color related with the respective community type (red or green) in the network, we note that nodes in the same community locate closer to each other forming two groups in the geometric space. Evaluating the representation of a network in relation to the geometric separability of the groups of nodes formed by their communities is an innovation that we introduce in this article.
Figure 2.
Geometric separability of community-based mesoscale patterns in complex networks. (a) The adjacency matrix of an artificial network with two communities generated with the nonuniform popularity similarity model (nPSO). From the adjacency matrix, the presence of any mesoscale structure associated to community organization is not visible (b) Embedding by the HOPE algorithm of the nPSO network in a two-dimensional geometric space reveals the presence of a geometric representation composed by two groups of nodes (one up and one down), providing evidence of network embedding efficacy to visualize the latent mesoscale structure of complex networks. (c) Attributing to each node a color related with the respective community type (red or green) in the network, we note that nodes in the same community locate closer to each other forming two groups in the geometric space. Evaluating the representation of a network in relation to the geometric separability of the groups of nodes formed by their communities is an innovation that we introduce in this article.