2.3.1. Principal Components Analysis
While maintaining the highest degree of diversity within the data sample, Principal Component Analysis (PCA) enables the expression of information as a few mutually orthogonal non-correlated variables, referred to as principal components (PCs). This approach aids in reducing the dimensionality of the spectral data to fewer components, facilitating subsequent analysis and mitigating the risk of drawing erroneous conclusions. Therefore, PCA was employed for the initial exploration of interrelationships among subspaces corresponding to individual plants and for the preliminary determination of the number of components further used in discriminative analysis. We used the previously defined "fingerprints" as their respective ranges to construct PCA models for UV-Vis and ATR-FTIR spectra, i.e. 310-440 nm for UV-Vis and 1000-1300 cm
-1 for ATR-FTIR. During our analysis, it became evident that while differentiating spectra positively impacts the direct distinguishability of the samples; it doesn't fundamentally alter the results with chemometric methods. Consequently, all chemometric calculations were performed on PQN-normalized spectra without calculating derivatives. With 93 samples available in both cases, we obtained two input matrices of dimensions 93x159 (UV-Vis) and 93x156 (ATR-FTIR). The first three principal components for UV-Vis account for 81.9%, 14.7%, and 3.3% (99.9% in total) of the total variance. For ATR-FTIR, the corresponding values are 55.6%, 29.9%, and 9.1% (94.5% in total). In the case of UV-Vis, the contribution of additional components beyond the first three appears insignificant and can be disregarded, justified by Kaiser's rule as they correspond to eigenvalues less than one after the third component. However, for the ATR-FTIR matrix, up to five of the first components have eigenvalues greater than one, indicating a more substantial contribution. In addition to considering the percentage of variance explained by an individual component, another method to determine the appropriate number of components in PCA involves evaluating the variance retained after dimensionality reduction. This entails analyzing how effectively each component reconstructs the original data in conjunction with preceding components, allowing the understanding of the variance contributed by each component. The better the first
n components reproduce the original spectrum, the more adequate the selection of a given number of them.
Figures S3 and S4 depict the ATR-FTIR (S3) and UV-VIS (S4) spectra treated with PQN alongside the spectra reproduced using the 1 (a), 2 (b), and 3 (c) components of PCA. It is easily noticed that reproduction using only one component does not give the correct duplication of both types of spectra.
For the investigated samples, groups with similar patterns were visualized using PCA plots of the first two/three principal components. The scores-scores chart provides the most insightful representation of the diversity among samples from different species.
Figure 5(a-b) displays the score plots of UV-Vis and ATR-FTIR data on the three primary principal components, explaining 99.9% and 94.5% of the overall variance, respectively. The supplementary information provides additional diagrams with centroids and confidence ellipses in
Figure S5.
The 2D score plots with only two components are shown in Figures S6(a-b), illustrating that the first two principal components distinguish well between the UV-Vis spectra of thyme and lavender. The addition of the third component aids in separating the spectrum of sage more clearly, albeit with noticeable outliers. The spectrum of basil presents challenges, with individual points between sage and oregano. Eliminating outliers improves the separation of sage from oregano (
Figure S5a). Despite the difficulties in basil differentiation due to its centroid being close to those of sage and oregano, including PCA components beyond three does not enhance the discriminative power of the model. The distinction is much simpler for ATR-FTIR spectra, as
Figure S5(c-d) indicates, where sets representing particular plants are more clearly distinguishable than UV-Vis spectra. Similar positive outcomes are expected for oregano, thyme, and lavender. While sage poses a potential challenge due to outliers interfering with lavender and oregano, the situation seems more optimistic than for UV-Vis. For ATR-FTIR spectra, consecutive principal components (i.e., n>3) also carry significant information regarding the variance distribution. Notably, lavender and thyme exhibit clear overlap in 2D diagrams (
Figure S6), but this overlap is no longer observed in the 3D plot (S5(c-d)). Therefore, a third component for lavender (or thyme) is expected to differentiate their positions in the PCA space. However, it's crucial to emphasize that the analysis of the entire PCA space provides only a preliminary orientation and does not definitively determine the number of components necessary to construct an effective model. This determination occurs during the model-building process.
2.3.2. Discriminant Analysis Methodology
The qualitative nature of PCA results provides only an initial understanding of the potential for further differentiation among the five investigated herbs. The Soft Independent Modeling of Class Analogy (SIMCA), initially developed in 1976 [
43], stands as one of the most popular and widely utilized techniques in chemometrics [
44,
45]. As a Class Modeling (CM) method [
3], SIMCA independently constructs a boundary in the variable hyperspace, where specimens associated with a specific category are likely located. This flexibility allows SIMCA to identify samples as belonging to none, one, or more modeled categories when considering multiple classes. Another commonly used multiclass classification methodology is PLS-DA [
46] (Partial Least Squares Discriminant Analysis). In contrast to SIMCA, PLS-DA does not use PCA directly. Instead, it integrates partial least squares regression (PLS) elements with discriminant analysis to construct a model that maximizes the covariance between the predictor variables and the class labels. PLS-DA creates a unified predictive framework using partial least squares regression, simultaneously considering all classes to enhance discriminative ability. Whereas SIMCA constructs separate PCA models for each class and categorizes new samples based on their similarity to these class models, PLS-DA aims to discriminate between predefined classes by building a predictive model that can accurately classify samples into these classes. SVM (Support Vector Machine) is a versatile supervised learning algorithm used for both classification and regression tasks [
47]. While SVM aims to maximize the margin between the nearest points of different classes, it does not create individual models for each class as SIMCA and PLS-DA do. Instead, SVM identifies the optimal hyperplane(s) that collectively separate the classes, rather than doing so individually. Consequently, SVM serves as a more general classification tool compared to SIMCA/PLS-DA, effectively separating different classes in the feature space without explicitly modeling each class separately. For further differentiation among the five investigated herbs, Linear Discriminant Analysis (LDA) [
48] can also be employed as a classification technique. LDA works by finding a linear combination of features that best separates two or more classes, maximizing the distance between the means of different classes while minimizing the variation within each class. Unlike SIMCA, which models each class independently, LDA creates a single predictive model that uses this linear combination to classify samples into predefined groups, enhancing the discrimination between classes.
Whereas SIMCA, PLS-DA, and SVM share similarities in their focus on class separation and modeling class-specific variability, the conventional machine learning (ML) methods, such as Artificial Neural Networks, Random Forests (RF), and Convolutional Neural Networks, are more general-purpose algorithms with broader applicability across different domains [
49]. The Multilayer Perceptron (MLP) is an artificial neural network with multiple layers of nodes, learning by adjusting weights between neurons during training using backpropagation and gradient descent. Artificial Neural Networks (ANN) and MLP are essentially the same, with MLP being a specific type of ANN consisting of multiple layers of neurons, including at least one hidden layer. Random Forest is an ensemble learning method based on decision trees. It constructs multiple decision trees during training, where each tree is trained on a random subset of the training data and features. During discrimination analysis, RF aggregates the predictions of individual trees (usually through averaging or voting) to make the final prediction. RF is robust to overfitting, less sensitive to outliers, and generally requires less hyperparameter tuning than other algorithms. Despite the differences, both MLP and RF are powerful machine learning algorithms known for their robustness and ease of use.
Discriminative methods are particularly useful when there is a clear boundary between different data classes, and the goal is to find and utilize for classification. On the other hand, traditional machine learning methods may be more versatile and flexible in different scenarios but only sometimes focused on directly distinguishing classes. However, a prerequisite for its effective application is a sufficient number of data series, enabling efficient analysis and interpretation of results. As such, discriminative methods are likely more effective when limited datasets are available because they focus on clear class distinctions. Traditional machine learning methods are supposed to be flexible and efficient for large, diverse datasets. Exploring this matter through a real example with potential commercial applications offers an interesting perspective from a research standpoint.
2.3.4. UV-Vis Discriminant Analysis
To assess the performance of a classification model in identifying class members and strangers, metrics such as sensitivity (true positive rate) and specificity (true negative rate) are commonly used. Additionally, the number of true/false positives and negatives, along with other statistics, can provide insights into the model's effectiveness. Since all methods utilized here are supervised, it is necessary to define training and test sets for carrying out the classification. Here, the training set was selected randomly using different values of
k, which represents a fraction of the total variables. Consequently, the number of items in the training set is equal to
k×N, where
N symbolizes the total number of samples (here
N=93). The corresponding test sets, accounting for about
(1-k)×N elements and never used in training, followed the same probability distribution as the training sets. The results are provided in
Table 1. In this table, accuracy is defined as:
where
F(P,N) means False Positives (Negatives),
T(P,N) True Positives (Negatives). To compare the performance of different discrimination methodologies on the same dataset and to assess their robustness and sensitivity to class boundaries, various
k coefficients, ranging from 0.3 to 0.8, are applied. The
k coefficient refers to the proportion of data allocated to the training and test sets. For example, a
k value of 0.3 means that 30% of the dataset is used for training, while the remaining 70% is for testing. This process allows for the evaluation of how each method handles varying levels of class separation or overlap, as represented by the different
k values. By analyzing the resulting classification accuracies, one can determine which methods are more consistent or exhibit superior discrimination under various conditions. This comparison is crucial for selecting the most suitable methodology for specific datasets, especially when class separability varies. The results from applying the techniques described above on the UV dataset with different
k values are summarized in
Table 1. Due to the identical structure of the sample set, this table accurately illustrates the comparative efficacy of the various methodologies.
Among the particular techniques, SIMCA stood out as the most consistent and accurate method, consistently achieving high accuracies across all values. Its performance culminated in a perfect accuracy of 1.0 at k=0.8, indicating excellent class discrimination even with varying training set sizes. LDA also showed strong results, particularly at higher values, with perfect accuracy at k=0.8, though it exhibited more variability at lower values, suggesting some sensitivity to the amount of training data.
On the other hand, traditional machine learning methods displayed a broader range of performance. PLS-DA consistently underperformed, with accuracies at most 0.71, which may indicate limitations in capturing the complex relationships within the UV-Vis data. With increasing training data, SVM and MLP improved, achieving accuracies of 0.88 and 0.97, respectively, at k=0.8. However, these methods generally required more data to reach high accuracy, reflecting their dependence on large datasets to learn from complex patterns effectively. RF also performs well, matching SIMCA and LDA with a match of 1.0 at k=0.8. This suggests that RF, with its ensemble learning approach, can also effectively handle the classification of herbs from titled family based on UV-Vis data.
Overall, while traditional machine learning methods like MLP and RF are powerful tools, their performance often improves with larger datasets and they may struggle with smaller, more nuanced differences between classes. In contrast, class-based methods like SIMCA are inherently designed to model class-specific features, making them more suitable for this type of problem where detailed discrimination is critical. This study underscores the effectiveness of class-based approaches in accurately classifying herbs using UV-Vis spectral data. Among different discriminative techniques, the class-based methods, particularly SIMCA, provided more consistent high performance across all values, highlighting their suitability for tasks requiring precise class distinction.
The best results obtained through the SIMCA approximation warrant further analysis to better understand the method's functionality. For this purpose, three distinct sets were used. The first set included single measurements from various samples, with each sample contributing three measurements. Elements from all analyzed samples were included in different combinations to ensure representativeness. Consequently, the training set contained 31 elements, while the corresponding test set contained 62 elements. Each training set (Training Set 1, Training Set 2, and Training Set 3) was constructed using each sample's first, second, and third elements, respectively. This approach resulted in three distinct training sets, each representing approximately one-third of the total data (
k=1/3). The SIMCA results with this division of samples are illustrated in
Figure 6; the partial accuracies for particular herbs are outlined in
Table S2 in the Supporting Info. Each point on the graph represents a single data sample, with colors indicating different herbs, as detailed in the description. In SIMCA, an observation is assigned to one class (herb), multiple classes, or none (
Figure 6c). As explained later in this section, this specific feature may enhance the model's effectiveness.
SIMCA operates under the assumption that a Principal Component (PC) representation of the input data can capture systematic information related to resemblances among specific classes. Prior to constructing the classification model, PCA analysis is performed separately for each class. The crucial step is the selection of the number of PCA components corresponding to each plant to build trustworthy models that are neither overfitted nor underfitted. The results obtained in the section 2.3.1 are valuable for this purpose. Sage has the highest sample volume, while thyme has the lowest. Volume differences might raise concerns about the validity of conclusions. However, the analysis of
Figures 5a and 5b reveals a substantial distance between the PCA subspaces corresponding to both plants, observed in both UV-Vis and ATR-FTIR analyses. This significant distance suggests that even a considerable increase in the thyme population would not decisively impact the separation between the two subspaces, thus not affecting the discrimination process. This hypothesis is supported by the minimum number of thyme components (1 for UV-Vis and 2 for ATR-FTIR) needed to achieve 100% performance for this herb. Given the significant distance between the subspaces corresponding to lavender and thyme, an easy separation of the spectra of these herbs is expected, accompanied by a simultaneous mixing of the others. A quick inspection of
Figure 6 and
Table S2 confirms this expectation. Indeed, lavender and thyme are well separated from the rest, with only a small number of PC components needed, specifically one for thyme. The situation becomes more intricate for oregano.
Figures 5 and S3 reveal that the subspace covering the first 3 PCA components corresponding to oregano significantly intersects with those for basil and sage. Therefore, higher-order components must also be considered for a more accurate characterization.
As mentioned above, SIMCA's class-oriented nature can also be used to further improve performance. In
Figure 6a, two oregano samples are erroneously attributed to basil, and in graphs 6b and 6c, some are also misclassified as sage. However, achieving 100% accuracy in identifying oregano is possible in each case by categorizing points that 'belong' simultaneously to oregano and another species. Adhering to this rule ensures 100% oregano classification performance for all training sets, as each of them is categorized into the oregano subspace. However, this is not the case for sage. As mentioned earlier, results for sage exhibit the largest scatter. Indeed, as observed in
Figure 5a, using three PCA components may not provide satisfactory results. Including two additional components improves the performance but is still not as outstanding as for thyme and lavender. For each training set, samples of different plants are also attributed to sage. Specifically, in
Figure 6a, two points belonging to basil are misclassified as sage. In
Figure 6b, one oregano sample is also erroneously identified as sage. In
Figure 6c, two points attributed to basil and four to oregano are credited as sage.
Additionally, one sage sample is misclassified as none in
Figure 6c. Consequently, the overall performance is less than 100%. For sets 2 (
Figure 6b) and 3 (
Figure 6c), this overall performance value can be raised to 100% by applying the same procedure as for oregano, i.e., classifying samples that simultaneously belong to basil (oregano) and another plant (in this case, sage) as basil (oregano). However, this corrective approach is not feasible for training set 1 (
Figure 6a), where two samples of sage are misclassified as basil and two samples of basil are misclassified as sage. Regardless of the training set choice, two samples are ambiguously credited to basil and sage. Achieving 100% accurate identification is only possible for sets 2 and 3, where the ambiguously assigned points are attributed to basil. It is impossible for set 1, where indistinctness appears for basil and sage – two points of each are simultaneously assigned to both herbs, preventing 100% accuracy. For all training sets, some spectra belonging to oregano and basil are incorrectly classified as sage, leading to a decrease in the accuracy of their identification.
In conclusion, the SIMCA method proves highly effective in classifying UV-Vis spectra of the herbs from the title family even for small k values less than 0.4, achieving 100% accuracy for thyme, lavender, and oregano. For basil and sage, the performance consistently exceeds 90%. In practical applications, this signifies the ability to unambiguously extract three spectra from a mixture of five.
2.3.4. ATR-FTIR Discriminant Analysis
Table 2 compares the performance of the methodologies discussed in the previous section for ATR-FTIR spectra. As for the UV-VIS, SIMCA seems to be the top-performing method, achieving perfect accuracy at
k=0.8. This result emphasizes SIMCA's robustness across different spectral domains, showcasing its ability to discern subtle chemical variations in herb spectra. LDA also performed well in the IR range, achieving a peak accuracy of 0.86 at
k=0.8, although it did not reach the scores observed in the UV-Vis data. Traditional machine learning methods also) exhibited strong performances, with RF achieving an accuracy of 0.95 and MLP reaching 0.92 at
k=0.8. Interestingly, SVM showed notable improvement in the IR analysis, reaching an accuracy of 0.95, suggesting that SVM may be more sensitive to the spectral features captured in the IR range compared to UV-Vis. However, PLS-DA continued to be the least effective, with a maximum accuracy of only 0.81, consistent with its lower performance across both spectral ranges. The consistency of SIMCA's high performance across UV-Vis and IR spectra underscores its suitability for tasks requiring the differentiation of complex and subtle spectral signatures. The results also highlight the importance of method selection tailored to the specific spectral characteristics of the dataset, particularly in applications like herb classification where precision is crucial. This comparative analysis further solidifies the advantage of class-based methodologies over traditional ML approaches in handling spectral data for the herbs from the title family, making them the preferred choice for accurate and reliable classification.
As discussed in
Section 2.3.1, the PCA subspaces corresponding to the ATR-FTIR spectra of individual herbs interpenetrate less than for UV-Vis. Additionally, the number of components carrying significant variance information is higher here, suggesting the effectiveness of discriminant models with more dimensions than in the case of UV-Vis. Similar to UV-Vis, three training and test sets were distinguished. The results are shown in
Figure 7 and
Table S3 in the Supporting Info. Analogously to UV-Vis, the subspace of thyme spectra is well disjointed from the others, and two components are sufficient for its unambiguous separation. Such good separation is likely due to the small number of thyme spectra. However, an analysis of
Figures 6 and S3 reveals a different position of the corresponding subspace.
While more data for calibration could complicate the picture, the pronounced differences would still maintain the model's performance. The same is valid for basil, which distinctly separates the PCA subspaces corresponding to both types of spectra. For UV-Vis, even four or more components are sufficient to correctly classify this plant, which could easily be confused with sage. As for UV-Vis, the sage spectra pose the most complexity. Utilizing a 5-dimensional space to fully capture their variability, this number of components was used to construct the SIMCA model, moderating the risk of overfitting and resulting randomness. The complexity of the sage subspace, similar to UV-Vis, leads to the categorization of two plants (lavender and oregano) simultaneously as sage, both correctly and incorrectly. However, their membership can be quickly established by acknowledging that points belonging to sage and lavender are classified as lavender. Similarly, points classified as sage and oregano are recognized as oregano. This straightforward procedure results in 100% accuracy for basil and lavender. Notably, while UV-Vis showed an overlap between basil and oregano on sage, the basil subspace is well separated in ATR-FTIR, introducing a difficulty for lavender. However, this does not hamper achieving 100% model performance.