This research is grounded in a methodology that integrates both data analysis and computational methodologies to advance the field of perfume engineering. Our process is comprehensive, aiming to harness the potential of available data and make informed decisions in creating and understanding perfumes. A more in-depth look at the steps involved in our methodology is described as follows:
2.1. Data Collection via Web Scraping and In-Depth Statistical Analysis
At the outset of our methodology, we aimed to curate a database rich in subjective details about commercially available perfumes. Several data sources were evaluated, but we ultimately gravitated toward the forum Parfumo. This choice was influenced by its consistent updates and the substantial volume of information, with over 140 thousand perfumes listed during the development of our web scraping program.
Figure 2.
Print screen of two perfume pages. To the left, is an example of a page with all the necessary information. To the right, is a page with little information.
Figure 2.
Print screen of two perfume pages. To the left, is an example of a page with all the necessary information. To the right, is a page with little information.
We aim to create a database of commercially available perfumes in the first step of the proposed methodology. Each perfume entry includes baseline details: name, release year, target gender, and brand. Some entries expand to ratings on scent, longevity, sillage, bottle design, and cost-effectiveness. These entries may also have rating counts, user reviews, visuals, community classifications, and fragrance notes, which may be segmented into top, heart, and base notes.
To retrieve data from Parfumo, it is proposed an adaptive web scraper tool. Given the inconsistencies in available data for each perfume, our scraper is designed to be flexible. A preliminary exploration of the forum confirms the consistent presence of name and brand details for every perfume, but other attributes differ. We crafted specific Python functions for each data attribute to cater to this. Exception Handling ensures the scraper manages missing data smoothly. Then, compile the extracted data using the Pandas library into a data frame. These functions employ the Python libraries: Requests, Selenium, and BeautifulSoup. While both Requests and Selenium facilitate data extraction from webpages, they differ in efficiency. The Requests library rapidly reads a page in milliseconds, making it the primary choice. However, for elements embedded in JavaScript, such as numerical ratings, which are not accessible to Requests, the proposed method uses the Selenium library. It emulates a browser environment, albeit at a slightly slower pace of about three seconds per page.
Following successful trials on specific forum pages, the scraper is configured to navigate the entire forum, guided by a link list organized by perfume release year and brand. This methodology phase is illustrated in
Figure 3.
Having obtained the database, the information is subject to statistical analysis to better understand the data that is contained, the detailed results of these analyses are presented in the results section of this work. As mentioned, the main objective of this step was to answer pivotal questions that provide a vital comprehension of the data. A few approaches were taken to answer each question:
-
Which fragrance notes appear most frequently across perfumes?
To answer this question, it is necessary to count how many perfumes contain a given fragrance note for every fragrance note that is available.
-
Is there a correlation between certain aromatic notes and higher consumer ratings?
Here, it is necessary to consider all the perfumes that contain a given fragrance note and calculate the average of the ratings of all perfumes. It is also useful to calculate the sum of ratings for each fragrance note (Sum of the ratings of all the perfumes that contain a given note)
-
Are there specific fragrance notes that are commonly paired together?
Firstly, a matrix is created using a technique called one-hot encoding. The matrix contains all the perfumes as rows and all the fragrance notes available as columns. If a perfume contains a given fragrance note, the value of the cell is one, else, it is zero. Next, a co-occurrence of fragrance notes matrix is calculated by multiplying the original matrix with its transpose.
Figure 3.
Flow diagram representing the steps to extract all the information necessary.
Figure 3.
Flow diagram representing the steps to extract all the information necessary.
2.2. Database Clustering Using the k-Means Algorithm
The next step was to group all similar perfumes into clusters, depending on which fragrance notes they had. The idea behind this clustering is that if a customer likes a perfume, other perfumes in the same cluster will probably be enjoyable as well since they have similar aromas but with different combinations of fragrance notes; it helps customers to buy similar products and manufacturers to produce a similar product to one that was successful in the past. For that effect, three processes were used: one hot encoding to pre-process the data, an autoencoder to compact the data, and finally a k-means clustering algorithm to group the data.
The K-means algorithm is an iterative algorithm that attempts to partition a dataset into K distinct non-overlapping subgroups (clusters), with each data point belonging to only one group. It attempts to keep intra-cluster data points as close as possible while keeping clusters as far as possible. It assigns data points to clusters so that the sum of the squared Euclidean distances between the data points and the cluster's centroid (the arithmetic mean of all the data points in that cluster) is as small as possible. The objective function is given by:
where K is the number of clusters, m is the number of data points, and w
ik is 1 if the data point belongs to cluster k and 0 if not. First, the objective function is minimized with respect to w
ik keeping µ
k fixed, and next, J is minimized with respect to µ
k keeping w
ik fixed. The data point is assigned to a cluster based on the closest centroid, then recalculates the centroid (Sinaga & Yang, 2020). The algorithm is applied in a loop to determine the ideal number of clusters. The algorithm is run at each iteration with a sequential number of clusters, and a decision is made based on the silhouette coefficient. This coefficient is a way of measuring the goodness of the clusters; its values range from -1 to 1, with 1 being the ideal value representing a well-defined and contained cluster, as opposed to a scattered cluster represented by a negative silhouette coefficient (Rousseeuw, 1987).
The data relating to fragrance notes in the data frame are all in string format, unusable for the k-means clustering algorithm, which requires numerical inputs. One hot encoding solves this problem by transforming the text strings into binary values: if a perfume contains a given note, its value is one for that note else it is zero. The k-means clustering algorithm loses accuracy at higher dimensions (number of all fragrance notes available in this case) due to the curse of dimensionality problem. The algorithm minimizes the squared Euclidean distance between points. However, at higher dimensions, the Euclidean distance for any two points is nearly the same (Beyer et al., 1999). Therefore, the least common notes were dropped to reduce the dimensions to a manageable number.
To enhance the compactness of the data frame, we suggest two distinct methodologies: Multiple Correspondence Analysis (MCA) and an autoencoder. Both methods yield a data frame with an identical row count as the original yet feature a reduced number of columns with numerical values, in lieu of binary ones, encapsulating the information.
Initially, MCA is deployed due to its straightforward nature. Nevertheless, its data compression capabilities might be limited. To address this potential shortfall, an autoencoder is incorporated into this stage of the methodology. This ensures a more optimal data compression suitable for the k-means algorithm. An autoencoder, in essence, is an unsupervised artificial neural network. It is tailored to learn efficient data compression and subsequently to reconstruct the data to mirror the original as closely as feasible (Wen & Zhang, 2018).
Figure 4.
Graphic representation of the layers of an autoencoder.
Figure 4.
Graphic representation of the layers of an autoencoder.
The described k-means clustering algorithm is also applied to describe the vapor pressure intervals for the top, heart, and bottom levels. Quantitative information about which molecules can fit at which level is scarce.
2.3. Molecule Generation with GGNN
As highlighted earlier, clustering enables us to identify perfumes similar to a given one. We crafted a straightforward code that accepts a perfume's name and a target gender (male, female, or unisex). This code determines the cluster to which the specified perfume belongs and then pinpoints perfumes within that cluster with the highest weighted scent rating, subsequently revealing their fragrance notes.
The weighted rating is derived by multiplying the scent rating with the number of reviews it has received. To filter out perfumes that might have low quality but high visibility, we set a minimum threshold of 8 for the scent ratings. This ensures that we exclude perfumes that, despite being popular, have garnered negative feedback. The fragrance notes returned from this process are poised to be appealing, given they originate from a perfume that has resonated well with the market.
To ensure a successful perfume formulation, simply knowing the fragrance notes is insufficient. Understanding the specific ingredients that deliver the desired aroma is crucial. In this context, we suggest employing Graph Neural Networks (GNNs) specifically designed for molecular synthesis. By training a Gated Graph Neural Network (GGNN) on a database of established fragrant molecules, the network becomes equipped to generate novel molecules, drawing inspiration from the characteristics of the original compounds.
To train the GGNN model, it is necessary to have a database containing existing molecules that have known fragrance notes, so that the model can learn what aspects of a molecule are likely to result in a specific note. Thus, a database with known molecules and their fragrance notes is used to train the GGNN, sourced from The Good Scents Company, 2021, webpage. Although this new database does not contain identical fragrance notes as those extracted from the Parfumo forum, it is possible to map all the fragrance notes from the new database to those from the forum. This database provides the SMILES (Weininger et al., 1989) representation of molecules and other related information. Given that the GGNN requires graphs as generative inputs, functions from the RDKit library, an open-source cheminformatics software, are employed to transform SMILES into graphs, following the encoding rules set by Weininger et al., 1989. This transformation process interprets atoms as nodes and bonds as edges, embedding chemical information from the atoms and bonds. These embeddings facilitate understanding the relationships within the graph components. Additionally, the graph structure encompasses other features such as the adjacency matrix and edge attributes. The adjacency matrix indicates how nodes interconnect, forming a square matrix based on the number of nodes in the graph, while edge attributes convey the distances between graph edges.
Molecules, now represented as graphs, undergo preprocessing to enable the generative network to effectively reconstruct them during generation. Each node and edge are numerically labeled based on their type. A starting node is determined, and molecules are sequentially deconstructed, adhering to the label order. These processed molecules then serve as inputs for the GGNN training.
The GGNN is a refined version of the graph neural network presented by Li et al. (2015). It incorporates gated recurrent units (GRU) in its propagation phase and utilizes back-propagation through time (BPTT) (Zhou et al., 2019). The integration of GRU and BPTT addresses the vanishing/exploding gradient challenge—a situation where modulating model weights becomes challenging due to the diminishing gradient as one navigates deeper into the recurrent neural network. Specifically, the GRU employs update and reset gates to discern and prioritize pertinent data for predictions. This chosen data is directed to the output and refines its understanding from historical data. Meanwhile, BPTT optimizes performance chronologically, adapting the traditional backpropagation used for systems without memory to those nonlinear systems equipped with memory, as exemplified by the GGNN (Campolucci et al., 1996).
The GGNN yields two primary outputs: the graph embedding (g) and the final transformed node feature matrix (HL). These outputs subsequently become inputs for the global readout block (GRB). The GRB, structured as a tiered multi-layered perceptron (MLP) architecture, is a distinct feedforward artificial neural network. It calculates the graph's action probability distribution (APD), a vector comprising probabilities for every conceivable action to evolve a graph. APD samples guide the model in graph creation. The potential actions are threefold: introducing a new node to the graph, linking the recent node to a pre-existing one, or completing the graph. It's crucial to note that certain actions might be unsuitable for specific graphs, necessitating the model's ability to assign zero probabilities to such invalid actions. The cumulative probabilities of all actions must equal 1, establishing the target vectors that the model strives to learn during training. The following equations describe the operations undertaken by the GGNN.
where
is the node feature vector for the initial node 𝑣 at the GGNN layer and is equal to its node feature vector in the graph,
is a GRU gate in the specific MLP layer t, and relative to the node 𝑣,
are normalization constants,
is the set of neighbor nodes for 𝑣; u is a specific node in the graph,
is a trainable weight tensor in
r regarding the edge label
le, b is a learnable parameter,
z is also a GRU gate, 𝜌 is a non-linear function, ⨀ is an element-wise multiplication. The functional form of these equations is translated by the following:
where
and
are the incoming messages and hidden states of node
vi,
eij is the edge feature vector between
vi and
vj,
l is a GNN layer index and L is the final GNN layer index.
g, the final graph embedding is given by:
The processes undertaken by the global readout block are translated by the following equations. The SOFTMAX function is the activation function of the block, it converts a vector of numbers into a vector of probabilities.
The training phase is executed in small batches and the activation function of the model is the scaled exponential linear unit (SELU), translated by the following two equations; the function is applied after every linear layer in the MLP. The loss in this phase is given by the Kullback-Leibler divergence (Kullback & Leibler, 1951) between the target APD and predicted APD. Additionally, the model uses the Adam optimizer in several stages. Developed by (Kingma & Ba, 2014), it is a straightforward first-order gradient-based optimization algorithm.
During the training phase, graph samples are taken at consistent intervals for evaluation. The metric of choice for this evaluation is the uniformity-completeness Jensen-Shannon divergence (UC-JSD), as introduced by Arús-Pous et al. (2019). UC-JSD serves as a measure of similarity for the distributions of the negative log-likelihood (NLL) per action sampled. Ideally, values should approach zero.
The culminating phase is dedicated to graph generation. Here, the APD, which was formed in the GRB, is sampled to construct the graphs. A graph will persist in its growth either until the 'terminate' action is chosen from the APD or when an invalid action transpires. Invalid actions include adding a node to a non-existent node in a graph (unless the graph is vacant), linking an already connected pair, or appending a node to a graph that's already reached its node capacity, as determined during preprocessing. It's worth noting that hydrogens are excluded during both the training and generation stages. They are later incorporated using RDKit functions, depending on the valency of each atom.
Figure 5.
Schematic representation of the methodology of the GGNN platform.
Figure 5.
Schematic representation of the methodology of the GGNN platform.
2.4. Molecule Generation for Desired Perfume Profiles and Assessment of Vapor Pressure
The molecules generated previously were all fragrant since the GGNN model was trained with a database containing only fragrant molecules. However, the output of the model has no information on the fragrance notes of each molecule. Since the objective of this work is to present ingredients that can be used to formulate a perfume, it is necessary to generate only molecules that correspond to the desired fragrance notes. For that effect, a technique called transfer learning was used.
The established definition of transfer learning was presented by Xue et al. in 2019. It is a machine learning technique applied to solve problems with a lack of data through transferring knowledge from another related problem or data set. In this case, the knowledge obtained by the GGNN platform during training with all the molecules in the database was transferred to another model being trained only with molecules that are known to have the desired fragrance note. The model could not be trained only with the known molecules a priori since there are not enough molecules per fragrance note in the database to generate accurate results. Instead, the already trained model was trained again with a new set of molecules. This process was made in a loop to generate molecules for all fragrance notes available in the database.
Figure 6.
Schematic representation of the transfer learning process.
Figure 6.
Schematic representation of the transfer learning process.
Finally, the thermo library, an open-source software available for use in Python, was used to consult the vapor pressure of all generated molecules. The library consults thermodynamic databases to check if the vapor pressure of the generated molecules is known, if not it returns an error. To define the type of note that each generated molecule is, the k-means clustering algorithm was also used to define vapor pressure intervals for the top, heart, and bottom levels. Quantitative information about which molecules can fit at which level is scarce. A sample of 1,000 fragrant molecules was scraped from the Good Scents company webpage, with information about the molecules’ duration (how long the smell lasts) and vapor pressure in mmHg at 25ºC. The algorithm was executed to calculate three clusters, and the centroid of each cluster was used to define the interval. Each centroid has a unique value of vapor pressure. At 95% confidence, the values are presented as intervals. At 25ºC, molecules with a vapor pressure between 0 and 0.0183 mmHg fall in the bottom category; molecules between 0.0183 and 0.0833 mmHg are classified as heart, and the molecules with a vapor pressure above 0.0833 mmHg can be used as top notes. Thus, all necessary information for the proposed methodology is achieved.