Preprint Article Version 1 This version is not peer-reviewed

A Brief Survey of ML Methods Predicting Molecular Solubility: Towards Lighter Models via Attention and Hyperparameter Optimization

Version 1 : Received: 10 September 2024 / Approved: 11 September 2024 / Online: 11 September 2024 (10:35:31 CEST)

How to cite: Lew, A. A Brief Survey of ML Methods Predicting Molecular Solubility: Towards Lighter Models via Attention and Hyperparameter Optimization. Preprints 2024, 2024090849. https://doi.org/10.20944/preprints202409.0849.v1 Lew, A. A Brief Survey of ML Methods Predicting Molecular Solubility: Towards Lighter Models via Attention and Hyperparameter Optimization. Preprints 2024, 2024090849. https://doi.org/10.20944/preprints202409.0849.v1

Abstract

Traditional chemical research often relies on trial-and-error synthesis and characterization methods. Now, modern machine learning (ML) offers data-driven approaches for predicting properties, like water solubility, directly from chemical structure. But with various data representation schemes for molecular structure and model approaches to select from, it can be difficult for non-experts to determine best practices for utilizing ML. To clarify this landscape of choices, this study uses the ESOL molecular solubility dataset to compare the performance of a selection of different models on different data representations. First, we compare three classical regression methods (linear, ridge, LASSO) on three common data representations (RDKit fingerprint, Morgan fingerprint, intuition-selected molecular features). Then, we demonstrate how two distinct deep learning approaches (multilayer perceptron, graph convolution) can achieve accurate predictions even when prior intuition about feature-property correlations are absent. Finally, we outline a modern attention-based approach, inspired by successes in language modeling and fine-tuned by Bayesian optimization, to achieve a prediction methodology that is more general and performant than the previous approaches. This attention-based approach operates directly on the common SMILES string molecular representation, without requiring as many model parameters as other deep learning approaches or preprocessing into a representative fingerprint or vector of intuitively selected features. In short, when selected molecular features are known to likely correlate with a feature of interest, it may be possible to achieve good predictive modeling without turning to massive deep learning approaches. When particular features are not known a priori, graph approaches are a common solution, and we further demonstrate how a modern hyperparameter optimized attention approach can perform even better.

Keywords

cheminformatics; water solubility; transformer network; Bayesian optimization

Subject

Chemistry and Materials Science, Materials Science and Technology

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.