Exploring new ways to represent and discover organic molecules is critical for developing novel therapies. With recent advances in bioinformatics, virtual screening of databases is possible. However, biochemical data must be encoded using computer algorithms to make them machine-readable, taking into account distance and similarity measures to support tasks such as similarity searching. Motivated by the ubiquity of the carbon element and the structured patterns that emerge, we propose a parametric approach to molecular encodings of carbon-based multilevel atomic neighborhoods. It implements a walk along the carbon chain of an organic molecule to compute different representations of its feature encoding in the form of a binary or numerical array that can be exported later into an image. Resulting encodings are reproducible and readily formatted for various domain tasks including machine learning tasks. This approach was evaluated using a 10-fold stratified cross validation for binary classification with eight data sets and six different encodings (384 models) in the domain knowledge of cell-penetrating peptides. The parametric approach is built on open-source software and is implemented as a Python package (cmangoes). Source code and documentation are available at https://github.com/ghattab/cmangoes.
Keywords:
Subject: Computer Science and Mathematics - Artificial Intelligence and Machine Learning
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.