2.1. QSPR model validation
As a first approach, regression models were built using genetic algorithms to select the most appropriate descriptors. After selection of the descriptors, multiple linear regression analysis was performed to generate suitable models that could allow us to categorize the biological activity of the dataset. The best QSPR model for antibacterial activity against
A. baumannii consists of fifteen descriptors as follows:
All statistical parameters were obtained as their average values (see
Table S3), for example, the square correlation coefficient (
) of 70.278(±0.907), and the
ADJ of 69.162(±0.973). The Fischer F and the standard deviation (s) are 62.978(±8.418) and 0.462(±0.000) respectively, indicating that our model is acceptable. Also, redundancy and overfitting rules were checked to determine the nature of the descriptors used in the model. In this sense, the overfitting rule,
= -0.008 (-0.054), was approved fairly while the redundancy rule,
= 0.015 (0.100), indicated that some descriptors, nHDon and nHBonds, are correlated to the dependent variable. However, these descriptors cannot be removed as they are important for the correct description of our regression model. Furthermore, the prediction ability of the model was validated by the leave-many-out cross-validation,
LMO = 67.886(±1.043), a value indicating that the regression model has good predictive power. The robustness parameter was indicated by the high value of
BOOT = 66.882(±1.104) based on bootstrapping, which was repeated 5000 times.
An external validation was essential as a high
LMO only indicates a good internal validation, but it does not show a high prediction capability of the created model. Therefore, for the external validation procedure, 70% of all the molecules in the dataset were randomly selected for the training process, and the remaining 30% were used as the test set. This process was repeated six times; their plots are shown in
Figure 1, with their upper/lower confidence intervals at a 95% confidence level. The Y-scrambling test was used on the training-test set, giving the new values of
= 0.014(±0.000) and
= -0.068(±0.000). These new values were lower than the original ones, confirming that our model is reliable. With the same purpose, the Asymptotic
rule,
= 0.000 (-0.005), was employed. Therefore, the model in (1) passed all the statistical tests proposed by Roy
et al. [
45,
46,
47], as an average value derived from ten experiments shows: (a)
67.88(±1.043); (b)
0.679(±0.000); (c)
0.001(±0.000); (d)
0.999(±0.000) (or
0.991(±0.000)); (e)
0.127(±0.000). For an acceptable prediction, the value of
should preferably be less than 0.2, while
should be greater than 0.5. In our model,
presents a value of 0.229(±0.000), while
has an average value of 0.552(±0.000). A complete list for each evaluation can be seen in
Table S4.
The applicability domain is graphically depicted by the Williams plot in
Figure 2. For each compound, the leverage values can be calculated and, by plotting these values against the standardized residuals, it is possible to establish the applicability domain of the developed model [
48]. This allows detection of molecules that our model cannot predict adequately, thus considered as outliers [
49,
50], molecules with distinctive structures (high leverage outliers,
), or those associated to the response (predicted residuals > 3*SDEC). All compounds that are outside the limits established by the leverage warning and three times the standard deviation in error calculation are outliers.
As seen from the Williams plot, outliers are correlated to the structure of molecules. Due to the relative wide variety of molecular structures used in our model, detected outliers both from the training and test sets are very different (
Figure 3).
In compound
1, although it shares a similar structure with those of the Batzelladine alkaloids’ family used in this model [
51], the two cyclic ether-like motifs at the central positively charged nitrogen core, as well as two pendant primary aliphatic amine arms, are distinctively different from the rest of the analyzed molecules. Compounds
2 and
5 to
13 are considered as outliers because of the many positively charged nitrogen atoms present at the molecules. Two fluoroquinolone derivatives are present as outliers: compound
3 possesses a 3,5-difluoro-substituted pyridine instead of the common cyclopropyl or ethyl groups at nitrogen while compound
4 has a pyridine-type structure at the core as in nalidixic acid. These two features are unique among the set of fluoroquinolones used in our model. Compounds
14 and
15, being both aminoglycosides, are seen as outliers from our model as it is suggested that amino groups are responsible for this distinction. Compound
16 is a flavanone-7-O-glycoside. Although there are many flavanones in the dataset, none of them present a disaccharide (or any mono- or polysaccharide), which makes
16 unique. On the other hand, many examples of substituted triazoles are seen in our model, but molecule
17 has a benzotriazole which unique, thus, it is considered as an outlier. Even though there are many compounds with aromatic alcohols,
18 (gallic acid) possess a benzenetriol motif which is not encountered in any other molecule. Structure
19 has the hydantoin functional group, which is unique among the set of active molecules against
A. baumannii.
To test the reliability of our QSPR model, molecules which were not introduced in our initial dataset were employed as an external validation set to obtain their predicted
pMIC values. Three sets of compounds were used as follows: a) the first set from molecules reported from Matsingos
et al. [
52]; b) compounds reported by Singh [
53], Wang [
54], and Zhou [
55] as the second set, and finally, c) chemical structures described by Lyons and collaborators [
56]. For the three sets of data there is a good correlation between experimental and predicted
pMIC values, with
values of 72.89, 71.64 and 71.56 respectively. On the other hand, compounds that exhibit, for example, positively charged nitrogen atoms like those reported by Vereshchagin and co-workers [
57], are not well predicted by our model in accordance with the results of the outliers analyzed previously.
Our model applied to the first set of linezolid analogues with different C5-acylamino substituents gives an insight into their structural features. An increase in the
pMIC values is seen when moving from small-chain alkyl groups to cyclic non-aromatic and finally to aromatic substituents. This increase is shown in
Figure 4.
The second set of compounds comprises three different groups of molecules for which our model classifies first the divin derivatives, moving into pyrazinoindole analogues, and finally with the subset of 2-aminothiazole sulfanilamide oximes, as seen in
Figure 5.
The last set of compounds comprises several oxazolidinone derivatives in
Figure 6. The first molecules are classified in accordance with the structure of the 1,5-naphthyridin-2(1
H)-one, while the last ones have a 1,8-naphthyridin-2(1
H)-one. Molecules at the center possess the nitrogen atom at different positions of the quinolin-2(1
H)-one core.
2.2. QSPR interpretation
The understanding of the descriptors presented by the QSPR model allows us to gain some insights into chemical features of the molecules used in the model that are relevant for their antibacterial activity towards A. baumannii. Equation (1) displays two topological descriptors (D/Dr06 and TI2), one 2D-autocorrelation (GATS6m), six functional group counts (nArCOOH, nRCONH2, nROR, nImidazoles, nHDon and nHBonds), and six atom-centred fragments (C-018, C-029, C-032, H-051, N-075 and N-079), all of them being 2D-dimensional descriptors.
The first descriptor in the model is D/Dr06, a topological descriptor. Distance/detour ring indices (D/Dr
k) are calculated by summing up distance/detour quotient matrix row sums of vertices belonging to single rings in the molecule. These descriptors can be considered special substructure descriptors reflecting local geometrical environments in complex cyclic systems [
58]. D/Dr06 displays a positive coefficient value, indicating that the presence of this descriptor enhances the activity of the molecule. This descriptor appears when a 6-membered cyclic structure is present in the molecule. From the set of compounds, most of the cyclic structures belongs to benzene type rings (both carbocyclic and heterocyclic). D/Dr06 has been used in similar way for the description of the anticancer activity of aromatic molecules [
59]. The highest D/Dr06 value belongs to compound
2 where two adamantyl moieties are present in the molecule. Values of zero correspond to molecules which do not display any 6-membered cyclic system, such as compounds
22 to
25, seen from
Figure 7. Furthermore, molecules that display high values of D/Dr06 also show high
pMIC values.
The second Mohar index [
60], (TI2), is calculated from the eigenvalues of the Laplacian matrix as shown:
Where the
is the number of non-H atoms and
is the first non-zero eigenvalue. TI2 is a topological descriptor and belongs to the Mohar indices that are related to solubility of compounds. In general, it is associated with the size, shape, symmetry, as well with branching or cyclicity of the molecule. TI2 shows a positive coefficient value, indicating that by increasing the value of the descriptor, the expected
pMIC values will also increase. This descriptor has been used in the explanation of the activity of diaryl urea derivatives [
61] and in the QSAR analysis of aminomethyl-piperidones [
62].
The GATS6m [
63,
64] descriptor belongs to the 2D autocorrelation indices where the Geary coefficient is a distance-type function that can be any physicochemical property (
w), calculated for each atom, such as atomic mass, polarizability, or volume, among others, and is represented by (3). By summing the products of a certain property of two atoms located at a certain distance or spatial lag (
k), a spatial autocorrelation can be obtained.
Where A is the number of non-hydrogen atoms,
is the average of the
atomic property value,
is the Kronecker delta, and
is the number of vertex pairs at distance equal to
k. GATS6m is the mean Geary autocorrelation of lag 6/weighted by atomic mass, which means that this descriptor considers the atomic mass of any atom in the structure at different path lengths (lag) of 6. Strong spatial autocorrelation between pair of atoms produces low values of this index. Also, symmetric, or low-branched structures as well as molecules with low number of heteroatoms (atoms besides C and H) are expected to produce low to zero values. The GATS6m descriptor displays a negative coefficient in (1), which indicates that by increasing the autocorrelation between pairs of atoms considering their atomic masses at a distance of 6 between them, the value of this descriptor will increase, causing a reduction in its
pMIC value. As seen from
Figure 8a, there is a homogeneous distribution of the data when plotting the GATS6m descriptor against the corresponding
pMIC values. Eight molecules from the dataset have a zero value of GATS6m; their structures are displayed in
Figure 8b. Furthermore, these molecules are seen to have medium-interval of
pMIC (between 3.5 to 5) relative to their location in the scatterplot. In
Figure 8c, for the molecule with the highest GATS6m-value, selected pathways are shown for which the sum of their atomic masses produces the final value.
The next six descriptors belong to the functional-group counts (FGC), which are considered as indicator variables. Their value will depend on the number of functional groups present or absent from the molecule, meaning that not all compounds will feature them. The FGC have been used to identify structural features that are important for a property of particular interest. Therefore, their presence or absence can significantly alter the predicted activity in the model. Each FGC descriptor can be easily understood in terms of the nature of functional groups. For example, nArCOOH, nRCONH2, nROR, and nImidazoles accounts for the number of aromatic carboxylic acids, the number of aliphatic primary amides, the number of aliphatic ethers and the number of imidazole moieties, respectively (
Figure 9).
The nHDon indicates the number of hydrogen donor atoms (-NH
2 and -OH) for which the formation of hydrogen bonds is possible; in the same manner, nHBonds accounts for the number of intramolecular hydrogen bonds that are possible when there are acceptor atoms like N, O, or F, as shown in
Figure 10. Intramolecular hydrogen bonds are crucial for the biological activity of many compounds. It is well stablished that intramolecular hydrogen bond formation can lead to temporarily closed ring systems which are more lipophilic in nature, while open forms are exposed to solvent, lending more hydrophilic character to the molecule [
65]. For example, small hydrophilic molecules, such as β-lactams, use the pore-forming porins to enter cytoplasm/periplasm [
66], while hydrophobic drugs like macrolides diffuse across the lipid bilayer [
67]. In our model, the nHDon descriptor displays a positive value, indicating that a high number of donor atoms leads to an increase in the biological activity. However, as the nHBonds descriptor possesses a negative coefficient, it indicates that as the number of intramolecular hydrogen bonds increases, in part due to a high number of hydrogen donor atoms, the biological activity decreases, which is correlated to a more lipophilic nature of the molecules.
Six atom-centred fragment (ACF) descriptors are present. ACF descriptors are based on structural fragments which contain the information of the central atom and their bonding neighbors [
68,
69,
70]. Each ACF is defined by the type of bonding, as well as the number and nature of the neighbors bounded to the centered atom. For example, C018 (=CHX) corresponds to a
sp2 C atom which is single-bonded to a hydrogen and to any electronegative atom (such as N, O, S,
etc). The C029 (R--CX--X) descriptor, for which the “--" represents an aromatic bond (e.g., benzene) or delocalized bonds (as in the N-O bond in a NO
2 group), corresponds to a central
sp2 C atom that is single-bonded to an electronegative X atom, and also both double-bonded to a X atom and a R group, in which their bonds are delocalized. The C032 (X--CX--X) descriptor behaves in a similar fashion to C029, but instead of a R group it is replaced by a third X atom. This descriptor has been used also for the analysis of chemical features essential for anticoronaviral activity [
71]. The H051 descriptor stands for the environment in which a hydrogen atom is bonded. It is defined as a hydrogen that is attached to an alpha-C atom; an alpha-C may be defined as a carbon connected through a single bond with -C=X (double bond), -C≡X (triple bond), or -C--X (aromatic bond), where X represents any electronegative atom, like in the case of alpha-hydrogens in carbonyl compounds. This descriptor has been used to explain the activity of a series of molecules containing nitroaromatics motifs as radiosensitizers [
72]. The next two descriptors, N075 and N079, are nitrogen based structural fragments. The first one is defined as a central
sp2 N atom which is bonded to two R groups or to one R and X groups (R--N--R or R--N--X), like in pyridine type motifs. This descriptor is particularly important as many molecules in our set present these kinds of motifs. The second descriptor is related to any nitrogen atom which bears a positive charge. Representative examples for each of the ACF descriptors are presented in
Figure 11.
From a general view, descriptors in Equation (1) can be classified into global and indicator variables. Global terms like GATS6m and TI2 are present in the molecule and give information of the whole structure, while indicator variables only appear if the molecular structure contains the motif. Furthermore, descriptors can be associated to the steric and electronic properties of the molecule (D/Dr06, GATS6m, and nImidazoles, as well as the six-ACF descriptors), while others are more related to solubility of compounds, like in the case of nHDon, nHBonds, TI2, as well as functional groups like nArCOOH, nRCONH2, and nROR. Electronic parameters can be associated to atom-centred fragments which indicates the distribution of substituents around a specific atom. As many molecules include within their structure specific ACF moieties, their inclusion will lead to an increase or decrease in the predicted pMIC value. For example, the three ACF based on central carbon (C-018, C-029 and C-032) are positive in their signs indicating that their presence enhance bioactivity. Furthermore, as they are carbon ACF descriptors, they can be associated with core-structure features. However, H051, N075, and N079 ACF descriptors lead to a decrease in the activity. H-051 recall hydrogen atoms which are reactive and hence, they are prone to be abstracted by the use of bases. Nitrogen atoms like those described by the N075 descriptor are good hydrogen bond acceptors, leading to the generation of inter- and intramolecular interactions by the use of their lone pairs of electrons, which decreases the solubility of molecules, as stated by the nHBonds descriptor. On the other hand, molecules which are well solvated in aqueous media are expected to be high in pMIC values.
Figure 12 shows the percentage of distribution of the descriptors for molecules in the dataset. 95.9% of the molecules (568) have the nHDon functional group and almost all other descriptors fall within this category. The second major descriptor that appears in the dataset is nHBonds with 53.9% of the molecules (319), followed by the N075 descriptor is in 278 molecules of the subset (47%). Considering the high number of bioactive compounds which includes a pyridine-fused or pyridine-containing heterocycles, as well as their tendency to participate in hydrogen bonding, the presence of this descriptor in great percentage is important to account the description of the activity of molecules [
73,
74].
Some molecules are observed to be outside the boundaries of the nHDon/nHBonds descriptors, which agrees with the presence of compounds without donor groups such as hydroxyls (-OH), or amines (-NH2), like in natural products. The rest of the molecules are located into these major categories, which can be seen adequately in the Venn diagram [
75,
76,
77,
78,
79,
80] in
Figure 13. It is also seen that eight molecules lack of the rest of molecular descriptors used in the model. Thus, they are depicted outside the Venn diagram as a sole group.
2.3. Virtual screening using BIOFACQUIM dataset
Once we fully validated our model for antibacterial activity against
A. baumannii, we proceeded to search for new molecular candidates in an online database of molecular compounds. BIOFACQUIM [
81] is a Mexican natural product database which comprises 528 compounds isolated from many plants and other organisms from Mexico. After careful curation of the database and calculation of their descriptors (
Table S5), we performed the analysis of the molecules using our QSPR model. The predicted
pMIC values from molecules of the database range between 1.65 and 11.24.
Table 1 shows these values for the most active molecules, suggested by our model, and depicted for some structures in
Figure 14. As stated in Equation 1, a high value of the calculated
pMIC implies a small concentration of the compound, which correlates to an increase in its potency. In this sense, desirable molecules should exhibit high
pMIC values.
Table 1 shows that molecules with the highest predicted
pMIC values are compounds
32 to
35 and
53, which were isolated from several plants of the genus
Ipomoea [
82,
83]. Their molecular structures contain several functional groups that contribute to their predicted activity. Three important features are observed: 1) all of them have a high number of pyranose-like rings, which may contribute to their hydrophilicity properties; 2) most of them contain large aliphatic side chains and/or macrocyclic lactone rings, which may contribute to their lipophilicity; 3) all of them present at least one terminal ester group which may be prone to cleavage by hydrolysis in aqueous media. Molecules
59 to
62 and
64 also exhibit terminal carboxylic acid fragments.
Analyzing these characteristics in our model we can obtain some insights regarding the structural information that correlates to the predicted values. For example, all the molecules exhibit a great number of aliphatic ether groups and, according to our model in Equation 12, as the number of aliphatic ether motifs (nROR) increases, the greater their activity will be. This is highly correlated to the large number of donor atoms (oxygens) and therefore, as the number of nHDon increases, so does the predicted bioactivity. Nonetheless, a great number of donor atoms also increases the possible number of intramolecular hydrogen bonds (nHBonds) which, according to our model, diminishes the predicted values. Another descriptor that appears to affect the predicted values is H-051 that implies the presence of hydrogens attached to alpha-carbon atoms, known as alpha-hydrogens (α-H). As the number of α-H increases, the bioactivity tends to decrease. In those molecules that are predicted with the highest
pMIC values, ester and carboxylic acid groups appear in great numbers, suggesting that this kind of functional groups are not adequate for their pharmacokinetic profile, as all of them exhibit α-H. Another feature is the presence of a high number of pyranose-like rings which are 6-membered rings, thus, the high D/Dr06 value displayed. Furthermore, because of their structure, these molecules are highly branched which is seen in their high TI2 values. GATS6m is complex in nature but correlates well with the molecules under analysis. As the average number of possible 6-pathways for which heavy atoms can be included, there is a decrease in the predicted bioactivity. There are many known bioactive compounds for which their molecular masses are substantially high, for example the macrolides and some other natural products like digitoxin [
21,
22,
84,
85], thus violating one of the Lipinski’s rules used for the evaluation of possible new drugs [
86]. Molecules
31,
48 and
53 present relative medium high GATS6m values, hence high molecular mass; however, our model predicts elevated
pMIC for these compounds. This can suggest that there could be a limit in the mass of the molecule and the number of oxygen atoms or any other heavy element that will cause molecules to be less active.
2.5. Glycoside SAR analysis
As stated before, the increasing number of multidrug-resistant bacteria represents an important risk to human health worldwide. Although
A. baumannii represents a serious treat, the search for wide-spectrum antibiotics for the treatment of infections caused by several of the ESKAPE pathogens is crucial. To determine if the proposed molecules display antibacterial activity towards this bacterial critical group, the corresponding bioassays were tested using clinical isolates which are metallo-β-lactamase producers and resistant to betalactam antibiotics (
Table 3).
From the results, important features arise from the molecular structures of the glycosides. First, molecular structures of compounds
54,
55,
56 and
58 contain the same tetrasaccharide core which is connected by a macrolactone ring. From
54 to
55, removal of one carbon atom, from central 2-methylbutyrate to 2-methylpropionate, increases the activity of the glycoside, being active not only to
K. pneumonia but also now to
P. aeruginosa and
S. aureus. In compound
56, reinsertion of the carbon atom but with the addition of a hydroxyl group at position three of the 2-methylbutyrate group reinforces the activity spectrum by being active to
A. baumannii, as seen in
Figure 16. However, removal of the hydroxyl group of the central and outer 2-methylbutyrate groups and addition of one carbon atom of the macrolactone ring (from ten atoms to eleven) causes molecule
58 to lose wide spectrum activity and to be only active against
K. pneumoniae, this suggest that hydroxyl groups, located in specific regions of this molecular core, enhance the bioactivity of this set of glycosides.
Compounds 60 to 62 are the smallest compounds. They share in common a terminal carboxylic acid alongside a pyranose-ring. Although 60 has wide antibacterial activity against multidrug resistant bacteria, 61 and 62 only display activity against A. baumannii. This important loss of activity may be attributed to the removal of the aliphatic chain connecting the pyranose ring and the terminal carboxylic acid, being replaced by a more rigid phenyl core. Close inspection of compound 63 reveals the structure of compound 60 within it forming an ester bond at the terminal carboxylic acid group. This feature could explain the retained activity against A. baumannii. Similar to this, molecules 54 to 56 share common structural features, like at the macrolactone ring with the same set of atoms, the lack of hydroxyl groups at the outer methylbutyrates may affect the expected activity.
An insight into the chemical structures of
32 to
35 and
53, the most potent molecules according to model in Equation 12, reveals that the core of
60 is present (
Figure 17). Furthermore, the macrolactone ring alongside the chiral carbon is also a common feature, with the cycle formed of ten or eleven methylene groups as in
54 to
58. This could suggest that molecules of the BIOFACQUIM database would also exert antibacterial activity towards
A. baumannii and other resistant bacteria.
One of our remaining questions is which action mechanisms can exert these molecules? In order to propose one, we constructed a simplified version of the Venn diagram in which is possible to observe the correlation between the H-051 and the nROR descriptors seen in the isolated molecules. The purpose of this diagram in
Figure 18 is to identify molecules with known action mechanism and with structural similarity (same molecular descriptors) to our compounds. Furthermore, other type of compounds used are also part of the inner set of molecules. These compounds have different structural motifs when compared to compounds
54 to
68 and, they present different mechanisms of action
From a structural point of view, compounds
56 and
63 resemble those of the macrolide antibiotics [
85]. Examples of macrolides are Erythromycin A, oleandomycin, josamycin, and spiramycin, isolated from different microorganisms, as well as many semisynthetic derivatives like clarithromycin, flurithromycin, and other unique compounds like azithromycin. Also, the latest new members, the ketolides and fluoroketolides are also structurally related to the macrolide family. As stated above, when comparing the new molecules with macrolides, several features are shared (
Figure 19). Macrolides are well characterized by the presence of 14- to 16-membered macrocyclic lactone ring to which one or more deoxy sugars are attached. In the case of compounds like
56 and
63, the macrolactone ring is shown connecting two or three sugar-type rings. Furthermore, because of the relative high number of carbonyl motifs in macrolides, α-H are also present in great numbers. This is also true for many compounds from
54 to
58 and
65 to
68, where the ester group is observed. Moreover, a great number of aliphatic ether groups, and a great number of oxygen atoms present at the hydroxyl groups and other motifs are also features which are in common. Macrolides are potential bacteriostatic compounds, for which one mechanism of action relies on binding to the P site on the 50S subunit of the bacterial ribosome. Because of this, we can suggest that compounds
56 and
64, among others, could exhibit a similar action on bacteria, thus acting as protein synthesis inhibitors
Finally, compound
60 as a small molecule can be considered as a lead compound for which specific chemical transformations could improve its efficacy. Moreover, molecules isolated from
Ipomoea also share within their structures a deoxy-sugar moiety that could be relevant for their activity. By close inspection of the fragment, we search for molecules in the ChEMBL database for bioactive compounds which incorporate the deoxy sugar in their structures. A wide variety of molecules possesses the motif, from anticancer to anti allergenic [
90,
91,
92,
93,
94,
95,
96,
97,
98]. Chemical structures for these compounds can be seen in
Figure S3.
In summary, the model was validated statistically by internal and external parameters, showing good predictive power. This was demonstrated using the model, first applied to the BIOFACQUIM natural products’ database in the search of potential candidates and finally, by exploring the properties of isolated natural products from Ipomoea sp. We observed wide antibacterial spectra activity of compounds 56, 60, and 63 against several isolated bacterial strains, which agrees with the properties calculated by the model.