Données de réplication pour : Towards the improvement of thermodynamic solubility prediction – a review

DOI

Evaluating thermodynamic solubility is crucial to design successful drug candidates. Yet, predicting it with in silico approaches remains a challenge. Machine learning methods are used to develop regression models leveraged on molecular descriptors. Recently, powerful solubility predictive models have been published using feature- and graph-based neural networks. These models often display attractive performances, yet, their reliability may be deceiving when used for prospective prediction. This review investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the structure of the aqueous solubility dataverse and data quality. We demonstrate that new models are not ready for public usage because they lack a well-defined applicability domain and they overlook some historical data sources. On the basis of carefully reviewed dataset we are able to illustrate the influence the data quality on model predictivity. We comprehensively investigated over 20 years of published solubility datasets and models, highlighting overlooked and interconnected datasets. We benchmarked recently published models on a Sanofi dataset, as an example of pharmaceutical context, and they performed poorly. We observed the impact of factors influencing the performances of the models: interlaboratory standard deviation, ionic state of the solute and source of the solubility data. As a consequence we draw a general workflow to cure aqueous solubility data with the aim of producing predictive models. Our results show how data quality and applicability domain of public models have an impact on their utility in a real context in pharmaceutical industry. We found that some data sources may appear as less reliable than initially expected, as for instance, the eChem dataset. This exhaustive aqueous solubility data analysis led to the development of a curation workflow; the resulting models and datasets are publicly available.

Data are available as CSV files.

File AqSolDBc.csv and AqSolDB_Enriched AqSolDBc is the final curated dataset after filtering of the AqSolDB_Enriched dataset. AqSolDBc is the curated data from the AqSolDB. The available columns are:

Source If in AqSolDBc, the value is "AqSolDBc" ID Compound ID (string) Name Name of the compound (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) ExperimentalLogS Mole/L logarithm in decimal basis of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILEScurated Curated SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float) Group Data quality label imported from AqSolDB (string) Dataset Source of the data point (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) DuplicateGroup ID used to regroup duplicate structures (integer) DuplicateSD Standard deviation of the measurements for each unique structure, based on duplicate observations (double) DuplicateOccurrence Number of measurement for each unique structure (integer) SD Experimental standard deviation as given in the original AqSolDB (double)

File AqSolDB.csv Original data from the AqSolDB. The available columns are:

ID Compound ID (string) Name Name of the compound (string) SMILES Original SMILES code of the chemical structure (string) SmilesCurated Curated SMILES code of the chemical structure (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) Dataset Source of the data point (string) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) SD Experimental standard deviation as given in the original AqSolDB (double)

File AqSolDB_Enriched_for_AqSolDBc.csv An extended version of AqSolDB_Enriched supplemented with molecular descriptors. Available columns:

ID Compound ID (string) Name Name of the compound (string) InChI InChI code of the chemical structure (string) InChIKey InChI hash code of the chemical structure (string) Solubility Mole/L logarithm in decimal basis of the thermodynamic solubility in water at pH 7 (+/-1) at ~300K (float) SMILES Original SMILES code of the chemical structure (string) SD Standard laboratory Deviation, default value: -1 (float) Ocurrences Number of occurrences in the original merged dataset from the original AqSolDB dataset Group Data quality label imported from AqSolDB (string) MolWt Molecular weight (double) MolLogP Computed logP lipophilicity (double) MolMR Computed molecular refractivity (double) HeavyAtomCount Number of heavy atoms (integer) NumHAcceptors Number of hydrogen bond acceptors (integer) NumHAcceptors Number of hydrogen bond donors (integer) NumRotatableBonds Number of rotatable bonds (integer) NumValenceElectrons Number of valence electrons (integer) NumAromaticRings Number of aromatic rings (integer) NumSaturatedRings Number of saturated rings (integer) NumAliphaticRings Number of aliphatic rings (integer) RingCount Number of rings (integer) TPSA Total Polar Surface Area in Å^2 (double) LabuteASA Labute approximate molecular surface area (double) BalabanJ Balaban topological descriptor (double) BertzCT Bertz molecular complexity descriptor (double) CAS Chemical Abstract Service identifier (string) Dataset Source of the data point (string) Composition Purity of the substance: mono-constituent, multi-constituent, UVCB (Categorical) Origin Either organic, organomettalic, NaN (Categorical) HasError Yes or No, see ErrorType for details (boolean) ErrorType Identifier error on the data point, default value: None (String) ROMol molecular identifier from RDkit (integer) Smiles7 Curated SMILES ionized at pH 7.0 (string) hasGroupSeparatedPy Has the molecular separated formal charges neutralizing each other (at pH 7.0) - True / False (boolean) totalChargePy Formal charge values count on the molecule at pH 7.0 (integer) sumChargePy Molecular formal charge at pH 7.0 (integer) ChargeRatioPy sumChargePy / totalChargePy, at pH 7.0 (double) Pass_ChargeRatioPy Category of the compound according to the ionization state at pH 7.0: Uncharged, Negative, Zwitterion, Positive, PureChargeSeparation (categorical) AtomCount Number of atoms (integer) AlertAtoms True if the molecule contains one of the AlertAtom - see curation (boolean) SD Experimental standard deviation as given in the original AqSolDB (double)

File OChem.csv Raw file obtained from OChem.

index A numerical identifier for each entry in the dataset (integer). SMILES Simplified Molecular Input Line Entry System (string). CASRN Chemical Abstracts Service Registry Number (string). EXTERNALID An external identifier that links to other databases or references (integer). N Identifier specific to the dataset (integer). NAME The name of the chemical compound (string). ARTICLEID Identifier for the article or publication where the data was reported (string). PUBMEDID Identifier for the article in the PubMed database, which indexes biomedical and life sciences literature (string). PAGE Page number in the publication where the data can be found (integer). TABLE Table number in the publication where the data can be found (integer). Water solubility The solubility of the chemical compound in water (double). UNIT {Water solubility} The unit of measurement for water solubility (e.g., mg/L, mol/L) (string). Water solubility {measured, converted} Water solubility data, indicating whether the value is measured directly or converted from another unit (string). UNIT {Water solubility}.1 The unit of measurement for the converted water solubility value (string). Dataset The specific dataset or source from which the data is derived (string). Temperature The temperature at which the water solubility measurement was taken (double). UNIT {Temperature} The unit of measurement for temperature (e.g., Celsius, Kelvin) (string). Ionic strength The ionic strength of the solution in which solubility was measured (double). UNIT {Ionic strength} The unit of measurement for ionic strength (e.g., mol/L) (string). comment (chemical) Additional comments or notes about the chemical compound (string). source The source from which the data was obtained (string). pH The pH value of the solution in which solubility was measured (double). UNIT {pH} The unit for pH, which is dimensionless (string). Quality code A code indicating the quality or reliability of the data (integer). UNIT {Quality code} The unit or scale used for the quality code (string). MW Molecular weight of the chemical compound (double). LogS (Format) Logarithm of the solubility (double). Temperature (Format) Temperature format (string). Temperature Keep (Format) Indicates whether the row is to be kept based on the temperature (boolean). NB Hetero (Format) Number of heteroatoms in the chemical compound (integer). CpId Compound identifier, a unique ID assigned to each chemical compound in the dataset (integer).

File OChemUnseen.csv Solubility data from OChem, curated and orthogonal to AqSolDB. The available columns are:

SMILES Curated SMILES code of the chemical structure (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float)

File OChemOverlapping.csv Solubility data from OChem, curated; chemical structures are also present inside AqSolDB. The available columns are:

SMILES Curated SMILES code of the chemical structure (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float)

File OChemCurated.csv Solubility data from OChem, curated. The available columns are:

ID Compound ID (string) Name Compound name (string) SMILES Curated SMILES code of the chemical structure (string) SDi Standard laboratory Deviation, default value: -1 (float) Reference Unformated bibliographic reference which the data point is originating from (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float) EXTERNALID Compound ID as appearing in its data source, default value: None (string) CASRN CAS number of the compound, default value: None (string) ARTICLEID Source ID linked to the column Reference (string) Temperature Temperature of the measure, in K (float)

File OChem_Clean.csv Curated version of OChem, with overlap annotation to AqSolDBc. (Ready for modeling)

SMILES Simplified Molecular Input Line Entry System. (string) LogS (Format) OChem experimental solubility. (double) SD Standard deviation of the experimental solubility assigned to the SMILES. (double) Reference Origin of the data. (string) Overlapping Does the compound overlap 'Overlap' or not 'New' with AqSolDBc. (categorical)

File OChem_Predicted.csv Predicted solubility values on the OChem dataset using the model trained on the AqSolDBc dataset.

smiles Standardized Simplified Molecular Input Line Entry System. (string) log Solubility Prediction of water solubility on the OChem dataset. (double)

File OChem_StandardizedStructures.csv OChem dataset with standardized chemical structures according to the AqSolDBc rules.

CpId Compound identifier, a unique ID assigned to each chemical compound in the dataset. (string) SMILES Standardized Simplified Molecular Input Line Entry System. (string)

File AqSolDBc_Overlap.csv Overlapping entry between AqSolDB and OChem.

CpId Original AqSolDB compound identifier, a unique ID assigned to each chemical compound in the dataset. (string) SMILES Simplified Molecular Input Line Entry System. (string) LogS Mole/L logarithm of the thermodynamic solubility in water at pH 7 (+/-1) (float) Pass_ChargeRatioPy Category of the compound according to the ionization state at pH 7.0: Uncharged, Negative, Zwitterion, Positive, PureChargeSeparation (categorical)

Identifier
DOI https://doi.org/10.57745/CZVZIA
Metadata Access https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.57745/CZVZIA
Provenance
Creator Llompart, Pierre ORCID logo; Minoletti, Claire ORCID logo; Baybekov, Shamkhal ORCID logo; Horvath, Dragos ORCID logo; Marcou, Gilles ORCID logo; Varnek, Alexandre (ORCID: 0000-0003-1886-925X)
Publisher Recherche Data Gouv
Contributor Marcou, Gilles; Université de Strasbourg; Centre national de la recherche scientifique; Entrepôt-Catalogue Recherche Data Gouv
Publication Year 2023
Funding Reference ANRT Cifre 2021/1684
Rights etalab 2.0; info:eu-repo/semantics/openAccess; https://spdx.org/licenses/etalab-2.0.html
OpenAccess true
Contact Marcou, Gilles (CMC - UMR7140 ; CNRS, Université de Strasbourg ; Strasbourg ; France)
Representation
Resource Type Dataset
Format text/tab-separated-values; application/x-ipynb+json; text/plain
Size 519218; 2513344; 5527968; 3164076; 2857901; 72137; 792128; 1257; 786411; 993073; 205089; 1135906; 3373; 1009535; 12558825; 100787; 2104
Version 2.0
Discipline Chemistry; Natural Sciences
Spatial Coverage Laboratory of Chemoinformatics (CMC - UMR7140)