Natural products (NPs) are single chemical compounds, substances or mixtures produced by a living organism - found in nature. Evolutionarily, NPs have been used as healing agents since thousands of years and still today continue to be the most important source of new potential therapeutic preparations. Natural products have played a key role in drug discovery for infectious diseases. Furthermore, following the increasing demand of consumers for natural food ingredients, many efforts have been made to discover natural low-calorie sweeteners in recent years. SuperNatural 3.0 is a freely available database of natural products. The updated version contains 790,096 different structures including isomers and 449,058 unique natural compounds along with their structural and physicochemical information. Additionally, information on pathways, mechanism of action, toxicity, vendor information if available, drug-like chemical space prediction for several diseases as antiviral, antibacterial, antimalarial, anticancer, and target specific cells like the central nervous system (CNS) are also provided for the natural compounds. The updated version of the database also provides a valuable pool of natural compounds in which potential highly sweet compounds are expected to be found. The possible taste profile of the natural compounds was predicted using our published VirtualTaste models. SuperNatural 3.0 database is freely available to all users, without any login or registration.
The data is stored in a relational MySQL database, which is hosted on the Charité IT system. For the handling of chemical information in the database, the Python package RDKit and ChemAxon software were used. The website back-end consists of a lab-based LAMP (Linux/Apache/MySQL/PHP) server, with PHP serving as the back-end language. The database connection is established through the MySQL interface and front-end data delivery through a mixture of Html from submission responses and AJAX requests. Website functionalities are implemented using Javascript and, in extension, its plugin jQuery. Additionally, the CSS_Framework Bootstrap 4 is used. Tables on the website were created with the jQuery plugin DataTables, and the absolute sorting extension. For the chemistry interface, the JavaScript library ChemDoodle Web components was used. The usage of a JavaScript-capable browser is essential, and the server was tested on the most recent version of Google Chrome and Mozilla Firefox.
There are different ways to identify a specific natural compound. If its name is contained in our databse, it can be retrieved via the name search option. During typing, a live search drop-down menu will appear that shows compounds containing the current text, where you can conveniently see if your compound name is known and choose it from the list. Additionally, you can enter SuperNatural and vendor-codes to faster retrieve a formerly found compound. Other options to retrieve a list of potential candidates are the property search, where you can apply different filters, and using the (sub-)structure searches ( detailed in FAQ point below).
The option to use a user provided molecular structure is available in a variety of tabs (compound search, mechanism of action and pathways). As a first step, a molecular structure has to be loaded in the ChemDoodle web interface. Structures can be obtained by entering a PubChem name, a SMILES string, loading a structure file or drawing with the provided tools (see below). Once a structure is loaded, additional modifications can be done as well. When satisfied with the result, the button "Start Similarity Search" can be used to start the search/predictions.
We provide links to suppliers for each compound, but we emphasize to also use the links to the Zinc database and to pubchem for further vendors.
You should be aware that most of the vendors do not distinct between isomers, and do not garantee isomeric purity.
We receive no provision or salary of any kind for providing these links.
In version 3.0 of SuperNatural, an effort was made to change the concept of the database from a collection of unassociated stereoisomers to parent structures
with corresponding isomers as sub-IDs. Whereas parent structures have no defined stereo geometry, corresponding child structures have the same structure as
the parent, but at least one defined stereo geometry at a stereo center.
Searches for similarity, substructures, physical properties and activities are mostly based on fingerprints or properties that are identical for isomers of
the same structure. Calculating the results only on the basis of the parent structures leads to much more compact result tables, with the isomers still being
accessible from the parent structure pages. On the parent compound pages, links to other databases collect links to different isomers of the structure, while
the links on child structure pages only show links to the specific isomers.
Parent structures are numbered according to the alphabetical order of their InChIKeys, starting with SN0000001 for InChIKey AAAACIDXVXFWLZ-UHFFFAOYSA-N.
Corresponding isomers start with the same ID, followed by a sub-ID corresponding to the alphabetical order of the InChIKey of the isomer (e.G. SN0000001-01
with AAAACIDXVXFWLZ-IIZQIQNSSA-N, SN0000001-04 with AAAACIDXVXFWLZ-YRHQEJDOSA-N).
All structures of the database are mapped to different np databases:
The score details the level of confidence derived from available natural compound information. Structures with assigned taxonomy or which are available from a vendor that
provides natural products only have a confidence score of 1. These vendors of natural products include suppliers such as Ambinter, Indofine or Ibs with their catalogues,
and additionally listed natural product vendors in Zinc15_np. Old vendors that disappeared from the market during the time when supernatural 1.0 and 2.0 were compiled are
not included within this process.
Additionally, structures that are listed within at least 3 different natural products only databases get a confidence score 1 as well. This way, 344 386 structures were
assigned a confidence score of 1. Structures without taxonomy and without vendor of natural products that are listed only within 1 or 2 natural product databases get a
lower score of 0.5 (104 672 structures).
Please note: Even though a score of 1 is assigned to compounds which are available from natural product vendors, it is still possible that the concerned compounds are merely natural product-based derivatives or mimetics. This is due to some natural product vendors not only supplying natural products which are extracted from their respective natural sources, but also derivatives obtained from modifying natural compounds, as well as mimetics of strictly natural products. Therefore, a score of 1 merely assures the compound is at least analogous to a natural compound or a modified natural product, if not a natural product itself.
ProTox-II is the first freely available web server for toxicity prediction based on chemical similarity and the identification of toxic fragments and demonstrates good performance in comparison to available QSAR-based methods. ProTox predicts the median oral lethal doses (LD50 values) and toxicity classes in rodents. In addition to the oral toxicity prediction, the web server indicates possible toxicity targets based on a collection of protein-ligand-based pharmacophores ('toxicophores') and therefore provides suggestions for the mechanism of toxicity development. SuperNatural 3.0 provides a possible toxicity alert for the use of a particular natural compound. However, the absence of such toxicity prediction or alert for a compound should not be taken as an indication of safety.
The mechanism of action preditcion can start either from a natural compound or a protein
target. Starting from a natural compound can be done using either identifiers or a
molecular structure, which works the same way as in the compound search (detailed in the
respective FAQ points for compound
and structure search.
When starting from a protein target, a distinction is made between human and
non-human targets. For human targets, it is enough to enter a
UniProt identifier, whereas for non-human
targets, you can choose directly from all contained targets. A target can be
identified by choosing the broad species first from a dropdown, and then consecutively
for more specific organisms until the last dropdown specifies the desired target.
Pathway preditcion can start either from a natural compound or a human
pathway of interest. Starting from a natural compound can be done
using either identifiers or a
molecular structure, which works the same way as in the compound search (detailed in the
respective FAQ points for compound
and structure search.
When starting from a human pathway, only the
KEGG ID of the desired pathway is needed
to start the search.
The various targeted libraries were downloaded from the Lifechemicals websites . The targeted libraries include screening libraries against different diseases such as anti-cancer, antiviral, antibacterial, antimalarial and also libraries which target specific cells for example center nervous system (CNS). The targeted library chemical space was analysed using maximum common substructure (MCS) search and Morgan circular fingerprints with radius 2.
Users can select a specific target library from the choices given in the drop-down menu and enter thresholds for the Tanimoto similarity. As a result, all SuperNatural compounds that have a Tanimoto similarity in the range of the chosen values with an active reference compound for the selected library will be displayed (see below).
Following the increasing demand of consumers for natural food ingredients, many efforts
have been made to discover natural low-calorie sweeteners in recent years. The taste of a
selection of natural compounds was predicted using our published VirtualTaste prediction models.
This dataset includes compounds that were predicted to either have a sweet (S), bitter (B)
or sour (Sr) taste. The taste of natural products can be searched via the “Taste” tab and parameters
like types of taste and confidence scores can be used to filter the natural compounds.
Additionally, new models for salty and umami tastes were developed, with logistic regression achieving
the highest accuracy on average. The models were evaluated
both on an external test set and using 10-fold cross-validation on the traning set, achieving > 90% accuracy,
sensitivy, specificity and area under curve score in all cases (see statistics).
For the taste prediction, it is necessary to choose the desired taste from the 5 provided options. Additionally, it is possible to further narrow the confidence level the retrieved results have to satisfy.
With the ongoing SARS-CoV-2 pandemic there is an urgent need for the discovery of a treatment for the coronavirus disease (COVID-19). Machine learning models can assist drug discovery through the prediction of the best compounds based on previously published data. Additionally, molecular docking was established using the AutoDock software. The natural compounds can be searched for their activity against the main protease using confidence intervals and the choice of the prediction method.
ENSEMBLE machine learning methods were used to develop predictive models
from recent SARS-CoV-2 in vitro inhibition data. The models were
evaluated on several performance measures and have achieved above
80% accuracy, sensitivity, specificity and AUC-ROC both on 10-fold
cross-validation and external validation sets.
The crystal structure of the COVID-19 main protease in complex with an inhibitor N3, with the identification number 6lu7, was downloaded from the PDB protein data bank. Due to computational capacity limitations, the decision was made to reduce the search space, based on a similarity search of the known N3 inhibitor and the natural compounds from the SuperNatural 3.0 database.
From the ChEMBL 29 database, all interactions between
ChEMBL compounds and human targets were extracted. Furthermore, activities were normalized
to nanomolar values and for each unique compound-target pair, the minimum activity value
was determined.
Afterwards, a similarity comparison to SuperNatural 3.0 compounds was performed, using Morgan fingerprints
of length 1024 and radius 2.
Used resources were the ChEMBL 29 database,
the UniProt database,
KEGG pathways and a list of druggable genes from
IDG, further enriched with 43 genes
which are targets of approved drugs from TTD.
At the start a table was created containing data from assays with human
organisms. This table was filtered for real binders, taking into account a
combination of binding strength, IC50/EC50 values and confidence scores. The
ChEMBL-IDs of the targets were mapped to UNIPROT, and to KEGG-IDs. Both
mappings are not unique, therefore there are cases where a ChEMBL-ID is
connected to more than one UNIPROT-ID, or a UNIPROT-ID is mapped to more than
one KEGG-ID. Not for every Target-ChEMBL-ID there is a UNIPROT-ID.
The resulting table contains a mapping of unique
relations between CHEMBL molecules and KEGG-IDs of human targets.
Using all human KEGG-ids, which were derived from the UNIPROT-DBs-table, into a
query on KEGG mapper, a table was
derived containing all information between KEGG pathways and human genes.
Filtering according to KEGG pathways related to diseases and infections,
a table containing only human KEGG-IDS appearing within disease pathways was built.
For each of these entries, the information whether the gene is druggable for small molecules
using HGNC mapping was added.
Using all this for each molecule and all disease pathways, the numbers how many distinct
druggable KEGG-IDs are related to molecule and pathway was evaluated.
Additionally, the number of distinct druggable KEGG-IDs for each molecule,
the number of distinct druggable KEGG-IDs for each pathway,
and the number of distinct druggable KEGG-IDs appearing in any disease-related pathway were assessed
and furthermore used to evaluate the probability for a specific compound to
act on the evaluated pathways.
For the search for molecules for a special pathway, this strategy is not useful
because you get a large number of molecules with the same number of targets within
this pathway, so an additional score for a ranking was needed.
Here all molecules with a binding to any druggable target within the pathway were
evaluated and ranked according to the binding value.
For the evaluation of the p-value the number of interactions for a compound to druggable genes and the part within the given pathways with the same values for all druggable genes are evaluated. The p-value gives the probability that you have at least so many interactions within the given pathways only by chance. This propability value is evaluated with the cumulative hypergeometric distribution using R. In this way, a binding within a pathway containing only two druggable genes is ranked up compared to a few bindings within pathways with a lot of druggable genes, and a single binding in one pathway of a compound with a lot of interactions is ranked down.
The e-value gives the probability that you find an interaction within a pathway by chance if you search within a number of pathways. For the evaluation of the e-values the number of pathways with druggable targets and binding within ChEMBL was taken into account (275 pathways). For very small p-values the p-value can be multiplied with the number of pathways, for larger values you need to evaluate this by the formula e = 1 - (1-p)275.
All available indications are displayed in a dropdown-menu the user can choose from, with accuracy thresholds additionally definable.
Searching for a specific indication displays a result table including both the performance of the specific machine learning model
(evaluated via 10-fold cross-validation) as “model accuracy” and the individual score of the natural compounds achieved using this model as “accuracy”.
The prediction of the association of natural compounds with disease indications is based on information extracted from the Therapeutic Target Database.
Specific indications were identified by their ICD 11 code and all associated compounds were used as training set for the
development of machine learning models. For each indication, a number of different machine learning models were tested and evaluated regarding their performance,
including logistic regression, linear discriminant analysis, k nearest neighbors, decision tree, support vector machines, gaussian naive bayes and random
forests. The model accuracy was evaluated using 10-fold cross-validation and the best performing model was chosen for each indication, resulting in 32 models included on the webpage.
The SuperNatural 3.0 database provides extended ‘Search’ options with different properties of natural compounds and offers the possibility to download all searched results as tables and data sets in a user-friendly manner. Users can customize the compound download list through advanced search and manual selection. The download formats are available as pdf, CSV, excel, sdf. To facilitate the natural product-based drug discovery pipeline and virtual screening protocol, the entire dataset is made available for bulk download as a CSV or SDF file.
Please cite as:
Gallo K, Kemmler E, Goede A, Becker F, Dunkel M, Preissner R and Banerjee P.
SuperNatural 3.0-a database of natural products and natural product-based derivatives. Nucleic Acids Research, gkac1008 2022 Nov 18.
doi: 10.1093/nar/gkac1008. PMID: 36399452.
Previous publication:
Banerjee P., Erehman J., Gohlke B. O., Wilhelm T., Preissner R. and Dunkel M.
Super Natural II: A database of natural products Nucleic Acids Research 2015 database issue