What is SuperNatural 3.0?

Natural products (NPs) are single chemical compounds, substances or mixtures produced by a living organism - found in nature. Evolutionarily, NPs have been used as healing agents since thousands of years and still today continue to be the most important source of new potential therapeutic preparations. Natural products have played a key role in drug discovery for infectious diseases. Furthermore, following the increasing demand of consumers for natural food ingredients, many efforts have been made to discover natural low-calorie sweeteners in recent years. SuperNatural 3.0 is a freely available database of natural products. The updated version contains 790,096 different structures including isomers and 449,058 unique natural compounds along with their structural and physicochemical information. Additionally, information on pathways, mechanism of action, toxicity, vendor information if available, drug-like chemical space prediction for several diseases as antiviral, antibacterial, antimalarial, anticancer, and target specific cells like the central nervous system (CNS) are also provided for the natural compounds. The updated version of the database also provides a valuable pool of natural compounds in which potential highly sweet compounds are expected to be found. The possible taste profile of the natural compounds was predicted using our published VirtualTaste models. SuperNatural 3.0 database is freely available to all users, without any login or registration.

What is the technical background for database and website?

The data is stored in a relational MySQL database, which is hosted on the Charité IT system. For the handling of chemical information in the database, the Python package RDKit and ChemAxon software were used. The website back-end consists of a lab-based LAMP (Linux/Apache/MySQL/PHP) server, with PHP serving as the back-end language. The database connection is established through the MySQL interface and front-end data delivery through a mixture of Html from submission responses and AJAX requests. Website functionalities are implemented using Javascript and, in extension, its plugin jQuery. Additionally, the CSS_Framework Bootstrap 4 is used. Tables on the website were created with the jQuery plugin DataTables, and the absolute sorting extension. For the chemistry interface, the JavaScript library ChemDoodle Web components was used. The usage of a JavaScript-capable browser is essential, and the server was tested on the most recent version of Google Chrome and Mozilla Firefox.

How can I find a specific natural compound of interest?

There are different ways to identify a specific natural compound. If its name is contained in our databse, it can be retrieved via the name search option. During typing, a live search drop-down menu will appear that shows compounds containing the current text, where you can conveniently see if your compound name is known and choose it from the list. Additionally, you can enter SuperNatural and vendor-codes to faster retrieve a formerly found compound. Other options to retrieve a list of potential candidates are the property search, where you can apply different filters, and using the (sub-)structure searches ( detailed in FAQ point below).

How can I use the structure search options?

The option to use a user provided molecular structure is available in a variety of tabs (compound search, mechanism of action and pathways). As a first step, a molecular structure has to be loaded in the ChemDoodle web interface. Structures can be obtained by entering a PubChem name, a SMILES string, loading a structure file or drawing with the provided tools (see below). Once a structure is loaded, additional modifications can be done as well. When satisfied with the result, the button "Start Similarity Search" can be used to start the search/predictions.

Where can I buy a compound?

We provide links to suppliers for each compound, but we emphasize to also use the links to the Zinc database and to pubchem for further vendors. You should be aware that most of the vendors do not distinct between isomers, and do not garantee isomeric purity.

We receive no provision or salary of any kind for providing these links.

What is the concept for the assignment of specific SNP-IDs?

In version 3.0 of SuperNatural, an effort was made to change the concept of the database from a collection of unassociated stereoisomers to parent structures with corresponding isomers as sub-IDs. Whereas parent structures have no defined stereo geometry, corresponding child structures have the same structure as the parent, but at least one defined stereo geometry at a stereo center.
Searches for similarity, substructures, physical properties and activities are mostly based on fingerprints or properties that are identical for isomers of the same structure. Calculating the results only on the basis of the parent structures leads to much more compact result tables, with the isomers still being accessible from the parent structure pages. On the parent compound pages, links to other databases collect links to different isomers of the structure, while the links on child structure pages only show links to the specific isomers.
Parent structures are numbered according to the alphabetical order of their InChIKeys, starting with SN0000001 for InChIKey AAAACIDXVXFWLZ-UHFFFAOYSA-N. Corresponding isomers start with the same ID, followed by a sub-ID corresponding to the alphabetical order of the InChIKey of the isomer (e.G. SN0000001-01 with AAAACIDXVXFWLZ-IIZQIQNSSA-N, SN0000001-04 with AAAACIDXVXFWLZ-YRHQEJDOSA-N).

How was the database curation performed?

All structures of the database are mapped to different np databases:

afromalaria, ambinter, analyticon, biofacquim, cmaup, cmnpd, coconut, exposome, food_db, gnps, ibs_2022, indofine, lotus, nanpdb, nci_dtp_np, npass npatlas, npcare, np_mrd, nubbe, panapl, phenolexplorer, sanc_db, supernatural 2, streptomed, swmd, tipdb, tppt, uefs, unpd, zinc15np

First of all, the number of charged structures was reduced by removing all charged structures for which the uncharged version of the same structure is contained within the database as well. The remaining charged structures were only kept if they could be mapped to at least 2 different exclusively natural product databases. This way, the number of charged structures was reduced from 61168 to 1074 at the end of the curation process.
Additionally, a score was assigned to all remaining parent structures, detailing the level of confidence derived from available natural compound information. As a consequence, around 35,000 structures which were assigned score of zero and were present in the SuperNatural2 database were removed from the current version in the data curation process.

What is the meaning of the “score” on parent compound pages?

The score details the level of confidence derived from available natural compound information. Structures with assigned taxonomy or which are available from a vendor that provides natural products only have a confidence score of 1. These vendors of natural products include suppliers such as Ambinter, Indofine or Ibs with their catalogues, and additionally listed natural product vendors in Zinc15_np. Old vendors that disappeared from the market during the time when supernatural 1.0 and 2.0 were compiled are not included within this process.
Additionally, structures that are listed within at least 3 different natural products only databases get a confidence score 1 as well. This way, 344 386 structures were assigned a confidence score of 1. Structures without taxonomy and without vendor of natural products that are listed only within 1 or 2 natural product databases get a lower score of 0.5 (104 672 structures).

Please note: Even though a score of 1 is assigned to compounds which are available from natural product vendors, it is still possible that the concerned compounds are merely natural product-based derivatives or mimetics. This is due to some natural product vendors not only supplying natural products which are extracted from their respective natural sources, but also derivatives obtained from modifying natural compounds, as well as mimetics of strictly natural products. Therefore, a score of 1 merely assures the compound is at least analogous to a natural compound or a modified natural product, if not a natural product itself.

How is the toxicity class of a compound determined?

ProTox-II is the first freely available web server for toxicity prediction based on chemical similarity and the identification of toxic fragments and demonstrates good performance in comparison to available QSAR-based methods. ProTox predicts the median oral lethal doses (LD50 values) and toxicity classes in rodents. In addition to the oral toxicity prediction, the web server indicates possible toxicity targets based on a collection of protein-ligand-based pharmacophores ('toxicophores') and therefore provides suggestions for the mechanism of toxicity development. SuperNatural 3.0 provides a possible toxicity alert for the use of a particular natural compound. However, the absence of such toxicity prediction or alert for a compound should not be taken as an indication of safety.

Which starting points are there for the mechanism of action prediction?

The mechanism of action preditcion can start either from a natural compound or a protein target. Starting from a natural compound can be done using either identifiers or a molecular structure, which works the same way as in the compound search (detailed in the respective FAQ points for compound and structure search.
When starting from a protein target, a distinction is made between human and non-human targets. For human targets, it is enough to enter a UniProt identifier, whereas for non-human targets, you can choose directly from all contained targets. A target can be identified by choosing the broad species first from a dropdown, and then consecutively for more specific organisms until the last dropdown specifies the desired target.

How can I use the pathway prediction?

Pathway preditcion can start either from a natural compound or a human pathway of interest. Starting from a natural compound can be done using either identifiers or a molecular structure, which works the same way as in the compound search (detailed in the respective FAQ points for compound and structure search.
When starting from a human pathway, only the KEGG ID of the desired pathway is needed to start the search.

Where are the targeted libraries reference compounds from?

The various targeted libraries were downloaded from the Lifechemicals websites . The targeted libraries include screening libraries against different diseases such as anti-cancer, antiviral, antibacterial, antimalarial and also libraries which target specific cells for example center nervous system (CNS). The targeted library chemical space was analysed using maximum common substructure (MCS) search and Morgan circular fingerprints with radius 2.

How can I search the targeted libraries?

Users can select a specific target library from the choices given in the drop-down menu and enter thresholds for the Tanimoto similarity. As a result, all SuperNatural compounds that have a Tanimoto similarity in the range of the chosen values with an active reference compound for the selected library will be displayed (see below).

How is the taste prediction performed?

Following the increasing demand of consumers for natural food ingredients, many efforts have been made to discover natural low-calorie sweeteners in recent years. The taste of a selection of natural compounds was predicted using our published VirtualTaste prediction models. This dataset includes compounds that were predicted to either have a sweet (S), bitter (B) or sour (Sr) taste. The taste of natural products can be searched via the “Taste” tab and parameters like types of taste and confidence scores can be used to filter the natural compounds.
Additionally, new models for salty and umami tastes were developed, with logistic regression achieving the highest accuracy on average. The models were evaluated both on an external test set and using 10-fold cross-validation on the traning set, achieving > 90% accuracy, sensitivy, specificity and area under curve score in all cases (see statistics).

Which options are there to specify the taste prediction?

For the taste prediction, it is necessary to choose the desired taste from the 5 provided options. Additionally, it is possible to further narrow the confidence level the retrieved results have to satisfy.

Which methods were used to predict COVID-19 main protease inhibition?

With the ongoing SARS-CoV-2 pandemic there is an urgent need for the discovery of a treatment for the coronavirus disease (COVID-19). Machine learning models can assist drug discovery through the prediction of the best compounds based on previously published data. Additionally, molecular docking was established using the AutoDock software. The natural compounds can be searched for their activity against the main protease using confidence intervals and the choice of the prediction method.

What is the achieved performance of the COVID-19 machine learning models?

ENSEMBLE machine learning methods were used to develop predictive models from recent SARS-CoV-2 in vitro inhibition data. The models were evaluated on several performance measures and have achieved above 80% accuracy, sensitivity, specificity and AUC-ROC both on 10-fold cross-validation and external validation sets.

How was the docking performed?

The crystal structure of the COVID-19 main protease in complex with an inhibitor N3, with the identification number 6lu7, was downloaded from the PDB protein data bank. Due to computational capacity limitations, the decision was made to reduce the search space, based on a similarity search of the known N3 inhibitor and the natural compounds from the SuperNatural 3.0 database.

How are the activity values for the mechanism of action determined?

From the ChEMBL 29 database, all interactions between ChEMBL compounds and human targets were extracted. Furthermore, activities were normalized to nanomolar values and for each unique compound-target pair, the minimum activity value was determined.
Afterwards, a similarity comparison to SuperNatural 3.0 compounds was performed, using Morgan fingerprints of length 1024 and radius 2.

How was the pathway enrichment analysis performed?

Used resources were the ChEMBL 29 database, the UniProt database, KEGG pathways and a list of druggable genes from IDG, further enriched with 43 genes which are targets of approved drugs from TTD.
At the start a table was created containing data from assays with human organisms. This table was filtered for real binders, taking into account a combination of binding strength, IC50/EC50 values and confidence scores. The ChEMBL-IDs of the targets were mapped to UNIPROT, and to KEGG-IDs. Both mappings are not unique, therefore there are cases where a ChEMBL-ID is connected to more than one UNIPROT-ID, or a UNIPROT-ID is mapped to more than one KEGG-ID. Not for every Target-ChEMBL-ID there is a UNIPROT-ID. The resulting table contains a mapping of unique relations between CHEMBL molecules and KEGG-IDs of human targets. Using all human KEGG-ids, which were derived from the UNIPROT-DBs-table, into a query on KEGG mapper, a table was derived containing all information between KEGG pathways and human genes. Filtering according to KEGG pathways related to diseases and infections, a table containing only human KEGG-IDS appearing within disease pathways was built. For each of these entries, the information whether the gene is druggable for small molecules using HGNC mapping was added. Using all this for each molecule and all disease pathways, the numbers how many distinct druggable KEGG-IDs are related to molecule and pathway was evaluated. Additionally, the number of distinct druggable KEGG-IDs for each molecule, the number of distinct druggable KEGG-IDs for each pathway, and the number of distinct druggable KEGG-IDs appearing in any disease-related pathway were assessed and furthermore used to evaluate the probability for a specific compound to act on the evaluated pathways.
For the search for molecules for a special pathway, this strategy is not useful because you get a large number of molecules with the same number of targets within this pathway, so an additional score for a ranking was needed. Here all molecules with a binding to any druggable target within the pathway were evaluated and ranked according to the binding value.

What is the p-value?

For the evaluation of the p-value the number of interactions for a compound to druggable genes and the part within the given pathways with the same values for all druggable genes are evaluated. The p-value gives the probability that you have at least so many interactions within the given pathways only by chance. This propability value is evaluated with the cumulative hypergeometric distribution using R. In this way, a binding within a pathway containing only two druggable genes is ranked up compared to a few bindings within pathways with a lot of druggable genes, and a single binding in one pathway of a compound with a lot of interactions is ranked down.

What is the e-value?

The e-value gives the probability that you find an interaction within a pathway by chance if you search within a number of pathways. For the evaluation of the e-values the number of pathways with druggable targets and binding within ChEMBL was taken into account (275 pathways). For very small p-values the p-value can be multiplied with the number of pathways, for larger values you need to evaluate this by the formula e = 1 - (1-p)²⁷⁵.

How can I search the predicted disease associations of natural products?

All available indications are displayed in a dropdown-menu the user can choose from, with accuracy thresholds additionally definable.

Searching for a specific indication displays a result table including both the performance of the specific machine learning model (evaluated via 10-fold cross-validation) as “model accuracy” and the individual score of the natural compounds achieved using this model as “accuracy”.

What is the basis for the disease association prediction?

The prediction of the association of natural compounds with disease indications is based on information extracted from the Therapeutic Target Database.
Specific indications were identified by their ICD 11 code and all associated compounds were used as training set for the development of machine learning models. For each indication, a number of different machine learning models were tested and evaluated regarding their performance, including logistic regression, linear discriminant analysis, k nearest neighbors, decision tree, support vector machines, gaussian naive bayes and random forests. The model accuracy was evaluated using 10-fold cross-validation and the best performing model was chosen for each indication, resulting in 32 models included on the webpage.

Which software was used during the development of the database?

AutoDock: AutoDock is a suite of automated docking tools. It is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of a known 3D structure. Current distributions of AutoDock consist of two generations of software: AutoDock 4 and AutoDock Vina.
KNIME: KNIME-the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME was used for data curation, standardization and properties calculation.
RDKit: It i an Open-Source Cheminformatics Software. RDkit was used to generate properties and performed molecular fingerprints based similarity search within the database.
Chemdoodle: ChemDoodle web components- A minimal, open-source, JavaScript library for fast, professional, HTML5 scientific interfaces were used for handling the chemical structures for the website.
ChemAxon: ChemAxon is a cheminformatics and bioinformatics software, the academic free version was used for the conversion of file formats and handling chemical datasets.
Python: The Python Programming Language version 3 was used for most of the programming related tasks.

Which download options do I have?

The SuperNatural 3.0 database provides extended ‘Search’ options with different properties of natural compounds and offers the possibility to download all searched results as tables and data sets in a user-friendly manner. Users can customize the compound download list through advanced search and manual selection. The download formats are available as pdf, CSV, excel, sdf. To facilitate the natural product-based drug discovery pipeline and virtual screening protocol, the entire dataset is made available for bulk download as a CSV or SDF file.

How to cite SuperNatural 3.0?

Please cite as:

Gallo K, Kemmler E, Goede A, Becker F, Dunkel M, Preissner R and Banerjee P. SuperNatural 3.0-a database of natural products and natural product-based derivatives. Nucleic Acids Research, gkac1008 2022 Nov 18.
doi: 10.1093/nar/gkac1008. PMID: 36399452.

Previous publication:

Banerjee P., Erehman J., Gohlke B. O., Wilhelm T., Preissner R. and Dunkel M. Super Natural II: A database of natural products Nucleic Acids Research 2015 database issue