The task is to predict binding results for a set of inhibitors of human
β-secretase 1 (BACE-1).
Tasks
1
Task type
classification
Total samples
1513
Recommended split
scaffold
Recommended metric
AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise,
returns SMILES as list of strings, and labels as a NumPy array (1D integer binary
vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns: “SMILES”, “label”
- tuple of: list of strings (SMILES), NumPy array (labels)
Load and return the BBBP (Blood-Brain Barrier Penetration) dataset.
The task is to predict blood-brain barrier penetration (barrier permeability)
of small drug-like molecules.
Tasks
1
Task type
classification
Total samples
2039
Recommended split
scaffold
Recommended metric
AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise,
returns SMILES as list of strings, and labels as a NumPy array (1D integer binary
vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns: “SMILES”, “label”
- tuple of: list of strings (SMILES), NumPy array (labels)
The task is to predict drug approval viability, by predicting clinical trial
toxicity and final FDA approval status. Both tasks are binary.
Tasks
2
Task type
multitask classification
Total samples
1477
Recommended split
scaffold
Recommended metric
AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 2 label columns,
FDA approval and clinical trial toxicity. Otherwise, returns SMILES as list
of strings,and labels as a NumPy array (2D integer array).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns “SMILES” and 2 label columns
- tuple of: list of strings (SMILES), NumPy array (labels)
Load and return the ESOL (Estimated SOLubility) dataset.
The task is to predict aqueous solubility. Targets are log-transformed,
and the unit is log mols per litre (log Mol/L).
Tasks
1
Task type
regression
Total samples
1128
Recommended split
scaffold
Recommended metric
RMSE
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise,
returns SMILES as list of strings, and labels as a NumPy array (1D float vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns: “SMILES”, “label”
- tuple of: list of strings (SMILES), NumPy array (labels)
Load and return the FreeSolv (Free Solvation Database) dataset.
The task is to predict hydration free energy of small molecules in water.
Targets are in kcal/mol.
Tasks
1
Task type
regression
Total samples
642
Recommended split
scaffold
Recommended metric
RMSE
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise,
returns SMILES as list of strings, and labels as a NumPy array (1D float vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns: “SMILES”, “label”
- tuple of: list of strings (SMILES), NumPy array (labels)
The task is to predict ability of molecules to inhibit HIV replication.
Tasks
1
Task type
classification
Total samples
41127
Recommended split
scaffold
Recommended metric
AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise,
returns SMILES as list of strings, and labels as a NumPy array (1D integer binary
vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns: “SMILES”, “label”
- tuple of: list of strings (SMILES), NumPy array (labels)
The task is to predict octanol/water distribution coefficient (logD) at pH 7.4.
Targets are already log transformed, and are a unitless ratio.
Tasks
1
Task type
regression
Total samples
4200
Recommended split
scaffold
Recommended metric
RMSE
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise,
returns SMILES as list of strings, and labels as a NumPy array (1D float vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns: “SMILES”, “label”
- tuple of: list of strings (SMILES), NumPy array (labels)
Load and return the MoleculeNet benchmark datasets.
Datasets have varied molecular property prediction tasks: regression, single-task,
and multitask classification. Scaffold split is recommended for all of them,
following Open Graph Benchmark [2]_. They differ in recommended metrics. For more
details, see loading functions for particular datasets.
Often only a subset of those datasets is used for benchmarking, e.g. only
single-task datasets, or only classification datasets and excluding PCBA (due to its
large size). A subset of datasets can be selected by using subset argument.
Dataset names are also returned (case-sensitive). Datasets, grouped by task, are:
subset ({None, "regression", "classification", "classification_single_task",) – “classification_multitask”, “classification_no_pcba”}
If not None, returns the given subset of datasets.
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES
as a list of strings, and labels as a NumPy array for each dataset.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns: “SMILES”, “label”
- tuple of: list of strings (SMILES), NumPy array (labels)
Load and return the MUV (Maximum Unbiased Validation) dataset.
The task is to predict 17 targets designed for validation of virtual screening
techniques, based on PubChem BioAssays. All tasks are binary.
Note that targets have missing values. Algorithms should be evaluated only on
present labels. For training data, you may want to impute them, e.g. with zeros.
Tasks
17
Task type
multitask classification
Total samples
93087
Recommended split
scaffold
Recommended metric
AUPRC, AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 17 label columns,
with names corresponding to MUV targets (see [1]_ and [2]_ for details).
Otherwise, returns SMILES as list of strings, and labels as a NumPy array.
Labels are 2D NumPy float array with binary labels and missing values.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns “SMILES” and 17 label columns
- tuple of: list of strings (SMILES), NumPy array (labels)
Load and return the MoleculeNet dataset splits from Open Graph Benchmark (OGB) [1]_.
OGB uses precomputed scaffold split with 80/10/10% split between train/valid/test
subsets. Test set consists of the smallest scaffold groups, and follows MoleculeNet
paper [2]_. Those splits are widely used in literature.
Dataset names here are the same as returned by load_moleculenet_benchmark function,
and are case-sensitive.
Parameters:
dataset_name ({"ESOL", "FreeSolv", "Lipophilicity","BACE", "BBBP", "HIV", "ClinTox",) – “MUV”, “SIDER”, “Tox21”, “ToxCast”, “PCBA”}
Name of the dataset to loads splits for.
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_dict (bool, default=False) – If True, returns the splits as dictionary with keys “train”, “valid” and “test”,
and index lists as values. Otherwise returns three lists with splits indexes.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_dict argument, one of:
- three lists of integer indexes
- dictionary with “train”, “valid” and “test” keys, and values as lists with
Load and return the PCBA (PubChem BioAssay) dataset.
The task is to predict biological activity against 128 bioassays, generated
by high-throughput screening (HTS). All tasks are binary active/non-active.
Note that targets have missing values. Algorithms should be evaluated only on
present labels. For training data, you may want to impute them, e.g. with zeros.
Tasks
128
Task type
multitask classification
Total samples
437929
Recommended split
scaffold
Recommended metric
AUPRC, AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 128 label columns,
with names corresponding to biological activities (see [1]_ and [2]_ for details).
Otherwise, returns SMILES as list of strings, and labels as a NumPy array.
Labels are 2D NumPy float array with binary labels and missing values.
verbosebool, default=False
If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns “SMILES” and 128 label columns
- tuple of: list of strings (SMILES), NumPy array (labels)
Load and return the SIDER (Side Effect Resource) dataset.
The task is to predict adverse drug reactions (ADRs) as drug side effects to
27 system organ classes in MedDRA classification. All tasks are binary.
Tasks
27
Task type
multitask classification
Total samples
1427
Recommended split
scaffold
Recommended metric
AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 27 label columns,
with names corresponding to MedDRA system organ classes (see [1]_ for details).
Otherwise, returns SMILES as list of strings,and labels as a NumPy array (2D
integer array).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns “SMILES” and 27 label columns
- tuple of: list of strings (SMILES), NumPy array (labels)
The task is to predict 12 toxicity targets, including nuclear receptors and
stress response pathways. All tasks are binary.
Note that targets have missing values. Algorithms should be evaluated only on
present labels. For training data, you may want to impute them, e.g. with zeros.
Tasks
12
Task type
multitask classification
Total samples
7831
Recommended split
scaffold
Recommended metric
AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 12 label columns,
with names corresponding to toxicity targets (see [1]_ and [2]_ for details).
Otherwise, returns SMILES as list of strings, and labels as a NumPy array.
Labels are 2D NumPy float array with binary labels and missing values.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns “SMILES” and 12 label columns
- tuple of: list of strings (SMILES), NumPy array (labels)
The task is to predict 617 toxicity targets from a large library of compounds
based on in vitro high-throughput screening. All tasks are binary.
Note that targets have missing values. Algorithms should be evaluated only on
present labels. For training data, you may want to impute them, e.g. with zeros.
Tasks
617
Task type
multitask classification
Total samples
8576
Recommended split
scaffold
Recommended metric
AUROC
Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory
is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 617 label columns,
with names corresponding to toxicity targets (see [1]_ and [2]_ for details).
Otherwise, returns SMILES as list of strings, and labels as a NumPy array.
Labels are 2D NumPy float array with binary labels and missing values.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
Returns:
data – Depending on the as_frame argument, one of:
- Pandas DataFrame with columns “SMILES” and 617 label columns
- tuple of: list of strings (SMILES), NumPy array (labels)