skfp.datasets.moleculenet#

Functions

load_bace([data_dir, as_frame, verbose])

Load and return the BACE dataset.

load_bbbp([data_dir, as_frame, verbose])

Load and return the BBBP (Blood-Brain Barrier Penetration) dataset.

load_clintox([data_dir, as_frame, verbose])

Load and return the ClinTox dataset.

load_esol([data_dir, as_frame, verbose])

Load and return the ESOL (Estimated SOLubility) dataset.

load_freesolv([data_dir, as_frame, verbose])

Load and return the FreeSolv (Free Solvation Database) dataset.

load_hiv([data_dir, as_frame, verbose])

Load and return the HIV dataset.

load_lipophilicity([data_dir, as_frame, verbose])

Load and return the Lipophilicity dataset.

load_moleculenet_benchmark([subset, ...])

Load and return the MoleculeNet benchmark datasets.

load_muv([data_dir, as_frame, verbose])

Load and return the MUV (Maximum Unbiased Validation) dataset.

load_ogb_splits(dataset_name[, data_dir, ...])

Load and return the MoleculeNet dataset splits from Open Graph Benchmark (OGB) [1]_.

load_pcba([data_dir, as_frame, verbose])

Load and return the PCBA (PubChem BioAssay) dataset.

load_sider([data_dir, as_frame, verbose])

Load and return the SIDER (Side Effect Resource) dataset.

load_tox21([data_dir, as_frame, verbose])

Load and return the Tox21 dataset.

load_toxcast([data_dir, as_frame, verbose])

Load and return the ToxCast dataset.

skfp.datasets.moleculenet.load_bace(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the BACE dataset.

The task is to predict binding results for a set of inhibitors of human β-secretase 1 (BACE-1).

Tasks

1

Task type

classification

Total samples

1513

Recommended split

scaffold

Recommended metric

AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D integer binary vector).

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_bbbp(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the BBBP (Blood-Brain Barrier Penetration) dataset.

The task is to predict blood-brain barrier penetration (barrier permeability) of small drug-like molecules.

Tasks

1

Task type

classification

Total samples

2039

Recommended split

scaffold

Recommended metric

AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D integer binary vector).

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_clintox(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the ClinTox dataset.

The task is to predict drug approval viability, by predicting clinical trial toxicity and final FDA approval status. Both tasks are binary.

Tasks

2

Task type

multitask classification

Total samples

1477

Recommended split

scaffold

Recommended metric

AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 2 label columns, FDA approval and clinical trial toxicity. Otherwise, returns SMILES as list of strings,and labels as a NumPy array (2D integer array).

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns “SMILES” and 2 label columns - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_esol(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the ESOL (Estimated SOLubility) dataset.

The task is to predict aqueous solubility. Targets are log-transformed, and the unit is log mols per litre (log Mol/L).

Tasks

1

Task type

regression

Total samples

1128

Recommended split

scaffold

Recommended metric

RMSE

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D float vector).

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_freesolv(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the FreeSolv (Free Solvation Database) dataset.

The task is to predict hydration free energy of small molecules in water. Targets are in kcal/mol.

Tasks

1

Task type

regression

Total samples

642

Recommended split

scaffold

Recommended metric

RMSE

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D float vector).

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_hiv(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the HIV dataset.

The task is to predict ability of molecules to inhibit HIV replication.

Tasks

1

Task type

classification

Total samples

41127

Recommended split

scaffold

Recommended metric

AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D integer binary vector).

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_lipophilicity(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the Lipophilicity dataset.

The task is to predict octanol/water distribution coefficient (logD) at pH 7.4. Targets are already log transformed, and are a unitless ratio.

Tasks

1

Task type

regression

Total samples

4200

Recommended split

scaffold

Recommended metric

RMSE

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D float vector).

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_moleculenet_benchmark(subset: str | None = None, data_dir: str | PathLike | None = None, as_frames: bool = False, verbose: bool = False) list[tuple[str, DataFrame]] | list[tuple[str, list[str], ndarray]]#

Load and return the MoleculeNet benchmark datasets.

Datasets have varied molecular property prediction tasks: regression, single-task, and multitask classification. Scaffold split is recommended for all of them, following Open Graph Benchmark [2]_. They differ in recommended metrics. For more details, see loading functions for particular datasets.

Often only a subset of those datasets is used for benchmarking, e.g. only single-task datasets, or only classification datasets and excluding PCBA (due to its large size). A subset of datasets can be selected by using subset argument.

Dataset names are also returned (case-sensitive). Datasets, grouped by task, are:

  • regression: ESOL, FreeSolv, Lipophilicity

  • single-task classification: BACE, BBBP, HIV

  • multitask classification: ClinTox, MUV, SIDER, Tox21, ToxCast, PCBA

Parameters:
  • subset ({None, "regression", "classification", "classification_single_task",) – “classification_multitask”, “classification_no_pcba”} If not None, returns the given subset of datasets.

  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_muv(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the MUV (Maximum Unbiased Validation) dataset.

The task is to predict 17 targets designed for validation of virtual screening techniques, based on PubChem BioAssays. All tasks are binary.

Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Tasks

17

Task type

multitask classification

Total samples

93087

Recommended split

scaffold

Recommended metric

AUPRC, AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 17 label columns, with names corresponding to MUV targets (see [1]_ and [2]_ for details). Otherwise, returns SMILES as list of strings, and labels as a NumPy array. Labels are 2D NumPy float array with binary labels and missing values.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns “SMILES” and 17 label columns - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_ogb_splits(dataset_name: str, data_dir: str | PathLike | None = None, as_dict: bool = False, verbose: bool = False) tuple[list[int], list[int], list[int]] | dict[str, list[int]]#

Load and return the MoleculeNet dataset splits from Open Graph Benchmark (OGB) [1]_.

OGB uses precomputed scaffold split with 80/10/10% split between train/valid/test subsets. Test set consists of the smallest scaffold groups, and follows MoleculeNet paper [2]_. Those splits are widely used in literature.

Dataset names here are the same as returned by load_moleculenet_benchmark function, and are case-sensitive.

Parameters:
  • dataset_name ({"ESOL", "FreeSolv", "Lipophilicity","BACE", "BBBP", "HIV", "ClinTox",) – “MUV”, “SIDER”, “Tox21”, “ToxCast”, “PCBA”} Name of the dataset to loads splits for.

  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_dict (bool, default=False) – If True, returns the splits as dictionary with keys “train”, “valid” and “test”, and index lists as values. Otherwise returns three lists with splits indexes.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_dict argument, one of: - three lists of integer indexes - dictionary with “train”, “valid” and “test” keys, and values as lists with

splits indexes

Return type:

tuple(list[int], list[int], list[int]) or dict

References

skfp.datasets.moleculenet.load_pcba(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the PCBA (PubChem BioAssay) dataset.

The task is to predict biological activity against 128 bioassays, generated by high-throughput screening (HTS). All tasks are binary active/non-active.

Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Tasks

128

Task type

multitask classification

Total samples

437929

Recommended split

scaffold

Recommended metric

AUPRC, AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 128 label columns, with names corresponding to biological activities (see [1]_ and [2]_ for details). Otherwise, returns SMILES as list of strings, and labels as a NumPy array. Labels are 2D NumPy float array with binary labels and missing values.

verbosebool, default=False

If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns “SMILES” and 128 label columns - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_sider(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the SIDER (Side Effect Resource) dataset.

The task is to predict adverse drug reactions (ADRs) as drug side effects to 27 system organ classes in MedDRA classification. All tasks are binary.

Tasks

27

Task type

multitask classification

Total samples

1427

Recommended split

scaffold

Recommended metric

AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 27 label columns, with names corresponding to MedDRA system organ classes (see [1]_ for details). Otherwise, returns SMILES as list of strings,and labels as a NumPy array (2D integer array).

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns “SMILES” and 27 label columns - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_tox21(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the Tox21 dataset.

The task is to predict 12 toxicity targets, including nuclear receptors and stress response pathways. All tasks are binary.

Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Tasks

12

Task type

multitask classification

Total samples

7831

Recommended split

scaffold

Recommended metric

AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 12 label columns, with names corresponding to toxicity targets (see [1]_ and [2]_ for details). Otherwise, returns SMILES as list of strings, and labels as a NumPy array. Labels are 2D NumPy float array with binary labels and missing values.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns “SMILES” and 12 label columns - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

skfp.datasets.moleculenet.load_toxcast(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the ToxCast dataset.

The task is to predict 617 toxicity targets from a large library of compounds based on in vitro high-throughput screening. All tasks are binary.

Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Tasks

617

Task type

multitask classification

Total samples

8576

Recommended split

scaffold

Recommended metric

AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 617 label columns, with names corresponding to toxicity targets (see [1]_ and [2]_ for details). Otherwise, returns SMILES as list of strings, and labels as a NumPy array. Labels are 2D NumPy float array with binary labels and missing values.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns “SMILES” and 617 label columns - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References