load_moleculenet_benchmark#

skfp.datasets.moleculenet.load_moleculenet_benchmark(subset: str | list[str] | None = None, data_dir: str | PathLike | None = None, as_frames: bool = False, verbose: bool = False) Iterator[tuple[str, DataFrame]] | Iterator[tuple[str, list[str], ndarray]]#

Load and return the MoleculeNet benchmark datasets.

Datasets have varied molecular property prediction tasks: regression, single-task, and multitask classification. Scaffold split is recommended for all of them, following Open Graph Benchmark [1]. They differ in recommended metrics. For more details, see loading functions for particular datasets.

Often only a subset of those datasets is used for benchmarking, e.g. only single-task datasets, or only classification datasets and excluding PCBA (due to its large size). A subset of datasets can be selected by using subset argument.

Dataset names are also returned (case-sensitive). Datasets, grouped by task, are:

  • regression: ESOL, FreeSolv, Lipophilicity

  • single-task classification: BACE, BBBP, HIV

  • multitask classification: ClinTox, MUV, SIDER, Tox21, ToxCast, PCBA

Parameters:
  • subset ({None, "regression", "classification", "classification_single_task",) – “classification_multitask”, “classification_no_pcba”} or list of strings If None, returns all datasets. String loads only a given subset of all datasets. List of strings loads only datasets with given names.

  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Loads and returns datasets with a generator. Returned types depend on the as_frame parameter, either: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

generator of pd.DataFrame or tuples (list[str], np.ndarray)

References