load_moleculenet_benchmark#

skfp.datasets.moleculenet.load_moleculenet_benchmark(subset: str | list[str] | None = None, data_dir: str | PathLike | None = None, as_frames: bool = False, verbose: bool = False) → Iterator[tuple[str, DataFrame]] | Iterator[tuple[str, list[str], ndarray]]#

Load the MoleculeNet benchmark datasets.

Datasets have varied molecular property prediction tasks: regression, single-task, and multitask classification. Scaffold split is recommended for all of them, following Open Graph Benchmark [1]. They differ in recommended metrics. For more details, see loading functions for particular datasets.

Often only a subset of those datasets is used for benchmarking, e.g. only single-task datasets, or only classification datasets and excluding PCBA (due to its large size). A subset of datasets can be selected by using subset argument.

Dataset names are also returned (case-sensitive). Datasets, grouped by task, are:

regression: ESOL, FreeSolv, Lipophilicity
single-task classification: BACE, BBBP, HIV
multitask classification: ClinTox, MUV, SIDER, Tox21, ToxCast, PCBA

Parameters:

subset ({None, "regression", "classification", "classification_single_task",) – “classification_multitask”, “classification_no_pcba”} or list of strings If None, returns all datasets. String loads only a given subset of all datasets. List of strings loads only datasets with given names.
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.
as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Loads and returns datasets with a generator. Returned types depend on the as_frame parameter, either: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

generator of pd.DataFrame or tuples (list[str], np.ndarray)

References

load_moleculenet_benchmark#

This Page