load_moleculenet_benchmark#
- skfp.datasets.moleculenet.load_moleculenet_benchmark(subset: str | list[str] | None = None, data_dir: str | PathLike | None = None, as_frames: bool = False, verbose: bool = False) Iterator[tuple[str, DataFrame]] | Iterator[tuple[str, list[str], ndarray]] #
Load and return the MoleculeNet benchmark datasets.
Datasets have varied molecular property prediction tasks: regression, single-task, and multitask classification. Scaffold split is recommended for all of them, following Open Graph Benchmark [1]. They differ in recommended metrics. For more details, see loading functions for particular datasets.
Often only a subset of those datasets is used for benchmarking, e.g. only single-task datasets, or only classification datasets and excluding PCBA (due to its large size). A subset of datasets can be selected by using
subset
argument.Dataset names are also returned (case-sensitive). Datasets, grouped by task, are:
regression: ESOL, FreeSolv, Lipophilicity
single-task classification: BACE, BBBP, HIV
multitask classification: ClinTox, MUV, SIDER, Tox21, ToxCast, PCBA
- Parameters:
subset ({None, "regression", "classification", "classification_single_task",) – “classification_multitask”, “classification_no_pcba”} or list of strings If
None
, returns all datasets. String loads only a given subset of all datasets. List of strings loads only datasets with given names.data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If
None
, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
- Returns:
data – Loads and returns datasets with a generator. Returned types depend on the
as_frame
parameter, either: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)- Return type:
generator of pd.DataFrame or tuples (list[str], np.ndarray)
References