load_tdc_benchmark#

skfp.datasets.tdc.load_tdc_benchmark(subset: str | list[str] | None = None, data_dir: str | PathLike | None = None, as_frames: bool = False, verbose: bool = False) Iterator[tuple[str, DataFrame]] | Iterator[tuple[str, list[str], ndarray]]#

Load the TDC benchmark datasets.

TDC [1] datasets are varied molecular property prediction tasks. Scaffold split is recommended for all of them. The tasks are split into 3 different groups:

  • ADME - absorbtion, distribution, metabolism, excertion

  • HTS - high-throughput screening

  • Toxicity - toxicity

For more details, see loading functions for particular datasets. Allowed individual dataset names are listed below. Dataset names are also returned (case-sensitive).

ADME group:

  • “b3db_classification”

  • “b3db_regression”

  • “bioavailability_ma”

  • “caco2_wang”

  • “clearance_hepatocyte_az”

  • “clearance_microsome_az”

  • “cyp1a2_veith”

  • “cyp2c19_veith”

  • “cyp2c9_substrate_carbonmangels”

  • “cyp2c9_veith”

  • “cyp2d6_substrate_carbonmangels”

  • “cyp2d6_veith”

  • “cyp3a4_substrate_carbonmangels”

  • “cyp3a4_veith”

  • “half_life_obach”

  • “hia_hou”

  • “hlm”

  • “pampa_ncats”

  • “pampa_approved_drugs”

  • “pgp_broccatelli”

  • “ppbr_az”

  • “rlm”

  • “solubility_aqsoldb”

  • “vdss_lombardo”

High throughput screening (HTS) group:

  • “sarscov2_3clpro_diamond”

  • “sarscov2_vitro_touret”

Toxicity group:

  • “ames”

  • “carcinogens_lagunin”

  • “dili”

  • “herg”

  • “herg_central_at_10um”

  • “herg_central_at_1um”

  • “herg_central_inhib”

  • “herg_karim”

  • “ld50_zhu”

  • “skin_reaction”

Parameters:
  • subset ({None, "ADME", "HTS", "Toxicity"}, default=None) – If None, returns all datasets. String loads only a given subset of all datasets. Alternatively the subset can contain names of individual datasets. List of strings loads only datasets with given names.

  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Loads and returns datasets with a generator. Returned types depend on the as_frame parameter, either: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

generator of pd.DataFrame or tuples (list[str], np.ndarray)

References