load_tdc_benchmark#
- skfp.datasets.tdc.load_tdc_benchmark(subset: str | list[str] | None = None, data_dir: str | PathLike | None = None, as_frames: bool = False, verbose: bool = False) Iterator[tuple[str, DataFrame]] | Iterator[tuple[str, list[str], ndarray]] #
Load the TDC benchmark datasets.
TDC [1] datasets are varied molecular property prediction tasks. Scaffold split is recommended for all of them. The tasks are split into 3 different groups:
ADME - absorbtion, distribution, metabolism, excertion
HTS - high-throughput screening
Toxicity - toxicity
For more details, see loading functions for particular datasets. Allowed individual dataset names are listed below. Dataset names are also returned (case-sensitive).
ADME group:
“b3db_classification”
“b3db_regression”
“bioavailability_ma”
“caco2_wang”
“clearance_hepatocyte_az”
“clearance_microsome_az”
“cyp1a2_veith”
“cyp2c19_veith”
“cyp2c9_substrate_carbonmangels”
“cyp2c9_veith”
“cyp2d6_substrate_carbonmangels”
“cyp2d6_veith”
“cyp3a4_substrate_carbonmangels”
“cyp3a4_veith”
“half_life_obach”
“hia_hou”
“hlm”
“pampa_ncats”
“pampa_approved_drugs”
“pgp_broccatelli”
“ppbr_az”
“rlm”
“solubility_aqsoldb”
“vdss_lombardo”
High throughput screening (HTS) group:
“sarscov2_3clpro_diamond”
“sarscov2_vitro_touret”
Toxicity group:
“ames”
“carcinogens_lagunin”
“dili”
“herg”
“herg_central_at_10um”
“herg_central_at_1um”
“herg_central_inhib”
“herg_karim”
“ld50_zhu”
“skin_reaction”
- Parameters:
subset ({None, "ADME", "HTS", "Toxicity"}, default=None) – If
None
, returns all datasets. String loads only a given subset of all datasets. Alternatively the subset can contain names of individual datasets. List of strings loads only datasets with given names.data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If
None
, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
- Returns:
data – Loads and returns datasets with a generator. Returned types depend on the
as_frame
parameter, either: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)- Return type:
generator of pd.DataFrame or tuples (list[str], np.ndarray)
References