load_tdc_benchmark#

Load the TDC benchmark datasets.

TDC [1] datasets are varied molecular property prediction tasks. Scaffold split is recommended for all of them. The tasks are split into 3 different groups:

ADME - absorbtion, distribution, metabolism, excertion
HTS - high-throughput screening
Toxicity - toxicity

For more details, see loading functions for particular datasets. Allowed individual dataset names are listed below. Dataset names are also returned (case-sensitive).

ADME group:

“b3db_classification”
“b3db_regression”
“bioavailability_ma”
“caco2_wang”
“clearance_hepatocyte_az”
“clearance_microsome_az”
“cyp1a2_veith”
“cyp2c19_veith”
“cyp2c9_substrate_carbonmangels”
“cyp2c9_veith”
“cyp2d6_substrate_carbonmangels”
“cyp2d6_veith”
“cyp3a4_substrate_carbonmangels”
“cyp3a4_veith”
“half_life_obach”
“hia_hou”
“hlm”
“pampa_ncats”
“pampa_approved_drugs”
“pgp_broccatelli”
“ppbr_az”
“rlm”
“solubility_aqsoldb”
“vdss_lombardo”

High throughput screening (HTS) group:

“sarscov2_3clpro_diamond”
“sarscov2_vitro_touret”

Toxicity group:

“ames”
“carcinogens_lagunin”
“dili”
“herg”
“herg_central_at_10um”
“herg_central_at_1um”
“herg_central_inhib”
“herg_karim”
“ld50_zhu”
“skin_reaction”

Parameters:

subset ({None, "ADME", "HTS", "Toxicity"}, default=None) – If None, returns all datasets. String loads only a given subset of all datasets. Alternatively the subset can contain names of individual datasets. List of strings loads only datasets with given names.
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.
as_frames (bool, default=False) – If True, returns the raw DataFrame for each dataset. Otherwise, returns SMILES as a list of strings, and labels as a NumPy array for each dataset.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Loads and returns datasets with a generator. Returned types depend on the as_frame parameter, either: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

generator of pd.DataFrame or tuples (list[str], np.ndarray)

References

load_tdc_benchmark#

This Page