load_toxcast#

skfp.datasets.moleculenet.load_toxcast(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray#

Load and return the ToxCast dataset.

The task is to predict 617 toxicity targets from a large library of compounds based on in vitro high-throughput screening. All tasks are binary.

Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Tasks

617

Task type

multitask classification

Total samples

8576

Recommended split

scaffold

Recommended metric

AUROC

Parameters:
  • data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.

  • as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 617 label columns, with names corresponding to toxicity targets (see [1] and [2] for details). Otherwise, returns SMILES as list of strings, and labels as a NumPy array. Labels are 2D NumPy float array with binary labels and missing values.

  • verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns “SMILES” and 617 label columns - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

Examples

>>> from skfp.datasets.moleculenet import load_toxcast
>>> dataset = load_toxcast()
>>> dataset  
(['[O-][N+](=O)C1=CC=C(Cl)C=C1', ..., 'CN1CC2=C(N[C@H](CC(O)=O)C1=O)C=CC(=C2)C(=O)N1CCC(CC1)C1CCNCC1'], array([[ 0.,  0., nan, ...,  0.,  0.,  0.],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]]))
>>> dataset = load_toxcast(as_frame=True)
>>> dataset.head() 
                                      SMILES  ...  Tanguay_ZF_120hpf_YSE_up
0                [O-][N+](=O)C1=CC=C(Cl)C=C1  ...                       0.0
1  C[SiH](C)O[Si](C)(C)O[Si](C)(C)O[SiH](C)C  ...                       NaN
2                   CN1CCN(CC1)C(=O)C1CCCCC1  ...                       NaN
3                 NC1=CC=C(C=C1)[N+]([O-])=O  ...                       0.0
4                 OC1=CC=C(C=C1)[N+]([O-])=O  ...                       0.0
...