load_tox21#

skfp.datasets.moleculenet.load_tox21(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) → DataFrame | tuple[list[str]] | ndarray#

Load the Tox21 dataset.

The task is to predict 12 toxicity targets, including nuclear receptors and stress response pathways. All tasks are binary.

Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.

Tasks	12
Task type	multitask classification
Total samples	7831
Recommended split	scaffold
Recommended metric	AUROC

Warning: in newer RDKit vesions, 8 molecules from the original dataset are not read correctly due to disallowed hypervalent states of their aluminium atoms (see [release notes](rdkit/rdkit)). This version of the Tox21 dataset contains manual fixes for those molecules, removing additional hydrogens, e.g. [AlH3] -> [Al]. In OGB scaffold split, used for benchmarking, only the first 1 of those problematic 8 is from the test set.

Parameters:

data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 12 label columns, with names corresponding to toxicity targets (see [1] and [2] for details). Otherwise, returns SMILES as list of strings, and labels as a NumPy array. Labels are 2D NumPy float array with binary labels and missing values.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns “SMILES” and 12 label columns - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

Examples

>>> from skfp.datasets.moleculenet import load_tox21
>>> dataset = load_tox21()
>>> dataset  
(['CCOc1ccc2nc(S(N)(=O)=O)sc2c1', ..., 'COc1ccc2c(c1OC)CN1CCc3cc4c(cc3C1C2)OCO4'], array([[ 0.,  0.,  1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., nan,  0.,  0.],
       [nan, nan, nan, ...,  0., nan, nan],
       ...,
       [ 1.,  1.,  0., ...,  0.,  0.,  0.],
       [ 1.,  1.,  0., ...,  0.,  1.,  1.],
       [ 0.,  0., nan, ...,  0.,  1.,  0.]]))

>>> dataset = load_tox21(as_frame=True)
>>> dataset.head() 
                                              SMILES  NR-AR  ...  SR-MMP  SR-p53
0                       CCOc1ccc2nc(S(N)(=O)=O)sc2c1    0.0  ...     0.0     0.0
1                          CCN1C(=O)NC(c2ccccc2)C1=O    0.0  ...     0.0     0.0
2  CC[C@]1(O)CC[C@H]2[C@@H]3CCC4=CCCC[C@@H]4[C@H]...    NaN  ...     NaN     NaN
3                    CCCN(CC)C(CC)C(=O)Nc1c(C)cccc1C    0.0  ...     0.0     0.0
4                          CC(O)(P(=O)(O)O)P(=O)(O)O    0.0  ...     0.0     0.0
...

load_tox21#

This Page