load_muv#
- skfp.datasets.moleculenet.load_muv(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray #
Load and return the MUV (Maximum Unbiased Validation) dataset.
The task is to predict 17 targets designed for validation of virtual screening techniques, based on PubChem BioAssays. All tasks are binary.
Note that targets have missing values. Algorithms should be evaluated only on present labels. For training data, you may want to impute them, e.g. with zeros.
Tasks
17
Task type
multitask classification
Total samples
93087
Recommended split
scaffold
Recommended metric
AUPRC, AUROC
- Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If
None
, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.as_frame (bool, default=False) – If True, returns the raw DataFrame with columns “SMILES” and 17 label columns, with names corresponding to MUV targets (see [1] and [2] for details). Otherwise, returns SMILES as list of strings, and labels as a NumPy array. Labels are 2D NumPy float array with binary labels and missing values.
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
- Returns:
data – Depending on the
as_frame
argument, one of: - Pandas DataFrame with columns “SMILES” and 17 label columns - tuple of: list of strings (SMILES), NumPy array (labels)- Return type:
pd.DataFrame or tuple(list[str], np.ndarray)
References
Examples
>>> from skfp.datasets.moleculenet import load_muv >>> dataset = load_muv() >>> dataset (['Cc1cccc(N2CCN(C(=O)C34CC5CC(CC(C5)C3)C4)CC2)c1C', ..., 'COc1ccc([N+](=O)[O-])cc1NC(=O)c1ccc(C)o1'], array([[nan, nan, nan, ..., nan, nan, nan], [ 0., 0., nan, ..., nan, 0., 0.], [nan, nan, 0., ..., nan, nan, 0.], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., 0., nan, nan], [nan, nan, nan, ..., 0., nan, nan]]))
>>> dataset = load_muv(as_frame=True) >>> dataset.head() SMILES MUV-466 ... MUV-858 MUV-859 0 Cc1cccc(N2CCN(C(=O)C34CC5CC(CC(C5)C3)C4)CC2)c1C NaN ... NaN NaN 1 Cn1ccnc1SCC(=O)Nc1ccc(Oc2ccccc2)cc1 0.0 ... 0.0 0.0 2 COc1cc2c(cc1NC(=O)CN1C(=O)NC3(CCc4ccccc43)C1=O... NaN ... NaN 0.0 3 O=C1/C(=C/NC2CCS(=O)(=O)C2)c2ccccc2C(=O)N1c1cc... NaN ... 0.0 NaN 4 NC(=O)NC(Cc1ccccc1)C(=O)O 0.0 ... NaN NaN ...