load_hiv#
- skfp.datasets.moleculenet.load_hiv(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) DataFrame | tuple[list[str]] | ndarray #
Load and return the HIV dataset.
The task is to predict ability of molecules to inhibit HIV replication [1] [2].
Tasks
1
Task type
classification
Total samples
41127
Recommended split
scaffold
Recommended metric
AUROC
- Parameters:
data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If
None
, currently set scikit-learn directory is used, by default$HOME/scikit_learn_data
.as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D integer binary vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.
- Returns:
data – Depending on the
as_frame
argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)- Return type:
pd.DataFrame or tuple(list[str], np.ndarray)
References
Examples
>>> from skfp.datasets.moleculenet import load_hiv >>> dataset = load_hiv() >>> dataset (['CCC1=[O+][Cu-3]2([O+]=C(CC)C1)[O+]=C(CC)CC(CC)=[O+]2', ..., 'CCCCCC=C(c1cc(Cl)c(OC)c(-c2nc(C)no2)c1)c1cc(Cl)c(OC)c(-c2nc(C)no2)c1'], array([0, 0, 0, ..., 0, 0, 0]))
>>> dataset = load_hiv(as_frame=True) >>> dataset.head() SMILES label 0 CCC1=[O+][Cu-3]2([O+]=C(CC)C1)[O+]=C(CC)CC(CC)... 0 1 C(=Cc1ccccc1)C1=[O+][Cu-3]2([O+]=C(C=Cc3ccccc3... 0 2 CC(=O)N1c2ccccc2Sc2c1ccc1ccccc21 0 3 Nc1ccc(C=Cc2ccc(N)cc2S(=O)(=O)O)c(S(=O)(=O)O)c1 0 4 O=S(=O)(O)CCS(=O)(=O)O 0