load_hiv#

skfp.datasets.moleculenet.load_hiv(data_dir: str | PathLike | None = None, as_frame: bool = False, verbose: bool = False) → DataFrame | tuple[list[str]] | ndarray#

Load the HIV dataset.

The task is to predict ability of molecules to inhibit HIV replication [1] [2].

Tasks	1
Task type	classification
Total samples	41127
Recommended split	scaffold
Recommended metric	AUROC

Warning: in newer RDKit vesions, 7 molecules from the original dataset are not read correctly due to disallowed hypervalent states of some atoms (see [release notes](rdkit/rdkit)). This version of the HIV dataset contains manual fixes for those molecules, made by cross-referencing original NCI data, PubChem substructure search, and visualization with ChemAxon Marvin. In OGB scaffold split, used for benchmarking, first 2 of those problematic 7 are from the test set.

Parameters:

data_dir ({None, str, path-like}, default=None) – Path to the root data directory. If None, currently set scikit-learn directory is used, by default $HOME/scikit_learn_data.
as_frame (bool, default=False) – If True, returns the raw DataFrame with columns: “SMILES”, “label”. Otherwise, returns SMILES as list of strings, and labels as a NumPy array (1D integer binary vector).
verbose (bool, default=False) – If True, progress bar will be shown for downloading or loading files.

Returns:

data – Depending on the as_frame argument, one of: - Pandas DataFrame with columns: “SMILES”, “label” - tuple of: list of strings (SMILES), NumPy array (labels)

Return type:

pd.DataFrame or tuple(list[str], np.ndarray)

References

Examples

>>> from skfp.datasets.moleculenet import load_hiv
>>> dataset = load_hiv()
>>> dataset  
(['CCC1=[O+][Cu-3]2([O+]=C(CC)C1)[O+]=C(CC)CC(CC)=[O+]2', ..., 'CCCCCC=C(c1cc(Cl)c(OC)c(-c2nc(C)no2)c1)c1cc(Cl)c(OC)c(-c2nc(C)no2)c1'], array([0, 0, 0, ..., 0, 0, 0]))

>>> dataset = load_hiv(as_frame=True)
>>> dataset.head() 
                                                  SMILES  label
0  CCC1=[O+][Cu-3]2([O+]=C(CC)C1)[O+]=C(CC)CC(CC)...      0
1  C(=Cc1ccccc1)C1=[O+][Cu-3]2([O+]=C(C=Cc3ccccc3...      0
2                   CC(=O)N1c2ccccc2Sc2c1ccc1ccccc21      0
3    Nc1ccc(C=Cc2ccc(N)cc2S(=O)(=O)O)c(S(=O)(=O)O)c1      0
4                             O=S(=O)(O)CCS(=O)(=O)O      0

load_hiv#

This Page