{ "cells": [ { "cell_type": "markdown", "id": "473e4b0b", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "source": [ "# Datasets and benchmarking" ] }, { "cell_type": "markdown", "id": "9a96a2ea", "metadata": {}, "source": [ "scikit-fingerprints hosts popular molecular datasets [on HuggingFace](https://huggingface.co/scikit-fingerprints), and implements functions to download them. We used them in previous tutorials. We also have a few other capabilities implemented for easy testing and benchmarking molecular models.\n", "\n", "In this tutorial, we'll focus on [MoleculeNet benchmark](https://doi.org/10.1039/C7SC02664A), the most popular one in molecular property prediction. Other ones would work very similarly." ] }, { "cell_type": "markdown", "id": "50656d53-e7e4-46d1-a5c9-56887de446f0", "metadata": {}, "source": [ "### Dataset loading functions\n", "\n", "Functions for loading datasets are in submodules in `skfp.datasets` package. For example, `skfp.datasets.moleculenet` contains functions to load datasets from MoleculeNet.\n", "\n", "By default, data is downloaded to the same locations as [scikit-learn datasets](https://scikit-learn.org/1.5/modules/generated/sklearn.datasets.get_data_home.html), which can be set with `SCIKIT_LEARN_DATA` environment variable. By default, it is `scikit_learn_data` in the user home directory. This can be controlled per dataset with `data_dir` parameter. Datasets are downloaded and cached on first usage.\n", "\n", "Functions return a tuple with list of SMILES strings and Numpy array with labels. If `as_frame` argument is True, they can alternatively return Pandas DataFrame with additional information, e.g. class names. This can be useful e.g. for multioutput datasets like [MUV](https://doi.org/10.1021/ci8002649).\n", "\n", "Note that multioutput datasets may have missing labels. This is by design, since some values are unknown. After train-test split, you can fill the training labels if needed, e.g. with the most popular class. However, you **must not** touch labels for testing, and instead evaluate only on available labels. We will see in just a bit how to do that. Since NumPy does not support NaN values for integers, data type is `float`, even if only present values are really 0 or 1." ] }, { "cell_type": "code", "execution_count": 1, "id": "bb06ca57-32ee-4c21-86dd-c69f65188acd", "metadata": { "execution": { "iopub.execute_input": "2025-01-31T18:48:34.543093Z", "iopub.status.busy": "2025-01-31T18:48:34.542890Z", "iopub.status.idle": "2025-01-31T18:48:35.554667Z", "shell.execute_reply": "2025-01-31T18:48:35.554236Z", "shell.execute_reply.started": "2025-01-31T18:48:34.543075Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2ccccc2C)C)CC1(C)C', 'Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(=O)C)(CC(C)C)C1=O)CCc1ccccc1)[C@H](O)[C@@H]1[NH2+]C[C@H](OCCC)C1', 'S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(OC)ccc1)Cc1ccccc1)C', 'S1(=O)(=O)C[C@@H](Cc2cc(O[C@H](COCC)C(F)(F)F)c(N)c(F)c2)[C@H](O)[C@@H]([NH2+]Cc2cc(ccc2)C(C)(C)C)C1', 'S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(ccc1)C(F)(F)F)Cc1ccccc1)C']\n", "[1 1 1 1 1]\n" ] } ], "source": [ "from skfp.datasets.moleculenet import load_bace\n", "\n", "smiles_list, y = load_bace()\n", "print(smiles_list[:5])\n", "print(y[:5])" ] }, { "cell_type": "code", "execution_count": 2, "id": "17e94201-7394-4538-a4d8-10c1ff039346", "metadata": { "execution": { "iopub.execute_input": "2025-01-31T18:48:35.555603Z", "iopub.status.busy": "2025-01-31T18:48:35.555233Z", "iopub.status.idle": "2025-01-31T18:48:35.708568Z", "shell.execute_reply": "2025-01-31T18:48:35.707982Z", "shell.execute_reply.started": "2025-01-31T18:48:35.555578Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['CCOc1ccc2nc(S(N)(=O)=O)sc2c1', 'CCN1C(=O)NC(c2ccccc2)C1=O', 'CC[C@]1(O)CC[C@H]2[C@@H]3CCC4=CCCC[C@@H]4[C@H]3CC[C@@]21C', 'CCCN(CC)C(CC)C(=O)Nc1c(C)cccc1C', 'CC(O)(P(=O)(O)O)P(=O)(O)O']\n", "[[ 0. 0. 1. nan nan 0. 0. 1. 0. 0. 0. 0.]\n", " [ 0. 0. 0. 0. 0. 0. 0. nan 0. nan 0. 0.]\n", " [nan nan nan nan nan nan nan 0. nan 0. nan nan]\n", " [ 0. 0. 0. 0. 0. 0. 0. nan 0. nan 0. 0.]\n", " [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]\n" ] } ], "source": [ "from skfp.datasets.moleculenet import load_tox21\n", "\n", "smiles_list, y = load_tox21()\n", "print(smiles_list[:5])\n", "print(y[:5])" ] }, { "cell_type": "code", "execution_count": 3, "id": "be7970b1-f016-4fad-9dd1-d2d5b54ffc6a", "metadata": { "execution": { "iopub.execute_input": "2025-01-31T18:48:35.709714Z", "iopub.status.busy": "2025-01-31T18:48:35.709324Z", "iopub.status.idle": "2025-01-31T18:48:35.887681Z", "shell.execute_reply": "2025-01-31T18:48:35.887267Z", "shell.execute_reply.started": "2025-01-31T18:48:35.709691Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | SMILES | \n", "NR-AR | \n", "NR-AR-LBD | \n", "NR-AhR | \n", "NR-Aromatase | \n", "NR-ER | \n", "NR-ER-LBD | \n", "NR-PPAR-gamma | \n", "SR-ARE | \n", "SR-ATAD5 | \n", "SR-HSE | \n", "SR-MMP | \n", "SR-p53 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "CCOc1ccc2nc(S(N)(=O)=O)sc2c1 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "NaN | \n", "NaN | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
1 | \n", "CCN1C(=O)NC(c2ccccc2)C1=O | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "0.0 | \n", "
2 | \n", "CC[C@]1(O)CC[C@H]2[C@@H]3CCC4=CCCC[C@@H]4[C@H]... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "NaN | \n", "
3 | \n", "CCCN(CC)C(CC)C(=O)Nc1c(C)cccc1C | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "NaN | \n", "0.0 | \n", "0.0 | \n", "
4 | \n", "CC(O)(P(=O)(O)O)P(=O)(O)O | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "