{ "cells": [ { "cell_type": "markdown", "id": "473e4b0b", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "source": [ "# Introduction to scikit-fingerprints" ] }, { "cell_type": "markdown", "id": "9a96a2ea", "metadata": {}, "source": [ "scikit-fingerprints is a scikit-learn compatible library for computation of molecular fingerprints, with focus on ease of usage and efficiency. It's also called `skfp` for short, similarly to `sklearn`. It is based on hugely popular [RDKit](https://github.com/rdkit/rdkit) library.\n", "\n", "We use familiar scikit-learn interface with classes implementing `.fit()` and `.transform()` methods. This ease of usage is particularly powerful combined with our efficient and parallelized implementations of fingerprint algorithms.\n", "\n", "**Molecular fingerprints** are algorithms for vectorizing molecules. They turn a molecular graph, made of atoms and bonds, into a feature vector. It can then be used in any typical ML algorithms for classification, regression, clustering etc." ] }, { "cell_type": "markdown", "id": "cdd2bde5", "metadata": {}, "source": [ "### Practical introduction\n", "\n", "Typical ML task on molecules is **molecular property prediction**, which is basically molecular graph classification or regression. It's also known as [QSAR (quantitative structure-activity prediction)](https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship) or, more accurately, QSPR (quantitative structure-activity prediction).\n", "\n", "Molecules are typically stored in [SMILES text format](https://en.wikipedia.org/wiki/Simplified_Molecular_Input_Line_Entry_System), along with labels for prediction. [RDKit](https://github.com/rdkit/rdkit) reads them as `Mol` objects, and then scikit-fingerprints computes fingerprints for them. After computing fingerprints, we turn the problem of molecular graph classification into tabular classification.\n", "\n", "So a simple workflow looks like this:\n", "1. Store SMILES and labels in CSV file\n", "2. Read them and transform into RDKit `Mol` objects\n", "3. Split into training and testing subsets\n", "4. Compute molecular fingerprint for each molecule\n", "5. Use the resulting tabular dataset for classification\n", "\n", "Let's see an example with well-known [beta-secretase 1 (BACE) dataset](https://doi.org/10.1021/acs.jcim.6b00290), where we predict whether a drug inhibits the production of beta-secretase 1 enzyme, suspected to influence the development of Alzheimer's disease. It is a part of popular [MoleculeNet benchmark](https://doi.org/10.1039/C7SC02664A). It's integrated into scikit-fingerprints, so we can download and load the data with a single function.\n", "\n", "For train-test split, we'll use [scaffold split](https://www.oloren.ai/blog/scaff-split), which splits the molecules by their internal structure, known as Bemis-Murcko scaffold. This makes test molecules quite different from training ones, limiting data leakage.\n", "\n", "We compute the popular [Extended Connectivity Fingerprint (ECFP)](https://docs.chemaxon.com/display/docs/fingerprints_extended-connectivity-fingerprint-ecfp.md), also known as Morgan fingerprint. By default, it uses radius 2 (diameter 4, we call this ECFP4 fingerprints) and 2048 bits (dimensions). Then, we train Random Forest classifier on those features, and evaluate it using AUROC (Area Under Receiver Operating Characteristic curve).\n", "\n", "All those elements are described in [scikit-fingerprints documentation](https://scikit-fingerprints.github.io/scikit-fingerprints/index.html):\n", "- [BACE dataset](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/datasets/generated/skfp.datasets.moleculenet.load_bace.html)\n", "- [scaffold split](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.model_selection.scaffold_train_test_split.html)\n", "- [ECFP fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html)" ] }, { "cell_type": "code", "execution_count": 1, "id": "a04d450a", "metadata": { "execution": { "iopub.execute_input": "2024-12-29T17:35:41.039230Z", "iopub.status.busy": "2024-12-29T17:35:41.038961Z", "iopub.status.idle": "2024-12-29T17:35:44.320140Z", "shell.execute_reply": "2024-12-29T17:35:44.319686Z", "shell.execute_reply.started": "2024-12-29T17:35:41.039213Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUROC: 78.25%\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import roc_auc_score\n", "\n", "from skfp.datasets.moleculenet import load_bace\n", "from skfp.fingerprints import ECFPFingerprint\n", "from skfp.model_selection import scaffold_train_test_split\n", "from skfp.preprocessing import MolFromSmilesTransformer\n", "\n", "smiles_list, y = load_bace()\n", "\n", "mol_from_smiles = MolFromSmilesTransformer()\n", "mols = mol_from_smiles.transform(smiles_list)\n", "\n", "mols_train, mols_test, y_train, y_test = scaffold_train_test_split(\n", " mols, y, test_size=0.2\n", ")\n", "\n", "# there's no need to call .fit() on fingerprints, they have no learnable weights\n", "ecfp_fp = ECFPFingerprint()\n", "X_train = ecfp_fp.transform(mols_train)\n", "X_test = ecfp_fp.transform(mols_test)\n", "\n", "clf = RandomForestClassifier(random_state=0)\n", "clf.fit(X_train, y_train)\n", "\n", "y_pred = clf.predict_proba(X_test)[:, 1]\n", "auroc = roc_auc_score(y_test, y_pred)\n", "\n", "print(f\"AUROC: {auroc:.2%}\")" ] }, { "cell_type": "markdown", "id": "a7ab91ff-3421-4a33-bbac-90fdb35788e5", "metadata": {}, "source": [ "### Step-by-step analysis\n", "\n", "Let's analyze elements of this code more closely." ] }, { "cell_type": "markdown", "id": "a657897f", "metadata": {}, "source": [ "Dataset loader functions by default load a list of SMILES strings and labels as NumPy array. This is a simple, binary classification, so we get a vector of 0s and 1s." ] }, { "cell_type": "code", "execution_count": 2, "id": "87a59a66-25fd-43d6-87af-05336b245944", "metadata": { "execution": { "iopub.execute_input": "2024-12-29T17:35:44.320978Z", "iopub.status.busy": "2024-12-29T17:35:44.320795Z", "iopub.status.idle": "2024-12-29T17:35:44.731777Z", "shell.execute_reply": "2024-12-29T17:35:44.731216Z", "shell.execute_reply.started": "2024-12-29T17:35:44.320963Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SMILES:\n", "['O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2ccccc2C)C)CC1(C)C', 'Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(=O)C)(CC(C)C)C1=O)CCc1ccccc1)[C@H](O)[C@@H]1[NH2+]C[C@H](OCCC)C1', 'S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(OC)ccc1)Cc1ccccc1)C']\n", "\n", "Labels:\n", "[1 1 1]\n" ] } ], "source": [ "smiles_list, y = load_bace()\n", "print(\"SMILES:\")\n", "print(smiles_list[:3])\n", "print()\n", "print(\"Labels:\")\n", "print(y[:3])" ] }, { "cell_type": "markdown", "id": "cee42d26-00cc-4cde-aafd-2ea31d0c6a3c", "metadata": {}, "source": [ "RDKit `Mol` objects are the basic molecular graph representation, and we compute the fingerprints from them." ] }, { "cell_type": "code", "execution_count": 3, "id": "ff04b9cd-0993-4639-aa41-25ad2e772412", "metadata": { "execution": { "iopub.execute_input": "2024-12-29T17:35:44.732566Z", "iopub.status.busy": "2024-12-29T17:35:44.732351Z", "iopub.status.idle": "2024-12-29T17:35:44.735452Z", "shell.execute_reply": "2024-12-29T17:35:44.734798Z", "shell.execute_reply.started": "2024-12-29T17:35:44.732543Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Molecules:\n", "[, , ]\n" ] } ], "source": [ "print(\"Molecules:\")\n", "print(mols[:3])" ] }, { "cell_type": "markdown", "id": "12e80970-3a06-4fa2-8f0e-4def65a18467", "metadata": {}, "source": [ "Fingerprints are by default binary NumPy arrays. They are typically long, with some (e.g. ECFP) having the length as a hyperparameter." ] }, { "cell_type": "code", "execution_count": 4, "id": "5a16dc52-407b-4f53-ba4a-fd6a123eb815", "metadata": { "execution": { "iopub.execute_input": "2024-12-29T17:35:44.736603Z", "iopub.status.busy": "2024-12-29T17:35:44.736154Z", "iopub.status.idle": "2024-12-29T17:35:44.740763Z", "shell.execute_reply": "2024-12-29T17:35:44.740086Z", "shell.execute_reply.started": "2024-12-29T17:35:44.736575Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ECFP fingerprints:\n", "(1210, 2048)\n", "[[0 1 0 ... 0 0 0]\n", " [0 0 0 ... 0 0 0]\n", " [0 0 0 ... 0 0 0]]\n" ] } ], "source": [ "print(\"ECFP fingerprints:\")\n", "print(X_train.shape)\n", "print(X_train[:3])" ] }, { "cell_type": "markdown", "id": "ff705658-c906-4214-9393-2ffa34687222", "metadata": {}, "source": [ "From this point, the problem is just like any other tabular classification in scikit-learn." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 5 }