{ "cells": [ { "cell_type": "markdown", "id": "473e4b0b", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "source": [ "# Fingerprint types" ] }, { "cell_type": "markdown", "id": "9a96a2ea", "metadata": {}, "source": [ "scikit-fingerprints implements a lot of different molecular fingerprints. We can roughly divide them into 3 types: descriptors, substructural, and hashed. We describe them in detail in subsequent sections.\n", "\n", "Of course, not all fingerprints fit nicely into this categorization, but it gives you a rough idea what you can use. Further sections describe them in more detail.\n", "\n", "Most fingerprints operate on 2D, \"flat\" molecular graphs, also known as **topological** information. This is typically fast and suffices for most tasks. There are also **3D (conformational)** fingerprints, which calculate values based on the spatial structure of a molecule, known as a conformer. While they add more information, computing the conformations is expensive, hard, and can even fail for some molecules. They are described in a separate tutorial, and here we focus only on topological fingerprints.\n", "\n", "By default, fingerprints are NumPy **dense** arrays, which explicitly store all values, including zeros. However, many of them are very **sparse**, i.e. with overwhelming majority of zeros. NumPy arrays store them explicitly, and while this is compatible with almost everything in Python ML ecosystem, it also wastes this memory. scikit-fingerprints can also compute SciPy sparse arrays when `sparse=True` is specified in class constructor." ] }, { "cell_type": "markdown", "id": "2933abd8-01e7-4fbc-ba46-6a53e4e5900f", "metadata": {}, "source": [ "### Dataset\n", "\n", "We will use a well-known [beta-secretase 1 (BACE) dataset](https://doi.org/10.1021/acs.jcim.6b00290), where we predict whether a drug inhibits the production of beta-secretase 1 enzyme, suspected to influence the development of Alzheimer's disease. It is a part of popular [MoleculeNet benchmark](https://doi.org/10.1039/C7SC02664A)." ] }, { "cell_type": "code", "execution_count": 1, "id": "bb18205d-3082-457c-a283-98f030718a10", "metadata": { "execution": { "iopub.execute_input": "2024-12-29T17:13:14.780999Z", "iopub.status.busy": "2024-12-29T17:13:14.780602Z", "iopub.status.idle": "2024-12-29T17:13:17.783433Z", "shell.execute_reply": "2024-12-29T17:13:17.782891Z", "shell.execute_reply.started": "2024-12-29T17:13:14.780966Z" } }, "outputs": [], "source": [ "from skfp.datasets.moleculenet import load_bace\n", "from skfp.preprocessing import MolFromSmilesTransformer\n", "\n", "smiles_list, y = load_bace()\n", "\n", "mol_from_smiles = MolFromSmilesTransformer()\n", "mols = mol_from_smiles.transform(smiles_list)" ] }, { "cell_type": "markdown", "id": "cdd2bde5", "metadata": {}, "source": [ "### Descriptors\n", "\n", "**Descriptors** are sets of physicochemical properties of molecule, e.g. number of heavy atoms (non-hydrogens), number of rings, estimated solubility, distributions of inter-atomic distances, and more. Those are typically floating point numbers or counts of simple topology features (graph structure). They are often very interpretable, as each feature has a certain chemical meaning. Those are e.g. [Mordred](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MordredFingerprint.html) and [VSA](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.VSAFingerprint.html).\n", "\n", "**Pros:**\n", "- well-correlated with many global properties of molecule\n", "- typically good performance for regression\n", "- interpretable\n", "\n", "**Cons:**\n", "- typically require feature selection\n", "- may have missing values\n", "- typically don't benefit from sparse arrays\n", "- some are very slow to compute\n", "\n", "Let's compute two descriptor fingerprints: [Mordred](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MordredFingerprint.html) and [RDKit2DDescriptorsFingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.RDKit2DDescriptorsFingerprint.html). Mordred is a set of descriptors proposed in the [Mordred software publication](https://doi.org/10.1186/s13321-018-0258-y), and the RDKit2DDescriptorsFingerprint is simply the collection of all topological descriptors available in RDKit.\n", "\n", "We will set a few options: `n_jobs=-1, batch_size=1, verbose=1`. Fingerprints can be computed for all molecules independently, so parallelism with multiple cores is very efficient. Setting `n_jobs=-1` by default uses all available N cores, dividing the dataset into N equal-sized batches. `batch_size` gives us more fine-grained control, and combined with `verbose=True`, it will show a nice progress bar, allowing us to check the progress molecule by molecule.\n", "\n", "Parallelism is very beneficial for descriptors, as there are typically a lot of them, e.g. Mordred has 1613 features. They have to be computed one after another for each molecule. Nevertheless, this will take a minute or more." ] }, { "cell_type": "code", "execution_count": 2, "id": "921d018d-37bf-4795-8d66-2b34a51a47a0", "metadata": { "execution": { "iopub.execute_input": "2024-12-29T17:13:17.784160Z", "iopub.status.busy": "2024-12-29T17:13:17.783955Z", "iopub.status.idle": "2024-12-29T17:14:24.065818Z", "shell.execute_reply": "2024-12-29T17:14:24.065287Z", "shell.execute_reply.started": "2024-12-29T17:13:17.784145Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "56ef90ca46574874936780178ff119fe", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/1513 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1fab5c15f40a4474943eaa05f5b18eda", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/1513 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Mordred shape: (1513, 1613)\n", "Mordred example values: [25.301933 18.76322 0. 0. 40.36188 2.4430382\n", " 4.8860765 40.36188 1.2613088 4.398988 ]\n", "\n", "RDKit 2D descriptors shape: (1513, 200)\n", "RDKit 2D descriptors example values: [ 1.5215017 1138.9497 22.880104 19.44299 19.44299\n", " 15.215147 11.412099 11.412099 9.666773 9.666773 ]\n" ] } ], "source": [ "from skfp.fingerprints import MordredFingerprint, RDKit2DDescriptorsFingerprint\n", "\n", "fp_mordred = MordredFingerprint(n_jobs=-1, batch_size=1, verbose=1)\n", "fp_rdkit_2d = RDKit2DDescriptorsFingerprint(n_jobs=-1, batch_size=1, verbose=1)\n", "\n", "X_mordred = fp_mordred.transform(mols)\n", "X_rdkit_2d = fp_rdkit_2d.transform(mols)\n", "\n", "print(f\"Mordred shape: {X_mordred.shape}\")\n", "print(f\"Mordred example values: {X_mordred[0, :10]}\")\n", "print()\n", "print(f\"RDKit 2D descriptors shape: {X_rdkit_2d.shape}\")\n", "print(f\"RDKit 2D descriptors example values: {X_rdkit_2d[0, :10]}\")" ] }, { "cell_type": "markdown", "id": "37ca9cdc-23a4-48de-9581-9b2dc85347df", "metadata": {}, "source": [ "Some descriptors also have built-in feature names, so we can inspect them and use them, e.g. with explainable AI approaches like [SHAP](https://shap.readthedocs.io/en/latest/). To understand the exact meaning of those features, see the source publications listed in documentation." ] }, { "cell_type": "code", "execution_count": 3, "id": "963b44e6-201c-4388-9d4f-9e50cc97b94a", "metadata": { "execution": { "iopub.execute_input": "2024-12-29T17:14:24.066631Z", "iopub.status.busy": "2024-12-29T17:14:24.066435Z", "iopub.status.idle": "2024-12-29T17:14:24.135063Z", "shell.execute_reply": "2024-12-29T17:14:24.134607Z", "shell.execute_reply.started": "2024-12-29T17:14:24.066610Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mordred feature names: ['ABC' 'ABCGG' 'nAcid' ... 'Zagreb2' 'mZagreb1' 'mZagreb2']\n", "\n" ] }, { "data": { "text/html": [ "
\n", " | ABC | \n", "ABCGG | \n", "nAcid | \n", "nBase | \n", "SpAbs_A | \n", "SpMax_A | \n", "SpDiam_A | \n", "SpAD_A | \n", "SpMAD_A | \n", "LogEE_A | \n", "... | \n", "SRW10 | \n", "TSRW10 | \n", "MW | \n", "AMW | \n", "WPath | \n", "WPol | \n", "Zagreb1 | \n", "Zagreb2 | \n", "mZagreb1 | \n", "mZagreb2 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "25.301933 | \n", "18.763220 | \n", "0.0 | \n", "0.0 | \n", "40.361881 | \n", "2.443038 | \n", "4.886076 | \n", "40.361881 | \n", "1.261309 | \n", "4.398988 | \n", "... | \n", "10.445841 | \n", "68.262878 | \n", "431.257263 | \n", "6.634727 | \n", "3233.0 | \n", "52.0 | \n", "172.0 | \n", "201.0 | \n", "10.923611 | \n", "6.805555 | \n", "
1 | \n", "36.076038 | \n", "29.331556 | \n", "0.0 | \n", "1.0 | \n", "58.299816 | \n", "2.512416 | \n", "4.903850 | \n", "58.299816 | \n", "1.240422 | \n", "4.759111 | \n", "... | \n", "10.673619 | \n", "100.694229 | \n", "657.382202 | \n", "6.707982 | \n", "8159.0 | \n", "72.0 | \n", "240.0 | \n", "278.0 | \n", "17.118055 | \n", "10.430555 | \n", "
2 | \n", "32.828228 | \n", "24.672882 | \n", "0.0 | \n", "1.0 | \n", "54.320873 | \n", "2.567860 | \n", "5.012524 | \n", "54.320873 | \n", "1.293354 | \n", "4.667789 | \n", "... | \n", "10.692922 | \n", "94.122177 | \n", "591.263550 | \n", "7.299550 | \n", "6374.0 | \n", "74.0 | \n", "224.0 | \n", "266.0 | \n", "13.756945 | \n", "9.236111 | \n", "
3 | \n", "31.013023 | \n", "24.251154 | \n", "0.0 | \n", "1.0 | \n", "47.993851 | \n", "2.443327 | \n", "4.886654 | \n", "47.993851 | \n", "1.199846 | \n", "4.599177 | \n", "... | \n", "10.655257 | \n", "77.270859 | \n", "591.251038 | \n", "7.484190 | \n", "5853.0 | \n", "64.0 | \n", "210.0 | \n", "241.0 | \n", "17.048611 | \n", "8.444445 | \n", "
4 | \n", "34.657589 | \n", "26.046909 | \n", "0.0 | \n", "1.0 | \n", "55.686096 | \n", "2.567861 | \n", "5.012539 | \n", "55.686096 | \n", "1.265593 | \n", "4.715560 | \n", "... | \n", "10.787399 | \n", "96.491135 | \n", "629.240417 | \n", "7.865505 | \n", "7300.0 | \n", "78.0 | \n", "238.0 | \n", "282.0 | \n", "15.569445 | \n", "9.402778 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
1508 | \n", "19.258291 | \n", "15.801585 | \n", "0.0 | \n", "0.0 | \n", "32.258286 | \n", "2.484048 | \n", "4.819666 | \n", "32.258286 | \n", "1.290331 | \n", "4.141597 | \n", "... | \n", "10.035612 | \n", "73.840050 | \n", "364.166595 | \n", "7.283332 | \n", "1619.0 | \n", "37.0 | \n", "128.0 | \n", "149.0 | \n", "8.138889 | \n", "5.611111 | \n", "
1509 | \n", "19.258291 | \n", "15.801585 | \n", "0.0 | \n", "0.0 | \n", "32.258286 | \n", "2.484048 | \n", "4.819666 | \n", "32.258286 | \n", "1.290331 | \n", "4.141597 | \n", "... | \n", "10.035612 | \n", "73.840050 | \n", "357.135651 | \n", "7.936347 | \n", "1619.0 | \n", "37.0 | \n", "128.0 | \n", "149.0 | \n", "8.138889 | \n", "5.611111 | \n", "
1510 | \n", "15.084601 | \n", "13.327415 | \n", "0.0 | \n", "0.0 | \n", "24.393719 | \n", "2.519188 | \n", "4.856795 | \n", "24.393719 | \n", "1.283880 | \n", "3.924382 | \n", "... | \n", "10.022292 | \n", "72.515038 | \n", "319.032013 | \n", "9.667637 | \n", "730.0 | \n", "29.0 | \n", "104.0 | \n", "125.0 | \n", "6.638889 | \n", "4.055555 | \n", "
1511 | \n", "19.177412 | \n", "15.661138 | \n", "0.0 | \n", "0.0 | \n", "32.058102 | \n", "2.524323 | \n", "4.872796 | \n", "32.058102 | \n", "1.335754 | \n", "4.155212 | \n", "... | \n", "10.236919 | \n", "78.661003 | \n", "317.152802 | \n", "7.375647 | \n", "1429.0 | \n", "38.0 | \n", "132.0 | \n", "159.0 | \n", "7.000000 | \n", "5.166667 | \n", "
1512 | \n", "16.470304 | \n", "13.438284 | \n", "0.0 | \n", "0.0 | \n", "26.983883 | \n", "2.453935 | \n", "4.742945 | \n", "26.983883 | \n", "1.284947 | \n", "4.005924 | \n", "... | \n", "9.863030 | \n", "74.256088 | \n", "306.124725 | \n", "7.653119 | \n", "1074.0 | \n", "28.0 | \n", "110.0 | \n", "128.0 | \n", "6.527778 | \n", "4.583333 | \n", "
1513 rows × 1613 columns
\n", "