{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "473e4b0b",
   "metadata": {
    "collapsed": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "source": [
    "# Fingerprint types"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a96a2ea",
   "metadata": {},
   "source": [
    "scikit-fingerprints implements a lot of different molecular fingerprints. We can roughly divide them into 3 types: descriptors, substructural, and hashed. We describe them in detail in subsequent sections.\n",
    "\n",
    "Of course, not all fingerprints fit nicely into this categorization, but it gives you a rough idea what you can use. Further sections describe them in more detail.\n",
    "\n",
    "Most fingerprints operate on 2D, \"flat\" molecular graphs, also known as **topological** information. This is typically fast and suffices for most tasks. There are also **3D (conformational)** fingerprints, which calculate values based on the spatial structure of a molecule, known as a conformer. While they add more information, computing the conformations is expensive, hard, and can even fail for some molecules. They are described in a separate tutorial, and here we focus only on topological fingerprints.\n",
    "\n",
    "By default, fingerprints are NumPy **dense** arrays, which explicitly store all values, including zeros. However, many of them are very **sparse**, i.e. with overwhelming majority of zeros. NumPy arrays store them explicitly, and while this is compatible with almost everything in Python ML ecosystem, it also wastes this memory. scikit-fingerprints can also compute SciPy sparse arrays when `sparse=True` is specified in class constructor."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2933abd8-01e7-4fbc-ba46-6a53e4e5900f",
   "metadata": {},
   "source": [
    "### Dataset\n",
    "\n",
    "We will use a well-known [beta-secretase 1 (BACE) dataset](https://doi.org/10.1021/acs.jcim.6b00290), where we predict whether a drug inhibits the production of beta-secretase 1 enzyme, suspected to influence the development of Alzheimer's disease. It is a part of popular [MoleculeNet benchmark](https://doi.org/10.1039/C7SC02664A)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "bb18205d-3082-457c-a283-98f030718a10",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:13:14.780999Z",
     "iopub.status.busy": "2024-12-29T17:13:14.780602Z",
     "iopub.status.idle": "2024-12-29T17:13:17.783433Z",
     "shell.execute_reply": "2024-12-29T17:13:17.782891Z",
     "shell.execute_reply.started": "2024-12-29T17:13:14.780966Z"
    }
   },
   "outputs": [],
   "source": [
    "from skfp.datasets.moleculenet import load_bace\n",
    "from skfp.preprocessing import MolFromSmilesTransformer\n",
    "\n",
    "smiles_list, y = load_bace()\n",
    "\n",
    "mol_from_smiles = MolFromSmilesTransformer()\n",
    "mols = mol_from_smiles.transform(smiles_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdd2bde5",
   "metadata": {},
   "source": [
    "### Descriptors\n",
    "\n",
    "**Descriptors** are sets of physicochemical properties of molecule, e.g. number of heavy atoms (non-hydrogens), number of rings, estimated solubility, distributions of inter-atomic distances, and more. Those are typically floating point numbers or counts of simple topology features (graph structure). They are often very interpretable, as each feature has a certain chemical meaning. Those are e.g. [Mordred](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MordredFingerprint.html) and [VSA](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.VSAFingerprint.html).\n",
    "\n",
    "**Pros:**\n",
    "- well-correlated with many global properties of molecule\n",
    "- typically good performance for regression\n",
    "- interpretable\n",
    "\n",
    "**Cons:**\n",
    "- typically require feature selection\n",
    "- may have missing values\n",
    "- typically don't benefit from sparse arrays\n",
    "- some are very slow to compute\n",
    "\n",
    "Let's compute two descriptor fingerprints: [Mordred](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MordredFingerprint.html) and [RDKit2DDescriptorsFingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.RDKit2DDescriptorsFingerprint.html). Mordred is a set of descriptors proposed in the [Mordred software publication](https://doi.org/10.1186/s13321-018-0258-y), and the RDKit2DDescriptorsFingerprint is simply the collection of all topological descriptors available in RDKit.\n",
    "\n",
    "We will set a few options: `n_jobs=-1, batch_size=1, verbose=1`. Fingerprints can be computed for all molecules independently, so parallelism with multiple cores is very efficient. Setting `n_jobs=-1` by default uses all available N cores, dividing the dataset into N equal-sized batches. `batch_size` gives us more fine-grained control, and combined with `verbose=True`, it will show a nice progress bar, allowing us to check the progress molecule by molecule.\n",
    "\n",
    "Parallelism is very beneficial for descriptors, as there are typically a lot of them, e.g. Mordred has 1613 features. They have to be computed one after another for each molecule. Nevertheless, this will take a minute or more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "921d018d-37bf-4795-8d66-2b34a51a47a0",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:13:17.784160Z",
     "iopub.status.busy": "2024-12-29T17:13:17.783955Z",
     "iopub.status.idle": "2024-12-29T17:14:24.065818Z",
     "shell.execute_reply": "2024-12-29T17:14:24.065287Z",
     "shell.execute_reply.started": "2024-12-29T17:13:17.784145Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "56ef90ca46574874936780178ff119fe",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1513 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1fab5c15f40a4474943eaa05f5b18eda",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/1513 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mordred shape: (1513, 1613)\n",
      "Mordred example values: [25.301933  18.76322    0.         0.        40.36188    2.4430382\n",
      "  4.8860765 40.36188    1.2613088  4.398988 ]\n",
      "\n",
      "RDKit 2D descriptors shape: (1513, 200)\n",
      "RDKit 2D descriptors example values: [   1.5215017 1138.9497      22.880104    19.44299     19.44299\n",
      "   15.215147    11.412099    11.412099     9.666773     9.666773 ]\n"
     ]
    }
   ],
   "source": [
    "from skfp.fingerprints import MordredFingerprint, RDKit2DDescriptorsFingerprint\n",
    "\n",
    "fp_mordred = MordredFingerprint(n_jobs=-1, batch_size=1, verbose=1)\n",
    "fp_rdkit_2d = RDKit2DDescriptorsFingerprint(n_jobs=-1, batch_size=1, verbose=1)\n",
    "\n",
    "X_mordred = fp_mordred.transform(mols)\n",
    "X_rdkit_2d = fp_rdkit_2d.transform(mols)\n",
    "\n",
    "print(f\"Mordred shape: {X_mordred.shape}\")\n",
    "print(f\"Mordred example values: {X_mordred[0, :10]}\")\n",
    "print()\n",
    "print(f\"RDKit 2D descriptors shape: {X_rdkit_2d.shape}\")\n",
    "print(f\"RDKit 2D descriptors example values: {X_rdkit_2d[0, :10]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37ca9cdc-23a4-48de-9581-9b2dc85347df",
   "metadata": {},
   "source": [
    "Some descriptors also have built-in feature names, so we can inspect them and use them, e.g. with explainable AI approaches like [SHAP](https://shap.readthedocs.io/en/latest/). To understand the exact meaning of those features, see the source publications listed in documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "963b44e6-201c-4388-9d4f-9e50cc97b94a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:14:24.066631Z",
     "iopub.status.busy": "2024-12-29T17:14:24.066435Z",
     "iopub.status.idle": "2024-12-29T17:14:24.135063Z",
     "shell.execute_reply": "2024-12-29T17:14:24.134607Z",
     "shell.execute_reply.started": "2024-12-29T17:14:24.066610Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mordred feature names: ['ABC' 'ABCGG' 'nAcid' ... 'Zagreb2' 'mZagreb1' 'mZagreb2']\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ABC</th>\n",
       "      <th>ABCGG</th>\n",
       "      <th>nAcid</th>\n",
       "      <th>nBase</th>\n",
       "      <th>SpAbs_A</th>\n",
       "      <th>SpMax_A</th>\n",
       "      <th>SpDiam_A</th>\n",
       "      <th>SpAD_A</th>\n",
       "      <th>SpMAD_A</th>\n",
       "      <th>LogEE_A</th>\n",
       "      <th>...</th>\n",
       "      <th>SRW10</th>\n",
       "      <th>TSRW10</th>\n",
       "      <th>MW</th>\n",
       "      <th>AMW</th>\n",
       "      <th>WPath</th>\n",
       "      <th>WPol</th>\n",
       "      <th>Zagreb1</th>\n",
       "      <th>Zagreb2</th>\n",
       "      <th>mZagreb1</th>\n",
       "      <th>mZagreb2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>25.301933</td>\n",
       "      <td>18.763220</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>40.361881</td>\n",
       "      <td>2.443038</td>\n",
       "      <td>4.886076</td>\n",
       "      <td>40.361881</td>\n",
       "      <td>1.261309</td>\n",
       "      <td>4.398988</td>\n",
       "      <td>...</td>\n",
       "      <td>10.445841</td>\n",
       "      <td>68.262878</td>\n",
       "      <td>431.257263</td>\n",
       "      <td>6.634727</td>\n",
       "      <td>3233.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>172.0</td>\n",
       "      <td>201.0</td>\n",
       "      <td>10.923611</td>\n",
       "      <td>6.805555</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>36.076038</td>\n",
       "      <td>29.331556</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>58.299816</td>\n",
       "      <td>2.512416</td>\n",
       "      <td>4.903850</td>\n",
       "      <td>58.299816</td>\n",
       "      <td>1.240422</td>\n",
       "      <td>4.759111</td>\n",
       "      <td>...</td>\n",
       "      <td>10.673619</td>\n",
       "      <td>100.694229</td>\n",
       "      <td>657.382202</td>\n",
       "      <td>6.707982</td>\n",
       "      <td>8159.0</td>\n",
       "      <td>72.0</td>\n",
       "      <td>240.0</td>\n",
       "      <td>278.0</td>\n",
       "      <td>17.118055</td>\n",
       "      <td>10.430555</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>32.828228</td>\n",
       "      <td>24.672882</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>54.320873</td>\n",
       "      <td>2.567860</td>\n",
       "      <td>5.012524</td>\n",
       "      <td>54.320873</td>\n",
       "      <td>1.293354</td>\n",
       "      <td>4.667789</td>\n",
       "      <td>...</td>\n",
       "      <td>10.692922</td>\n",
       "      <td>94.122177</td>\n",
       "      <td>591.263550</td>\n",
       "      <td>7.299550</td>\n",
       "      <td>6374.0</td>\n",
       "      <td>74.0</td>\n",
       "      <td>224.0</td>\n",
       "      <td>266.0</td>\n",
       "      <td>13.756945</td>\n",
       "      <td>9.236111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>31.013023</td>\n",
       "      <td>24.251154</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>47.993851</td>\n",
       "      <td>2.443327</td>\n",
       "      <td>4.886654</td>\n",
       "      <td>47.993851</td>\n",
       "      <td>1.199846</td>\n",
       "      <td>4.599177</td>\n",
       "      <td>...</td>\n",
       "      <td>10.655257</td>\n",
       "      <td>77.270859</td>\n",
       "      <td>591.251038</td>\n",
       "      <td>7.484190</td>\n",
       "      <td>5853.0</td>\n",
       "      <td>64.0</td>\n",
       "      <td>210.0</td>\n",
       "      <td>241.0</td>\n",
       "      <td>17.048611</td>\n",
       "      <td>8.444445</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>34.657589</td>\n",
       "      <td>26.046909</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>55.686096</td>\n",
       "      <td>2.567861</td>\n",
       "      <td>5.012539</td>\n",
       "      <td>55.686096</td>\n",
       "      <td>1.265593</td>\n",
       "      <td>4.715560</td>\n",
       "      <td>...</td>\n",
       "      <td>10.787399</td>\n",
       "      <td>96.491135</td>\n",
       "      <td>629.240417</td>\n",
       "      <td>7.865505</td>\n",
       "      <td>7300.0</td>\n",
       "      <td>78.0</td>\n",
       "      <td>238.0</td>\n",
       "      <td>282.0</td>\n",
       "      <td>15.569445</td>\n",
       "      <td>9.402778</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1508</th>\n",
       "      <td>19.258291</td>\n",
       "      <td>15.801585</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>32.258286</td>\n",
       "      <td>2.484048</td>\n",
       "      <td>4.819666</td>\n",
       "      <td>32.258286</td>\n",
       "      <td>1.290331</td>\n",
       "      <td>4.141597</td>\n",
       "      <td>...</td>\n",
       "      <td>10.035612</td>\n",
       "      <td>73.840050</td>\n",
       "      <td>364.166595</td>\n",
       "      <td>7.283332</td>\n",
       "      <td>1619.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>149.0</td>\n",
       "      <td>8.138889</td>\n",
       "      <td>5.611111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1509</th>\n",
       "      <td>19.258291</td>\n",
       "      <td>15.801585</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>32.258286</td>\n",
       "      <td>2.484048</td>\n",
       "      <td>4.819666</td>\n",
       "      <td>32.258286</td>\n",
       "      <td>1.290331</td>\n",
       "      <td>4.141597</td>\n",
       "      <td>...</td>\n",
       "      <td>10.035612</td>\n",
       "      <td>73.840050</td>\n",
       "      <td>357.135651</td>\n",
       "      <td>7.936347</td>\n",
       "      <td>1619.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>149.0</td>\n",
       "      <td>8.138889</td>\n",
       "      <td>5.611111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1510</th>\n",
       "      <td>15.084601</td>\n",
       "      <td>13.327415</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>24.393719</td>\n",
       "      <td>2.519188</td>\n",
       "      <td>4.856795</td>\n",
       "      <td>24.393719</td>\n",
       "      <td>1.283880</td>\n",
       "      <td>3.924382</td>\n",
       "      <td>...</td>\n",
       "      <td>10.022292</td>\n",
       "      <td>72.515038</td>\n",
       "      <td>319.032013</td>\n",
       "      <td>9.667637</td>\n",
       "      <td>730.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>104.0</td>\n",
       "      <td>125.0</td>\n",
       "      <td>6.638889</td>\n",
       "      <td>4.055555</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1511</th>\n",
       "      <td>19.177412</td>\n",
       "      <td>15.661138</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>32.058102</td>\n",
       "      <td>2.524323</td>\n",
       "      <td>4.872796</td>\n",
       "      <td>32.058102</td>\n",
       "      <td>1.335754</td>\n",
       "      <td>4.155212</td>\n",
       "      <td>...</td>\n",
       "      <td>10.236919</td>\n",
       "      <td>78.661003</td>\n",
       "      <td>317.152802</td>\n",
       "      <td>7.375647</td>\n",
       "      <td>1429.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>132.0</td>\n",
       "      <td>159.0</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>5.166667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1512</th>\n",
       "      <td>16.470304</td>\n",
       "      <td>13.438284</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>26.983883</td>\n",
       "      <td>2.453935</td>\n",
       "      <td>4.742945</td>\n",
       "      <td>26.983883</td>\n",
       "      <td>1.284947</td>\n",
       "      <td>4.005924</td>\n",
       "      <td>...</td>\n",
       "      <td>9.863030</td>\n",
       "      <td>74.256088</td>\n",
       "      <td>306.124725</td>\n",
       "      <td>7.653119</td>\n",
       "      <td>1074.0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>110.0</td>\n",
       "      <td>128.0</td>\n",
       "      <td>6.527778</td>\n",
       "      <td>4.583333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1513 rows × 1613 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            ABC      ABCGG  nAcid  nBase    SpAbs_A   SpMax_A  SpDiam_A  \\\n",
       "0     25.301933  18.763220    0.0    0.0  40.361881  2.443038  4.886076   \n",
       "1     36.076038  29.331556    0.0    1.0  58.299816  2.512416  4.903850   \n",
       "2     32.828228  24.672882    0.0    1.0  54.320873  2.567860  5.012524   \n",
       "3     31.013023  24.251154    0.0    1.0  47.993851  2.443327  4.886654   \n",
       "4     34.657589  26.046909    0.0    1.0  55.686096  2.567861  5.012539   \n",
       "...         ...        ...    ...    ...        ...       ...       ...   \n",
       "1508  19.258291  15.801585    0.0    0.0  32.258286  2.484048  4.819666   \n",
       "1509  19.258291  15.801585    0.0    0.0  32.258286  2.484048  4.819666   \n",
       "1510  15.084601  13.327415    0.0    0.0  24.393719  2.519188  4.856795   \n",
       "1511  19.177412  15.661138    0.0    0.0  32.058102  2.524323  4.872796   \n",
       "1512  16.470304  13.438284    0.0    0.0  26.983883  2.453935  4.742945   \n",
       "\n",
       "         SpAD_A   SpMAD_A   LogEE_A  ...      SRW10      TSRW10          MW  \\\n",
       "0     40.361881  1.261309  4.398988  ...  10.445841   68.262878  431.257263   \n",
       "1     58.299816  1.240422  4.759111  ...  10.673619  100.694229  657.382202   \n",
       "2     54.320873  1.293354  4.667789  ...  10.692922   94.122177  591.263550   \n",
       "3     47.993851  1.199846  4.599177  ...  10.655257   77.270859  591.251038   \n",
       "4     55.686096  1.265593  4.715560  ...  10.787399   96.491135  629.240417   \n",
       "...         ...       ...       ...  ...        ...         ...         ...   \n",
       "1508  32.258286  1.290331  4.141597  ...  10.035612   73.840050  364.166595   \n",
       "1509  32.258286  1.290331  4.141597  ...  10.035612   73.840050  357.135651   \n",
       "1510  24.393719  1.283880  3.924382  ...  10.022292   72.515038  319.032013   \n",
       "1511  32.058102  1.335754  4.155212  ...  10.236919   78.661003  317.152802   \n",
       "1512  26.983883  1.284947  4.005924  ...   9.863030   74.256088  306.124725   \n",
       "\n",
       "           AMW   WPath  WPol  Zagreb1  Zagreb2   mZagreb1   mZagreb2  \n",
       "0     6.634727  3233.0  52.0    172.0    201.0  10.923611   6.805555  \n",
       "1     6.707982  8159.0  72.0    240.0    278.0  17.118055  10.430555  \n",
       "2     7.299550  6374.0  74.0    224.0    266.0  13.756945   9.236111  \n",
       "3     7.484190  5853.0  64.0    210.0    241.0  17.048611   8.444445  \n",
       "4     7.865505  7300.0  78.0    238.0    282.0  15.569445   9.402778  \n",
       "...        ...     ...   ...      ...      ...        ...        ...  \n",
       "1508  7.283332  1619.0  37.0    128.0    149.0   8.138889   5.611111  \n",
       "1509  7.936347  1619.0  37.0    128.0    149.0   8.138889   5.611111  \n",
       "1510  9.667637   730.0  29.0    104.0    125.0   6.638889   4.055555  \n",
       "1511  7.375647  1429.0  38.0    132.0    159.0   7.000000   5.166667  \n",
       "1512  7.653119  1074.0  28.0    110.0    128.0   6.527778   4.583333  \n",
       "\n",
       "[1513 rows x 1613 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "print(f\"Mordred feature names: {fp_mordred.get_feature_names_out()}\")\n",
    "print()\n",
    "\n",
    "df_mordred = pd.DataFrame(X_mordred, columns=fp_mordred.get_feature_names_out())\n",
    "df_mordred"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f94f0f91-572a-4c21-9da3-93615a39f8c2",
   "metadata": {},
   "source": [
    "### Substructure fingerprints\n",
    "\n",
    "**Substructure fingerprints** check for existence of selected substructures (subgraphs, patterns) in a molecule, such as functional groups, rings of given size, or counts of atoms of particular element. They are often hand-crafted and selected by domain experts, reflecting which parts of molecule is typically interesting to e.g. medicinal chemists. Substructures are typically described using [SMARTS patterns](https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html), which can be though of as a kind of \"regular expressions\" for molecule structures. Examples include [MACCS fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MACCSFingerprint.html) and [PubChem fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.PubChemFingerprint.html).\n",
    "\n",
    "Like descriptors, they have a constant length, but they result in integer-valued vectors. We can distinguish **binary** and **count** variants of those fingerprints. Binary ones only check for existence of a given pattern, and often are created as if/else conditions, e.g. \"number of oxygens >= 4\" or \"is there a ring of size 6?\". Count variants instead use the number of occurrences of a substructure, e.g. \"number of oxygens\" or \"number of rings of size 6\". Due to this difference, the count version may have less features. Most of those are scikit-fingerprints novel propositions.\n",
    "\n",
    "**Pros:**\n",
    "- typically good substructure searching performance\n",
    "- quite interpretable (if you understand SMARTS patterns)\n",
    "- some are very sparse, and benefit a lot from sparse arrays\n",
    "\n",
    "**Cons:**\n",
    "- don't generalize well outside the chemical space they've been designed for\n",
    "- longer ones are quite slow\n",
    "\n",
    "We'll compute [MACCS fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.MACCSFingerprint.html) and [PubChem fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.PubChemFingerprint.html), in both binary and count versions. SMARTS pattern matching can be quite slow, so those fingerprints often benefit from parallelism, similarly to descriptors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cf6066ec-c9aa-4725-a1fd-aa502bdb024d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:14:24.135800Z",
     "iopub.status.busy": "2024-12-29T17:14:24.135614Z",
     "iopub.status.idle": "2024-12-29T17:14:31.567718Z",
     "shell.execute_reply": "2024-12-29T17:14:31.567127Z",
     "shell.execute_reply.started": "2024-12-29T17:14:24.135782Z"
    }
   },
   "outputs": [],
   "source": [
    "from skfp.fingerprints import MACCSFingerprint, PubChemFingerprint\n",
    "\n",
    "fp_maccs = MACCSFingerprint(n_jobs=-1)\n",
    "fp_maccs_count = MACCSFingerprint(count=True, n_jobs=-1)\n",
    "\n",
    "fp_pubchem = PubChemFingerprint(n_jobs=-1)\n",
    "fp_pubchem_count = PubChemFingerprint(count=True, n_jobs=-1)\n",
    "\n",
    "X_maccs = fp_maccs.transform(mols)\n",
    "X_maccs_count = fp_maccs_count.transform(mols)\n",
    "\n",
    "X_pubchem = fp_pubchem.transform(mols)\n",
    "X_pubchem_count = fp_pubchem_count.transform(mols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "e572fcf4-956b-47d5-bdb4-6f8ff799e319",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:14:31.568476Z",
     "iopub.status.busy": "2024-12-29T17:14:31.568303Z",
     "iopub.status.idle": "2024-12-29T17:14:31.573780Z",
     "shell.execute_reply": "2024-12-29T17:14:31.572994Z",
     "shell.execute_reply.started": "2024-12-29T17:14:31.568459Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Binary MACCS:\n",
      "Shape: (1513, 166)\n",
      "Example values: [1 1 1 1 1 1 1 1 1 0]\n",
      "\n",
      "Count MACCS:\n",
      "Shape: (1513, 159)\n",
      "Example values: [2 1 2 2 0 2 3 1 1 4]\n",
      "\n",
      "Binary PubChem:\n",
      "Shape: (1513, 881)\n",
      "Example values: [1 1 1 0 0 0 0 0 0 1]\n",
      "\n",
      "Count PubChem:\n",
      "Shape: (1513, 757)\n",
      "Example values: [31  0  0 27  3  2  0  0  0  0]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"Binary MACCS:\")\n",
    "print(f\"Shape: {X_maccs.shape}\")\n",
    "print(f\"Example values: {X_maccs[0, -10:]}\")\n",
    "print()\n",
    "print(\"Count MACCS:\")\n",
    "print(f\"Shape: {X_maccs_count.shape}\")\n",
    "print(f\"Example values: {X_maccs_count[0, -10:]}\")\n",
    "print()\n",
    "print(\"Binary PubChem:\")\n",
    "print(f\"Shape: {X_pubchem.shape}\")\n",
    "print(f\"Example values: {X_pubchem[0, :10]}\")\n",
    "print()\n",
    "print(\"Count PubChem:\")\n",
    "print(f\"Shape: {X_pubchem_count.shape}\")\n",
    "print(f\"Example values: {X_pubchem_count[0, :10]}\")\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd96cb99-19bf-43b1-9ebd-2ad694aed4a3",
   "metadata": {},
   "source": [
    "### Hashed fingerprints\n",
    "\n",
    "**Hashed fingerprints** extract all subgraphs of general shape, such as shortest paths between pairs of atoms (linear subgraphs) or neighborhoods of bonded atoms (circular fingerprints). From each substructure, an integer identifier is then computed. Then we use the hashing function (hence the name), which translates the identifier into an index in the final vector, where we put information that subgraph was detected. Those are e.g. [ECFP fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html) and [Atom Pair fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.AtomPairFingerprint.html).\n",
    "\n",
    "Initially, each atom gets assigned a numerical identifier, called atom invariant. It combines a few basic properties like e.g. atomic number, charge, and atomic mass, into a single integer value (typically with XOR function). This allows us to distinguish e.g. carbon in different contexts.\n",
    "\n",
    "Then, subgraphs are computed, with their shape depending on a fingerprint. For example, [ECFP](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html) uses circular neighborhood, e.g. atom with neighbors (bonded atoms), then with their neighbors (radius 2 neighborhood), up to a given radius (by default 2). The identifiers of atoms and bonds in the subgraph are combined into a single identifier of a whole substructure.\n",
    "\n",
    "Output vector starts with only zeros. It has a given length (also called \"number of bits\"), which is a common hyperparameter of all hashed fingerprints. Subgraph identifiers are hashed into it, translating subgraph identifier into index of a vector, e.g. with a [modulo function](https://en.wikipedia.org/wiki/Modulo). Hashing collisions may occur, when two distinct substructures get the same index, but this is typically not a big problem. Binary variant ignores such collisions (just marks 1 at a given position), and count version sums up all occurrences at each index.\n",
    "\n",
    "**Pros:**\n",
    "- very flexible\n",
    "- typically good performance for classification\n",
    "- fast to compute\n",
    "- very sparse, and benefit a lot from sparse arrays\n",
    "\n",
    "**Cons:**\n",
    "- not interpretable\n",
    "\n",
    "Let's check out [ECFP fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html) and [Atom Pair fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.AtomPairFingerprint.html). We will create default versions with length 2048, and short ones with 1024. Those fingerprints are often so fast to compute they don't benefit from parallelism at all."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "570ad67f-19d7-4312-a70e-64a00196915d",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:14:31.575948Z",
     "iopub.status.busy": "2024-12-29T17:14:31.575603Z",
     "iopub.status.idle": "2024-12-29T17:14:32.242811Z",
     "shell.execute_reply": "2024-12-29T17:14:32.242220Z",
     "shell.execute_reply.started": "2024-12-29T17:14:31.575924Z"
    }
   },
   "outputs": [],
   "source": [
    "from skfp.fingerprints import AtomPairFingerprint, ECFPFingerprint\n",
    "\n",
    "fp_ecfp = ECFPFingerprint()\n",
    "fp_ecfp_short = ECFPFingerprint(fp_size=1024)\n",
    "\n",
    "fp_ap = AtomPairFingerprint()\n",
    "fp_ap_short = AtomPairFingerprint(fp_size=1024)\n",
    "\n",
    "X_ecfp = fp_ecfp.transform(mols)\n",
    "X_ecfp_short = fp_ecfp_short.transform(mols)\n",
    "\n",
    "X_ap = fp_ap.transform(mols)\n",
    "X_ap_short = fp_ap_short.transform(mols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "69d12b82-a958-45c3-80dd-653b0f4b6018",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:14:32.243704Z",
     "iopub.status.busy": "2024-12-29T17:14:32.243517Z",
     "iopub.status.idle": "2024-12-29T17:14:32.248958Z",
     "shell.execute_reply": "2024-12-29T17:14:32.248320Z",
     "shell.execute_reply.started": "2024-12-29T17:14:32.243691Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ECFP:\n",
      "Shape: (1513, 2048)\n",
      "Example values: [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      "\n",
      "Short ECFP:\n",
      "Shape: (1513, 1024)\n",
      "Example values: [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n",
      "\n",
      "Atom Pair:\n",
      "Shape: (1513, 2048)\n",
      "Example values: [1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]\n",
      "\n",
      "Short Atom Pair:\n",
      "Shape: (1513, 1024)\n",
      "Example values: [1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 1]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"ECFP:\")\n",
    "print(f\"Shape: {X_ecfp.shape}\")\n",
    "print(f\"Example values: {X_ecfp[0, :25]}\")\n",
    "print()\n",
    "print(\"Short ECFP:\")\n",
    "print(f\"Shape: {X_ecfp_short.shape}\")\n",
    "print(f\"Example values: {X_ecfp_short[0, :25]}\")\n",
    "print()\n",
    "print(\"Atom Pair:\")\n",
    "print(f\"Shape: {X_ap.shape}\")\n",
    "print(f\"Example values: {X_ap[0, :25]}\")\n",
    "print()\n",
    "print(\"Short Atom Pair:\")\n",
    "print(f\"Shape: {X_ap_short.shape}\")\n",
    "print(f\"Example values: {X_ap_short[0, :25]}\")\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5721fd10-6ce6-4851-8291-7bad18a9381d",
   "metadata": {},
   "source": [
    "### Comparison\n",
    "\n",
    "Let's compare the efficiency of those fingerprints: compute time, memory requirements, and classifier performance. Of course, this is just a single dataset comparison, but we'll already be able to see some general trends.\n",
    "\n",
    "For compute time, we'll use parallel versions for descriptors and substructure fingerprints, since this is a typically recommended approach. Hashed fingerprints for typically sized datasets are so fast that spawning and managing processes is slower than any gain from multiple cores. Reported time will be the average of 3 runs. Running this takes a few minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "dd1116f7-fb21-470a-9a2f-08202b817cc0",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:14:32.249998Z",
     "iopub.status.busy": "2024-12-29T17:14:32.249638Z",
     "iopub.status.idle": "2024-12-29T17:18:10.884696Z",
     "shell.execute_reply": "2024-12-29T17:18:10.884136Z",
     "shell.execute_reply.started": "2024-12-29T17:14:32.249977Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mordred: 65.202s\n",
      "RDKit 2D descriptors: 3.350s\n",
      "MACCS: 0.278s\n",
      "PubChem: 3.699s\n",
      "ECFP: 0.123s\n",
      "Atom Pair: 0.221s\n"
     ]
    }
   ],
   "source": [
    "from time import time\n",
    "\n",
    "import numpy as np\n",
    "from rdkit.Chem import Mol\n",
    "\n",
    "\n",
    "def get_fp_compute_time(fp_obj, mols: list[Mol]) -> float:\n",
    "    times = []\n",
    "    for _ in range(3):\n",
    "        start = time()\n",
    "        fp_obj.transform(mols)\n",
    "        end = time()\n",
    "        times.append(end - start)\n",
    "\n",
    "    return np.mean(times)\n",
    "\n",
    "\n",
    "for fp_name, fp_obj in [\n",
    "    (\"Mordred\", MordredFingerprint(n_jobs=-1)),\n",
    "    (\"RDKit 2D descriptors\", RDKit2DDescriptorsFingerprint(n_jobs=-1)),\n",
    "    (\"MACCS\", MACCSFingerprint(n_jobs=-1)),\n",
    "    (\"PubChem\", PubChemFingerprint(n_jobs=-1)),\n",
    "    (\"ECFP\", ECFPFingerprint()),\n",
    "    (\"Atom Pair\", AtomPairFingerprint()),\n",
    "]:\n",
    "    fp_time = get_fp_compute_time(fp_obj, mols)\n",
    "    print(f\"{fp_name}: {fp_time:.3f}s\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79b50915-ed80-4828-a675-e8f623ecd563",
   "metadata": {},
   "source": [
    "For measuring memory, we'll check the size of dense NumPy array, as well as sparse SciPy array, in KB. Then we'll be able to compare how much those fingerprints benefit from sparsity. Since computing sparse arrays has basically no impact on time, we can simply compute sparse arrays and then convert them to dense ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "aa236565-3a1f-4e92-810e-5a078ce4449f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:32:12.788728Z",
     "iopub.status.busy": "2024-12-29T17:32:12.788104Z",
     "iopub.status.idle": "2024-12-29T17:33:24.788496Z",
     "shell.execute_reply": "2024-12-29T17:33:24.787989Z",
     "shell.execute_reply.started": "2024-12-29T17:32:12.788697Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mordred: dense 9533 B, sparse 7788 B, 1.2x reduction\n",
      "RDKit 2D descriptors: dense 1182 B, sparse 634 B, 1.9x reduction\n",
      "MACCS: dense 245 B, sparse 90 B, 2.7x reduction\n",
      "PubChem: dense 1301 B, sparse 250 B, 5.2x reduction\n",
      "ECFP: dense 3026 B, sparse 89 B, 34.0x reduction\n",
      "Atom Pair: dense 3026 B, sparse 596 B, 5.1x reduction\n"
     ]
    }
   ],
   "source": [
    "def get_fp_memory_sizes(fp_obj, mols: list[Mol]) -> tuple[float, float]:\n",
    "    X_sparse = fp_obj.transform(mols)\n",
    "    X_dense = X_sparse.todense()\n",
    "\n",
    "    # transform B -> KB\n",
    "    memory_dense = X_dense.nbytes // 1024\n",
    "    memory_sparse = X_sparse.data.nbytes // 1024\n",
    "\n",
    "    return memory_dense, memory_sparse\n",
    "\n",
    "\n",
    "for fp_name, fp_obj in [\n",
    "    (\"Mordred\", MordredFingerprint(sparse=True, n_jobs=-1)),\n",
    "    (\"RDKit 2D descriptors\", RDKit2DDescriptorsFingerprint(sparse=True, n_jobs=-1)),\n",
    "    (\"MACCS\", MACCSFingerprint(sparse=True, n_jobs=-1)),\n",
    "    (\"PubChem\", PubChemFingerprint(sparse=True, n_jobs=-1)),\n",
    "    (\"ECFP\", ECFPFingerprint(sparse=True)),\n",
    "    (\"Atom Pair\", AtomPairFingerprint(sparse=True)),\n",
    "]:\n",
    "    memory_dense, memory_sparse = get_fp_memory_sizes(fp_obj, mols)\n",
    "    reduction = memory_dense / memory_sparse\n",
    "    print(\n",
    "        f\"{fp_name}: dense {memory_dense} B, sparse {memory_sparse} B, {reduction:.1f}x reduction\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2fd6d70-1479-46e0-81bd-5b6f146f7db2",
   "metadata": {},
   "source": [
    "For evaluating the efficiency of downstream ML models, we'll train a Random Forest classifier and check its AUROC. Both this model type and metric are commonly used in chemoinformatics. For train-test splitting, we use [scaffold split](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.model_selection.scaffold_train_test_split.html), which typically gives better estimation than overly optimistic random split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "7c420ff0-4846-4ac3-be73-9df7466b2e18",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-12-29T17:37:48.049860Z",
     "iopub.status.busy": "2024-12-29T17:37:48.049621Z",
     "iopub.status.idle": "2024-12-29T17:38:55.529105Z",
     "shell.execute_reply": "2024-12-29T17:38:55.528647Z",
     "shell.execute_reply.started": "2024-12-29T17:37:48.049843Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mordred AUROC 74.48%\n",
      "RDKit 2D descriptors AUROC 74.43%\n",
      "MACCS AUROC 77.01%\n",
      "PubChem AUROC 76.05%\n",
      "ECFP AUROC 78.25%\n",
      "Atom Pair AUROC 77.60%\n"
     ]
    }
   ],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "from skfp.model_selection import scaffold_train_test_split\n",
    "\n",
    "\n",
    "def get_rf_auroc(\n",
    "    fp_obj,\n",
    "    mols_train: list[Mol],\n",
    "    mols_test: list[Mol],\n",
    "    y_train: np.ndarray,\n",
    "    y_test: np.ndarray,\n",
    ") -> tuple[float, float]:\n",
    "    X_train = fp_obj.transform(mols_train)\n",
    "    X_test = fp_obj.transform(mols_test)\n",
    "\n",
    "    clf = RandomForestClassifier(n_jobs=-1, random_state=0)\n",
    "    clf.fit(X_train, y_train)\n",
    "\n",
    "    y_pred = clf.predict_proba(X_test)[:, 1]\n",
    "    auroc = roc_auc_score(y_test, y_pred)\n",
    "\n",
    "    return auroc\n",
    "\n",
    "\n",
    "mols_train, mols_test, y_train, y_test = scaffold_train_test_split(\n",
    "    mols, y, test_size=0.2\n",
    ")\n",
    "\n",
    "for fp_name, fp_obj in [\n",
    "    (\"Mordred\", MordredFingerprint(n_jobs=-1)),\n",
    "    (\"RDKit 2D descriptors\", RDKit2DDescriptorsFingerprint(n_jobs=-1)),\n",
    "    (\"MACCS\", MACCSFingerprint(n_jobs=-1)),\n",
    "    (\"PubChem\", PubChemFingerprint(n_jobs=-1)),\n",
    "    (\"ECFP\", ECFPFingerprint()),\n",
    "    (\"Atom Pair\", AtomPairFingerprint()),\n",
    "]:\n",
    "    auroc = get_rf_auroc(fp_obj, mols_train, mols_test, y_train, y_test)\n",
    "    print(f\"{fp_name} AUROC {auroc:.2%}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.21"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}