{ "cells": [ { "cell_type": "markdown", "id": "473e4b0b", "metadata": { "collapsed": true, "jupyter": { "outputs_hidden": true } }, "source": [ "# Hyperparameter tuning" ] }, { "cell_type": "markdown", "id": "9a96a2ea", "metadata": {}, "source": [ "In machine learning, we learn parameter values, but **hyperparameter** values are tuned, typically using cross-validation. Most often, you can see hyperparameter tuning for estimators, e.g. number of trees in Random Forest or regularization strength for linear models. However, feature extraction and preprocessing methods also have their own hyperparameters, e.g. number of output dimensions in PCA.\n", "\n", "Molecular fingerprints, as major parts of molecular pipelines, also have hyperparameters. They can be tuned to achieve better performance, which results from better chemical representation.\n", "\n", "Most common hyperparameter is `count`, wheter to use count variant instead of binary. Counting substructures is particularly beneficial for larger molecules, when we can expect multiple occurrences of e.g. functional groups. For many fingerprints, this is the only tunable setting.\n", "\n", "Let's see the impact of using binary vs count variant on [beta-secretase 1 (BACE) dataset](https://doi.org/10.1021/acs.jcim.6b00290) from MoleculeNet benchmark, using [functional groups fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.FunctionalGroupsFingerprint.html). It detects functional groups (fragments) [defined in RDKit](https://www.rdkit.org/docs/source/rdkit.Chem.Fragments.html)." ] }, { "cell_type": "code", "execution_count": 5, "id": "bb06ca57-32ee-4c21-86dd-c69f65188acd", "metadata": { "execution": { "iopub.execute_input": "2025-01-30T19:37:37.713553Z", "iopub.status.busy": "2025-01-30T19:37:37.713148Z", "iopub.status.idle": "2025-01-30T19:37:40.733691Z", "shell.execute_reply": "2025-01-30T19:37:40.733276Z", "shell.execute_reply.started": "2025-01-30T19:37:37.713531Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUROC binary: 71.92%\n", "AUROC count: 74.89%\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.pipeline import make_pipeline\n", "\n", "from skfp.datasets.moleculenet import load_bace\n", "from skfp.fingerprints import FunctionalGroupsFingerprint\n", "from skfp.model_selection import scaffold_train_test_split\n", "\n", "smiles_list, y = load_bace()\n", "smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(smiles_list, y)\n", "\n", "pipeline_binary = make_pipeline(\n", " FunctionalGroupsFingerprint(),\n", " RandomForestClassifier(n_jobs=-1, random_state=0),\n", ")\n", "pipeline_binary.fit(smiles_train, y_train)\n", "y_pred_binary = pipeline_binary.predict_proba(smiles_test)[:, 1]\n", "auroc_binary = roc_auc_score(y_test, y_pred_binary)\n", "print(f\"AUROC binary: {auroc_binary:.2%}\")\n", "\n", "pipeline_count = make_pipeline(\n", " FunctionalGroupsFingerprint(count=True),\n", " RandomForestClassifier(n_jobs=-1, random_state=0),\n", ")\n", "pipeline_count.fit(smiles_train, y_train)\n", "y_pred_count = pipeline_count.predict_proba(smiles_test)[:, 1]\n", "auroc_count = roc_auc_score(y_test, y_pred_count)\n", "print(f\"AUROC count: {auroc_count:.2%}\")" ] }, { "cell_type": "markdown", "id": "a718b936-d9d2-4e91-a6a9-17eae8830650", "metadata": {}, "source": [ "This is manual tuning and we compare the results on the test set. In practice, this should **never** be done this way, since it introduces data leakage. Instead, we should use only training data, e.g. with cross-validation." ] }, { "cell_type": "markdown", "id": "f3b4b0c0-b404-4667-a42b-53ccbb516610", "metadata": {}, "source": [ "### Scikit-learn tuning\n", "\n", "Scikit-fingerprints estimators are fully compatible with scikit-learn tuning interface. We can plug them directly into e.g. [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), which will check all combinations of hyperparameters. They can be defined for fingerprint, estimator, or both. Let's see examples of all 3 situations.\n", "\n", "We will use [ECFP fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.ECFPFingerprint.html), which has a lot of hyperparameters. This is typical for hashed fingerprints, e.g. Atom Pair, Topological Torsion, RDKit. For ECFP, two main hyperparameters are:\n", "- `fp_size`, number of features, typically a multiple of 512, e.g. 1024, 2048, 4096\n", "- `radius`, what subgraphs size should be used, e.g. ECFP4 uses radius 2 (diameter 4), ECFP6 uses radius 3 (diameter 6), and so forth\n", "\n", "Let's tune a few of those. We will also tune regularization strength of Random Forest with `min_samples_split`.\n", "\n", "We use [scikit-learn pipelines](https://scikit-learn.org/stable/modules/grid_search.html), and in such cases, the hyperparameter grid definition the key is step name + double underscore + hyperparameter name. Note that this is a general scikit-learn mechanism, and you could also include more steps and tune more complex pipelines this way. Using custom step names with `Pipeline`, instead of `make_pipeline`, is often useful in such cases." ] }, { "cell_type": "code", "execution_count": 10, "id": "e676863e-6504-4706-98c1-fce2fcc23287", "metadata": { "execution": { "iopub.execute_input": "2025-01-30T19:57:41.449629Z", "iopub.status.busy": "2025-01-30T19:57:41.449366Z", "iopub.status.idle": "2025-01-30T20:00:47.698748Z", "shell.execute_reply": "2025-01-30T20:00:47.698351Z", "shell.execute_reply.started": "2025-01-30T19:57:41.449611Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUROC fingerprint tuning: 78.83%\n", "AUROC Random Forest tuning: 78.48%\n", "AUROC fingerprint + Random Forest tuning: 79.44%\n" ] } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.pipeline import Pipeline\n", "\n", "from skfp.fingerprints import ECFPFingerprint\n", "\n", "pipeline = Pipeline(\n", " [\n", " (\"fp\", ECFPFingerprint()),\n", " (\"rf\", RandomForestClassifier(n_jobs=-1, random_state=0)),\n", " ]\n", ")\n", "fp_params = {\n", " \"fp__fp_size\": [1024, 2048],\n", " \"fp__radius\": [2, 3],\n", " \"fp__use_pharmacophoric_invariants\": [False, True],\n", " \"fp__include_chirality\": [False, True],\n", "}\n", "rf_params = {\n", " \"rf__min_samples_split\": [2, 5, 10],\n", "}\n", "\n", "for name, params in [\n", " (\"fingerprint\", fp_params),\n", " (\"Random Forest\", rf_params),\n", " (\"fingerprint + Random Forest\", fp_params | rf_params),\n", "]:\n", " cv = GridSearchCV(pipeline, params)\n", " cv.fit(smiles_train, y_train)\n", " y_pred = cv.predict_proba(smiles_test)[:, 1]\n", " auroc = roc_auc_score(y_test, y_pred)\n", " print(f\"AUROC {name} tuning: {auroc:.2%}\")" ] }, { "cell_type": "markdown", "id": "c86ca2dd-83a2-4904-9e9a-d52f58942eb0", "metadata": {}, "source": [ "### Optimized Scikit-fingerprints tuning\n", "\n", "scikit-learn pipelines are very convenient, but they have a significant performance downside - they don't consider any order or caching. For example, consider the situation:\n", "- you want to tune ECFP fingerprint and Random Forest classifier\n", "- there are 4 hyperparameter combinations for fingerprint, e.g. 2 values for `fp_size` and `radius` each\n", "- Random Forest checks 10 values for `min_samples_split`\n", "- we have 40 combinations in total\n", "\n", "scikit-learn will run all thsoe 40 combinations independently, recomputing fingerprint 40 times. But there is no need to do so! For a given set of fingerprint hyperparameters, we can compute it and check all values for Random Forest. Consider two nested loops:\n", "- go over fingerprint hyperparameter combination\n", "- for each tune Random Forest\n", "- pick the best combination of both\n", "\n", "This will also check all 40 combinations, but fingerprint is calculated only 4 times. This results in huge efficiency gains for more costly fingerprints, e.g. [RDKit fingerprint](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.fingerprints.RDKitFingerprint.html), which extracts all subgraphs up to `max_path` bonds.\n", "\n", "scikit-fingerprints implements this optimized scheme in [FingerprintEstimatorGridSearch](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.model_selection.FingerprintEstimatorGridSearch.html) and [FingerprintEstimatorRandomizedSearch](https://scikit-fingerprints.github.io/scikit-fingerprints/modules/generated/skfp.model_selection.FingerprintEstimatorRandomizedSearch.html) classes. They are much more efficient when you need to tune hyperparameters of both fingerprint and estimator. Their parameters are:\n", "- fingerprint object\n", "- parameters grid for fingerprint\n", "- tuning object for estimator, e.g. `GridSearchCV`\n", "\n", "Let's see how this works and compare the total time." ] }, { "cell_type": "code", "execution_count": 17, "id": "c47a9fca-47f2-4228-ad3e-a425936b7fc9", "metadata": { "execution": { "iopub.execute_input": "2025-01-30T20:21:16.053369Z", "iopub.status.busy": "2025-01-30T20:21:16.052882Z", "iopub.status.idle": "2025-01-30T20:21:34.142234Z", "shell.execute_reply": "2025-01-30T20:21:34.141839Z", "shell.execute_reply.started": "2025-01-30T20:21:16.053350Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUROC scikit-fingerprints tuning: 78.29%\n", "scikit-fingerprints tuning time: 18.02\n" ] } ], "source": [ "from time import time\n", "\n", "from skfp.fingerprints import RDKitFingerprint\n", "from skfp.model_selection import FingerprintEstimatorGridSearch\n", "\n", "# scikit-fingerprints approach\n", "fp = RDKitFingerprint(n_jobs=-1)\n", "fp_params = {\"fp_size\": [1024, 2048], \"max_path\": [5, 7, 9]}\n", "clf_cv = GridSearchCV(\n", " estimator=RandomForestClassifier(n_jobs=-1, random_state=0),\n", " param_grid={\"min_samples_split\": [2, 5, 10]},\n", ")\n", "\n", "start = time()\n", "fp_cv = FingerprintEstimatorGridSearch(fp, fp_params, clf_cv)\n", "fp_cv.fit(smiles_train, y_train)\n", "end = time()\n", "\n", "y_pred = fp_cv.predict_proba(smiles_test)[:, 1]\n", "auroc = roc_auc_score(y_test, y_pred)\n", "print(f\"AUROC scikit-fingerprints tuning: {auroc:.2%}\")\n", "print(f\"scikit-fingerprints tuning time: {end - start:.2f}\")" ] }, { "cell_type": "code", "execution_count": 20, "id": "f718178a-bc71-4949-a747-44116c8bb774", "metadata": { "execution": { "iopub.execute_input": "2025-01-30T20:21:50.958458Z", "iopub.status.busy": "2025-01-30T20:21:50.958153Z", "iopub.status.idle": "2025-01-30T20:23:17.747256Z", "shell.execute_reply": "2025-01-30T20:23:17.746781Z", "shell.execute_reply.started": "2025-01-30T20:21:50.958436Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AUROC scikit-learn tuning: 78.29%\n", "scikit-learn tuning time: 86.71\n" ] } ], "source": [ "# scikit-learn approach\n", "pipeline = Pipeline(\n", " [\n", " (\"fp\", RDKitFingerprint(n_jobs=-1)),\n", " (\"rf\", RandomForestClassifier(n_jobs=-1, random_state=0)),\n", " ]\n", ")\n", "params_grid = fp_params = {\n", " \"fp__fp_size\": [1024, 2048],\n", " \"fp__max_path\": [5, 7, 9],\n", " \"rf__min_samples_split\": [2, 5, 10],\n", "}\n", "cv = GridSearchCV(pipeline, params_grid)\n", "\n", "start = time()\n", "cv.fit(smiles_train, y_train)\n", "end = time()\n", "\n", "y_pred = cv.predict_proba(smiles_test)[:, 1]\n", "auroc = roc_auc_score(y_test, y_pred)\n", "print(f\"AUROC scikit-learn tuning: {auroc:.2%}\")\n", "print(f\"scikit-learn tuning time: {end - start:.2f}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 5 }