Hyperparameter tuning#

In machine learning, we learn parameter values, but hyperparameter values are tuned, typically using cross-validation. Most often, you can see hyperparameter tuning for estimators, e.g. number of trees in Random Forest or regularization strength for linear models. However, feature extraction and preprocessing methods also have their own hyperparameters, e.g. number of output dimensions in PCA.

Molecular fingerprints, as major parts of molecular pipelines, also have hyperparameters. They can be tuned to achieve better performance, which results from better chemical representation.

Most common hyperparameter is count, wheter to use count variant instead of binary. Counting substructures is particularly beneficial for larger molecules, when we can expect multiple occurrences of e.g. functional groups. For many fingerprints, this is the only tunable setting.

Let’s see the impact of using binary vs count variant on beta-secretase 1 (BACE) dataset from MoleculeNet benchmark, using functional groups fingerprint. It detects functional groups (fragments) defined in RDKit.

[5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import make_pipeline

from skfp.datasets.moleculenet import load_bace
from skfp.fingerprints import FunctionalGroupsFingerprint
from skfp.model_selection import scaffold_train_test_split

smiles_list, y = load_bace()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(smiles_list, y)

pipeline_binary = make_pipeline(
    FunctionalGroupsFingerprint(),
    RandomForestClassifier(n_jobs=-1, random_state=0),
)
pipeline_binary.fit(smiles_train, y_train)
y_pred_binary = pipeline_binary.predict_proba(smiles_test)[:, 1]
auroc_binary = roc_auc_score(y_test, y_pred_binary)
print(f"AUROC binary: {auroc_binary:.2%}")

pipeline_count = make_pipeline(
    FunctionalGroupsFingerprint(count=True),
    RandomForestClassifier(n_jobs=-1, random_state=0),
)
pipeline_count.fit(smiles_train, y_train)
y_pred_count = pipeline_count.predict_proba(smiles_test)[:, 1]
auroc_count = roc_auc_score(y_test, y_pred_count)
print(f"AUROC count: {auroc_count:.2%}")
AUROC binary: 71.92%
AUROC count: 74.89%

This is manual tuning and we compare the results on the test set. In practice, this should never be done this way, since it introduces data leakage. Instead, we should use only training data, e.g. with cross-validation.

Scikit-learn tuning#

Scikit-fingerprints estimators are fully compatible with scikit-learn tuning interface. We can plug them directly into e.g. GridSearchCV, which will check all combinations of hyperparameters. They can be defined for fingerprint, estimator, or both. Let’s see examples of all 3 situations.

We will use ECFP fingerprint, which has a lot of hyperparameters. This is typical for hashed fingerprints, e.g. Atom Pair, Topological Torsion, RDKit. For ECFP, two main hyperparameters are:

  • fp_size, number of features, typically a multiple of 512, e.g. 1024, 2048, 4096

  • radius, what subgraphs size should be used, e.g. ECFP4 uses radius 2 (diameter 4), ECFP6 uses radius 3 (diameter 6), and so forth

Let’s tune a few of those. We will also tune regularization strength of Random Forest with min_samples_split.

We use scikit-learn pipelines, and in such cases, the hyperparameter grid definition the key is step name + double underscore + hyperparameter name. Note that this is a general scikit-learn mechanism, and you could also include more steps and tune more complex pipelines this way. Using custom step names with Pipeline, instead of make_pipeline, is often useful in such cases.

[10]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

from skfp.fingerprints import ECFPFingerprint

pipeline = Pipeline(
    [
        ("fp", ECFPFingerprint()),
        ("rf", RandomForestClassifier(n_jobs=-1, random_state=0)),
    ]
)
fp_params = {
    "fp__fp_size": [1024, 2048],
    "fp__radius": [2, 3],
    "fp__use_pharmacophoric_invariants": [False, True],
    "fp__include_chirality": [False, True],
}
rf_params = {
    "rf__min_samples_split": [2, 5, 10],
}

for name, params in [
    ("fingerprint", fp_params),
    ("Random Forest", rf_params),
    ("fingerprint + Random Forest", fp_params | rf_params),
]:
    cv = GridSearchCV(pipeline, params)
    cv.fit(smiles_train, y_train)
    y_pred = cv.predict_proba(smiles_test)[:, 1]
    auroc = roc_auc_score(y_test, y_pred)
    print(f"AUROC {name} tuning: {auroc:.2%}")
AUROC fingerprint tuning: 78.83%
AUROC Random Forest tuning: 78.48%
AUROC fingerprint + Random Forest tuning: 79.44%

Optimized Scikit-fingerprints tuning#

scikit-learn pipelines are very convenient, but they have a significant performance downside - they don’t consider any order or caching. For example, consider the situation:

  • you want to tune ECFP fingerprint and Random Forest classifier

  • there are 4 hyperparameter combinations for fingerprint, e.g. 2 values for fp_size and radius each

  • Random Forest checks 10 values for min_samples_split

  • we have 40 combinations in total

scikit-learn will run all thsoe 40 combinations independently, recomputing fingerprint 40 times. But there is no need to do so! For a given set of fingerprint hyperparameters, we can compute it and check all values for Random Forest. Consider two nested loops:

  • go over fingerprint hyperparameter combination

  • for each tune Random Forest

  • pick the best combination of both

This will also check all 40 combinations, but fingerprint is calculated only 4 times. This results in huge efficiency gains for more costly fingerprints, e.g. RDKit fingerprint, which extracts all subgraphs up to max_path bonds.

scikit-fingerprints implements this optimized scheme in FingerprintEstimatorGridSearch and FingerprintEstimatorRandomizedSearch classes. They are much more efficient when you need to tune hyperparameters of both fingerprint and estimator. Their parameters are:

  • fingerprint object

  • parameters grid for fingerprint

  • tuning object for estimator, e.g. GridSearchCV

Let’s see how this works and compare the total time.

[17]:
from time import time

from skfp.fingerprints import RDKitFingerprint
from skfp.model_selection import FingerprintEstimatorGridSearch

# scikit-fingerprints approach
fp = RDKitFingerprint(n_jobs=-1)
fp_params = {"fp_size": [1024, 2048], "max_path": [5, 7, 9]}
clf_cv = GridSearchCV(
    estimator=RandomForestClassifier(n_jobs=-1, random_state=0),
    param_grid={"min_samples_split": [2, 5, 10]},
)

start = time()
fp_cv = FingerprintEstimatorGridSearch(fp, fp_params, clf_cv)
fp_cv.fit(smiles_train, y_train)
end = time()

y_pred = fp_cv.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC scikit-fingerprints tuning: {auroc:.2%}")
print(f"scikit-fingerprints tuning time: {end - start:.2f}")
AUROC scikit-fingerprints tuning: 78.29%
scikit-fingerprints tuning time: 18.02
[20]:
# scikit-learn approach
pipeline = Pipeline(
    [
        ("fp", RDKitFingerprint(n_jobs=-1)),
        ("rf", RandomForestClassifier(n_jobs=-1, random_state=0)),
    ]
)
params_grid = fp_params = {
    "fp__fp_size": [1024, 2048],
    "fp__max_path": [5, 7, 9],
    "rf__min_samples_split": [2, 5, 10],
}
cv = GridSearchCV(pipeline, params_grid)

start = time()
cv.fit(smiles_train, y_train)
end = time()

y_pred = cv.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC scikit-learn tuning: {auroc:.2%}")
print(f"scikit-learn tuning time: {end - start:.2f}")
AUROC scikit-learn tuning: 78.29%
scikit-learn tuning time: 86.71