Hyperparameter tuning#
In machine learning, we learn parameter values, but hyperparameter values are tuned, typically using cross-validation. Most often, you can see hyperparameter tuning for estimators, e.g. number of trees in Random Forest or regularization strength for linear models. However, feature extraction and preprocessing methods also have their own hyperparameters, e.g. number of output dimensions in PCA.
Molecular fingerprints, as major parts of molecular pipelines, also have hyperparameters. They can be tuned to achieve better performance, which results from better chemical representation.
Most common hyperparameter is count
, wheter to use count variant instead of binary. Counting substructures is particularly beneficial for larger molecules, when we can expect multiple occurrences of e.g. functional groups. For many fingerprints, this is the only tunable setting.
Let’s see the impact of using binary vs count variant on beta-secretase 1 (BACE) dataset from MoleculeNet benchmark, using functional groups fingerprint. It detects functional groups (fragments) defined in RDKit.
[5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import make_pipeline
from skfp.datasets.moleculenet import load_bace
from skfp.fingerprints import FunctionalGroupsFingerprint
from skfp.model_selection import scaffold_train_test_split
smiles_list, y = load_bace()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(smiles_list, y)
pipeline_binary = make_pipeline(
FunctionalGroupsFingerprint(),
RandomForestClassifier(n_jobs=-1, random_state=0),
)
pipeline_binary.fit(smiles_train, y_train)
y_pred_binary = pipeline_binary.predict_proba(smiles_test)[:, 1]
auroc_binary = roc_auc_score(y_test, y_pred_binary)
print(f"AUROC binary: {auroc_binary:.2%}")
pipeline_count = make_pipeline(
FunctionalGroupsFingerprint(count=True),
RandomForestClassifier(n_jobs=-1, random_state=0),
)
pipeline_count.fit(smiles_train, y_train)
y_pred_count = pipeline_count.predict_proba(smiles_test)[:, 1]
auroc_count = roc_auc_score(y_test, y_pred_count)
print(f"AUROC count: {auroc_count:.2%}")
AUROC binary: 71.92%
AUROC count: 74.89%
This is manual tuning and we compare the results on the test set. In practice, this should never be done this way, since it introduces data leakage. Instead, we should use only training data, e.g. with cross-validation.
Scikit-learn tuning#
Scikit-fingerprints estimators are fully compatible with scikit-learn tuning interface. We can plug them directly into e.g. GridSearchCV, which will check all combinations of hyperparameters. They can be defined for fingerprint, estimator, or both. Let’s see examples of all 3 situations.
We will use ECFP fingerprint, which has a lot of hyperparameters. This is typical for hashed fingerprints, e.g. Atom Pair, Topological Torsion, RDKit. For ECFP, two main hyperparameters are:
fp_size
, number of features, typically a multiple of 512, e.g. 1024, 2048, 4096radius
, what subgraphs size should be used, e.g. ECFP4 uses radius 2 (diameter 4), ECFP6 uses radius 3 (diameter 6), and so forth
Let’s tune a few of those. We will also tune regularization strength of Random Forest with min_samples_split
.
We use scikit-learn pipelines, and in such cases, the hyperparameter grid definition the key is step name + double underscore + hyperparameter name. Note that this is a general scikit-learn mechanism, and you could also include more steps and tune more complex pipelines this way. Using custom step names with Pipeline
, instead of make_pipeline
, is often useful in such cases.
[10]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from skfp.fingerprints import ECFPFingerprint
pipeline = Pipeline(
[
("fp", ECFPFingerprint()),
("rf", RandomForestClassifier(n_jobs=-1, random_state=0)),
]
)
fp_params = {
"fp__fp_size": [1024, 2048],
"fp__radius": [2, 3],
"fp__use_pharmacophoric_invariants": [False, True],
"fp__include_chirality": [False, True],
}
rf_params = {
"rf__min_samples_split": [2, 5, 10],
}
for name, params in [
("fingerprint", fp_params),
("Random Forest", rf_params),
("fingerprint + Random Forest", fp_params | rf_params),
]:
cv = GridSearchCV(pipeline, params)
cv.fit(smiles_train, y_train)
y_pred = cv.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC {name} tuning: {auroc:.2%}")
AUROC fingerprint tuning: 78.83%
AUROC Random Forest tuning: 78.48%
AUROC fingerprint + Random Forest tuning: 79.44%
Optimized Scikit-fingerprints tuning#
scikit-learn pipelines are very convenient, but they have a significant performance downside - they don’t consider any order or caching. For example, consider the situation:
you want to tune ECFP fingerprint and Random Forest classifier
there are 4 hyperparameter combinations for fingerprint, e.g. 2 values for
fp_size
andradius
eachRandom Forest checks 10 values for
min_samples_split
we have 40 combinations in total
scikit-learn will run all thsoe 40 combinations independently, recomputing fingerprint 40 times. But there is no need to do so! For a given set of fingerprint hyperparameters, we can compute it and check all values for Random Forest. Consider two nested loops:
go over fingerprint hyperparameter combination
for each tune Random Forest
pick the best combination of both
This will also check all 40 combinations, but fingerprint is calculated only 4 times. This results in huge efficiency gains for more costly fingerprints, e.g. RDKit fingerprint, which extracts all subgraphs up to max_path
bonds.
scikit-fingerprints implements this optimized scheme in FingerprintEstimatorGridSearch and FingerprintEstimatorRandomizedSearch classes. They are much more efficient when you need to tune hyperparameters of both fingerprint and estimator. Their parameters are:
fingerprint object
parameters grid for fingerprint
tuning object for estimator, e.g.
GridSearchCV
Let’s see how this works and compare the total time.
[17]:
from time import time
from skfp.fingerprints import RDKitFingerprint
from skfp.model_selection import FingerprintEstimatorGridSearch
# scikit-fingerprints approach
fp = RDKitFingerprint(n_jobs=-1)
fp_params = {"fp_size": [1024, 2048], "max_path": [5, 7, 9]}
clf_cv = GridSearchCV(
estimator=RandomForestClassifier(n_jobs=-1, random_state=0),
param_grid={"min_samples_split": [2, 5, 10]},
)
start = time()
fp_cv = FingerprintEstimatorGridSearch(fp, fp_params, clf_cv)
fp_cv.fit(smiles_train, y_train)
end = time()
y_pred = fp_cv.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC scikit-fingerprints tuning: {auroc:.2%}")
print(f"scikit-fingerprints tuning time: {end - start:.2f}")
AUROC scikit-fingerprints tuning: 78.29%
scikit-fingerprints tuning time: 18.02
[20]:
# scikit-learn approach
pipeline = Pipeline(
[
("fp", RDKitFingerprint(n_jobs=-1)),
("rf", RandomForestClassifier(n_jobs=-1, random_state=0)),
]
)
params_grid = fp_params = {
"fp__fp_size": [1024, 2048],
"fp__max_path": [5, 7, 9],
"rf__min_samples_split": [2, 5, 10],
}
cv = GridSearchCV(pipeline, params_grid)
start = time()
cv.fit(smiles_train, y_train)
end = time()
y_pred = cv.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC scikit-learn tuning: {auroc:.2%}")
print(f"scikit-learn tuning time: {end - start:.2f}")
AUROC scikit-learn tuning: 78.29%
scikit-learn tuning time: 86.71