FingerprintEstimatorGridSearch#

class skfp.model_selection.FingerprintEstimatorGridSearch(fingerprint: BaseFingerprintTransformer, fp_param_grid: dict | list[dict], estimator_cv: BaseSearchCV, cache_best_fp_array: bool = False, verbose: int = 0)#

Exhaustive search over specified hyperparameter values for a pipeline of a molecular fingerprint and scikit-learn estimator.

This approach is useful for pipelines which first compute fingerprints and then operate on the resulting matrices, and when both fingerprint and estimator hyperparameters are optimized. Regular scikit-learn combination of Pipeline and GridSearchCV would recompute the fingerprint for each set of hyperparameter values.

Here, we instead perform a nested loop:

  1. Loop over all possible combinations of fingerprint hyperparameter values

  2. Compute fingerprint

  3. Optimize estimator hyperparameters

This way, computed fingerprint representations are efficiently used for many sets of estimator hyperparameters. This is useful when tuning classifier or fingerprint and classifier. When only fingerprint is tuned, combination of Pipeline and GridSearchCV is enough. The difference is particularly significant for more computationally heavy fingerprints and large grids for estimators.

Note that much of the behavior is controlled via passed estimator_cv object, e.g. the scoring metric used to select the best pipeline. In particular, if GridSearchCV is used, then the result is equivalent to using grid search on all hyperparameters, but faster. However, any other strategy can be used for the estimator, e.g. RandomizedSearchCV.

Parameters:
  • fingerprint (fingerprint object) – Instance of any fingerprint class. To maximize performance, consider setting n_jobs larger than 1, since parallelization is not performed here when going through fingerprint hyperparameter grid.

  • fp_param_grid (dict or list[dict]) – Dictionary with names of fingerprint hyperparameters (str) as keys and lists of hyperparameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of hyperparameter settings.

  • estimator_cv (object) – Inner cross-validation object for tuning estimator, e.g. GridSearchCV. Should be an instantiated object, not a class.

  • cache_best_fp_array (bool, default=False) – Whether to cache the array of values from the best fingerprint in best_fp_array_ parameter. Note that this can result in high memory usage.

  • verbose (int, default=0) –

    Controls the verbosity: the higher, the more messages.

    • >0 : size of parameter grid, parameter candidate for each fold

    • >1 : the computation time and score for each candidate

cv_results_#

List of dictionaries, where each one represents the set of hyperparameters (names and values) and “score” key with the cross-validated performance of the pipeline with those hyperparameters.

Type:

list[dict]

best_fp_#

Fingerprint that was chosen by the search, i.e. fingerprint which gave the highest score (or smallest loss if specified) on the left out data. Use with best_estimator_cv_ to obtain the best found pipeline.

Type:

fingerprint object

best_fp_params_#

Fingerprint hyperparameter values that gave the best results on the hold out data.

Type:

dict

best_fp_array_#

Fingerprint values for best_fp_. If cache_best_fp_array is False, this will not be used and will be None instead.

Type:

np.ndarray

best_score_#

Mean cross-validated score of the best fingerprint and estimator.

Type:

float

best_estimator_cv_#

Inner cross-validation object that gave the best results on the hold out data. Use with best_fp_ to obtain the best found pipeline.

Type:

CV object

See also

FingerprintEstimatorRandomizedSearch

Related fingerprint, but uses randomized search for fingerprint hyperparameters.

Examples

>>> from skfp.datasets.moleculenet import load_bace
>>> from skfp.fingerprints import ECFPFingerprint
>>> from skfp.model_selection import FingerprintEstimatorGridSearch
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.model_selection import GridSearchCV
>>> smiles, labels = load_bace()
>>> fp = ECFPFingerprint(n_jobs=-1)
>>> fp_params = {"radius": [2, 3]}
>>> clf = RandomForestClassifier(n_jobs=-1)
>>> clf_params = {"min_samples_split": [2, 3, 4]}
>>> clf_cv = GridSearchCV(clf, clf_params)
>>> fp_cv = FingerprintEstimatorGridSearch(fp, fp_params, clf_cv)
>>> fp_cv.fit(smiles, labels)  
FingerprintEstimatorGridSearch(estimator_cv=GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1),
                                                         param_grid={'min_samples_split': [2,
                                                                                           3,
                                                                                           4]}),
                               fingerprint=ECFPFingerprint(n_jobs=-1),
                               fp_param_grid={'radius': [2, 3]})
>>> fp_cv.best_fp_params_  
{'radius': 2}

Methods

fit(X[, y])

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X)

predict_proba(X)

set_params(**params)

Set the parameters of this estimator.

transform(X)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance