FingerprintEstimatorRandomizedSearch#

class skfp.model_selection.FingerprintEstimatorRandomizedSearch(fingerprint: BaseFingerprintTransformer, fp_param_distributions: dict | list[dict], estimator_cv: BaseSearchCV, greater_is_better: bool = True, n_iter: int = 10, cache_best_fp_array: bool = False, verbose: int | dict = 0, random_state: int | None = 0)#

Randomized search over specified hyperparameter distributions for a pipeline of a molecular fingerprint and scikit-learn estimator.

This approach is useful for pipelines which first compute fingerprints and then operate on the resulting matrices, and when both fingerprint and estimator hyperparameters are optimized. Regular scikit-learn combination of Pipeline and RandomizedSearchCV would recompute the fingerprint for each set of hyperparameter values.

Here, we instead perform a nested loop:

  1. Randomly select a combination of fingerprint hyperparameter values

  2. Compute fingerprint

  3. Optimize estimator hyperparameters

This way, computed fingerprint representations are efficiently used for many sets of estimator hyperparameters. This is useful when tuning classifier or fingerprint and classifier. When only fingerprint is tuned, combination of Pipeline and GridSearchCV is enough. The difference is particularly significant for more computationally heavy fingerprints and large grids for estimators.

Note that much of the behavior is controlled via passed estimator_cv object, e.g. the scoring metric used to select the best pipeline. In particular, the inner CV is evaluated for each one of n_iter random selections of the fingerprint hyperparameters, i.e. outer loop. This should be taken into consideration when selecting n_iter or hyperparameter grids. If RandomizedSearchCV is used, then the result is roughly equivalent to using randomized search on all hyperparameters, but faster. However, any other strategy can be used for the estimator, e.g. GridSearchCV.

Parameters:
  • fingerprint (fingerprint object) – Instance of any fingerprint class. To maximize performance, consider setting n_jobs larger than 1, since parallelization is not performed here when going through fingerprint hyperparameter grid.

  • fp_param_distributions (dict or list[dict]) – Dictionary with names of fingerprint hyperparameters as keys and lists of hyperparameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of hyperparameter settings.

  • estimator_cv (object) – Inner cross-validation object for tuning estimator, e.g. RandomziedSearchCV. Should be an instantiated object, not a class.

  • greater_is_better (bool, default=True) – Whether higher values of scoring metric in estimator_cv are better or not. False should be used for error (loss) functions, typically used in regression.

  • n_iter (int, default=10) – How many iterations of random search to perform.

  • cache_best_fp_array (bool, default=False) – Whether to cache the array of values from the best fingerprint in best_fp_array_ parameter. Note that this can result in high memory usage.

  • verbose (int or dict, default=0) –

    Controls the verbosity when computing fingerprints.

    • >0 : size of parameter grid, parameter candidate for each fold

    • >1 : the computation time and score for each candidate

    If a dictionary is passed, it is treated as kwargs for tqdm(), and can be used to control the progress bar.

cv_results_#

List of dictionaries, where each one represents the set of hyperparameters (names and values) and "score" key with the cross-validated performance of the pipeline with those hyperparameters.

Type:

list[dict]

best_fp_#

Fingerprint that was chosen by the search, i.e. fingerprint which gave the highest score (or smallest loss if specified) on the left out data. Use with best_estimator_cv_ to obtain the best found pipeline.

Type:

fingerprint object

best_fp_params_#

Fingerprint hyperparameter values that gave the best results on the hold out data.

Type:

dict

best_fp_array_#

Fingerprint values for best_fp_. If cache_best_fp_array is False, this will not be used and will be None instead.

Type:

np.ndarray

best_score_#

Mean cross-validated score of the best fingerprint and estimator.

Type:

float

best_estimator_cv_#

Inner cross-validation object that gave the best results on the hold out data. Use with best_fp_ to obtain the best found pipeline.

Type:

CV object

See also

FingerprintEstimatorGridSearch

Related fingerprint, but uses grid search for fingerprint hyperparameters.

Examples

>>> from skfp.datasets.moleculenet import load_bace
>>> from skfp.fingerprints import ECFPFingerprint
>>> from skfp.model_selection import FingerprintEstimatorRandomizedSearch
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.model_selection import RandomizedSearchCV
>>> smiles, labels = load_bace()
>>> fp = ECFPFingerprint(n_jobs=-1)
>>> fp_params = {"fp_size": list(range(512, 4097, 128))}
>>> clf = RandomForestClassifier(n_jobs=-1)
>>> clf_params = {"min_samples_split": list(range(2, 10))}
>>> clf_cv = RandomizedSearchCV(clf, clf_params, n_iter=5, random_state=0)
>>> fp_cv = FingerprintEstimatorRandomizedSearch(fp, fp_params, clf_cv, n_iter=5)
>>> fp_cv = fp_cv.fit(smiles, labels)  
>>> fp_cv.best_fp_params_  
{'fp_size': 768}

Methods

fit(X[, y])

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

predict(X)

predict_proba(X)

set_params(**params)

Set the parameters of this estimator.

transform(X)

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance