FingerprintEstimatorGridSearch#
- class skfp.model_selection.FingerprintEstimatorGridSearch(fingerprint: BaseFingerprintTransformer, fp_param_grid: dict | list[dict], estimator_cv: BaseSearchCV, greater_is_better: bool = True, cache_best_fp_array: bool = False, verbose: int = 0)#
Exhaustive search over specified hyperparameter values for a pipeline of a molecular fingerprint and scikit-learn estimator.
This approach is useful for pipelines which first compute fingerprints and then operate on the resulting matrices, and when both fingerprint and estimator hyperparameters are optimized. Regular scikit-learn combination of
Pipeline
andGridSearchCV
would recompute the fingerprint for each set of hyperparameter values.Here, we instead perform a nested loop:
Loop over all possible combinations of fingerprint hyperparameter values
Compute fingerprint
Optimize estimator hyperparameters
This way, computed fingerprint representations are efficiently used for many sets of estimator hyperparameters. This is useful when tuning classifier or fingerprint and classifier. When only fingerprint is tuned, combination of
Pipeline
andGridSearchCV
is enough. The difference is particularly significant for more computationally heavy fingerprints and large grids for estimators.Note that much of the behavior is controlled via passed
estimator_cv
object, e.g. thescoring
metric used to select the best pipeline. In particular, ifGridSearchCV
is used, then the result is equivalent to using grid search on all hyperparameters, but faster. However, any other strategy can be used for the estimator, e.g.RandomizedSearchCV
.- Parameters:
fingerprint (fingerprint object) – Instance of any fingerprint class. To maximize performance, consider setting n_jobs larger than 1, since parallelization is not performed here when going through fingerprint hyperparameter grid.
fp_param_grid (dict or list[dict]) – Dictionary with names of fingerprint hyperparameters as keys and lists of hyperparameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of hyperparameter settings.
estimator_cv (object) – Inner cross-validation object for tuning estimator, e.g.
GridSearchCV
. Should be an instantiated object, not a class.greater_is_better (bool, default=True) – Whether higher values of scoring metric in
estimator_cv
are better or not.False
should be used for error (loss) functions, typically used in regression.cache_best_fp_array (bool, default=False) – Whether to cache the array of values from the best fingerprint in
best_fp_array_
parameter. Note that this can result in high memory usage.verbose (int, default=0) –
Controls the verbosity: the higher, the more messages.
>0 : size of parameter grid, parameter candidate for each fold
>1 : the computation time and score for each candidate
- cv_results_#
List of dictionaries, where each one represents the set of hyperparameters (names and values) and
"score"
key with the cross-validated performance of the pipeline with those hyperparameters.- Type:
list[dict]
- best_fp_#
Fingerprint that was chosen by the search, i.e. fingerprint which gave the highest score (or smallest loss if specified) on the left out data. Use with
best_estimator_cv_
to obtain the best found pipeline.- Type:
fingerprint object
- best_fp_params_#
Fingerprint hyperparameter values that gave the best results on the hold out data.
- Type:
dict
- best_fp_array_#
Fingerprint values for
best_fp_
. Ifcache_best_fp_array
is False, this will not be used and will be None instead.- Type:
np.ndarray
- best_score_#
Mean cross-validated score of the best fingerprint and estimator.
- Type:
float
- best_estimator_cv_#
Inner cross-validation object that gave the best results on the hold out data. Use with
best_fp_
to obtain the best found pipeline.- Type:
CV object
See also
FingerprintEstimatorRandomizedSearch
Related fingerprint, but uses randomized search for fingerprint hyperparameters.
Examples
>>> from skfp.datasets.moleculenet import load_bace >>> from skfp.fingerprints import ECFPFingerprint >>> from skfp.model_selection import FingerprintEstimatorGridSearch >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.model_selection import GridSearchCV >>> smiles, labels = load_bace() >>> fp = ECFPFingerprint(n_jobs=-1) >>> fp_params = {"radius": [2, 3]} >>> clf = RandomForestClassifier(n_jobs=-1) >>> clf_params = {"min_samples_split": [2, 3, 4]} >>> clf_cv = GridSearchCV(clf, clf_params) >>> fp_cv = FingerprintEstimatorGridSearch(fp, fp_params, clf_cv) >>> fp_cv.fit(smiles, labels) FingerprintEstimatorGridSearch(estimator_cv=GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1), param_grid={'min_samples_split': [2, 3, 4]}), fingerprint=ECFPFingerprint(n_jobs=-1), fp_param_grid={'radius': [2, 3]}) >>> fp_cv.best_fp_params_ {'radius': 2}
Methods
fit
(X[, y])Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
predict
(X)predict_proba
(X)set_params
(**params)Set the parameters of this estimator.
transform
(X)- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params – Parameter names mapped to their values.
- Return type:
dict
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance