FingerprintEstimatorRandomizedSearch#
- class skfp.model_selection.FingerprintEstimatorRandomizedSearch(fingerprint: BaseFingerprintTransformer, fp_param_distributions: dict | list[dict], estimator_cv: BaseSearchCV, greater_is_better: bool = True, n_iter: int = 10, cache_best_fp_array: bool = False, verbose: int | dict = 0, random_state: int | None = 0)#
Randomized search over specified hyperparameter distributions for a pipeline of a molecular fingerprint and scikit-learn estimator.
This approach is useful for pipelines which first compute fingerprints and then operate on the resulting matrices, and when both fingerprint and estimator hyperparameters are optimized. Regular scikit-learn combination of
Pipeline
andRandomizedSearchCV
would recompute the fingerprint for each set of hyperparameter values.Here, we instead perform a nested loop:
Randomly select a combination of fingerprint hyperparameter values
Compute fingerprint
Optimize estimator hyperparameters
This way, computed fingerprint representations are efficiently used for many sets of estimator hyperparameters. This is useful when tuning classifier or fingerprint and classifier. When only fingerprint is tuned, combination of
Pipeline
andGridSearchCV
is enough. The difference is particularly significant for more computationally heavy fingerprints and large grids for estimators.Note that much of the behavior is controlled via passed
estimator_cv
object, e.g. thescoring
metric used to select the best pipeline. In particular, the inner CV is evaluated for each one ofn_iter
random selections of the fingerprint hyperparameters, i.e. outer loop. This should be taken into consideration when selectingn_iter
or hyperparameter grids. IfRandomizedSearchCV
is used, then the result is roughly equivalent to using randomized search on all hyperparameters, but faster. However, any other strategy can be used for the estimator, e.g.GridSearchCV
.- Parameters:
fingerprint (fingerprint object) – Instance of any fingerprint class. To maximize performance, consider setting
n_jobs
larger than 1, since parallelization is not performed here when going through fingerprint hyperparameter grid.fp_param_distributions (dict or list[dict]) – Dictionary with names of fingerprint hyperparameters as keys and lists of hyperparameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of hyperparameter settings.
estimator_cv (object) – Inner cross-validation object for tuning estimator, e.g.
RandomziedSearchCV
. Should be an instantiated object, not a class.greater_is_better (bool, default=True) – Whether higher values of scoring metric in
estimator_cv
are better or not.False
should be used for error (loss) functions, typically used in regression.n_iter (int, default=10) – How many iterations of random search to perform.
cache_best_fp_array (bool, default=False) – Whether to cache the array of values from the best fingerprint in
best_fp_array_
parameter. Note that this can result in high memory usage.verbose (int or dict, default=0) –
Controls the verbosity when computing fingerprints.
>0 : size of parameter grid, parameter candidate for each fold
>1 : the computation time and score for each candidate
If a dictionary is passed, it is treated as kwargs for
tqdm()
, and can be used to control the progress bar.
- cv_results_#
List of dictionaries, where each one represents the set of hyperparameters (names and values) and
"score"
key with the cross-validated performance of the pipeline with those hyperparameters.- Type:
list[dict]
- best_fp_#
Fingerprint that was chosen by the search, i.e. fingerprint which gave the highest score (or smallest loss if specified) on the left out data. Use with best_estimator_cv_ to obtain the best found pipeline.
- Type:
fingerprint object
- best_fp_params_#
Fingerprint hyperparameter values that gave the best results on the hold out data.
- Type:
dict
- best_fp_array_#
Fingerprint values for
best_fp_
. Ifcache_best_fp_array
is False, this will not be used and will be None instead.- Type:
np.ndarray
- best_score_#
Mean cross-validated score of the best fingerprint and estimator.
- Type:
float
- best_estimator_cv_#
Inner cross-validation object that gave the best results on the hold out data. Use with
best_fp_
to obtain the best found pipeline.- Type:
CV object
See also
FingerprintEstimatorGridSearch
Related fingerprint, but uses grid search for fingerprint hyperparameters.
Examples
>>> from skfp.datasets.moleculenet import load_bace >>> from skfp.fingerprints import ECFPFingerprint >>> from skfp.model_selection import FingerprintEstimatorRandomizedSearch >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.model_selection import RandomizedSearchCV >>> smiles, labels = load_bace() >>> fp = ECFPFingerprint(n_jobs=-1) >>> fp_params = {"fp_size": list(range(512, 4097, 128))} >>> clf = RandomForestClassifier(n_jobs=-1) >>> clf_params = {"min_samples_split": list(range(2, 10))} >>> clf_cv = RandomizedSearchCV(clf, clf_params, n_iter=5, random_state=0) >>> fp_cv = FingerprintEstimatorRandomizedSearch(fp, fp_params, clf_cv, n_iter=5) >>> fp_cv = fp_cv.fit(smiles, labels) >>> fp_cv.best_fp_params_ {'fp_size': 768}
Methods
fit
(X[, y])Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
predict
(X)predict_proba
(X)set_params
(**params)Set the parameters of this estimator.
transform
(X)- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params – Parameter names mapped to their values.
- Return type:
dict
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance