butina_train_valid_test_split#
- skfp.model_selection.butina_train_valid_test_split(data: Sequence[str | Mol], *additional_data: Sequence, train_size: float | None = None, valid_size: float | None = None, test_size: float | None = None, threshold: float = 0.65, approximate: bool = False, return_indices: bool = False, n_jobs: int | None = None) tuple[Sequence[str | Mol], Sequence[str | Mol], Sequence[str | Mol], Sequence[Sequence[Any]]] | tuple[Sequence, ...] | tuple[Sequence[int], Sequence[int], Sequence[int]] #
Split using Taylor-Butina clustering.
This split uses deterministically partitioned clusters of molecules from Taylor-Butina clustering [1] [2] [3]. It aims to verify the model generalization to structurally novel molecules.
First, molecules are vectorized using binary ECFP4 fingerprint (radius 2) with 2048 bits. They are then clustered using Leader Clustering, a variant of Taylor-Butina clustering by Roger Sayle for RDKit [4]. Cluster centroids (central molecules) are guaranteed to have at least a given Tanimoto distance between them, as defined by threshold parameter.
Clusters are divided deterministically, with the smallest clusters assigned to the test subset, larger to the validation subset, and the rest to the training subset
If
train_size
,valid_size
andtest_size
are integers, they must sum up to thedata
length. If they are floating numbers, they must sum up to 1.- Parameters:
data (sequence) – A sequence representing either SMILES strings or RDKit
Mol
objects.additional_data (sequence) – Additional sequences to be split alongside the main data, e.g. labels.
train_size (float, default=None) – The fraction of data to be used for the train subset. If None, it is set to 1 - test_size - valid_size. If valid_size is not provided, train_size is set to 1 - test_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.8.
valid_size (float, default=None) – The fraction of data to be used for the test subset. If None, it is set to 1 - train_size - valid_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.1.
test_size (float, default=None) – The fraction of data to be used for the validation subset. If None, it is set to 1 - train_size - valid_size. If valid_size is not provided, test_size is set to 1 - train_size. If train_size, test_size and valid_size aren’t set, test_size is set to 0.1.
threshold (float, default=0.65) – Tanimoto distance threshold, defining the minimal distance between cluster centroids. Default value is based on ECFP4 activity threshold as determined by Roger Sayle [4].
approximate (bool, default=False) – Whether to use approximate similarity calculation to speed up computation on large datasets. It uses NNDescent algorithm [5] [6] and requires PyNNDescent library to be installed. However, it is much slower on small datasets, and exact version is always used for data with less than 5000 molecules.
return_indices (bool, default=False) – Whether the method should return the input object subsets, i.e. SMILES strings or RDKit
Mol
objects, or only the indices of the subsets instead of the data.n_jobs (int, default=None) – The number of jobs to run in parallel.
transform()
is parallelized over the input molecules.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Scikit-learn documentation onn_jobs
for more details.
- Returns:
subsets – Tuple with train-valid-test subsets of provided arrays. First three are lists of SMILES strings or RDKit
Mol
objects, depending on the input type. If return_indices is True, lists of indices are returned instead of actual data.- Return type:
tuple[list, list, …]
References