randomized_scaffold_train_valid_test_split#
- skfp.model_selection.randomized_scaffold_train_valid_test_split(data: Sequence[str | Mol], *additional_data: Sequence, train_size: float | None = None, valid_size: float | None = None, test_size: float | None = None, use_csk: bool = False, return_indices: bool = False, random_state: int | RandomState | Generator | None = None)#
Split using randomized groups of Bemis-Murcko scaffolds.
This split uses randomly partitioned groups of Bemis-Murcko molecular scaffolds [1] for splitting. This is a nondeterministic variant of scaffold split, introduced in the MoleculeNet [2] paper. It aims to verify the model generalization to new scaffolds, as an approximation to the time split, while also allowing multiple train-test splits.
By default, core structure scaffolds are used (following RDKit), which include atom types. Original Bemis-Murcko approach uses the cyclic skeleton (CSK) of a molecule, replacing all atoms by carbons. It is also known as CSK [3], and can be used with use_csk parameter.
This approach is known to have certain limitations. In particular, molecules with no rings will not get a scaffold, resulting in them being grouped together regardless of their structure.
This variant is nondeterministic, and the scaffolds are randomly shuffled before being assigned to subsets (in order: test, valid, train). This approach is also known as “balanced scaffold split”, and typically leads to more optimistic evaluation than regular, deterministic scaffold split [4].
If
train_size
,valid_size
andtest_size
are integers, they must sum up to thedata
length. If they are floating numbers, they must sum up to 1.- Parameters:
data (sequence) – A sequence representing either SMILES strings or RDKit
Mol
objects.additional_data (sequence) – Additional sequences to be split alongside the main data, e.g. labels.
train_size (float, default=None) – The fraction of data to be used for the train subset. If None, it is set to 1 - test_size - valid_size. If valid_size is not provided, train_size is set to 1 - test_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.8.
valid_size (float, default=None) – The fraction of data to be used for the test subset. If None, it is set to 1 - train_size - valid_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.1.
test_size (float, default=None) – The fraction of data to be used for the validation subset. If None, it is set to 1 - train_size - valid_size. If valid_size is not provided, test_size is set to 1 - train_size. If train_size, test_size and valid_size aren’t set, test_size is set to 0.1.
use_csk (bool, default=False) – Whether to use the molecule cyclic skeleton (CSK), instead of the core structure scaffold.
return_indices (bool, default=False) – Whether the method should return the input object subsets, i.e. SMILES strings or RDKit
Mol
objects, or only the indices of the subsets instead of the data.random_state (int or NumPy Random Generator instance, default=0) – Seed for random number generator or random state that would be used for shuffling the scaffolds.
- Returns:
subsets – Tuple with train-valid-test subsets of provided arrays. First three are lists of SMILES strings or RDKit
Mol
objects, depending on the input type. If return_indices is True, lists of indices are returned instead of actual data.- Return type:
tuple[list, list, …]
References