randomized_scaffold_train_test_split#
- skfp.model_selection.randomized_scaffold_train_test_split(data: Sequence[str | Mol], *additional_data: Sequence, train_size: float | None = None, test_size: float | None = None, use_csk: bool = False, return_indices: bool = False, random_state: int | RandomState | Generator | None = None)#
Split using randomized groups of Bemis-Murcko scaffolds.
This split uses randomly partitioned groups of Bemis-Murcko molecular scaffolds [1] for splitting. This is a nondeterministic variant of scaffold split, introduced in the MoleculeNet [2] paper. It aims to verify the model generalization to new scaffolds, as an approximation to the time split, while also allowing multiple train-test splits.
By default, core structure scaffolds are used (following RDKit), which include atom types. Original Bemis-Murcko approach uses the cyclic skeleton (CSK) of a molecule, replacing all atoms by carbons. It is also known as CSK [3], and can be used with use_csk parameter.
This approach is known to have certain limitations. In particular, molecules with no rings will not get a scaffold, resulting in them being grouped together regardless of their structure.
This variant is nondeterministic, and the scaffolds are randomly shuffled before being assigned to subsets (test set is created fist). This approach is also known as “balanced scaffold split”, and typically leads to more optimistic evaluation than regular, deterministic scaffold split [4].
If
train_size
andtest_size
are integers, they must sum up to thedata
length. If they are floating numbers, they must sum up to 1.- Parameters:
data (sequence) – A sequence representing either SMILES strings or RDKit
Mol
objects.additional_data (sequence) – Additional sequences to be split alongside the main data, e.g. labels.
train_size (float, default=None) – The fraction of data to be used for the train subset. If None, it is set to 1 - test_size. If test_size is also None, it will be set to 0.8.
test_size (float, default=None) – The fraction of data to be used for the test subset. If None, it is set to 1 - train_size. If train_size is also None, it will be set to 0.2.
use_csk (bool, default=False) – Whether to use the molecule cyclic skeleton (CSK), instead of the core structure scaffold.
return_indices (bool, default=False) – Whether the method should return the input object subsets, i.e. SMILES strings or RDKit
Mol
objects, or only the indices of the subsets instead of the data.random_state (int or NumPy Random Generator instance, default=0) – Seed for random number generator or random state that would be used for shuffling the scaffolds.
- Returns:
subsets – Tuple with train-test subsets of provided arrays. First two are lists of SMILES strings or RDKit
Mol
objects, depending on the input type. If return_indices is True, lists of indices are returned instead of actual data.- Return type:
tuple[list, list, …]
References