randomized_scaffold_train_valid_test_split#

Split using randomized groups of Bemis-Murcko scaffolds.

This split uses randomly partitioned groups of Bemis-Murcko molecular scaffolds [1] for splitting. This is a nondeterministic variant of scaffold split, introduced in the MoleculeNet [2] paper. It aims to verify the model generalization to new scaffolds, as an approximation to the time split, while also allowing multiple train-test splits.

By default, core structure scaffolds are used (following RDKit), which include atom types. Original Bemis-Murcko approach uses the cyclic skeleton (CSK) of a molecule, replacing all atoms by carbons. It is also known as CSK [3], and can be used with use_csk parameter.

This approach is known to have certain limitations. In particular, molecules with no rings will not get a scaffold, resulting in them being grouped together regardless of their structure.

This variant is nondeterministic, and the scaffolds are randomly shuffled before being assigned to subsets (in order: test, valid, train). This approach is also known as “balanced scaffold split”, and typically leads to more optimistic evaluation than regular, deterministic scaffold split [4].

If train_size, valid_size and test_size are integers, they must sum up to the data length. If they are floating numbers, they must sum up to 1.

Parameters:

data (sequence) – A sequence representing either SMILES strings or RDKit Mol objects.
additional_data (sequence) – Additional sequences to be split alongside the main data, e.g. labels.
train_size (float, default=None) – The fraction of data to be used for the train subset. If None, it is set to 1 - test_size - valid_size. If valid_size is not provided, train_size is set to 1 - test_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.8.
valid_size (float, default=None) – The fraction of data to be used for the test subset. If None, it is set to 1 - train_size - valid_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.1.
test_size (float, default=None) – The fraction of data to be used for the validation subset. If None, it is set to 1 - train_size - valid_size. If valid_size is not provided, test_size is set to 1 - train_size. If train_size, test_size and valid_size aren’t set, test_size is set to 0.1.
use_csk (bool, default=False) – Whether to use the molecule cyclic skeleton (CSK), instead of the core structure scaffold.
return_indices (bool, default=False) – Whether the method should return the input object subsets, i.e. SMILES strings or RDKit Mol objects, or only the indices of the subsets instead of the data.
random_state (int or NumPy Random Generator instance, default=0) – Seed for random number generator or random state that would be used for shuffling the scaffolds.

Returns:

subsets – Tuple with train-valid-test subsets of provided arrays. First three are lists of SMILES strings or RDKit Mol objects, depending on the input type. If return_indices is True, lists of indices are returned instead of actual data.

Return type:

tuple[list, list, …]

References

randomized_scaffold_train_valid_test_split#

This Page