skfp.model_selection.randomized_scaffold_train_valid_test_split#

skfp.model_selection.randomized_scaffold_train_valid_test_split(data: Sequence[str | Mol], *additional_data: Sequence, train_size: float | None = None, valid_size: float | None = None, test_size: float | None = None, use_csk: bool = False, return_indices: bool = False, random_state: int | RandomState | Generator | None = None)#

Split using randomized groups of Bemis-Murcko scaffolds.

This split uses randomly partitioned groups of Bemis-Murcko molecular scaffolds [1] for splitting. This is a nondeterministic variant of scaffold split, introduced in the MoleculeNet [2] paper. It aims to verify the model generalization to new scaffolds, as an approximation to the time split, while also allowing multiple train-test splits.

By default, core structure scaffolds are used (following RDKit), which include atom types. Original Bemis-Murcko approach uses the cyclic skeleton (CSK) of a molecule, replacing all atoms by carbons. It is also known as CSK [3], and can be used with use_csk parameter.

This approach is known to have certain limitations. In particular, molecules with no rings will not get a scaffold, resulting in them being grouped together regardless of their structure.

This variant is nondeterministic, and the scaffolds are randomly shuffled before being assigned to subsets (in order: test, valid, train). This approach is also known as “balanced scaffold split”, and typically leads to more optimistic evaluation than regular, deterministic scaffold split [4].

The split fractions (train_size, valid_size, test_size) must sum to 1.

Parameters:
  • data (sequence) – A sequence representing either SMILES strings or RDKit Mol objects.

  • additional_data (sequence) – Additional sequences to be split alongside the main data, e.g. labels.

  • train_size (float, default=None) – The fraction of data to be used for the train subset. If None, it is set to 1 - test_size - valid_size. If valid_size is not provided, train_size is set to 1 - test_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.8.

  • valid_size (float, default=None) – The fraction of data to be used for the test subset. If None, it is set to 1 - train_size - valid_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.1.

  • test_size (float, default=None) – The fraction of data to be used for the validation subset. If None, it is set to 1 - train_size - valid_size. If valid_size is not provided, test_size is set to 1 - train_size. If train_size, test_size and valid_size aren’t set, test_size is set to 0.1.

  • use_csk (bool, default=False) – Whether to use the molecule cyclic skeleton (CSK), instead of the core structure scaffold.

  • return_indices (bool, default=False) – Whether the method should return the input object subsets, i.e. SMILES strings or RDKit Mol objects, or only the indices of the subsets instead of the data.

  • random_state (int or NumPy Random Generator instance, default=0) – Seed for random number generator or random state that would be used for shuffling the scaffolds.

Returns:

  • subsets (tuple[list, list, …])

  • Tuple with train-valid-test subsets of provided arrays. First three are lists of

  • SMILES strings or RDKit Mol objects, depending on the input type. If return_indices

  • is True, lists of indices are returned instead of actual data.

References