skfp.model_selection.scaffold_train_valid_test_split#

skfp.model_selection.scaffold_train_valid_test_split(data: Sequence[str | Mol], *additional_data: Sequence, train_size: float | None = None, valid_size: float | None = None, test_size: float | None = None, use_csk: bool = False, return_indices: bool = False) tuple[Sequence[str | Mol], Sequence[str | Mol], Sequence[str | Mol], Sequence[Sequence[Any]]] | tuple[Sequence, ...] | tuple[Sequence[int], Sequence[int], Sequence[int]]#

Split using groups of Bemis-Murcko scaffolds.

This split uses deterministically partitioned groups of Bemis-Murcko molecular scaffolds [1] for splitting, as introduced in the MoleculeNet [2] paper. It aims to verify the model generalization to new and rare scaffolds, as an approximation to the time split.

By default, core structure scaffolds are used (following RDKit), which include atom types. Original Bemis-Murcko approach uses the cyclic skeleton of a molecule, replacing all atoms by carbons. It is also known as CSK (Cyclic SKeleton) [3], and can be used with use_csk parameter.

This approach is known to have certain limitations. In particular, molecules with no rings will not get a scaffold, resulting in them being grouped together regardless of their structure.

The split is fully deterministic, with the smallest scaffold sets assigned to the test subset, larger to the validation subset, and the rest to the training subset.

The split fractions (train_size, valid_size, test_size) must sum to 1.

Parameters:
  • data (sequence) – A sequence representing either SMILES strings or RDKit Mol objects.

  • additional_data (sequence) – Additional sequences to be split alongside the main data, e.g. labels.

  • train_size (float, default=None) – The fraction of data to be used for the train subset. If None, it is set to 1 - test_size - valid_size. If valid_size is not provided, train_size is set to 1 - test_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.8.

  • valid_size (float, default=None) – The fraction of data to be used for the test subset. If None, it is set to 1 - train_size - valid_size. If train_size, test_size and valid_size aren’t set, train_size is set to 0.1.

  • test_size (float, default=None) – The fraction of data to be used for the validation subset. If None, it is set to 1 - train_size - valid_size. If valid_size is not provided, test_size is set to 1 - train_size. If train_size, test_size and valid_size aren’t set, test_size is set to 0.1.

  • use_csk (bool, default=False) – Whether to use the molecule cyclic skeleton (CSK), instead of the core structure scaffold.

  • return_indices (bool, default=False) – Whether the method should return the input object subsets, i.e. SMILES strings or RDKit Mol objects, or only the indices of the subsets instead of the data.

Returns:

  • subsets (tuple[list, list, …])

  • Tuple with train-valid-test subsets of provided arrays. First three are lists of

  • SMILES strings or RDKit Mol objects, depending on the input type. If return_indices

  • is True, lists of indices are returned instead of actual data.

References