maxmin_train_test_split#

skfp.model_selection.maxmin_train_test_split(data: Sequence[str | Mol], *additional_data: Sequence, train_size: float | None = None, test_size: float | None = None, return_indices: bool = False, random_state: int = 0) tuple[Sequence[str | Mol], Sequence[str | Mol], Sequence[Sequence[Any]]] | tuple[Sequence, ...] | tuple[Sequence[int], Sequence[int]]#

Split using MaxMin algorithm.

MaxMinPicker is an efficient algorithm for picking an optimal subset of diverse compounds from a candidate pool. The original algorithm was introduced in [1], but here we use an optimized implementation by Roger Sayle [2] [3] [4].

First, molecules are vectorized using binary ECFP4 fingerprint (radius 2) with 2048 bits. The first test molecule is picked randomly. Each next one is selected to maximize the minimal distance to the already selected molecules (hence the MaxMin name) [4]. Distances are calculated on the fly as required.

First, the test set is constructed, and training set are all other molecules.

If train_size and test_size are integers, they must sum up to the data length. If they are floating numbers, they must sum up to 1.

Parameters:
  • data (sequence) – A sequence representing either SMILES strings or RDKit Mol objects.

  • additional_data (list[sequence]) – Additional sequences to be split alongside the main data (e.g., labels or feature vectors).

  • train_size (float, default=None) – The fraction of data to be used for the train subset. If None, it is set to 1 - test_size. If test_size is also None, it will be set to 0.8.

  • test_size (float, default=None) – The fraction of data to be used for the test subset. If None, it is set to 1 - train_size. If train_size is also None, it will be set to 0.2.

  • return_indices (bool, default=False) – Whether the method should return the input object subsets, i.e. SMILES strings or RDKit Mol objects, or only the indices of the subsets instead of the data.

Returns:

subsets – Tuple with train-test subsets of provided arrays. First two are lists of SMILES strings or RDKit Mol objects, depending on the input type. If return_indices is True, lists of indices are returned instead of actual data.

Return type:

tuple[list, list, …]

References