Dataset splits#
Splitting dataset into training and testing sets can be done in a few different ways. We can also split training set into train and validation part if we have enough data, instead of using cross-validation.
scikit-fingerprints implements many split types specific for chemistry, which take into consideration molecular data itself. This can result in much more realistic split than simple random splitting.
They are generally divided into two settings:
Internal / interpolative testing, where we expect future data to be similar to training data in terms of distribution. For example, molecules would be similar structurally, in terms of physicochemical properties, bioactivity etc. to existing data. In other words, we want to test in-distribution performance of ML models.
External / extrapolative testing, when we know that future data will substantially differ from training data. For example, we will work on novel structures, non-patented molecules, new chemical spaces etc. Therefore, we need to test out-of-distribution (OOD) generalization ability of ML models.
Splitting methods for internal testing are random and MaxMin (maximum diversity) splitting, whereas scaffold and Butina splits are designed for extrapolative testing. Which split you want to use depends on the use case, what assumptions you make, and what kind of generalization you want to check.
Let’s go over those types of splits.
Random split#
A typical splitting method, implemented in train_test_split function in scikit-learn. It doesn’t use data structure, but instead just randomly assigns it to train and test sets.
It can use stratification, ensuring the same proportion of classes before splitting and in resulting splits, which is useful for imbalanced classification. Since it relies on randomness, we can do this multiple times with different random_state
values set, and calculate e.g. standard deviation.
However, it can overestimate performance if we expect novel data in the future (see e.g. MoleculeNet paper, here, here and here). It’s also susceptible to “clumping”, i.e. natural clustering of molecules in chemical space (see e.g. MUV dataset paper), where training and testing molecules are very similar, which inflates the quality metric values. This is because random picking will more often sample from dense regions in chemical space, relatively to number of molecules. Therefore, performance estimation underestimates the importance of more sparsely sampled areas, which still may be very interesting, e.g. due to patentability.
Pros:
simple
stratification
can use many random initializations
Cons:
often overestimates performance
very similar molecules in train and test sets
susceptible to clumping
uneven sampling of chemical space
[5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from skfp.datasets.moleculenet import load_bace
from skfp.fingerprints import ECFPFingerprint
smiles_list, y = load_bace()
smiles_train, smiles_test, y_train, y_test = train_test_split(
smiles_list, y, random_state=0
)
pipeline = make_pipeline(
ECFPFingerprint(),
RandomForestClassifier(n_jobs=-1, random_state=0),
)
pipeline.fit(smiles_train, y_train)
y_pred = pipeline.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC random split: {auroc:.2%}")
AUROC random split: 89.30%
MinMax split#
Maximum diversity picking is a task of picking \(k\) most diverse molecules from a set of \(n\) molecules. This is typically practically implemented as selecting \(k\) molecules with maximal sum of pairwise Tanimoto distances between them, using ECFP4 as fingerprint representation. This way, we can select the maximally diverse test set. scikit-fingerprints uses MaxMin algorithm from RDKit, which is an efficient approximation (exact solution is NP-hard). Publications using this split include e.g. ApisTox, this paper and this paper.
Picking maximally diverse molecules results in relatively uniform coverage of whole chemical space in the dataset. This alleviates problems like clumping and undersampling sparse areas. It validates the internal generalization performance across the whole spectrum of molecular structures available. Since this is an approximation, relying on random starting molecule, it can also use random_state
to obtain different split variants.
The performance estimation can still be influenced by very similar molecules in train and test sets. It also does not consider class distribution, and for very imbalanced datasets can pick almost only the majority class.
Pros:
measures performance uniformly in the entire chemical space
resistant to clumping and sparse areas
more robust internal measure than random split
can use many random initializations
Cons:
can have very similar molecules in train and test sets
may not work well for heavily imbalanced datasets
[2]:
from skfp.model_selection import maxmin_train_test_split
smiles_train, smiles_test, y_train, y_test = maxmin_train_test_split(smiles_list, y)
pipeline.fit(smiles_train, y_train)
y_pred = pipeline.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC MaxMin split: {auroc:.2%}")
AUROC MaxMin split: 88.42%
Scaffold split#
Scaffold split divides molecules based on their Bemis-Murcko scaffold. Scaffold is a “backbone” of the molecule, built from its connected ring systems and using only carbons. The idea is that molecules with the same scaffold have the same general structure and shape, differing only by substituents on the “edges” of the molecule. In scaffold split, we first group molecules by their scaffold, and the test set is made by combining the smallest scaffold groups. This way, we test on the most atypical, rare scaffolds, requiring out-of-distribution generalization to structurally novel molecules.
Using it for train-test splitting has been proposed in MoleculeNet paper, and has been widely adopted, being arguably the most popular split in molecular ML nowadays. This split is very fast to compute, in constrast to many external testing splits. It typically results in more realistic performance estimation than random split, particularly for tasks requiring novel molecule design.
However, it is susceptible to small changes in the scaffold atoms, where almost identical molecules can differ by a single atom and get different scaffolds. This is a consequence by rigidly defining similarity as “identical scaffold or not”, instead of answering a general question how similar structurally are two molecules. For further discussion, see e.g. RDKit blog, this blog post, and this paper. In the canonical version, it also does not work for molecules with disconnected components, e.g. salts. RDKit and scikit-fingerprints allow it, using predefined ordering, but proper scaffold is not well defined in those cases. It is fully deterministic, and there is only a single train-test split possible.
There are also some variants on the scaffold definition, which can be sometimes useful, but can be challenging for reproducible benchmarks. While original Bemis-Murcko scaffold is very “generic” and uses carbon-only connected ring systems (CSK, Cyclic SKeleton), RDKit scaffolds include some substituent atoms. See this RDKit discussion for details. scikit-fingerprints uses RDKit version by default, but you can use CDK with use_csk
parameter.
Pros:
fast
popular
tests for structurally novel molecules
typically more challenging than random split
deterministic
Cons:
susceptible to small changes in scaffold
arguably not very realistic and challenging
differing scaffold definitions
not well-defined for disconnected molecules
only a single train-test split possible
[3]:
from skfp.model_selection import scaffold_train_test_split
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(smiles_list, y)
pipeline.fit(smiles_train, y_train)
y_pred = pipeline.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC scaffold split: {auroc:.2%}")
AUROC scaffold split: 78.25%
[4]:
from skfp.model_selection import scaffold_train_test_split
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(
smiles_list, y, use_csk=True
)
pipeline.fit(smiles_train, y_train)
y_pred = pipeline.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC CSK scaffold split: {auroc:.2%}")
AUROC CSK scaffold split: 84.55%
Randomized scaffold split#
Scaffold split is fully deterministic, putting smallest scaffold groups in the test set. However, we can also divide them randomly between train and test results, resulting in randomized scaffold split. It is also known as “balanced” scaffold split.
It can be run multiple times with different random_state
, allowing calculation of e.g. standard deviation. We can also be interested in more popular scaffold generalization, rather than just the rarest ones.
However, the performance estimation is much more optimistic in this variant. This is because “simpler”, larger groups of scaffolds can easily dominate the test set. Furthermore, some authors unfortunately mix up this variant and regular, more challenging scaffold split, e.g. in GROVER paper, without any proper distinction. See Appendix G in MOLTOP paper for discussion. This can inflate the results quite a lot.
Pros:
can use many random initializations
Cons:
performance estimation can be too optimistic
often confused with scaffold split
[6]:
from skfp.model_selection import randomized_scaffold_train_test_split
smiles_train, smiles_test, y_train, y_test = randomized_scaffold_train_test_split(
smiles_list, y, random_state=0
)
pipeline.fit(smiles_train, y_train)
y_pred = pipeline.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC randomized scaffold split: {auroc:.2%}")
AUROC randomized scaffold split: 87.49%
Butina split#
Butina split applies Taylor-Butina clustering to cluster together similar molecules, and then assigns the smallest clusters to the test set. Typically ECFP4 fingerprints are used as features. As a density-based clustering, it can detect clusters of varied shapes and sizes. It typically results in a large numer of small clusters, since it uses Tanimoto similarity threshold to limit maximal allowed dissimilarity in a cluster.
It can be seen as a more flexible alternative to scaffold split, using more complex structural similarity measure (ECFP + Tanimoto, instead of identical Bemis-Murcko scaffolds). As such, it is often more realistic and challenging.
However, the computational cost is quite high, as it requires computing full \(O(n^2)\) similarity matrix in the worst case. scikit-fingerprints uses efficient Leader Clustering implementation from RDKit, but scaling is still unfavorable for large datasets.
A tradeoff between accuracy is cost is the approximate solution, using NNDescent algorithm for computing approximate nearest neighbors. It requires installing PyNNDescent library.
Pros:
flexible
tests for structurally novel molecules
challenging out-of-distribution split
deterministic
approximate version available
Cons:
computationally expensive
only a single train-test split possible
[8]:
!pip install --quiet pynndescent
[7]:
from skfp.model_selection import butina_train_test_split
smiles_train, smiles_test, y_train, y_test = butina_train_test_split(smiles_list, y)
pipeline.fit(smiles_train, y_train)
y_pred = pipeline.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC Butina split: {auroc:.2%}")
AUROC Butina split: 80.25%
[9]:
from skfp.model_selection import butina_train_test_split
smiles_train, smiles_test, y_train, y_test = butina_train_test_split(
smiles_list, y, approximate=True
)
pipeline.fit(smiles_train, y_train)
y_pred = pipeline.predict_proba(smiles_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC approximate Butina split: {auroc:.2%}")
AUROC approximate Butina split: 80.26%