Introduction to scikit-fingerprints#
scikit-fingerprints is a scikit-learn compatible library for computation of molecular fingerprints, with focus on ease of usage and efficiency. It’s also called skfp
for short, similarly to sklearn
. It is based on hugely popular RDKit library.
We use familiar scikit-learn interface with classes implementing .fit()
and .transform()
methods. This ease of usage is particularly powerful combined with our efficient and parallelized implementations of fingerprint algorithms.
Molecular fingerprints are algorithms for vectorizing molecules. They turn a molecular graph, made of atoms and bonds, into a feature vector. It can then be used in any typical ML algorithms for classification, regression, clustering etc.
Practical introduction#
Typical ML task on molecules is molecular property prediction, which is basically molecular graph classification or regression. It’s also known as QSAR (quantitative structure-activity prediction) or, more accurately, QSPR (quantitative structure-activity prediction).
Molecules are typically stored in SMILES text format, along with labels for prediction. RDKit reads them as Mol
objects, and then scikit-fingerprints computes fingerprints for them. After computing fingerprints, we turn the problem of molecular graph classification into tabular classification.
So a simple workflow looks like this:
Store SMILES and labels in CSV file
Read them and transform into RDKit
Mol
objectsSplit into training and testing subsets
Compute molecular fingerprint for each molecule
Use the resulting tabular dataset for classification
Let’s see an example with well-known beta-secretase 1 (BACE) dataset, where we predict whether a drug inhibits the production of beta-secretase 1 enzyme, suspected to influence the development of Alzheimer’s disease. It is a part of popular MoleculeNet benchmark. It’s integrated into scikit-fingerprints, so we can download and load the data with a single function.
For train-test split, we’ll use scaffold split, which splits the molecules by their internal structure, known as Bemis-Murcko scaffold. This makes test molecules quite different from training ones, limiting data leakage.
We compute the popular Extended Connectivity Fingerprint (ECFP), also known as Morgan fingerprint. By default, it uses radius 2 (diameter 4, we call this ECFP4 fingerprints) and 2048 bits (dimensions). Then, we train Random Forest classifier on those features, and evaluate it using AUROC (Area Under Receiver Operating Characteristic curve).
All those elements are described in scikit-fingerprints documentation:
[1]:
from skfp.datasets.moleculenet import load_bace
from skfp.fingerprints import ECFPFingerprint
from skfp.model_selection import scaffold_train_test_split
from skfp.preprocessing import MolFromSmilesTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
smiles_list, y = load_bace()
mol_from_smiles = MolFromSmilesTransformer()
mols = mol_from_smiles.transform(smiles_list)
mols_train, mols_test, y_train, y_test = scaffold_train_test_split(mols, y, test_size=0.2)
# there's no need to call .fit() on fingerprints, they have no learnable weights
ecfp_fp = ECFPFingerprint()
X_train = ecfp_fp.transform(mols_train)
X_test = ecfp_fp.transform(mols_test)
clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)[:, 1]
auroc = roc_auc_score(y_test, y_pred)
print(f"AUROC: {auroc:.2%}")
AUROC: 78.25%
Step-by-step analysis#
Let’s analyze elements of this code more closely.
Dataset loader functions by default load a list of SMILES strings and labels as NumPy array. This is a simple, binary classification, so we get a vector of 0s and 1s.
[2]:
smiles_list, y = load_bace()
print("SMILES:")
print(smiles_list[:3])
print()
print("Labels:")
print(y[:3])
SMILES:
['O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2ccccc2C)C)CC1(C)C', 'Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(=O)C)(CC(C)C)C1=O)CCc1ccccc1)[C@H](O)[C@@H]1[NH2+]C[C@H](OCCC)C1', 'S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H]([C@H](O)C[NH2+]Cc1cc(OC)ccc1)Cc1ccccc1)C']
Labels:
[1 1 1]
RDKit Mol
objects are the basic molecular graph representation, and we compute the fingerprints from them.
[3]:
print("Molecules:")
print(mols[:3])
Molecules:
[<rdkit.Chem.rdchem.Mol object at 0x70f9d576e040>, <rdkit.Chem.rdchem.Mol object at 0x70f9d575f350>, <rdkit.Chem.rdchem.Mol object at 0x70f9d575f3c0>]
Fingerprints are by default binary NumPy arrays. They are typically long, with some (e.g. ECFP) having the length as a hyperparameter.
[4]:
print("ECFP fingerprints:")
print(X_train.shape)
print(X_train[:3])
ECFP fingerprints:
(1210, 2048)
[[0 1 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
From this point, the problem is just like any other tabular classification in scikit-learn.