Welcome to scikit-fingerprints documentation!#
A scikit-learn compatible library for efficient computation of molecular fingerprints.
Main features:
scikit-learn compatible
feature-rich, with >30 fingerprints
parallelization
sparse matrix support
commercial-friendly MIT license
Easiest way to get started is to install it with pip:
pip install scikit-fingerprints
Then, see the quickstart below, or go for more detailed Examples.
Quickstart#
Most fingerprints are based on molecular graphs (topological, 2D-based), and you can use SMILES input directly:
from skfp.fingerprints import AtomPairFingerprint
smiles_list = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]
atom_pair_fingerprint = AtomPairFingerprint()
X = atom_pair_fingerprint.transform(smiles_list)
print(X)
For fingerprints using conformers (conformational, 3D-based), you need to create molecules first
and compute conformers. Those fingerprints have requires_conformers
attribute set
to True
.
from skfp.preprocessing import ConformerGenerator, MolFromSmilesTransformer
from skfp.fingerprints import WHIMFingerprint
smiles_list = ["O=S(=O)(O)CCS(=O)(=O)O", "O=C(O)c1ccccc1O"]
mol_from_smiles = MolFromSmilesTransformer()
conf_gen = ConformerGenerator()
fp = WHIMFingerprint()
print(fp.requires_conformers) # True
mols_list = mol_from_smiles.transform(smiles_list)
mols_list = conf_gen.transform(mols_list)
X = fp.transform(mols_list)
print(X)
You can also use scikit-learn functionalities like pipelines, feature unions etc. to build complex workflows. Popular datasets, e.g. from MoleculeNet benchmark, can be loaded directly.
from skfp.datasets.moleculenet import load_clintox
from skfp.metrics import multioutput_auroc_score, extract_pos_proba
from skfp.model_selection import scaffold_train_test_split
from skfp.fingerprints import ECFPFingerprint, MACCSFingerprint
from skfp.preprocessing import MolFromSmilesTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline, make_union
smiles, y = load_clintox()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(
smiles, y, test_size=0.2
)
pipeline = make_pipeline(
MolFromSmilesTransformer(),
make_union(ECFPFingerprint(count=True), MACCSFingerprint()),
RandomForestClassifier(random_state=0),
)
pipeline.fit(smiles_train, y_train)
y_pred_proba = pipeline.predict_proba(smiles_test)
y_pred_proba = extract_pos_proba(y_pred_proba)
auroc = multioutput_auroc_score(y_test, y_pred_proba)
print(f"AUROC: {auroc:.2%}")