Similarity and distance metrics#

scikit-fingerprints implements multiple ways to measure similarity or distance between molecules, particularly between their fingerprints. Those similarity measures and distance metrics can be used e.g. in searching, clustering, dimentionality reduction, kNN classification, and more.

Similarities and metrics overview#

In general, we can divide those measures into two groups, depending on their input:

Working on molecular fingerprints (vectorized molecules). Different metrics and metric variants are used for binary and count fingerprints.
Using molecules (RDKit Mol objects) directly.

Most functions are naturally defined as similarities - the higher, the more similar two molecules are. Most similarity functions have a bounded value range, typically [0,1]. Every similarity also has a corresponding distance function implemented, usually equal to 1 - similarity.

Additionally, for batch computation, e.g. pairwise similarity measurements, every metric also has a bulk function, which works on whole matrices (or lists of molecules).

Fingerprint similarities#

Similarities and distances working on fingerprint vectors can be defined for binary or count fingerprints. Most similarities, however, can only be defined for binary fingerprints. This distinction is visible in function names, e.g. tanimoto_binary_similarity vs tanimoto_count_similarity. Most similarity functions have bounded value range, typically \([0, 1]\).

There are two major groups of similarities:

only considering “on” bits, i.e. 1s in vector representations
including both “off” and “on” bits

For two binary vectors x and y, we can define four values, which are used to compute metrics:

\(a\) – \(|x \cap y|\), the number of common “on” bits
\(b\) – \(|x \cap \bar{y}|\), the number of positions where \(x\) is 1 and \(y\) is 0
\(c\) – \(|\bar{x} \cap y|\), the number of positions where \(x\) is 0 and \(y\) is 1
\(d\) – \(|\bar{x} \cap \bar{y}|\), the number of positions where both are 0, the number of common “off” bits

We can also mark \(|x|\) as total number of “on” bits (1s) in the \(x\) vector.

Tanimoto binary similarity is the most commonly used similarity measure, defined as:

\[sim(x, y) = \frac{|x \cap y|}{|x \cup y|} = \frac{|x \cap y|}{|x| + |y| - |x \cap y|} = \frac{a}{a + b + c}\]

Tanimoto count similarity is an extension to count vectors, utilizing dot product as a measure of “common bits”, and vector length instead of just sum of 1s. The larger the dot product, the more similar the two vectors are. It is defined as:

\[sim(x, y) = \frac{x \cdot y}{\|x\|^2 + \|y\|^2 - x \cdot y}\]

Both variants of Tanimoto similarity use only “on” bits. Rogot-Goldberg similarity is an example of similarity that includes “off” bits information. It works only for binary vectors. It is defined as:

\[sim(x, y) = \frac{a}{2 \times (2a + b + c)} + \frac{d}{2 \times (2d + b + c)}\]

Tanimoto similarity in both versions and Rogot-Goldberg similarity have values in range \([0, 1]\).

Let’s see some code examples of those similarities. Functions take two vectors and output similarity/distance value as a float. NumPy arrays shoule be 1-dimensional vectors, i.e. rows of array after computing fingerprints.

[1]:

import numpy as np

from skfp.distances import (
    rogot_goldberg_binary_distance,
    rogot_goldberg_binary_similarity,
    tanimoto_binary_distance,
    tanimoto_binary_similarity,
)

vec_a = [1, 1, 0, 0]
vec_b = [0, 1, 1, 0]

vec_a_dense = np.array(vec_a)
vec_b_dense = np.array(vec_b)

tanimoto_sim = tanimoto_binary_similarity(vec_a_dense, vec_b_dense)
tanimoto_dist = tanimoto_binary_distance(vec_a_dense, vec_b_dense)

rogot_sim = rogot_goldberg_binary_similarity(vec_a_dense, vec_b_dense)
rogot_dist = rogot_goldberg_binary_distance(vec_a_dense, vec_b_dense)

print(f"Tanimoto similarity: {tanimoto_sim:.2f}")
print(f"Rogot-Goldberg similarity: {rogot_sim:.2f}")
print()
print(f"Tanimoto distance: {tanimoto_dist:.2f}")
print(f"Rogot-Goldberg similarity: {rogot_dist:.2f}")

Tanimoto similarity: 0.33
Rogot-Goldberg similarity: 0.50

Tanimoto distance: 0.67
Rogot-Goldberg similarity: 0.50

For sparse data, SciPy sparse arrays should be in CSR format and have a single row with values. Everything else works exactly the same.

[2]:

from scipy.sparse import csr_array

vec_a_sparse = csr_array([vec_a])
vec_b_sparse = csr_array([vec_b])

tanimoto_sim_sparse = tanimoto_binary_similarity(vec_a_sparse, vec_b_sparse)
rogot_sim_sparse = rogot_goldberg_binary_similarity(vec_a_sparse, vec_b_sparse)

print(f"Tanimoto similarity: {tanimoto_sim_sparse:.2f}")
print(f"Rogot-Goldberg similarity: {rogot_sim_sparse:.2f}")

Tanimoto similarity: 0.33
Rogot-Goldberg similarity: 0.50

Count variants work exactly the same way as binary ones.

[3]:

import numpy as np

from skfp.distances import tanimoto_count_distance, tanimoto_count_similarity

vec_a = [2, 3, 4, 0]
vec_b = [2, 3, 4, 2]

vec_a_numpy = np.array(vec_a)
vec_b_numpy = np.array(vec_b)

count_sim_numpy = tanimoto_count_similarity(vec_a_numpy, vec_b_numpy)
count_dist_numpy = tanimoto_count_distance(vec_a_numpy, vec_b_numpy)

print(f"Tanimoto count similarity: {count_sim_numpy:.2f}")
print(f"Tanimoto count distance: {count_dist_numpy:.2f}")

Tanimoto count similarity: 0.88
Tanimoto count distance: 0.12

Molecule similarities#

Some similarities include molecule structure directly in their calculation. Sometimes they can be more flexible, as they don’t lose any information during the fingerprint calculation step.

Fraggle similarity is designed to be less sensitive to small changes in the middle of molecule, compared to fingerprint-based measures. It looks more on the “overall shape similarity” of molecules. Its calculation consists of a few steps:

fragment molecule into “interesting” substructures by acyclic and ring cuts, leaving only “large” parts of the molecule (>60%)
compare fragments with Tversky similarity, keep only appropriately similar ones
compute RDKit fingerprints with path length 5, compare with Tanimoto similarity
largest Tanimoto similarity is the Fraggle similarity value

This measure is asymmetric, i.e. sim(mol_a, mol_b) can be potentially quite different from sim(mol_b, mol_a). Its value range is \([0, 1]\).

Maximum Common Substructure (MCS) similarity checks the size of the maximum common substructure (MCS) between two molecules as their structural overlap, with the formula:

\[sim(mol_a, mol_b) = \frac{numAtoms(MCS(mol_a, mol_b))}{numAtoms(mol_a) + numAtoms(mol_b) - numAtoms(MCS(mol_a, mol_b))}\]

It also penalizes difference in molecule sizes. Its value range is \([0, 1]\).

Let’s jump into the code. Here, we use RDKit Mol objects, rather than fingerprints. We can see that Fraggle similarity is indeed asymmetric.

[4]:

from rdkit.Chem import MolFromSmiles

from skfp.distances import fraggle_similarity

mol_query = MolFromSmiles("COc1cc(CN2CCC(NC(=O)c3cncc(C)c3)CC2)c(OC)c2ccccc12")
mol_ref = MolFromSmiles("COc1ccccc1")

fraggle_sim = fraggle_similarity(mol_query, mol_ref)
fraggle_sim_reverse = fraggle_similarity(mol_ref, mol_query)

print(f"Fraggle similarity query-reference: {fraggle_sim:.2f}")
print(f"Fraggle similarity reference-query: {fraggle_sim_reverse:.2f}")

Fraggle similarity query-reference: 0.16
Fraggle similarity reference-query: 0.26

MCS similarity is used identically, but it is symmetric.

[5]:

from skfp.distances import mcs_similarity

mcs_sim = mcs_similarity(mol_query, mol_ref)
mcs_sim_reverse = mcs_similarity(mol_ref, mol_query)

print(f"MCS similarity query-reference: {mcs_sim:.2f}")
print(f"MCS similarity reference-query: {mcs_sim_reverse:.2f}")

MCS similarity query-reference: 0.26
MCS similarity reference-query: 0.26

Scikit-learn compatibility#

scikit-fingerprints is designed and tested to be fully compatible with scikit-learn. As such, you can use similarity metrics in your ML pipelines, e.g. for k nearest neighbors classifier. Same mechanism would also work for density-based clustering like DBSCAN.

Let’s see how this works using a kNN classifier on BACE dataset. Note that in scikit-learn, the interface expects distances, not similarities, so you have to use an appropriate function.

[6]:

from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

from skfp.datasets.moleculenet import load_bace
from skfp.fingerprints import ECFPFingerprint
from skfp.model_selection import scaffold_train_test_split

smiles, y = load_bace()
smiles_train, smiles_test, y_train, y_test = scaffold_train_test_split(
    smiles, y, test_size=0.2
)

pipeline = make_pipeline(
    ECFPFingerprint(), KNeighborsClassifier(metric=tanimoto_binary_distance)
)
pipeline.fit(smiles_train, y_train)

y_pred = pipeline.predict(smiles_test)
auroc = roc_auc_score(y_test, y_pred)

print(f"AUROC: {auroc:.2%}")

AUROC: 74.62%

If we use count fingerprint, then we should also switch the metric appropriately. Everything else stays the same.

[7]:

pipeline = make_pipeline(
    ECFPFingerprint(count=True), KNeighborsClassifier(metric=tanimoto_count_distance)
)
pipeline.fit(smiles_train, y_train)

y_pred = pipeline.predict(smiles_test)
auroc = roc_auc_score(y_test, y_pred)

print(f"AUROC: {auroc:.2%}")

AUROC: 71.21%

Bulk similarity computation#

If you need to quickly compute similarity or distance between many vectors, you can use bulk variant of a function. Bulk variants are equivalent to scikit-learn’s pairwise distances, but are much faster, thanks to Numba JIT and optimized representation. Note that they do not support sparse arrays due to Numba limitations.

In the example below, the similarity is computed between i-th rows and j-th columns of both arrays.

[8]:

from skfp.distances.tanimoto import (
    bulk_tanimoto_binary_similarity,
)

arr_1 = np.array(
    [
        [1, 1, 1],
        [0, 0, 1],
        [1, 1, 1],
    ]
)

arr_2 = np.array(
    [
        [1, 0, 1],
        [0, 1, 1],
        [1, 1, 1],
    ]
)

sim = bulk_tanimoto_binary_similarity(arr_1, arr_2)
sim

[8]:

array([[0.66666667, 0.66666667, 1.        ],
       [0.5       , 0.5       , 0.33333333],
       [0.66666667, 0.66666667, 1.        ]])

If we pass a single array, then the similarities will be computed between its rows. This is useful for self-similarity computation, e.g. evaluation chemical diversity of a dataset.

[9]:

X = np.array(
    [
        [1, 1, 1],
        [0, 0, 0],
        [1, 1, 1],
    ]
)

sim = bulk_tanimoto_binary_similarity(X)
sim

[9]:

array([[1., 0., 1.],
       [0., 1., 0.],
       [1., 0., 1.]])

Scikit-fingerprints acceleration#

Let’s see how much scikit-fingerprints with Numba JIT and other optimizations speeds things up. We will compare with a manual nested loop, used by many older projects.

[10]:

from skfp.fingerprints import ECFPFingerprint

mols_list = [
    "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",  # Ibuprofen
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C",  # caffeine
    "c1ncccc1[C@@H]2CCCN2C",  # nicotine
    "C1CC1N2C=C(C(=O)C3=CC(=C(C=C32)N4CCNCC4)F)C(=O)O",  # Ciprofloxacin
    "CC(=O)CC(C1=CC=CC=C1)C2=C(C3=CC=CC=C3OC2=O)O",  # Warfarin
    "CC(=O)Nc1ccc(O)cc1",  # Paracetamol
    "CCC[C@@H](C(=O)C(=O)NC1CC1)NC(=O)[C@@H]2[C@H]3CCC[C@H]"  # Telaprevir
    "3CN2C(=O)[C@H](C(C)(C)C)NC(=O)[C@H](C4CCCCC4)NC(=O)c5cnccn5",  # Atorvastatin
    "O=C(O)C[C@H](O)C[C@H](O)CCn2c(c(c(c2c1ccc(F)cc1)c3ccccc3)C(=O)Nc4ccccc4)C(C)C",  # Telmisartan
    "CS(=O)(=O)CCNCc1ccc(o1)c2ccc3c(c2)c(ncn3)Nc4ccc(c(c4)Cl)OCc5cccc(c5)F",  # Lapatinib
    "O=C(N)C(C)(C)CNC(=O)[C@H](C(C)C)C[C@H](O)[C@@H](N)C[C@@H](C(C)C)Cc1cc(OCCCOC)c(OC)cc1",  # Aliskiren
    "C=12CCC=3C=C(C=C(C3[C@H](C1N=CC(=C2)Br)C4CCN(CC4)C(=O)CC5CCN(CC5)C(N)=O)Br)Cl",  # Ergotamin
    # Rinfampin
    r"CN1CCN(CC1)/N=C/c2c(O)c3c5C(=O)[C@@]4(C)O/C=C/[C@H](OC)[C@@H](C)[C@@H](OC(C)=O)[C@H](C)[C@H](O)[C@H](C)[C@@H](O)[C@@H](C)\C=C\C=C(\C)C(=O)Nc2c(O)c3c(O)c(C)c5O4",
    # Probucol
    "S(c1cc(c(O)c(c1)C(C)(C)C)C(C)(C)C)C(Sc2cc(c(O)c(c2)C(C)(C)C)C(C)(C)C)(C)C",
    # Kanamycin
    "O([C@@H]2[C@@H](O)[C@H](O[C@H]1O[C@H](CN)[C@@H](O)[C@H](O)[C@H]1O)[C@@H](N)C[C@H]2N)[C@H]3O[C@@H]([C@@H](O)[C@H](N)[C@H]3O)CO",
]

fp = ECFPFingerprint(count=True)
fps = fp.transform(mols_list)

[11]:

%timeit -r 3 -n 10 [tanimoto_count_similarity(fps[i], fps[j]) for i in range(len(fps)) for j in range(len(fps))]

4.72 ms ± 148 μs per loop (mean ± std. dev. of 3 runs, 10 loops each)

[12]:

%timeit -r 3 -n 10 [bulk_tanimoto_count_similarity(fps, fps)]

The slowest run took 315.64 times longer than the fastest. This could mean that an intermediate result is being cached.
13.9 ms ± 19.4 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)

Speedup is visible even on such a small sample. For hundreds or thousands of compounds the time performance can be improved by several orders of magnitude.