fraggle_similarity#

skfp.distances.fraggle_similarity(mol_query: Mol, mol_ref: Mol, tversky_threshold: float = 0.8) float#

Fraggle similarity between molecules.

Computes the Fraggle similarity [1] [2] between two RDKit Mol objects. It is a hybrid between substructure and similarity search, using a “fuzzy” matching based on molecule fragments and Tanimoto similarity of path-based fingerprints.

It is designed to properly recognize small changes in the “middle” of a molecule, where typical fingerprint-based measures would result in a very low similarity.

This is an asymmetric measure, with query and reference molecules. It consists of 3 phases: query fragmentation, fragment-reference Tversky similarity comparison, and Tanimoto similarity comparison using RDKit fingerprints.

Query molecule is fragmented into “interesting” substructures by acyclic and ring cuts, leaving only “large” parts of a molecule (>60%). Fragements are then compared with the reference molecule using Tversky similarity [3] (alpha=0.95, beta=0.05), keeping those with at least tversky_threshold similarity. Lastly, the Tanimoto similarity of RDKit fingerprints with path length 5 is computed for kept fragments and the reference molecule. The highest one is the Fraggle similarity value.

The calculated similarity falls within the range \([0, 1]\). Passing all-zero vectors to this function results in a similarity of 0. Note that this measure is asymmetric, and order of query and reference molecules matters.

Parameters:
  • mol_query (RDKit Mol object) – Query molecule.

  • mol_ref (RDKit Mol object) – Reference molecule.

  • tversky_threshold (float, default=0.8) – Required minimal Tversky similarity [3] between a fragment and the reference molecule.

Returns:

similarity – Fraggle similarity between mol_query and mol_ref.

Return type:

float

References

Examples

>>> from rdkit.Chem import MolFromSmiles
>>> from skfp.distances import fraggle_similarity
>>> mol_query = MolFromSmiles("COc1cc(CN2CCC(NC(=O)c3cncc(C)c3)CC2)c(OC)c2ccccc12")
>>> mol_ref = MolFromSmiles("COc1ccccc1")
>>> sim = fraggle_similarity(mol_query, mol_ref)
>>> sim
0.1640625