Data Science for drug discovery research -Morgan fingerprints in Python.

Darko Medin
6 min readNov 2, 2022

--

Data Science for drug discovery is one of the most important areas of today’s life science and pharmaceutical research.

In this exemplar tutorial, both the theoretical and coding parts will be covered.

Here is the full Github repository code :https://github.com/DarkoMedin/Morgan-fingerprints-exemplar-project-in-Python , but the individual blocks of code will also be explained in this article.

One of the best ways to engineer data for molecules is to make fingerprints out of the molecules. In this tutorial Morgan fingerprint will be used. This process is essential for an area called Feature engineering in Data Science and is essential in Cheminformatics and Drug discovery.

Morgan fingerprints enable mapping of certain structures of the molecule within certain radius of organic molecule bonds

The libraries used for this tutorial are numpy, pandas and rdkit. Make sure you have installed the rdkit following the original installation guidelines and run environment created in the process. I will load Chem, AllChem and Draw module diretly from rdkit in addition to numpy and panas to make the coding part more effective.

import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw

For the first example i will use a real world bio-molecule, amino acid arginine and fingerprint it using Morgan fingerprint principles. Arginine is an amino acid with 6 carbons, 3 NH2 groups and -COOH ending . It also has a linker N attom, so its SMILES would be ‘C(C[C@@H](C(=O)O)N)CNC(=N)N’. Very simple code to convert it to a molecule rdkit object…

m = Chem.MolFromSmiles('C(C[C@@H](C(=O)O)N)CNC(=N)N')
m

To store the information about the potential Morgan fingerprints, I will first create a simple empty dictionary called bit. Then the GetMorganFingerprintAsBitVect () function is used to create the Morgan fingerprints. I will specify the radius of atoms used to create a single mapping characteristic and the nBits or number of unique combinations of paths to use for mapping the structures of the molecule.

bit={}
morganfp=AllChem.GetMorganFingerprintAsBitVect(m,useChirality=True, radius=5, nBits = 1024, bitInfo=bit)

Check that the morganfp objet is created

morganfpOut: <rdkit.DataStructs.cDataStructs.ExplicitBitVect at 0x23cb1270f40>

The morgan fp object is then converted to a numpy array for ease of further use. Remember, the function GetMorganFingerprintAsBitVect () was used to create the fingerprint as a bit vector meaning the resulting vector will be composed on 0s and 1s. The 1s will reperesnt the presence of a certain molecule structure, while 0s will represent the absence of the same. The np.zero() function can then be used to identify the positions of the mapped fingerprints

mfpvector = np.array(morganfp)
print(np.nonzero(mfpvector))
Out:(array([ 1, 3, 80, 110, 128, 139, 140, 147, 197, 330, 353,
389, 427, 553, 565, 623, 650, 667, 708, 739, 752, 786,
807, 820, 852, 884, 887, 893, 894, 983, 1002, 1018, 1020],
dtype=int64),)

It can be seen that 34 Morgan fingerprints are found and their locations in the array identified.

Using the numbers which represent the positions of mapped fingerprints in the array and the previously create bit objects, these structures can now be visualized using Draw.DrawMorganBit() function. I will focus only on a few for the purporses of this tutorial. First one will be a bit 3.

Draw.DrawMorganBit(m,110, bit)

The fingerprint on the position 1020 is more complex and contains multimple bonds and atomic groups.

Draw.DrawMorganBit(m,1020, bit)

Morgan fingerprint on the position 1002 has both N atoms and multiple central bonds included.

Draw.DrawMorganBit(m,1002, bit)

Now i will create the rdkit molecular structure for a bit more complex molecue, Testosterone, which has multiple aromatic rings bound togdeader. This is the SMILES of the testosterone

[H][C@@]12CCC3=CC(=O)CC[C@]3(C)[C@@]1([H])CC[C@]1(C)[C@@H](O)CC[C@@]21[H]

m2 = Chem.MolFromSmiles('[H][C@@]12CCC3=CC(=O)CC[C@]3(C)[C@@]1([H])CC[C@]1(C)[C@@H](O)CC[C@@]21[H]')
m2

Next step is the repeat the process of creating the previous Mogran fingerprint object, but this time for the testosterone. Further the fingerprint object will then be again converted to numpy array.

bit={}
morganfp2=AllChem.GetMorganFingerprintAsBitVect(m2,useChirality=True, radius=2, nBits = 1024, bitInfo=bit)
mfpvector2 = np.array(morganfp2)
print(np.nonzero(mfpvector2))
Out:
(array([ 33, 36, 39, 40, 60, 84, 129, 138, 182, 193, 225,
233, 242, 246, 250, 262, 301, 314, 356, 425, 479, 504,
507, 546, 561, 650, 714, 759, 790, 807, 837, 841, 849,
873, 926, 929, 953, 975, 1019], dtype=int64),)

The exact number of Morgan fingerprints identified is 39 and their locations in the numpy array are presented in the output above.

Not the Morgan fingerprint based on bit 2 will be plotted and examined visually.

Draw.DrawMorganBit(m2,2, bit)

Morgan fingerprint based on bit 5 will contain structures from 3 aromatic rings. It can be seen that the fingerprints are much more complex in Testosterone compared to the Alanine, which is logical as the molecule itself much more complex and as such allows for more complex structures with max radius of bonds = 10.

Draw.DrawMorganBit(m2,5, bit)

Position 6 in the array of the identified molecular fingerprints is smaller, and contains part of the aromatic ring, but its bound to the Oxygen atom.

Draw.DrawMorganBit(m2,6, bit)

The molecular finger print of the bit on position 91 in the array is actually a part of the aromatic ring as it can be seen.

Draw.DrawMorganBit(m2,91, bit)

The molecular fingerprint for identified bit 94 actually contains structures included in 3 aromatic rings, so it also includes specific bond path in which the rings are connected to each other.

Draw.DrawMorganBit(m2,94, bit)

Final notes : This data science for drug discovery tutorial was related to one of the most important part of Data Science for Cheminformatics, which is Data Engineering. Starting with SMILES notations but ending up with 30 plus categorical molecular Morgan fingerprints. These fingerprint methods are most usefull for smaller to medium organic molecule fingerprinting while larger molecules such as protein or DNA include other more optimal Data Science and Bioinformatics methods. This step is essential for creating the datasets for Machine Learning parts which are going the be the theme later in this series of tutorials. In this tutorial the molecules of interest were smaller organic molecules like Alanine or Testosterone, while in the next one the main topic will be working with larger molecules like peptides or protein and FASTA format. Thanks for reading and happy learning!

Created by a Data Scientist and Life Science Researcher,

Darko Medin

--

--

Darko Medin
Darko Medin

Written by Darko Medin

Biostatistics Consultant / Data Scientist / Artificial Intelligence, Educator, darkomedin.com

Responses (1)