Virtual screening/Docking workflow

Whilst high-throughput screening (HTS) has been the starting point for many successful drug discovery programs the cost of screening, the lack of accessibility of a large diverse sample collection, or low throughput of the primary assay may preclude HTS as a starting point and identification of a smaller selection of compounds with a higher probability of being a hit may be desired. Directed or Virtual screening is a computational technique used in drug discovery research designed to identify potential hits for evaluation in primary assays. It involves the rapid in silico assessment of large libraries of chemical structures in order to identify those structures that most likely to be active against a drug target. The in silico screen can be based on known ligand similarity or based on docking ligands into the desired binding site.

In this workflow I’ll be looking at using docking to identify potential hits. There are a number of docking algorithms available some are listed below

AutoDock Vina is reported to be orders of magnitude faster than AutoDock whilst improving binding mode predictions. Smina is a fork of Autodock Vina that focuses on improving scoring and minimization. More details are disclosed in this publication DOI. I used smina and there are pre-built binaries available for Mac OSX and Linux.

Ligands for docking

Whilst there are a number of sites that offer compilations of structures for docking probably the most comprehensive is ZINC a free database of commercially-available compounds for virtual screening containing 35 million purchasable structures. The ZINC structures are nicely categorised so you can download subsets based on calculated physicochemical properties.

Once downloaded you will need to generate multiple reasonable conformations for each molecule. Whilst systematic enumeration of all conformational space is possible, with flexible ligands in can rapidly lead to an explosion in the number of possible conformations. For a reviews see DOI and DOI

The generation of conformations for small molecules is a problem of continuing interest in cheminformatics and computational drug discovery. This review will present an overview of methods used to sample conformational space, focusing on those methods designed for organic molecules commonly of interest in drug discovery.

Since the release 2015.09.1, a new conformer generator method is available in the RDKit, termed ETKDG this is a knowledge-based method that combines the distance geometry approach with experimental torsion-angle preferences obtained from small-molecule crystallographic data DOI. A jupyter notebook was created to run the conformation generation, using the parallelisation code contribution from Andrew Dalke and described in the rdkit cookbook http://www.rdkit.org/docs/Cookbook.html, code is shown below and the notebook can be downloaded here. ConformationsNotebook.ipynb.zip. One thing to note is that to aid molecule tracking the Molecule name field needs to be populated, this the row just below the header in the sdf file. There is a definition of the SDF file format here.

The target protein

More than 58 million children are afflicted annually with diarrheal disease associated with the most prevalent infections of the small intestine, including Escherichia coli, Rotavirus, Giardia lamblia, and Cryptosporidium parvum, which ultimately results in the death of 2.5 million children. C. parvum is an obligate parasite in the same phylum of Apicomplexa as Plasmodium and the same order of Eucoccidiorida as Toxoplasma and Eimeria. It is one of the pathogenic agents responsible for cryptosporidiosis, a zoonotic and enteric disease. Children in resource-poor settings are particularly at risk, not only with an increased incidence of Cryptosporidium spp. infection, but also with increased acute and long-lasting morbidity. 

The Protein Data Bank is a fantastic resource that at the time of writing contains 41305 Distinct Protein Sequences. For this study I’ll be using 2wei the kinase domain of Cryptosporidium parvum calcium dependent protein kinase in complex with 3-MB-PP1 DOI. Whilst scientists will often debate the accuracy and validity of the algorithms used it is equally important to check the quality of the starting protein structure. 

We tend to believe that PDB files are a message from God (Derek Lowe, In The Pipeline).

The quality of an X-ray crystal structure can vary and it is useful to understand how the model is derived from the electron density maps. The crystallographer builds a 3D model of the protein attempts to fit it to the electron density then uses the built structure to generate an electron density map, they then use this to derive a difference map to show where the modelled structure does not match the experimental density map. The modelled structure is then adjusted and the process repeated until they have a refined structure.

As the image below highlights the resolution can have a significant impact on the quality of the eventual structure.

If you are downloading a PDB structure the page for each entry provides a lot of very useful information, giving the resolution and a graphical display of various parameters (red=Poor, Blue=Good).

The Molecular Description provides information about the protein but also indicates which residues are in the crystal structure, many crystal structures may have been modified to aid crystallisation. This information is also contained within the header of the PDB file you can download and can be read in a text editor.

Unfortunately, even when working with a high-resolution x-ray crystallographic structure, researchers can spend considerable time and effort correcting common problems such as missing hydrogen atoms, incomplete side chains and loops, ambiguous protonation states, and flipped residues. Fortunately there are a number of software tools that can be used to tidy up the structures such as the “Protein Preparation tools” in MOE or Schrödinger.

Once the pub file was downloaded it was opened in MOE and the structure corrected, the ligand was then selected and saved in PDB format the protein minus the ligand was also then saved in PDB format.

Some of the potential issues are :-

  • Alternates, Residues with alternate locations and/or ambiguous sequence identities (choose highest occupancy)
  • Termini, Protein chain C- or N-termini which need to be charged or capped, or if DNA the terminal PO4 may only have three oxygens bonded to the phosphorous and an additional oxygen needs to be added. 
  • Sometimes loops are very disordered and appear as a breaks in the chain, it may be possible to use a loop library to model a replacement.
  • Hydrogens, Often not visible and so need to be added/checked, particularly check hydrogens on heteroatoms.
  • Ligand, Novel ligands in particular need checking to confirm atoms and bond orders are correct
  • Conformation, check that torsions are reasonable and there are no clashes.
  • Charge, It with worth checking the charge on all ionisable groups.
  • It can be difficult to be certain of the position of nitrogens in His or the primary amide in Asn, Gln.

We could check that all is working by redocking the ligand in the x-ray structure but that seems a slightly trivial exercise, it would be better to use a range of known ligands and use them in a docking study.

There is currently no data in ChEMBL related to this target but there are a number of other CDPK1 targets that have been the subject of screening efforts. CDPK1 is a potential malaria target and the results of a screen are available with confirmation of hits identified. It might be interesting to dock molecules shown to be active at this related protein to see if useful starting points can be identified.

The 581 molecules with IC50 data were downloaded and imported into MOE, the structures were then “washed” to move counterions, correct structures etc. then exported in SDF format and subjected to conformation generation to yield a file containing 2579 conformations.

Running the docking

With Smina installed it is relatively easy to run the docking from the command line. The —cpu option allows you define how many cores to use, I usually keep a couple back to keep my machine responsive to user input. Since I have multiple cores on my MacPro I define how many to use, for reproducibility, we specify a random number seed. The bounding box for docking is specified automatically with the autobox ligand option which creates a box with an 8 ̊A buffer around the provided ligand. The results are saved in a compressed sdf file.

Analysis of the results

The output from the docking run is a compressed sdf file that can be read into Vortex as shown below, the output includes an energy calculation that can be used to select molecules for further evaluation. However it is also possible to apply multiple scoring functions and the use of multiple scoring functions is now well established.

Recently, machine-learning scoring functions trained on protein-ligand complexes have shown significant promise an example being (RF-Score-VS) trained on 15 426 active and 893 897 inactive molecules docked to a set of 102 targets DOI.

Our results show RF-Score-VS can substantially improve virtual screening performance: RF-Score-VS top 1% provides 55.6% hit rate, whereas that of Vina only 16.2% (for smaller percent the difference is even more encouraging: RF-Score-VS top 0.1% achieves 88.6% hit rate for 27.5% using Vina). In addition, RF-Score-VS provides much better prediction of measured binding affinity than Vina (Pearson correlation of 0.56 and −0.18, respectively). Lastly, we test RF-Score-VS on an independent test set from the DEKOIS benchmark and observed comparable results. 

Binaries for RF-Score-VS are available https://github.com/oddt/rfscorevs_binary and requires only minimal input

  • -i input file format; if not present then based on extension [optional]
  • –receptor a protein file; format based on extension [required]
  • -O output file; if -o is not present file format is based on extension [optional]
  • -o output file format; if -O is not present then molecules are printed to standard output [optional]

Thus the command line is

This can be accessed via a Vortex script, first we get the path to the imported SDF file, then use a dialog box to get the path to the PDB file used for the docking. Then we construct the command for rescoring and submit it. Finally we parse the output returned and populate the workspace.

Vortex script for rescoring docking results

This script rescores all the docking poses and populates the table as shown below. Sometimes it can be a little difficult to discern the 3D structures displayed in Vortex so a useful trick is to click on the “Tools” menu and select “Calculate Properties”, then in the dialog box select the check box alongside “SMILES code of molecule”. An additional column will be added that contains the SMILES string rendered as a 2D molecule.

The next task is to then select the molecules for biological screening. You can work your way through the list line by line looking at the docking or you can simply choose the lowest energy pose for each molecule. The following vortex script selects the lowest energy pose, it requires that there is a unique identifier (name) for each row in the workspace. Once selected you can then export the selected structures to a new workspace. In a similar manner you can select the highest scoring poses.

Vortex script for selecting lowest energy

Exploring Docking Results

With a workspace containing the selected docking results we can use the embed AstexViewer to look at the docked poses

Add AstexViewer Script

Right click on the plot area to access the plot setting dialog box, click on Configure single protein to navigate to the PDB file used in the docking.

Given that the ligands that were docked were taken ChEMBL we also have the activity at the related CDPK1 target and it is perhaps interesting to see how the docking results reflect the experimental affinities. The bioactivity data was appended to the workspace using the ChEMBL ID as the key. A plot of IC50 versus molecular weight (MW), (below) shows that these is no correlation between activity and molecular weight.

However if we plot the affinity score versus MW these appears to be a bias with higher molecular weight molecules yielding lower energies. Colour coding the molecules with IC50 <100 nM (green) serves to emphasise the bias towards higher molecular weight compounds. This bias is perhaps not surprising since there is little penalty for adding MW to the ligands and they can often potentially pick up additional binding.

If we do a similar analysis based on the poses selected based on RF-Score-VS (below) it does seem there is much less of a bias towards high molecular weight compounds.

he next step is to screen a large library of novel ligands, that is described in the following post A workflow for docking/virtual screening 2.

All the Vortex scripts can be downloaded here 

Last Updated 25 August 2017

Related Posts

One thought on “Virtual screening/Docking workflow

Comments are closed.