Sie sind auf Seite 1von 19

Improvements, trends, and new ideas in molecular docking: 2012–2013 in review


Molecular docking is a computational method for predicting the placement of ligands in the binding sites
of their receptor(s). In this review, we discuss the methodological developments that occurred in the
docking field in 2012 and 2013, with a particular focus on the more difficult aspects of this
computational discipline. The main challenges and therefore focal points for developments in docking,
covered in this review, are receptor flexibility, solvation, scoring, and virtual screening. We specifically
deal with such aspects of molecular docking and its applications as selection criteria for constructing
receptor ensembles, target dependence of scoring functions, integration of higher-level theory into
scoring, implicit and explicit handling of solvation in the binding process, and comparison and evaluation
of docking and scoring methods. Copyright © 2015 John Wiley & Sons, Ltd.


Molecular Docking Is A Computational Method For placing ligands (often small molecules) into the
binding site of their receptor (macromolecular target). Docking algorithms and scoring functions are
capable of generating structures of receptor–ligand complexes, ranking compounds, and sometimes
estimating binding energies/affinities.

The current review covers the advances in the molecular docking field that were published in 2012–
2013. While extremely useful in a multitude of tasks and scenarios, docking and, particularly, scoring
suffer from many shortcomings, specifically when it comes to accounting for entropy, solvation, and
receptor flexibility.

In this review, we specifically focus on the more challenging aspects of molecular docking and its
applications, such as selection criteria for constructing receptor ensembles, target dependence of
scoring functions, integration of higher-level theory into scoring, implicit and explicit handling of
solvation in the binding process, and comparison and evaluation of docking and scoring methods.

This review does not deal with the basic principles of the docking methodology; for that novice users are
referred to an excellent recent review.[1] We also have not systematically surveyed miscellaneous case
studies, but included those studies where important methodological developments are described. Most
sections of this review conclude by listing additional examples for further exploration of specific topics.
Unlike in the previous installments,[2,3] the areas of protein– protein docking and targeting protein–
protein interfaces are now covered


Receptor flexibility plays a crucial role in biomolecular recognition. A number of techniques have been
developed to account for receptor flexibility in docking. These techniques could be broadly divided into
on-the-fly methods, based on modifications to standard rigid receptor protocols, and methods applying
multiple receptor or ensemble input, either from experiment or by employing enhanced sampling

On the fly
These methods deal with receptor flexibility by enumerating its conformations or modeling their
changes during docking. This is achieved via various conformational searching and/or optimization
approaches and sometimes includes treatment of ligands simultaneously with the receptor.

GalaxyDock (original version[4] and version 2,[5] Tables 1 and 2) accounts for protein flexibility of pre-
selected residues within the receptor binding site by means of global optimization and compares
favorably to a range of programs (FLIPDock, AutoDock (versions 3 and 4), RosettaLigand, and SCARE) in
binding pose prediction accuracy with success rates of 80–87%. Success rate is the fraction of docked
poses with predicted rmsd of ≤2Å compared with the experimental structures.

Kalid y col. presentó el método de acoplamiento de ajuste inducido por consenso (cIFD) para adaptar un
sitio de unión al receptor a varios ligandos diversos, una característica crítica para el cribado virtual (VS).
[6] El flujo de trabajo de cIFD implica dos pasos: (i) IFD de múltiples ligandos para la determinación
preliminar del modo de unión y (ii) optimización del receptor en presencia de un ligando "híbrido", que
combina poses seleccionadas de los ligandos acoplados a IFD. El cIFD se validó en tres objetivos que
anteriormente se mostraban como un reto para el acoplamiento. Ofrece una forma pragmática de
explicar implícitamente la flexibilidad del receptor al tiempo que evita los costos computacionales
adicionales para ensamblar la cuenta del acoplamiento para la flexibilidad total del receptor durante el

The latest release of RosettaLigand[7] implements both receptor and ligand structural changes during
the docking stage, incorporating full protein backbone and side-chain flexibility and multiple ligand
simultaneous docking. Similar to RosettaLigand, the new program Sampler for Multiple Protein– Ligand
Entities (S4MPLE, Tables 1 and 2) simultaneously deals with receptor and ligand(s) conformations.[8] It is
based on a hybrid genetic algorithm and handles intramolecular and intermolecular degrees of freedom
(DoF) and an arbitrary number of independent species (1=conformational sampling; 2=docking; 3 or
more=multiple ligand docking). S4MPLE operators work on a randomly picked molecular substructure. If
it is covalently connected, the movements are restricted by covalent constraints (e.g., bond length). If a
substructure is not connected, as in a ligand–receptor complex, intermolecular contacts act as
constraints. An analogous methodology is implemented in AutoDock, where selected side chains can
undergo conformational changes during docking. This is achieved by separating the chosen side chains
from the rigid portion of the receptor and making them part of the input ligand file, however without
freedom to change the translation or orientation. Implementations of this approach were done with
both AutoDock 4[9] and AutoDock Vina.[10] Other recent approaches to account for receptor flexibility
include the following: mixed coarse grained/all-atom simulations with pre-calculated protein side chains
and explicit treatment of bonded interactions along the backbone;[11] combining IFD with quantum
polarized docking to obtain realistic affinity constants and enantioselectivity estimates;[12] IFD by
combinatorial rearrangement of binding site side chains, followed by grouping rearranged residues in
sterically independent families and side-chain conformer clustering;[13] fast, graph-based optimization
algorithm for assignment of the near-optimal set of residue rotamers;[14] and efficient multistage
backbone reconstruction algorithm for loop regions in receptors.[15]

The range of new methods and tools described above is encouraging and clearly provides new
opportunities to tackle hitherto intractable problems (e.g., highly flexible targets) in a computationally
efficient manner. However, there is currently a lack of objective independent evaluation and comparison
of these methods.
Ensemble-based methods

These methods depend on using multiple input receptor conformations into docking programs. Many
studies have now demonstrated that using an ensemble approach is superior to a single receptor
conformation input.[16] However, the main shortcoming of these approaches is a lack of a broadly
applicable protocol for a priori selection of the most predictive structures.[2,3] Furthermore, even the
issue of how many structures should be included into an ensemble for its optimal performance has not
yet been resolved.[2,3] Korb et al. identified several key factors affecting ensemble docking with respect
to pose prediction and VS performance: sampling accuracy, choice of the scoring function, and the
similarity of docked ligands to the ligands bound to the protein structures in an ensemble.[17] They also
comprehensively evaluated the ensemble performance compared with the performance of the
individual ensemble members and found the following: that (i) in almost all cases, ensembles perform
better than the worst single structure; (ii) in many cases, ensembles perform better than the average
single protein structure; and (iii) in some cases, ensembles perform better than the best single protein
structure. Based on these findings, they concluded that the rational prospective selection of optimum
ensembles is a challenging task, critically in need of further research. Most importantly, protocols are
required to generate ensembles, in terms of both size and membership, leading to increased docking
efficiency and reduced false positive rate in VS. To address the issue of the rational selection of
ensemble members, Rueda et al. modified their LiBERO method, which relies on the use of ligand
information for selecting the best performing individual pockets. They developed ALiBERO, a new tool,
which expands the pocket selection from single to multiple.[18] The dual method of ALiBERO uses
exhaustive combinatorial searching and individual addition of pockets, selecting only those that
maximize the discrimination of known active compounds from decoys. ALiBERO was tested with the
human estrogen receptor α and lead to improved VS performance. Xu and Lill dealt with the issue of
rational ensemble construction by considering three potential selection strategies: clustering based on
pairwise rmsd; pose prediction performance; and VS performance in terms of actives/decoys
differentiation.[19] Using the VS performance as selection criterion was shown to be the most successful
method. Thus, using the earlier developed Limoc concept, they were able to achieve a balance between
the extent of protein flexibility accounted for and the risk of false positives resulting from the excessive
ensemble size. Specifically, successful docking was shown to be performed using ensembles of relatively
small size: 3–5 receptor structures. Other methods for ensemble construction included the following: a
binding site shape diversity approach based on clustering binding site volume overlaps;[20] selection of
conformations of predefined binding site residues based on rotamer libraries;[21] rank-averaging of
frames from MD simulations;[22] computing and combining a set of grid maps using an energy
weighting scheme;[23] using multiple crystal structures with varied binding site geometries for a given
target;[24] and using HYBRID, a variation of FRED that exploits the knowledge of bound ligands[25–27]
(programs FRED and HYBRID are discussed in more detail in the “Systematic evaluation of popular and
widely used docking programs and scoring functions” subsection). Applications of experiment-derived
ensemble docking included a range of targets: chikungunya envelope proteins;[28] histone deacetylase
8;[29] sortase A;[30] kinases;[25] and GPCRs.[31] From these studies, a theme is emerging whereby
potential ensemble members are tested against known experimental data (structures, activities) with
only those that give the best predictions retained. The ability of potential ensemble members to
distinguish between known actives and decoys emerges as a critical criterion. Conceptually, this
approach is similar to using ligand training in binding site optimization and then selecting the binding
site(s) performing the best in VS. This method is commonly used in GPCR modeling.[32] However, an
opposite viewpoint was proposed by Nasr et al., based on time and resource cost of retrospective VS
evaluation.[33] They tested the properties of the binding sites as metrics for receptor inclusion into
ensembles. Specifically, they computed the area of the opening and volume of the binding sites,
hydrophobicity, and solvent accessible surface area. Using the volume of the binding site as a criterion,
they were able to select the best single structure and ensemble members. It could be argued that cost
considerations are not critical given the increasing availability of computational resources; however, the
usefulness of the “binding site properties-based” guidelines[33] goes beyond resources and is
particularly beneficial for systems for which the number of know nactives is limited or non-existent.

Otros metodos

Combinando acoplamiento con dinámica molecular

Si los datos cristalográficos extensos y diversos no estuvieran disponibles para un objetivo dado, se
pueden usar conjuntos basados en simulación. [16] Una opinión generalmente aceptada es que es
deseable una mayor muestra de espacio biomolecular para la construcción del conjunto óptimo. [16] Los
principales obstáculos para dicha cobertura son los pequeños cambios de configuración que
generalmente se observan en esquemas complejos relajados y métodos de ajuste inducido y, por el
contrario, escalas de tiempo cortas generalmente accesibles para simulaciones MD. Los enfoques para
superar estos obstáculos incluyen los avances de hardware de fuerza bruta y mejoras algorítmicas.
Conceptualmente, las mejoras brutas de fuerza bruta más simples al problema de la cobertura del
espacio de configuración provienen del aumento de la velocidad y la eficiencia del hardware de la
computadora. Los avances recientes en esta área provienen de la implementación de la computación
acelerada por GPU, [34,35] construyendo supercomputadoras de propósito especial como Anton [36] o
moviendo cálculos a la nube. [37] Conceptualmente más exigentes son los avances en algoritmos, lo que
permite una cobertura de espacio más eficiente. Los métodos de muestreo mejorados, como el
intercambio de réplicas aceleradas por temperatura, el MD acelerado basado en Hamilton, el muestreo
general, la metadinámica y los modelos de estado de Markov, aceleran los cálculos al introducir sesgos
artificiales en las simulaciones. [16] Chaudhuri y col. desarrolló un nuevo algoritmo, dinámica esencial /
dinámica molecular (DE / MD), para generar conjuntos perturbados, que representan la flexibilidad del
sitio de unión inducida por ligando de una manera muy precisa. [38] Utilizados en el acoplamiento, estos
conjuntos perturbados se unieron a un rendimiento superior en comparación con una estructura o
conjuntos derivados de simulaciones MD convencionales. La herramienta MM2QM permitió combinar
los cálculos de acoplamiento, MD, MM (mecánica molecular) y QM (mecánica cuántica) para generar
conformaciones alternativas de receptores de proteínas u optimizar estructuras finales acopladas. [39]

Other receptor features

Structural features of protein binding sites, beyond their conformational aspects discussed above,
started drawing increased attention in recent times. Kim et al. directed their attention to the effects of
ionizable residues of protein receptors, in particular – histidine protonation[41] – while Wirth et al.
analyzed the protein pocket shape and size characteristics.[42] Both studies tested the inclusion of these
protein-related factors into VS scenarios. In an interesting and timely methodological study, Sherman
and co-workers systematically explored steps involved in preparing a system for VS.[43] They
determined that VS enrichment is improved with proper preparation and that ignoring certain
preparation steps produces a systematic deterioration in enrichment, which can be large for some
targets. Users, beware!

Effect of input ligand structure

In order to be realistic, prediction of binding should take into account ligand speciation, that is,
ionization and tautomerism. Natesan et al. pursued that goal by implementing multispecies docking
where simulations were performed individually for each species pair, and the results were combined in a
correlation equation, calibrated using the experimental data.[44] This treatment significantly improved
the correlation between experimental inhibition potencies and the predicted interaction energies.
Others have also studied the effects of ligand protonation states in docking.[45,46] Feher and Williams
continued with the theme of docking outcomes as a function of ligand input and random chaotic effects
due to the sensitivity to input perturbations. Strikingly, they showed that even identical input structures,
only varying in the atom order, can produce different docking outcomes.[47]
UsingGOLD(stochastic)andGlide(deterministic) programs, they demonstrated that variations could range
from small and normally distributed, in well-behaved cases, to significant and widely different. The
authors asked whether these outcomes are algorithm-specific and should other commonly used
programs be evaluated for such sensitivity? However, they also concluded that such reproducibility issue
is inherent in docking similarly with many other computational fields, and rather than being a problem,
it should be viewed as a feature. And as such, it should be accounted for, without an expectation, that
future improved algorithms can fully overcome it.

Ligand flexibility

Bohari and Sastry tested five programs (Glide, GOLD, FlexX, CDocker, and LigandFit) on a dataset of 199
FDA approved drugs-target complexes.[48] Even given the limited range of ligand conformational
complexity in such dataset, they noted the dependence of pose prediction accuracy on ligand “size” and
observed better performance for low or medium flexibility. In a similar study, using docking with
AutoDock 4 and AutoDock Vina, Houston and Walkinshaw[49] suggested that there appears

to be a certain ligand size that maximizes pose prediction accuracy, because of optimum flexibility. Using
these programs, as well as others,[2,3] tends to result in increased failure rates when docking small,
fragment-like molecules. This is generally accepted to be a scoring failure. Conversely, the failure to
correctly dock large, often highly flexible molecules is ascribed to the shortcoming of sampling. To
address the issues of sampling, new methods are designed, allowing variable DoF, which depend on
ligand complexity and availability of secondary constraints. S4MPLE, a new conformational tool, is
adoptable for docking because of its unique treatment of DoFs.[8,50] Because of its generalized DoF-
based design, S4MPLE can consider conformational space of the flexible parts of the system: receptor,
or ligand, or both. Some algorithmic details of S4MPLE are described earlier, in the “On-the-fly methods”
subsection. LiGenDock, also designed with improved sampling in mind,[51] uses pharmacophore-derived
constraints, allowing flexible docking without conformer enumeration. Specifically, flexible alignment of
a molecule on pharmacophore points eliminates the bias imposed by the degree of coverage of
conformational space. While the idea is not strictly novel (similar approaches to search space reduction
have been developed previously, as far back as anchor-first methodology of DOCK[52]) and the success
rate is modest (Table 2), LiGenDock was successfully implemented in the high performance de novo
workflow LiGen[53] described later in the “Fragment docking” subsection. An alternative way of
reducing search complexity was proposed by Joung et al.[54] They developed an algorithm to generate
initial ligand poses within a binding site; based on the property-weighted vector (P-weiV), the 3D vector
determined by the molecular property of hydration-free energy density, which performed favorably
compared with LigandFit.


Poor performance in VS is commonly manifested by scoring functions’ inability to correlate with

experimentally measured binding affinities and perform equally well for various targets (target
dependence). These shortcomings stimulate further research into the development of new and the
improvement of existing scoring functions. These efforts are summarized in Table 3.

Empirical functions

Empirical scoring functions are based on meaningful terms that are weighted to reproduce binding
affinities and binding modes. Developments in empirical functions are commonly concerned with one or
more of their aspects: descriptors (new and/or improved), training sets (larger and/or higher quality),
and alternative methods for regression/correlation. Development of SFCscoreRF (Table 3) by Zilian and
Sotriffer[55] considered the last two aspects to improve on the SFCscore function (see in the “Machine
learning approaches to scoring function development” subsection). The descriptor aspect is illustrated
by a new treatment implemented in the SVR-SF (Table 3),[56] the first SVM-based scoring function able
to compute novel valuable information, namely, thermodynamic components of protein–ligand binding
energies. Unlike the usual practice of scoring, where ΔG is estimated as a function of descriptors, the
SVR-SF calculates ΔG from the values of ΔH and TΔS, which are first estimated from the descriptors. Two
new scoring functions (HotLig[57] and knowledge-based and empirical combined scoring algorithm
(KECSA)[58] (Table 3)), which demonstrated improved pose and binding affinity prediction metrics, were
the result of combining the empirical and knowledge-based terms.

Theory-based functions

La teoría de nivel superior como QM, incluida la QM semiempírica (SQM), podría aprovecharse para
mejorar el rendimiento de las funciones de puntuación empírica. Sin embargo, la limitación común de
los cálculos de los primeros principios es su costo computacional cuando se aplica a sistemas grandes
como los complejos proteína-ligando, incluso cuando se usa la metodología híbrida QM / MM. No
obstante, los desarrollos recientes en los campos QM / MM y DFT han permitido a Raoet al. Derivar la
primera función de puntaje de alto nivel basada en la teoría de no ajuste y entejido de cables: base de
datos de inhibidores CDK2 (CDK2, 76 complejos) y base de datos quinasa activada p21 (20 complejos).

The calculated score values and experimental inhibitor efficacies correlated very satisfactorily with
R2=0.76–0.88. Considering such high degree of correlation, currently out of reach of empirical and
knowledge-based functions (Table 3), and its assumed target independence due to its firstprinciples
nature, this function holds promise of overcoming both major challenges faced by conventional scoring
functions. In a similar development, carried out by Brahmkshatriya et al.,[60] a DFT/SQM/MM function
was also tested for a series of CDK2 inhibitors and produced a correlation with R2=0.64.

Los valores de puntaje calculados y las eficiencias experimentales del inhibidor se correlacionaron muy
satisfactoriamente con R2 = 0.76-0.88. Teniendo en cuenta este alto grado de correlación, actualmente
fuera del alcance de las funciones empíricas y basadas en el conocimiento (Tabla 3), y su independencia
objetivo asumida debido a su naturaleza de principios principales, esta función promete superar los dos
desafíos principales que enfrentan las funciones de puntuación convencionales. En un desarrollo similar,
llevado a cabo por Brahmkshatriya et al., [60] una función DFT / SQM / MM también se probó para una
serie de inhibidores de CDK2 y produjo una correlación con R2 = 0.64.

Machine learning approaches to scoring function development

ZilianandSotriffer[55] demonstratedthattheSFCscoreRF (Table3), trained as a regression-based RF model,

produced improvements of 0.1–0.6 in the Pearson coefficient for the PDBBind2007 test set compared
with the original SFCscore function and matched (if not slightly exceeded) the level of performance of
the RF-Score function.[61] A more modest improvement of 0.1–0.2 was produced for the CSAR-NRC HiQ
test set, indicating the sensitivity of the SFCscoreRF to the composition of the dataset and, thus, its
target dependence. Another ML function, SVR-SF[56] (Table 3), had its performance improved by the
inclusion of comprehensive ligand-based descriptors. Other new ML functions (Table 3) included the
following: empirical B2BScore,[62] knowledge-based interaction fingerprintguided,[63] eSimDock,[64]
ID-Score,[65] SVRR (SVR regression),[66] MIEC-SVM,[67,68] and hybrid RF-MNLR (RF-multinomial logistic
regression).[69] With the advent of ML-based functions, the field was ripe for a systematic evaluation
and comparison with conventional scoring functions. In such an evaluation, Ashtawy and Mahapatra
compared six ML methods (boosted regression trees (BRT), knearestneighbors(kNN),mutlivariate
adaptiveregressionsplines (MARS), MLR, RF, and SVM) and 16 popular conventional scoring functions
(from Discovery Studio (DS) [LigScore, Piecewise Linear Potential (PLP), PMF, Jain, and LUDI], Sybyl [D-
Score, PMF-Score, G-Score, F-Score, and ChemScore], GOLD [GoldScore, ChemScore, and ASP],
GlideScore, DrugScore, and X-Score).[70] Using the non-overlapping subsets of PBDBind2007 set as
training and test sets, they tested all functions for their scorin and ranking powers. ML-based functions
have systematically outperformed conventional functions with the top three being RF, BRT, and SVM
methods and the remaining three taking the sixth (kNN), eighth (MLR), and ninth (MARS) positions.
Unlike individual comparisons, mostly at the hands of function developers, such an evaluation bolsters
the confidence in the performance of ML-based approaches. By testing scoring and ranking powers
separately, the authors found that the good performance in one did not necessarily deliver a good
performance in another. This finding relates to other observations of task-specific performance of many
scoring functions (e.g., for GlideScore[71] and HYdrogen bond and DEhydration energies (HYDE[72]))
and points to the need to tailor their use, following appropriate evaluation for specific targets or tasks

Target dependence

Target dependence of functions remains an as yet unconquered challenge in the scoring field,[2,3] and
currently, the most practical approach is to screen as many different protocols as possible (if known
binders are available) and choose the most target appropriate. This approach could be strengthened by
various rescoring algorithms, such as docking with AutoDock Vina and rescoring with NNScore, as
suggested by McCammon and co-workers.[73] Totacklethetarget-dependencechallengeatits
core,multiple research directions are currently underway. For example, testing SFCscoreRF[55] against
two common benchmark sets (PDBBind2007 and CSAR-NRC HiQ) ranked it among the best performing
functions with top prediction metrics for these sets (Table 3). However, a cross-validation test indicated
its targetdependent performance. The authors suggested that a possible direction for overcoming this
challenge may lay in fine-tuning the applicability domain of the scoring function, specifically via a
rational choice of descriptors. To address the issue of target dependence conceptually, Ross et al.
formally and systematically analyzed a structure-based modeling process and demonstrated conclusively
that targetspecific empirical scoring functions are on average more accurate than the very best
generalized/universal models.[74] The authors further suggested that, if generalized functions are
used,theycouldbeoptimizedfor agiventargettypebyincorporating prior knowledge (e.g., known actives).
It will be extremely interesting to see whether this study heralds the beginning of the end for
development of universal empirical scoring functions. On the other hand, recent advances in QM
applications in scoring (refer to subsections “Theory-based functions” mentioned earlier and
“Combination of scoring with quantum mechanics calculations” in the succeeding paragraphs) indicate
that a universal theory-based scoring function is in principle possible but most likely is not
computationally tractable at the moment. It will be fascinating to watch further developments in this
area of research.


Water plays a critical role in the thermodynamics of ligand binding. Accurately accounting for
solvation/desolvation effects is still a significant challenge, which is mainly addressed by developing
methods that complement, rather than integrate into, the scoring processes described.

Explicit solvent approaches

The WaterMap protocol[75] uses inhomogeneous solvation theory to estimate the free energy of
solvent displacement based on short MD runs in explicit solvent and clustering of resulting water
positions. A survey of WaterMap implementation studies was published recently by Yang et al.[76]
WaterMap is indeed a powerful protocol, when used appropriately. Namely, it is effective when applied
to binding driven by hydrophobic interactions and is likely to strain when the dominant nature of
binding is electrostatic. For example, Kohlmann et al. studied an extensive congeneric series of SRC
tyrosine kinase inhibitors,[77] and WaterMap-generated energies failed to correlate well with the
energies determined experimentally (R2=0.55). However, when applied to a subset of ligands with the
increase in potency due to the addition of a hydrophobic substituent, the correlation improved
(R2=0.65). This finding is similar to the observations of Nurisso et al.,[78] related to the importance of
hydrophobic interactions when scoring with GOLD (for more details on Nurisso study, refer to
subsection “Systematic evaluation of popular and widely used docking programs and scoring functions”
in the succeeding paragraphs).

In the study of Kohlmann et al., WaterMap correctly identified main structure–activity relationship
trends of this ligand series, while it was outperformed by MM/GBSA in terms of ligand ranking (R2=0.68
and 0.80, respectively). The authors listed specific factors (ligand–protein interactions, ligand
conformation, and water-mediated interactions), which are not accounted for by WaterMap and which
are likely to decrease WaterMap usefulness for systems where such factors predominate. Thus,
WaterMap is useful for estimation of desolvation as a contribution to binding energy; however, it should
not be used as a sole scoring function, at least without a detailed analysis of the factors driving
interaction for a specific target-ligand system.

Se implementó otro protocolo explícito basado en el agua en SZMAP, que utiliza un enfoque de semi-
continuo, que combina un tratamiento explícito del agua con el área de superficie de Poisson-Boltzmann
(PBSA). SZMAP se incorporó con el acoplamiento de ligandos RosettaLigand en seis sistemas de
proteínas, como parte del ejercicio de referencia CSAR 2012, utilizando el conjunto CSARdock 2011. [79]
Se descubrió que, si bien la inclusión de todas las posiciones de agua identificadas experimentalmente
conduce a una disminución del rendimiento (en términos de capacidad de atraque y clasificación), las
moléculas de agua cuidadosamente seleccionadas mejoraron el rendimiento de la fijación de estructuras
de cristal de encuadernación. y valores de energías libres de diferencia neutra (positivo o cercano a cero
para aguas desplazables y negativo para aguas muy cerradas). El uso de tales cálculos de mapeo de agua
dio una estimación cuantitativa para seleccionar moléculas de agua que favorecen la unión.

Forli and Olson came up with an alternative concept to consider explicit waters during docking – ligand
hydration method.[80] In their modification of the standard AutoDock 4 function, water molecules are
not placed in specific positions within the binding site. Instead, they are attached to ligands prior to
docking and continuously evaluated for their effects on ligand–receptor interaction. If their influence is
stabilizing, they are retained, otherwise they are removed. This method does not require any prior
knowledge of the apo or holo protein hydration state and its main advantages are the following: (i) the
ease of determining water-binding positions on ligand structures (compared with target surfaces); (ii)
the consideration of ligand-specific hydration patterns; (iii) limiting calculations to relevant water
molecules (i.e., only those directly interactingwith the ligand); and (iv) allowing for hydration of novel
structurally diverse ligands. The method yielded a 12% improvement in docking power. WaterDock,
based on AutoDock (specifically, AutoDock Vina), is a pipeline used to predict the following: (i) the
location of water molecules (with 97% accuracy); (ii) whether they are likely to be conserved or
displaced after ligand binding (with 75% accuracy); and (iii) the probability that predicted molecules will
be displaced by polar or non-polar groups (with 80% accuracy).[81] An interesting approach to the
treatment of waters in docking was implemented in S4MPLE (discussed in “On-the-fly methods”
subsection). It allows explicit waters to be included as ligands docked simultaneously with small
molecules/fragments, within a continuum solvent.[8] While the success rates were relatively marginal at
56–69%, the advantage of this approach (i.e., the combination of implicit and explicit solvent treatment)
may be realized after further fine-tuning of S4MPLE. Other studies to account for explicit waters
included the following: development of the criterion based on the strength of binding to the receptor
and/or a reference ligand;[14] improving fragment docking;[82,83] effect on VS performance in the
context of various degrees of receptor flexibility;[27] improvement of RosettaLigand performance for
both protein-centric and ligand-centric docking, where waters are located relative to the protein and
ligand entities, respectively.[84]

Implicit solvent approaches (MM/PBSA and MM/GBSA calculations)

Rescoring docked poses with MM/PBSA and MM/GBSA improves ligand ranking and, marginally, binding
affinity prediction.[2,3] The accuracy of MM/GBSA and the factors affecting it was interrogated using a
set of 106 carefully selected complexes.[85] The correlation between MM/GBSA energies and
experimentally determined binding affinities was found to degrade with greater variability in the affinity
data and/or increased structural faults in the complex structures, either experimental or computational.
These findings demonstrate that improving pose prediction methods could improve the computation of
binding affinities using implicit solvation methods. Greenidge et al. tested the assumption that
MM/GBSA only works for a congeneric series of ligands. They used a diverse (non-congeneric) dataset
(largest to date: 855 complexes from PDBbind2009) to identify the reasons for MM/GBSA failure.[86]
They showed that it is possible to obtain MM/GBSA binding affinity values for a diverse set, which are
comparable to those obtained for congeneric series. They also demonstrated that the inclusion of water
To further improve the affinity prediction, MM/PBSA and MM/GBSA models can be augmented with
terms related to specific binding phenomena. For example, to better represent ligand solvation, Liu et al.
incorporated the polarized protein-specific charge model into MM/PBSA and, using eight test cases,
demonstrated the effectiveness of their approach for correctly ranking the docked poses.[87] They also
confirmed the importance of bridging water molecules for correct pose prediction. Another example of
such augmentation is the study by Chung et al., who used the π–π interaction energy terms molecules
worsens the predictive quality, while the inclusion of ligand strain slightly increases the overall accuracy.

To further improve the affinity prediction, MM/PBSA and MM/GBSA models can be augmented with
terms related to specific binding phenomena. For example, to better represent ligand solvation, Liu et al.
incorporated the polarized protein-specific charge model into MM/PBSA and, using eight test cases,
demonstrated the effectiveness of their approach for correctly ranking the docked poses.[87] They also
confirmed the importance of bridging water molecules for correct pose prediction. Another example of
such augmentation is the study by Chung et al., who used the π–π interaction energy terms alongside
MM/GBSA.[88] Using the benzimidazole Raf inhibitors as a case study, they obtained a correlation
between the experimental and calculated inhibitory activities with q2=0.56. Examples of using
MM/PBSA or MM/GBSA for ligand rescoring included the following: study of enzyme inhibitors[89] and
lectinbinding carbohydrates;[90] fragment docking;[91,92] porting to GPUs for faster calculation;[11]
evaluation of a range of targets;[93] and combining MM/GBSA with calculations of Jarzynski identity.[94]

Additional method for solvation

Nikolic et al. implemented their 3D reference interaction site model with the Kovalenko-Hirata closure
(3D-RISM-KH) in the 3D-RISM-DOCK protocol.[95] Specifically, they treated the ligand fragments as a
part of the solvent and mapped the density distributions of all atomic sites onto the spatial grid around
the protein. Following that, conventional docking with AutoDock and scoring with PMF were used. This
approach has several advantages in terms of transferability of its solvation part, which can account for
(i) competition between different ligands, (ii) the effects of ligand concentration and solvent
composition, and (iii) the thermodynamic state of a system. Another advantage of this protocol is
calculating free energy within the samestatistical mechanical framework and the same force field as
used for docking.

Combination of scoring with quantum mechanics calculations Quantum mechanics calculations offer
opportunities for improving scoring success in general and in particular for more challenging and
electronically complex binding sites (e.g., with metal ions or highly polar groups).[96] Apart from the
recent applications of novel theory-based scoring functions,[59,60] QM usually contributes to docking as
a rescoring tool. Recent examples in the succeeding paragraphs illustrate such applications, one each for
the ab initio and semiempirical QM levels of theory, respectively. Natesan et al. developed the QM/MM-
Linear Interaction Energy approach to account for multiple ligand tautomers and protomers.[44] They
applied single-point B3LYP/6-31G*/AMBER level of theory to 66 inhibitors of MAPK-activated protein
kinase. This protocol resulted in improving the correlation between calculated and experimental binding
affinities (R2) from 0.66 to 0.91. Mikulskis et al. compared three semiempirical methods (AM1, RM1, and
PM6) for rescoring ligand binding to avidin, factor Xa, and ferritin.[97] The best correlation between
calculated and experimental binding affinities was obtained for the AM1-DH2 method applied to ferritin
(R2=0.92). The authors also noted the target-dependent performance of these SQM calculations. Other
QM-derived properties used in conjunction with scoring and rescoring included the following: π–π
interaction energies;[88] partial charges[98] (empirical Gasteiger–Hückel charges were most suitable for
VS of large databases, AM1-BCC charges – for lead optimization and accurate VS of small databases);
and quantum entanglement contributions.[99]


New programs

A range of new programs have appeared in literature in the last couple of years. Classification into a
“new docking program´´ (Tables 1 and 2) or a “new scoring function” (Table 3) is sometimes fuzzy; in this
survey, we made a call based on authors’ own designations and on whether additional features (beyond
the nature of the scoring function) are included in these publications. The new programs mostly address
the issues of computational efficiency,[5,100,101] tailoring for specific interaction types and
applications,[51,100–103] novel optimization algorithms,[4,5,104] and improved performance
(compared with earlier versions or other programs).[4,5,26,102,104] Novel features and evaluation
details are summarized in Tables 1–3. Several of the new programs are freely available (Table 4).

Fragment docking Docking of fragments continued to be a hot topic in the docking field, particularly due
to the increased use of various ligand efficiency metrics. It has been shown that such metrics could be
used to rationalize the success rates of fragment docking.[105] SAMPL3 fragment-based VS challenge
(Journal of ComputerAided Molecular Design 2012, Volume 26, issue 5) gave researchers a valuable
opportunity to test their programs, methods,andscreening protocols in ablindtesting environment using
bovine pancreatic trypsin inhibitors. Kumar and Zhang evaluated their VS protocol with RosettaLigand by
screening a 500-fragment Maybridge library.[106] They highlighted the importance of an appropriate
method for the calculation of partial charges. They also confirmed that using multiple receptor
ensembles in docking does not always yield better enrichment than individual receptors, a finding
previously observed in other studies.[3] Surpateanu and Iorga assessed the ability of several
combinations of docking programs (GOLD and Glide) and scoring functions (GoldScore, ChemScore,
ChemPLP, standard precision (SP), and extra precision (XP) to predict the binding of a fragment-like
library.[107] The combination of GOLD with GoldScore, with or without rescoring, was the most suitable
protocol, with good results for the SAMPL3 dataset and enrichment factors of about 10 for Top 20
compounds. Sulea et al. evaluated the combination of their docking program Wilma, the solvation
interaction energy for scoring, and the solvation model (first shell of hydration, which captures some of
the discrete properties of water within a continuum model).[108] Good VS enrichment was achieved
with ROC AUC of approximately 0.7. The early enrichment was also quite good: 50% of true actives
recovered with 15% and 3% false positive rates in prospective and retrospective calculations,
respectively. Binding affinities predicted were generally ≤2kcal/mol away from the experimental values,
but the rank ordering of affinities differing by ≤2kcal/mol was not well predicted. Merz and co-workers
tested versions of the empirical scoring functions LISA and LISA + (a total of 11 methods),[109] which
showed relatively low absolute errors but also low correlation with experiment. Overall, the SAMPL3
revealed the difficulties in predicting binding affinities of small molecular fragments and the benefit of
using quality training data. It showed that success could be increased significantly via meticulous
selection of receptor structures, proper consideration of flexibility, accurate assignment of partial
charges, and sophisticated solvation models.

S4MPLE (Tablas 1 y 2) se usó para acoplar un conjunto de datos de ligandos de fragmentos, y se

encontró que las tasas de éxito estaban a la par con las de las moléculas similares a las drogas. [50] Sin
embargo, aunque a menudo se generaron fines experimentales, siempre se puntuaron
preferentemente. Para crear un flujo de trabajo de diseño de fármacos basado en fragmentos

S4MPLE was combined with GenLinkersDB, to create a linker library by fragmenting a compound
database, and JMolEvolve, to build novel compounds by attaching chemically compatible linkers to the
starting fragments.[50] S4MPLE was thus used to probe the optimal placement of the linkers within the
binding cavity, based on the initial fragment restraints. Several examples from the fragment-based drug
design literature were used as test cases with good results for pose prediction of reference

ligands, ability to produce the optimized ligand or its close analogs, and enrichment of analogs against
decoys. Following on the earlier work of Verdonk et al., where the reasons of docking failures were
compared for fragments and small molecules,[105] Vass et al. used Glide to further investigate factors
affecting success rates for the placement of linked fragments in sequential fragment docking.[82,110]
Using 129 protein–ligand complexes, they tested three sampling protocols (SP; XP and SP hard (SP
without scaling ligand atom radii)) and three different scoring functions (GlideScore, Emodel, and
GlideEnergy) as well as other factors: effects of ligand number, docking order, drug-likeness, and
closeness of the binding sites.[82] The average success rate of 36% increased to 50% for docking drug-
like ligands into closed binding sites by the SP hard protocol. Glide was able to reproduce the positions
of multiple bound fragments if conserved water molecules were retained. For 32 complexes
representing 18 targets,[110] it was found that, in the docking of the first fragment, only two fragments
were not ranked top out of the 32 cases. However, in the docking of the linked fragment, the scoring
function was unable to score the experimental binding mode as top in seven cases. Two possible
reasons for the lower success rate of the secondstage weresuggested: (i)the inaccuracy of the prediction
of the pose of the first fragment, which translates into greater error in the second, especially if the first
docked fragment partially overlaps with the experimental binding mode of the second, and (ii) intrinsic
difficulty of identifying the correct binding mode at secondary sites due to primary fragments exploiting
the specific interactions of the protein hotspot, leaving the secondary site usually less attractive and
therefore more challenging to predict accurately. Beccari et al. addressed the issue of incorporating
synthetic feasibility of de novo ligands into LiGen worflow via accurate and flexible reactant mapping.
[53] The advantage of LiGen is in the user-controlled prioritization of the chemical reactions through the
definition of a probability of the following: (i) a reaction class and (ii) reactive groups, with the most
reactive one having more chance to be selected by the system to be incorporated into the final
molecule. Other studies showed that fragment docking may be improved by the following: MM-PBSA
rescoring;[91,111] a combination of structure-based and ligand-based screening;[112] protein mapping
with FTMap;[113] templating of fragment ligands on known structures;[114] and GPU-accelerated MD.

Protein–protein docking While protein–protein docking is in principle similar to proteinsmall molecule

docking, specialized programs are usually used, because of the greater complexity of the systems. These
are regularly evaluated within the Critical Assessment of Prediction of Interactions (CAPRI) community-
wide comparative evaluations. In 2013, the fifth CAPRI meeting was held, and the results for the eight
rounds (20–27) held during the years of 2010–2012 were published in a special issue of Proteins (2013,
Volume 81, issue 12) and well analyzed.[115,116] Of the 15 targets during these rounds, 10 targets
represented classical docking and scoring problems. Round 27 was the first to contain a protein–
polysaccharide complex with a pleasing 50% of predictions producing at least an acceptable model.[116]
Online servers did surprisingly well. For example, a new server SwarmDock[117] (Table 4) participated
after round 20 and predicted at least one correct model for each target. Following rounds 20–27, areas
that still need work are the following: predicting the solvation of interfaces (where there is a tendency
to over-solvate) and affinity prediction (where no single group was able to predict the binding affinities
of interfaces). These bottleneck areas parallel similar problems in protein-small molecule docking,
discussed in the “Scoring” section Docking methods and scoring functions One of the main approaches
in protein–protein docking is Fast Fourier Transformation (FFT), which efficiently searches 6D space on
rigid receptors and ligands to find low energy conformations. These methods are well established and
used in programs such as KBDOCK, which implements the Hex algorithm.[118] A recent advance in this
methodology is the parallelization (e.g., with MEGADOCK[119]),whichallows
tacklinglargescalebiologicalprotein-protein interaction (PPI) networks.

Los métodos FFT a menudo se combinan con algoritmos que permiten la flexibilidad (p. Ej., Versiones de
AutoDock [9,10]). Dos protocolos de flexibilidad se incorporaron recientemente en el conjunto de
Rosetas. ReplicaDock es una expansión de RosettaDock, que utiliza el método de intercambio de
temperatura y réplica Metropolis-Monte Carlo. [120] Este método generó una fracción
significativamente mayor de conformaciones casi nativas en el conjunto de 30 proteínas en el método
de muestreo de escopetas empleado anteriormente por RosettaDock, con aproximadamente el mismo
costo computacional. Sin embargo, donde el área de superficie de unión nativa no era continua,
exhibiendo secciones parcheadas de interacciones, el muelle de réplica no realizó áreas de superficie de
unión mucho más profundas y predichas. El segundo protocolo identifica bucles flexibles en las
proteínas antes del acoplamiento. [121] Utilizando 28 casos de prueba, incluidos tres objetivos CAPRI
particularmente difíciles, predecía casi el doble de los modelos aceptables que otros métodos

Other developments in protein–protein docking included the following: review of coarse-grain modeling
of PPIs, particularly relevant for large-scale systems such as membrane complexes;[122] benchmarking
of 115 scoring functions;[123] formulation of a highly efficient manifold rigid body minimization
algorithm;[124] and investigation of optimal workflows for alignment of docking and scoring.[125]

Binding site prediction

For protein–protein docking, the binding sites are often not clearly defined pockets, and the greater the
information about the binding site, the higher the chance of getting the correct docking solution.
However there is growing evidence that in most cases, protein–protein interfaces are within 6Å of
known ligand binding sites.[126] Advances in technologies and the growing understanding of protein–
protein interaction hotspots[127] are allowing for some significant improvement in binding site
prediction programs. Xwalk, implemented into Rosetta, capitalized on experimental data from cross-
linking mass spectrometry, which resulted in up to 5Å average rmsd improvements in pose prediction.
[128] Lopes et al. docked over 300000 conformations per protein pair for the set of 28224 possible pairs
(168 proteins of the Mintseris Benchmark 2.0) in a large and comprehensive crossdocking study.[129]
They used a MAXDo (molecular association via cross-docking) algorithm and found that by combining a
simple docking algorithm with evolutionary information, they could discriminate interacting from non-
interacting proteins, while not being able to correctly predict the exact interactions between the two
proteins. This method is a significant improvement over any other methods thus far reported, which
have been based purely upon shape complimentary.

Targeting protein–protein interfaces

From a docking point of view, PPIs are in principle similar to traditional drug targets and were shown to
be amenable to docking.[130] However, compared with traditional target inhibitors, PPI inhibitors are
approximately 30% less likely to be identified, postulated to be because of the differences in properties
suchasagreaterexposedsurfaceareacomparedwithtraditional ligands.[131] Villoutreix et al. examined
115 PPI inhibitors and found that these compounds tended to have increased hydrophobicity,
aromaticity, and molecular weight.[132] Such dissimilar inhibitor profiles prompted a design of specific
PPI-focused libraries such as those found at Asinex, ChemDiv, and Otava.[127] Finding druggable
pockets on PPIs is also a challenging task because they tend to be more flexible, more transient, and
often do not have a known ligand.[133] MD simulations can be used for tasks such as the prediction of
hotspots and transient pockets.[134] However, Metz at al. found that computationally cheaper
algorithms such as Framework Rigidity Optimized Dynamics Algorithm outperformed MD when
predicting hotspots and transient pockets albeit, at this stage, only on the IL2:IL2 receptor interface.
[135] It will be interesting to see how this program performs on alternate targets in the future. Another
alternative to MD is a combination of ligand-based pharmacophores with hotspot prediction, which was
shown to outperform either method in over 50% of the tested systems.[136] A variation on this method
was also used to find inhibitors of the p53-MDM2 interaction with an outstanding reported hit rate of
40%, which was partly due to the novel library design strategy, termed “Rule-of-3-for-PPIs.”[137]

Reverse docking

Reverse docking is used to predict the target(s) of a compound by virtually screening a library of
receptor structures. While it is conceptually valid, the success of reverse docking is significantly
undermined by the strong tendency of the existing scoring functions to be target dependent (see in the
“Target dependence” subsection discussed earlier), arguably even more so than the standard “forward”
docking, where a scoring function can in principle be tailored for a particular target or target type. For
reverse docking, a scoring function must indeed be as much target independent as possible. Wang et al.
have addressed this issue by testing Glide in the reverse docking scenario.[71] Using 58 complexes from
refined Astex diverse dataset, they found that Glide (with GlideScore, Emodel, or EnergyScore) was only
57% successful in identifying a correct protein–ligand pair. This poor performance was ascribed to
“interprotein noises”, that is, to overestimation and underestimation of the scores for specific proteins.
The uneven performance was attributed to a disproportionate distribution of binding site features
among the proteins of the dataset. Using the decision tree algorithm, binding site features of the
proteins were analyzed and related to the successes and failures of these proteins when used in reverse
screening. Following this analysis, a correction term for GlideScore was derived, based on the ratio of
the relative hydrophobic and hydrophilic character of the binding site. Introduction of this correction
term led to the increase of target prediction accuracy to 72%. While a specific correction term was
developed in this study for GlideScore, it is not likely that the same term will work for other functions.
Therefore, function-specific evaluations would be required to correct those functions for interprotein
noises. However, the general outcome of this study is significant: an effective pathway to docking
improvement may require the tailoring of a scoring function for a specific objective, here – reverse
docking. This conclusion is

similar to that reached for the revised HYDE scoring function (Table 3) and its applications to binding
affinity prediction versus VS.[72] Other inverse docking investigations included the following: the virtual
target screening method calibrating a set of small molecules against a protein library[138] and
prediction of activity of 656 marketed drugs on 73 unintended “side effect” targets.[139]
Consideration of multiple docking solutions

Consideration of multiple docking solutions can significantly increase the likelihood of determining the
native binding pose, usually represented by an experimental pose. However, similar to the conundrum
of using receptor ensembles for docking (refer to subsection “Ensemble-based methods” discussed
earlier), the justification of the exact range of poses to be used is still unclear. To address this issue,
Grigoryan et al. proposed to use the energy gap as the metric allowing the consideration of the
ensemble of docked poses.[140] Energy gap, which is the difference between the energy of the best
ranked docking pose and the average energy of all docking poses, is inferred to be a measure of the
binding energy landscape sharpness. In retrospective VS against 38 protein targets, using ICM and DUD,
the energy gap outperformed the docking score in its ability to discriminate true binders from decoys
and led to a significant increase in the success rate of binding pose prediction. An alternative,
interaction-based approach to the exploitation of pose ensemble was developed by Yuriev, Ramsland
and coworkers[141–143] In this method, AutoMap[142] – the site mapping technique – drives LigPlot to
determine the hydrogen bonding and van der Waals interactions taking place between a target protein
and each pose of a ligand ensemble. It then tallies these interactions per protein residue, normalizes the
tallies, and maps them to the protein surface. The residues involved in the interactions are selected
according to specific cutoffs. The procedure has been demonstrated to perform well in studying
carbohydrate–protein and peptide–antibody recognition.

Docking into homology models

Docking or screening can be undertaken against a homology model of a target protein. While this is not
uncommon, the additional uncertainties, thus introduced into the docking protocol or a VS workflow,
decrease the confidence in the identified hits. Several research directions are being pursued to increase
the likelihood of identifying reliable drug candidates against the model targets. First, scoring functions
can be developed, which are less sensitive to distortions within target structures. One such scoring
function developed recently is eSimDock[64] (Table 3). It shows remarkably high tolerance to structural
deformations in target structures and thus represents a practical strategy for proteome-wide VS.
Second, numerous studies have been carried out to investigate docking/VS outcomes for arange
ofprotein targets bycomparing results for crystal structures and homology models. For example,
Kaufmann and Meiler developed the RosettaLigand protocol[7,144] and benchmarked it for docking into
homology models of nine proteins from the eighth round of the Critical Assessment of protein Structure
Prediction[145] Twenty-one additional complexes were selected
to cover a wider space of chemotypes. In 21 of the 30 cases, RosettaLigand successfully found a correct
pose among the top 10 poses. Careful template selection based on ligand occupancy provided the best
chance of success while overall sequence identity between template and target did not appear to
improve the results. Interestingly, a related conclusion with respect to template selection was reached
in a GPCR modeling study, where homology models were evaluated via VS.[32] Given the recent rise in
the number of crystal structures of GPCRs, it has been questioned whether “crystal structures obviate
the need for theoretical models of GPCRs” for structure-based VS.[146] Nguyen et al.[147] and Tang et
al.[146] investigated this query thoroughly and systematically. On average, top ranked receptor
homology models (using templates with ≥50% sequence identity) were found within 2.9Å of the target
structures.[147] The application of knowledge-based and energybased
filterssignificantlyimprovedthequalityofposeprediction. Particularly influential were the following:
structure–activity relationships of known actives, mutational data as experimental constraints, and the
knowledge of specific ligand–protein contacts. Tang et al. compared the historical models (i.e., built
prior to the release of the crystal structure) of β2 adrenoreceptor to the crystal structure, in terms of
their VS performance.[146] Several models gave VS success rates exceeding those using X-ray structures,
suggesting that knowledge-based and carefully built models capture critical chemical and structural
binding site features, and thus may be even more useful for practical structure-based drug discovery
than X-ray structures. Examples of systems where homology models were used as targets included the
following: nicotinic acetylcholine receptors;[148] p-glycoprotein;[149] protein kinase R-like endoplasmic
reticulum-localized eIF2α kinase;[140] hERG K+ channels;[150] GPCRs;[151–156] and virulence regulator

Hybrid approaches


To address target-dependent performance (discussed in detail in the “Target dependence” subsection),

the use of filtering out or penalizing pose decoys (non-native docking poses) was proposed in order to
reduce the number of false positives in VS.[158] Tropsha and co-workers demonstrated that this hybrid
oftheforce field-basedscoringfunction(MedusaScore)andatarget-specific pose filter gave significant
improvements of VS hit rates in 12 out of 13 benchmark sets from DUD. Consensus docking – a concept
similar to consensus scoring – was devised by Houston and Walkinshaw.[49] They used a test set of 228
complexes from the PDBbind-CN database and redocked ligands cognately using AutoDock 4 and
AutoDock Vina. The success rates of the two programs were found to be 55% and 64%, respectively.
However, combining the outputs of the two programs and selecting only concordant poses (i.e., differing
by ≤2Å) increased the success rate to 82%. To reduce the risk of false positives, particularly undesirable
in a VS context, the authors optimized the rmsd cutoff for selecting the matching poses from the two
programs. While combining outputs from multiple programs is not an entirely new approach, using rmsd
matching as a prescoring filter is novel and acts as an amplification of pose prediction

Combination of docking with ligand-based approaches

Docking outcomes can be improved by combining it with ligand-based methods. In a range of studies,
docking was combined with electronic reactivity and multiple-instance learning,[159] molecular
similarity-based protocol,[160] shape matching with Rapid Overlay of Chemical Structures (ROCS)[25]
and Shape Signatures,[161] pharmacophore mapping and comparative binding energy analysis,[162] and
2D-QSAR.[163] While in these studies investigators demonstrated an improved performance for affinity
and/or pose prediction, the clarity of which protocols should be combined and via which algorithms was
lacking. To address this gap in guidelines, Svensson et al. evaluated five different data fusion algorithms
(sum rank, rank vote, sum score, Pareto ranking, and parallel selection), combining data from docking
(Glide), pharmacophore searching (Phase), shape similarity (ROCS), and electrostatic similarity (EON).
[164] For the 16 datasets, the best performing was parallel selection, but both rank voting and Pareto
ranking also showed good performance. While this evaluation revealed that structure-based and
ligandbased data fusion improves the quality of compound ranking in VS; compared with using single
methods, there was no single method that consistently outperformed other ones. Instead, large
differences between the datasets were observed, pointing yet again to target dependence as an Achilles
heel of docking and/or scoring (discussed in more detail in the “Target dependence” subsection). The
systems, for which docking was combined with ligandbased methods, were the following: hERG K+
channels (QSAR);[150] microtubules (Comparative Molecular Field Analysis (CoMFA) and Comparative
Molecular Similarity Indices Analysis (CoMSIA));[165] fragment-sized ligands;[112] GPCRs;[154,155] and
protein–protein interaction inhibitors.[137]

Combination of docking with experimental data/knowledge

Docking outcomes can be significantly improved if combined with knowledge from experiment. For
example, constrained Glide docking led to improved prediction of binding affinity of nicotinic ligands
docked to nicotinic acetylcholine receptors.[148] The specific constraint was a hydrogen bond between
the cationic center of the ligand and the backbone carbonyl of the conserved Trp-149. Another way to
complement docking was developed by Orts et al. who efficiently used interligand NOEs to filter and
rank docked poses.[166] These nuclear magnetic resonance-derived experimental restraints improved
the accuracy of docking by two orders of magnitude. Other systems, for which docking was combined
with experimental methods, included the following: an enzyme essential for Pseudomonas aeruginosa
quorum sensing apparatus PqsD (isothermal titration calorimetry, surface plasmon resonance, and STD-
NMR);[167] standardized datasets (X-ray);[26] kinesin-5 (STD-NMR); protein kinase A (NMR-derived
interligand NOEs);[168] protein–protein complexes;[128] and GPCRs.[169]

Open access

Numerous docking and VS programs and other resources are publicly available. Many such tools are
briefly reviewed in the studies of Jacob et al.[170] and Villoutreix et al.,[171] and the recent additions to
this arsenal are listed in Table 4. This list is by no means comprehensive or complete, merely


Several really good analytical and critical reviews and surveys of VS have appeared in 2012–2013. The
state-of-the-art structure-based VS was reviewed[172] and systematically evaluated by surveying 279
prospective case studies, published in July 2011.[173] It was observed that high resolution of receptors
and sophistication of scoring functions were not actually pivotal factors determining the success of VS.
[173] Instead, other factors played more important roles in compound selection for testing: scientific
expertise, chemical intuition, and subjective compound rankings. Scior et al.alsosurveyedrecentVS
literatureandcatalogedthe problems, failures, and technical traps of VS methods.[174] Zhu et al. critically
analyzed VS results published between 2007 and 2011 and made several practical recommendations for
selection of compounds for hit identification, optimization, and experimental testing. Their key advice
was the use of size-targeted ligand efficiency values as hit identification criterion.

Automation/integration/parallelization/distributed computing

The focus of VS research in recent years has been the development of automated integrated approaches
for seamless progression through the various steps of VS workflows. Such workflows are highly valued in
the pharmaceutical industry because they allow medicinal chemists, who are usually not computational
chemistry experts, to implement many of the steps involved in VS. Therrien et al. have developed
Screening databases and decoy sets for virtual screening benchmarking ZINC remains the key resource
for VS.[182] At the time of writing, the ZINC website contains >35 million purchasable compounds in
downloadable ready-to-dock 3D formats. ZINC website enables searching by structure, name, Chemical
Abstracts Service number, biological activity, physical properties, vendor name, and catalog number.
ZINC allows creation of small custom subsets as well as their editing, downloading, sharing, docking, and
conveying to a vendor for purchase. To address the issue of rational filtering and determining the
optimal size of the database, Baell presented an extensive battery of functional group filters – pan assay
interference compounds (PAINS).[183] He showed the rationale of using functional group filters in
constructing general purpose high throughput screening libraries of high quality. These filters reduced
the database of commercially available compounds (representing ~80% of the available lead-like space)
to fewer than 100000 compounds remaining in the selection pool. Several new advances occurred in
benchmarks for VS testing and particularly in new and improved decoy libraries. With DUD being widely
used for VS evaluation, DUD-Enhanced (DUD-E) benchmarking set was generated, which includes 102
diverse targets with 22886 clustered ligands from ChEMBL09, each with 50 property-matched decoys
drawn from ZINC.[184] For chemotype diversity, each target’s ligands were clustered, and net charge
was added to the matched physicochemical properties. The improved matched decoys could be
generated for user-supplied ligands. DUD-E was tested by docking to all 102 targets, using the results to
improve the balance between ligand desolvation and electrostatics in DOCK 3.6. The developers of
DEKOIS 2.0, based on BindingDB bioactivity data, provided 81 new and structurally diverse benchmark
sets for a wide variety of different target classes.[185] The original DEKOIS library was improved with
enhanced physicochemical matching (now considering molecular charges) and with a more
sophisticated elimination of latent actives in the decoy set. Physicochemical matching was also
implemented in the GPCR ligand library for 147 targets, with 39 decoy molecules matched for each
ligand, producing the GPCR Decoy Database (GDD).[186] Decoys were matched to ligands based on six
physical properties (molecular weight, number of rotatable bonds and hydrogen bond donors and
acceptors, logP, and formal charge), while ensuring ligand–decoy chemical dissimilarity. The docking
performance of the GDD was evaluated on 19 GPCR structures, demonstrating a marked decrease in
bias, compared with uncorrected decoy sets. Several of the screening libraries, decoy sets, and
associated tools are freely available (Table 4).FORECASTER, a web-based platform (Table 4).[175]
FORECASTER integrates programs for the following: (i) processing ligand molecules, including structure
generation, filtering, building combinatorial, and focused libraries; (ii) processing macromolecules,
including file conversion and setting up for docking; and (iii) docking with Flexibility Induced Through
Targeted Evolutionary Description (FITTED). FORECASTER was very successful at the lead identification
stage for finding selective estrogen receptor modulators. Since VS uses ever increasing database inputs,
computational capability has always been a limiting factor. Cloud computing has a potential to reduce, if
not eliminate, this bottleneck.[176] Several algorithm designs and benchmarking exercises have used
cloud computing and GPUs for VS in recent times. Versions of AutoDock were implemented for intensive
simulations on the Venus-C platform (Table 4), which made use of Windows Azure-based cloud
resources,[177] and in the new workflow data pattern called self-adaptive multiple instances, which
made use of a middleware built on Amazon EC2 instances.[37] Another VS application implemented in
the cloud is BINDSURF (Table 4),[178,179] an efficient and fast blind method for the determination of
protein binding sites, which uses the massively parallel architecture of GPUs for unbiased prescreening
of large ligand databases. Other algorithms taking advantage of the speedup offered by GPUs were
developed by Khar et al.[180] and Heinzerling et al.[181] Several recently designed workflows were
particularly aimed at bringing together structure-based approaches with ligandbased in silico methods
(Table 5).

the views of the academic milieu, and only a scarce fraction of the protein-ligand docking applications in
the pharmaceutical industry, as most of the research work conducted at the latter is not publicly
available and does not get published. Several features can be associated to this popularity. The price of
the program is naturally an important issue. Open source alternatives and programs that are made
publicly available to academic institutions tend to get a higher number of citations than the ones that
require a paid license. Even within the latter, there can be large differences in price for different
software alternatives, which reflect in the number of users, but this can also be affected by the
marketing efforts. Another set of issues that are important to the number of citations associated to a
given program involves its ease of installation and use, the existence of support and the availability of
adequate learning tutorials that could help a user to make the most of the program. Then, on top of all
these issues we have, of course, the quality of the program, its range of application, the variety and
quality of the available scoring functions and search algorithms, the computational times associated, etc.
Despite these potential limitations, the number of citation, when used with care, presents a useful way
to identify and track emerging trends within this rapidly evolving field that is program-ligand docking.

Evolution in the Last 10 Years (Fig. (2) shows the evolution of the number of citations per year of the 7
most cited protein-ligand docking programs over the last 10 years, together with its relative percentage
in terms of citations per year. The results show that AutoDock was the top cited protein-ligand docking
software throughout the last decade, reaching a level around 500 citations per year. In addition, the
results show that while in 2001 its difference towards the second most cited alternative - DOCK - was of
only a few citations, in 2010 the difference towards the second most cited docking program – GOLD -
grew to close to 200 citations per year. In the past five years, its relative number of citations among the
top cited alternatives was maintained among 36-37%, indicating a stable and very significant “market
share”. Between 2001 and 2011, DOCK went from being the second most cited program to the fourth
place, behind GOLD and Glide, while keeping close to an average number of 150 citations per year.
GOLD has been through this period the most cited commercially available docking program. While
between 2001 and 2007 GOLD’s main competitor among paid alternatives was FlexX, Schrodinger’s
Glide has emerged as its most cited competitor. Nevertheless, Gold has been able to secure through the
past five years a “market-share” of 20-23% among all the most cited alternatives, while Glide is currently
at 17% and FlexX at 9%. FTDOCK and QXP only represent 3 and 1% respectively of the total number of
citations per year of the seven most cited docking alternatives. Globally, these results show that
AutoDock has been dominating the competition, in terms of number of citations,