FIGARO: A Fast and Interpopulational Genetic Algorithm For Receptor Optimization

FIGARO - a Fast and Interpopulational Genetic
Algorithm for Receptor Optimization

FIGARO - Een nieuwe evolutionaire strategie voor receptor optimalisatie
Dissertation presented in
Fulfillment of the requirements
for the degree of Master of Science in
Bioinformatics
Promotors:
Prof. Bart De Moor
Prof. Yves Moreau
ESAT - STADIUS. Stadius Centre for Dynamical Systems.
Signal Processing and Data Analytics
January 2017
Pieter Noyens
"This dissertation is part of the examination and has not been corrected after defence for
eventual errors. Use as a reference is permitted subject to written approval of the promotor
stated on the front page."
FIGARO A Fast and Interpopulational Genetic

Algorithm for Receptor Optimization
Pieter Noyens
pieter.noyens@student.kuleuven.be
January 2017
Foreword
More than any other previous project I stumbled upon during my study
trajectory, this work learned me what its like to work on something really
big and how to organize the process from end to end. Key to this is trying
to keep a healthy pace in the development process without closing your eyes
for upcoming problems and concerns. In first instance, I did not manage to
find this balance. While I was very happy to be able to come up with an
idea myself, I also found myself floating in a vast space of naive beginners
optimism which was soon to be followed up by a more dark tainted feel
of uncertainty and decreased confidence in success and myself in general.
Looking back to that time right now, I never thought this would have such
a big impact on me. But whenever I showed up at my daily supervisor Dusan
Popovic, he told me that he was deeply surprised by what I had been able
to implement for the time that passed. For this I would like to thank him in
particular, as he created a whole new wave of courage by just speaking out
these few words. So here you have it. Thank you, Dusan. I hope you will
continue to motivate other people around you and I wish there were more
people doing the same thing. Its often not the ability to do things but the
motivation that stands in the way of progression.
Getting back together took a while as there were other unexpected things
happening in my life and that of my closest relatives. But no matter what,
the amount of help and support was incredible. I would like to thank my
great friends and family for being there when I needed them the most. Next
ii
FOREWORD
iii
to those days, I can also look back in joy to the times we worked together in
the library. Those moments meant a lot to me and are the reason I finally
figured out that work and play can ultimately be combined. We called
ourselves The Colleagues and found help with each other during the year
and the examination periods. Marlies, Birger, Stijn, Dani, Bram, Daan and
others, thank you for that!
2016 was also the year I met Eva. At the other side of the world amid
the altitudes of the Argentine mountains, she walked into my life. To this
day, Im honored to experience the positivism that goes out from her at all
times which is a major drive in my work.
As a last word of appreciation I want to thank my promotor Bart De
Moor and co-promotor Yves Moreau. They showed the flexibility to work
out a strategy on my own and gave me the opportunity to implement it in all
freedom. This would later turn out to be the most educational experience
Ive had in my life. I would fall, but also rise again. This is the work that I
am proud to present to you.
Abstract
With a better understanding of biochemical processes, recent advances in
artificial intelligence and the availability of tremendously growing computational power, the field of biological engineering is currently facing an era of
major breakthroughs. Full simulations of biochemical systems have made it
possible to fine-tune metabolic networks to a high degree, which has shown
to successfully render highly optimized microbial strains with maximized
production yields of the desired industrial compounds. [1, 2] An important
domain of research in this field is the development of new artificial enzymes
to speed up non-native reactions in this process; even though nature found
a wide range of extremely efficient biomolecular catalytic machines, not all
industrially relevant reactions have a known natural catalyst to increase the
flux of the chemical process. Several attempts to develop de novo artificial enzymes for non-native reactions have already been made in previous
studies, with a functional Kemp-eliminase and retro-aldolase as the most
representative examples of success. [3, 4] For now however, these techniques
have been largely driven by intuition based on known biochemical reactions
and active site coordination. At best, these enzymes are comparable with
catalytic antibodies regarding catalytic performance, while natural enzymes
surpass them by several orders of magnitude. [5, 6, 7, 8] The level of their
success remains therefore subject of discussion. Much of the factors contributing to enzyme catalytic activity are indeed yet to be discovered, which
makes this kind of biased rational design unlikely to yield enzymes of comiv
ABSTRACT
peting activity in the near future.

Another approach that has been applied so far is directed evolution,
which succeeds in eliminating bias to a large extent but requires a lot of
practical resources and lab hours. [9, 10] This iterative technique consists of
routinely evaluating the effects of mutations introduced to known protein
sequences and eventually selecting the best mutants for further optimization,
but none or only part of this process is carried out computationally at the
moment. Therefore, this method is often combined with rational design in
order to speed up the process, but again bias is increased.
In this work we propose a fully automated and less biased evolutionary strategy to design and optimize a binding site for random substrate
molecules without known natural binding pocket. Based on an efficient genetic algorithm dubbed FIGARO (a Fast and Interpopulational Genetic
Algorithm for Receptor Optimization), we show that this strategy can be
a promising new approach in finding valuable protein structures that can
be useful either as starting points in the task of artificial enzyme design, or
may also be used as general optimized receptors in important domains of
research like the antibody treatment of cancer.
List of Code Snippets

3.1
Program parameters . . . . . . . . . . . . . . . . . . . . . . .
44
3.2
Application backbone . . . . . . . . . . . . . . . . . . . . . .
45
3.3
Distributed computing support . . . . . . . . . . . . . . . . .
47
3.4
PDB mining implementation . . . . . . . . . . . . . . . . . .
48
3.5
Response XML . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.6
Binding site detection . . . . . . . . . . . . . . . . . . . . . .
52
3.7
The Protein class . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.8
The mutation operator . . . . . . . . . . . . . . . . . . . . . .
56
3.9
Backrub modeling implementation . . . . . . . . . . . . . . .
57
3.10 Ligand docking implementation . . . . . . . . . . . . . . . . .
61
3.11 The crossover operator . . . . . . . . . . . . . . . . . . . . . .
64
3.12 Homology modeling implementation . . . . . . . . . . . . . .
66
3.13 The selection operator . . . . . . . . . . . . . . . . . . . . . .
69
B.1 main.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
B.2 functions.py . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
B.3 protein.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
vi
List of Figures
1.1
Graphical presentation of torsion angles . . . . . . . . . . . .
1.2
Ramachandran plot example . . . . . . . . . . . . . . . . . .
1.3
Folding funnel representation . . . . . . . . . . . . . . . . . .
1.4
-helices and -sheets . . . . . . . . . . . . . . . . . . . . . .
1.5
The leucine zipper . . . . . . . . . . . . . . . . . . . . . . . .
10
1.6
Hemoglobin T and R state . . . . . . . . . . . . . . . . . . . .
14
1.7
Lock and key model . . . . . . . . . . . . . . . . . . . . . . .
16
1.8
Induced fit model . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.9
Ab initio modeled structures . . . . . . . . . . . . . . . . . . .
21
1.10 Chimera GUI for MODELLER . . . . . . . . . . . . . . . . .
23
1.11 The backrub move . . . . . . . . . . . . . . . . . . . . . . . .
25
1.12 Side chain predictions based on backrub motion . . . . . . . .
26
2.1
Genetic algorithm pipeline . . . . . . . . . . . . . . . . . . . .
31
2.2
Stochastic Universal Sampling . . . . . . . . . . . . . . . . . .
40
vii
Contents
Foreword
ii
Abstract
iv
Literature study
1 Protein structure
1.1
The protein backbone . . . . . . . . . . . . . . . . . . . . . .
1.2
Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
The structure-function relationship . . . . . . . . . . . . . . .
13
1.4
Binding pockets, enzymes and active sites . . . . . . . . . . .
16
1.5
Protein structure prediction . . . . . . . . . . . . . . . . . . .
20
1.5.1
The backrub move . . . . . . . . . . . . . . . . . . . .
25
Molecular docking . . . . . . . . . . . . . . . . . . . . . . . .
28
1.6
2 Genetic algorithms
30
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.2
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
2.3
Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.4
Methods of selection . . . . . . . . . . . . . . . . . . . . . . .
40
viii
CONTENTS
II
FIGARO
3 Application overview
ix
42
43
3.1
General outline . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2
Mining the PDB . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.3
Screening for binding sites . . . . . . . . . . . . . . . . . . . .
52
3.4
The Protein class . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.5
Modeling point mutations . . . . . . . . . . . . . . . . . . . .
56
3.6
Ligand docking . . . . . . . . . . . . . . . . . . . . . . . . . .
62
3.7
Modeling recombined sequences . . . . . . . . . . . . . . . . .
65
3.8
The selection operator . . . . . . . . . . . . . . . . . . . . . .
68
4 Discussion
70
5 Conclusion
73
III
Appendix
76
A Bibliography
77
B Attachments
86
B.1 main.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
B.2 functions.py . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
B.3 protein.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

C End summary
112
Part I
Literature study
Chapter 1
Protein structure
CHAPTER 1. PROTEIN STRUCTURE
Figure 1.1: Graphical presentation of and torsion angles. Included are

Ramachandran plots for several important amino acids.1
1.1
The protein backbone
Protein structure exists in several superposed levels of hierarchy. To come

to a final set of biologically relevant capabilities like catalysis, ligand binding and membrane association, the structural properties of proteins all have
to be fine-tuned to a high extent. This is known as the structure-function
relationship, which will be further discussed later on. It also implies that
the altering of structure will likely result in reduction or loss of function. In
short, four levels of hierarchy exist regarding protein structure development,
classified as the primary, secondary, tertiary and quaternary structure. The
primary structure is the most fundamental one and consists of the amino
acid sequence in its most basic, unfolded form. It contains all information
needed for further folding processes (known as Anfinsens dogma [11]) and
is the backbone for every protein. The 20 amino acids all have their specific physicochemical properties, which accounts for an enormous range of
possibilities regarding final sequence composition and protein conformation.
This final conformation is based on both strong and weak interaction forces
1
Picture taken from http://www.ym.edu.tw/jierongh/research_e.html.
Figure 1.2: Ramachandran plot based on about 100,000 data points for
general amino acids (glycine, proline and pre-proline excluded).1
between the reactive groups of the individual amino acids. [12] Strong covalent bonds can be found between cysteine residues, where oxidation of
the thiol groups can result in cystine bonds between two sulfur atoms also
known as disulfide bridges. Weaker interactions are in particular van der
Waals forces, electrostatic interactions, hydrophobic interactions and hydrogen bonds. The total of these factors will eventually lead to a final structure
of the polypeptide chain, which can be reached at either one of all possible
levels. Keratin, for example, has a final structure at the secondary level.
It is therefore called a fibrous protein and is extended in length, serving
a structural role in biological systems. Some proteins even lack any form
of secondary structure like insulin. Globular proteins, however, will have
tertiary or quaternary structure and can be catalytically active. More information about these different levels of structure development is given in
following chapters.
A protein conformation can be described by a set of torsion angles of
1
Picture taken from [13].
certain bonds in the protein backbone. For every amino acid, two torsion
angles are essential to describe the relative geometrical position in the chain.
These are called the phi (, between the N and C atoms in every chained
amino acid) and psi (, between C and C) dihedral angles and are visually
represented in figure 1.1. Interresidual attractions and repulsions force rotation around the single bonds between stable equiplanar amide units. These
amide or peptide bonds have -bonding characteristics and are therefore geometrically fixed. [12] All empirically observable sets of and angles can
be visualized in a two-dimensional plot, which is called a Ramachandran
plot or diagram. An example is given in figure 1.2. Three main regions
are clearly distinguishable. As is later discussed, these regions represent the
conformational tendency to form what are called -helices (lower left quadrant) and -strands (upper left quadrant). [13] Clearly, many combinations
are never observed in nature due to unfavorable steric hindrance between
residues. A notable exception is glycine, which is the smallest amino acid.
With a side chain of only one hydrogen atom, glycine is the most flexible
amino acid and can take on many conformations in a polypeptide chain,
causing the protein to be more dynamical and adaptable. The Ramachandran plot for glycine as shown in figure 1.1 confirms this hypothesis. This
is an essential property in enzyme-mediated catalysis, as induced fit and
other conformational changes during transition states are indispensable for
functionally active proteins.
1.2
Protein folding
The central dogma in molecular biology states that all genetic information
is contained within the DNA or the genome of an organism, which can be
either replicated by DNA polymerase or transcribed by RNA polymerase
to form rRNA, tRNA or mRNA transcripts. [14] The latter will eventually be translated by ribosomes to form polypeptide chains, which in turn
give rise to functional proteins. All biological systems, both eukaryote and
prokaryote, are based on this principle. While mRNA is being translated
and amino acid building blocks are chained together by the ribosome, interand intraresidual attractions and distractions force the nascent polypeptide coil to adopt a certain conformation at the N-terminus and folding is
started. [15] Initial research by Christian Anfinsen on RNase A or bovine
pancreatic ribonuclease showed that this protein did indeed fold spontaneously into a three dimensional conformation. Anfinsen later concluded
that this three dimensional structure can be lost when the environmental
conditions are altered, for example when pH is increased, temperature is
increased or salt is added. This process is called denaturation. After restoring to initial conditions, Anfinsen observed that RNase A would fold back
spontaneously in its native conformation. This, however, was later found to
be only true for a minority of observations. [16, 17] Spontaneous folding only
occurs in optimal conditions during and after biosynthesis. In the crowded
intracellular environment, other proteins are therefore needed to prevent aggregation and misfolding. These proteins are called chaperones and assist
in the folding process. Heat-shock proteins are a well-known example and
prevent misfolding of proteins due to increased temperature.
The Levinthal paradox states that if proteins were to be folded by random sampling of all possible conformations, it would take about the lifetime
of the universe for a single protein to attain its native fold. As real-life folding takes about a few milliseconds to even a few microseconds, this clearly
Figure 1.3: Graphical representation of the folding funnel theorem.1

wasnt the case and folding had to be guided in a certain pathway or multiple possible pathways. [18] This hypothesis is based on the fact that more
compact protein conformations, established by research in polymer thermodynamics, have less sampling possibilities with respect to its potential
energy surface. Each step in the folding process yields a more compact protein structure with less potential energy, thus reducing the search space to
be sampled. This can be represented by a folding funnel represented in figure 1.3 in which the native state is attained by several traces going through
intermediate semi-stable states along the funnel. [19] Eventually a unique
native conformation can be reached efficiently this way. Moreover, it is assumed that folding is further accelerated by parallelizing the process. [20]
Secondary structures are first formed locally and later associated to form
a more global conformation. However, the exact mechanism of folding remains unraveled, as there is no general applicable theory that explains why
some proteins fold faster than others and take different routes to their native
conformation.
What we do know is that folding is predominantly driven by a small
set of physical interactions between the amino acids and its side chains and
1
Picture taken from http://science.sciencemag.org/content/338/6110/1042/F3.
Figure 1.4: -helices (in green) and -sheets (in red).1

constraints with respect to the dihedral torsion angles of the protein backbone. Also, constraints exist on the geometrically orientation of the atoms
in the side chains of the amino acids. Single bonds are in general freely
rotatable, but small differences in potential energy between configurations
tend to prefer energetically more favorable rotamers in final protein structures. Some combinations are even completely excluded from the set of
possible rotamers, resulting in a rotamer library which is frequently used in
computational methods to predict protein three-dimensional structure. [15]
As discussed earlier, protein structure can be described from 4 different
perspectives. The primary structure in essence just represents the amino
acid sequence of the protein, being held together by peptide bridges allowing for the and torsion angles as discussed above. We therefore wont
go into further chemical detail about this structural level, as we assume the
reader to be familiar with the basic organic chemistry behind macromolecular structures.
The main driving forces causing secondary structure elements to form
1
Picture
taken
from
cellchemistrycontd.html.
http://oregonstate.edu/instruction/bi314/fall12/
are hydrogen bonds between atoms in the protein backbone. These occur
between the hydrogen atoms bound to electronegative nitrogen atoms and a
double-bonded oxygen atom at another site in the backbone. In 1951, Linus
Pauling already predicted this to give rise to two main structural patterns:
-helices and -sheets. [21] These structural units are visualized in figure
1.4. As is discussed earlier, these patterns can also be inferred from the
Ramachandran plots of thousands of proteins. The third group visible in
these plots (upper right quadrant in figure 1.2) represent the left-handed
-helices. This is a rather small group though; generally speaking, lefthanded -helices only occur in nature as short loop regions or single-turn
-helices. [22]
Once a polypeptide chain is completely folded into a globular protein
unit, tertiary structure is reached. This structural level is most common
for enzymes and will be treated exclusively in this thesis. Separate units
of tertiary structure can function on their own, but are also seen to be
associated with each other, serving as identical or deviating subunits in a
polymeric complex. This final complex then constitutes the functional and
biologically relevant protein. In this work however, we will be constrained
to single polypeptide chains for simplicitys sake.
When visualizing the three-dimensional structure of folded proteins, secondary structure elements are often accented by special illustrative patterns
and colors like arrows, spirals and cylindrical boxes, also known as the cartoon representation of proteins. This makes it possible to get better insight
into its general composition and biological role. Indeed, plotting the absolute coordinates of all atoms should in theory be enough to deduce the
general shape of the protein but at this low-level representation, interesting higher-level motifs are not directly revealed to the biomolecular analyst.
This is important because a wide range of short motifs can be associated
with specific biological roles. An example is the well-known leucine zipper,
10
Figure 1.5: Leucine zipper structure interacting with a DNA double helix.1
which is shown in figure 1.5 and has strong DNA-binding characteristics.
A dipole moment exists along the extent of the -helices, which is caused
by the accumulation of individual dipoles in every amide bond of the helix.
This results in a more positively charged N-terminus, while the C-terminus
of the helix gets more negatively charged. This, in combination with a high
abundance of basic amino acids at the N-terminus side of the helix increasing further the positive charge on this side of the chain, causes it to have
high affinity for the negatively charged strands of DNA. Finally, hydrophobic amino acids at the center of the helix make the helices associate with
each other and a zipper-like structure is formed. [23]
Motifs like the leucine zipper can be combined to form higher-order functional modules called domains. Domains are independently folding units in a
global protein structure and are often duplicated and exchanged between ge1
png.
Picture taken from https://commons.wikimedia.org/wiki/File:Leucine_zipper.
11
netic sequences, introducing new functionality in other genes. Evolutionary

processes like mutation, crossover and conservation allow protein domains
to be efficiently optimized and transported to other places in the genome,
serving as fundamental modular building blocks for protein structure and
function. [15]
In order for scientists to keep a clear overview on all these known domains and experimentally determined structures, bioinformaticians all over
the world created several databases to store this knowledge online. The three
main reference databases for protein domain classification are SCOP [24],
CATH [25] and Pfam [26]. While Pfam is a sequence-based database inferring domain characteristics from Hidden Markov Model (HMM) analysis of
new sequences, SCOP and CATH share the same structure-based nature and
derive their information from experimentally determined protein structures,
which means that they need an underlying database of verified structures
to do their analysis on. In 1971, scientists at Texas A&M University began
a small database and molecular visualization project eventually leading to
whats now known as the Protein Data Bank (PDB), serving as the sole
reference in protein structure storage and consultation today. [27] The PDB
currently contains more than 100,000 protein structures and also supports
geometrical information about other biological macromolecules like nucleic
acids and protein-nucleic acid complexes. Thankfully, it also offers a RESTful Web Services API to allow for automated structure mining based on
extended querying possibilities.
The PDB requires protein structure coordinates to be stored in a specific file format defined by the wwPDB, a central organization managing
the PDB. [28] In order to maintain a coherent framework for structural and
functional analysis, strict rules have been established for this format but
unfortunately these are not always respected. Also, not all PDB structural
information is of high quality. This is mainly due to differences in methods
12
to infer the atom coordinates. X-ray crystallography is generally known to

deliver better quality structures with higher resolution than NMR, but even
in this category structural quality can differ significantly. In general, structures with a resolution smaller than 1
Angstrom are considered to be of very
high quality, while resolutions of 3
Angstrom and more indicate rather poor
quality. Also, it has to be mentioned that membrane proteins do not tend to
crystallize easily, so high quality structures of these proteins are rather hard
to find. All together it has no doubt that it is very important to handle the
PDB and its format with care in order to get reliable results. Thankfully,
several software packages like Biopython do a good job accounting for these
imperfections. [29]
1.3
13
The structure-function relationship
A protein is nothing more than its own sequence of amino acids and the
biochemical implications going with that. That is, when one describes a
protein and its biological role, be it an enzyme, membrane or transport protein, it should always be kept in mind that only its amino acid composition
is responsible for that. As described earlier, Anfinsens dogma states that
the amino acid sequence contains all information needed for correct folding inside a cellular environment. As an example, membrane proteins often
fold into two main regions with different physicochemical properties. Often a
highly hydrophobic and more hydrophilic region can be clearly distinguished.
This hydrophobic region is thermodynamically predefined to associate with
lipid compounds, so the protein will be fixed to the double-layered phospholipid cellular membrane at this site. This property is what gives the protein
its correct localization and makes it function as a membrane protein. The
fact that this protein is located at the cellular membrane can thus be reduced
to the fact of having a fold that divides a hydrophobic and hydrophilic environment, which in turn is a direct result of its unique amino acid sequence
that has evolved to fold in this way. If we would introduce enough mutation
to this sequence eventually altering this fold dramatically, it should be clear
that its function and correct localization will be lost, too. [12]
Another well studied example representing the fine-tuned interplay between structure and function is hemoglobin and its variable structural properties. Hemoglobin can exist in two states, which is shown in figure 1.6.
When the structure resides in its taut or tensed (T) state, it exhibits a
smaller affinity to oxygen. However, upon environmental changes like decreased CO2 partial pressure and accompanying pH decrease, the conformation of hemoglobin alters to a more relaxed (R) state which shows much
higher affinity to oxygen. When aerobic organisms breath, CO2 is removed
from the environment causing hemoglobin to transform in its R state, being
14
Figure 1.6: Hemoglobin structure conformation in its T (left) and R (right)

state. The R state brings two histidine residues close to each other, enabling
a more favoured binding of O2 .1
optimized for oxygen take-up and transport. Also, oxygen binding further
stabilizes the R state, causing more oxygen to be taken up easily. When
hemoglobin enters an environment where the release of oxygen is desired,
structure again switches to the T conformational state. The TCA cycle in
tissue cells, happening in- and outside the membranes of the mitochondria,
needs oxygen to produce energy and is preceded by glycolysis. Glycolysis
implies the co-production of 2,3-BPG, which can exit the cell into the blood
stream, bind to hemoglobin and act as an allosteric regulator inducing a conformational transition to the T state. This disfavors oxygen binding upon
which it is released, taken up by the cells and exposed to the intracellular
mitochondria. [30]
It should be clear that it doesnt end with these two examples. All proteins need their unique structures to be functional. It has to be mentioned
though that this applies on different levels of detail. For example, structural
proteins like keratin only having secondary structure do need a sturdy and
1
Picture
taken
from
http://cbc.arizona.edu/classes/bioc462/462a/NOTES/
hemoglobin/hemoglobin_function.htm.
15
extended conformation for their biological role, but single point mutations
are not likely to have any direct effects on functional properties. By contrast, enzymes and other proteins with globular conformation often have
a highly optimized and flexible fold. A single point mutation around the
active site of an enzyme can already cause far less affinity to its substrate,
which in some cases can lead to disastrous impact on the rate of catalysis,
in turn negatively influencing complete metabolic pathways with possibly
severe complications as a result. In other cases though, point mutations like
this can in fact be desirable and increase enzymatic activity to a certain
extent.
16
Figure 1.7: Graphical representation of the lock and key theorem.1
Figure 1.8: Graphical representation of the induced fit theorem.1
1.4
Binding pockets, enzymes and active sites
Proteins can be specialized in all kinds of tasks. A main differentiator between all protein functional classes is the ability to bind smaller molecules,
also called ligands. We just came to describe a representative of this family,
as hemoglobin can be seen as a receptor protein binding to a dioxygen ligand. As such, transport proteins belong to a higher level class of proteins
1
Picture
taken
from
induced-fit-hypothesis/ and edited.
https://aberdeenc.wordpress.com/tag/
17
with ligand-binding capabilities. Many membrane proteins and in particular enzymes share this classification. In addition to having possibly multiple
binding pockets for ligands, enzymes also possess an active site which can
be situated at either one of the available binding pockets and carries out
the actual catalysis. This particular binding site exhibits high affinity for a
transition state structure of the reaction to catalyze. An often demonstrated
but overly simplified model of enzyme working mechanism is the lock and
key representation, illustrated in figure 1.7. This, however, is an outdated
interpretation of how many enzymes work. According to this model, an enzyme readily possesses high affinity for every substrate molecule involved in
the reaction and is rigidly shaped to bind these ligands exclusively. Indeed,
enzymes are known to have high specificity for a certain range of substrates,
like a lock has for its keys. Upon sturdy binding of these molecules, the
reaction is facilitated by the enzyme and a product for which the enzyme
has much less affinity is released. This theory is nowadays in general considered to be rather incorrect and, instead, an induced fit model is preferred.
This model assumes that initial binding of a substrate molecule triggers a
conformational change in the enzyme, causing the transition state of the
reacting compounds to be thermodynamically favored and stabilized, rather
than the substrates as such. As a result, activation energy is significantly
lowered and the reaction rate is increased. The main difference with regard
to the lock and key model is that, in this model, the enzyme is a flexible
entity that dynamically adapts to the shape of its substrates and transition
state. A schematic overview of this is given in figure 1.8. [31, 32]
Enzyme dynamics and especially the chemical kinetics of enzyme-mediated
catalysis have been a main domain of research the last century. Two main
concepts should be highlighted in this regard. Introduced in 1913 by Leonor
Michaelis and Maud Menten, the rate constant kcat and the affinity constant KM succeeded in characterizing enzyme kinetics based on a few made
18
estimations. [33] In short, the rate constant describes at which turnover

rate an enzyme is able to catalyze the reaction under the assumption of
complete saturation of the available active sites and fixed enzyme concentration, thereby defining the maximum speed or Vmax of the catalyzed reaction. The affinity constant defines at which substrate concentration half
of that Vmax is reached. When enzymes have higher affinity for a certain
substrate, this point is reached at lower substrate concentration. Thus, a
lower KM indicates higher enzyme affinity for the substrate. If we want to
increase enzyme activity, either of these two parameters can be targeted.
An often used value for comparable enzyme efficiency is the ratio of kcat
to KM . Optimizing enzyme dynamics and stabilization of the transition
state would return a better kcat value, while optimizing substrate-binding in
general should lower KM . [33]
In our project, we will be mainly focusing on the induced fit binding of a
given ligand rather than enzyme dynamics and transition state stabilization;
this did not fit within the scope of this masters thesis. In order to optimize
enzyme catalytic efficiency, it should be a first step to increase affinity to the
substrate which is straight-forward. Transition states can exist in multiple
intermediate steps to eventually yield the product, so modeling this is more
complex. Later on however, it should be within reach to extend this work
and integrate enzyme mechanics. For example, the Rosetta suite also offers
active site design and evaluation functionality, which could be a valuable
extension to the fitness function in our genetic algorithm. [34] Moreover, the
same docking procedure as applied in this work could be used to optimize
the binding, and thus stabilization, of transition state analogs. A multiobjective adaptation of our genetic algorithm should support that.
Binding of a ligand to its binding pocket is based on a wide range of
weak interactions leading to many degrees of freedom regarding optimization. All these interactions can be computationally quantified based on
19
physical laws. Therefore, a total summation over all relevant forces would
give us a quantitative measure for goodness of fit. Several software packages
exist to perform what is called molecular docking, but we will come back to
this later and discuss it in more detail.
1.5
20
Protein structure prediction
The protein folding problem has come a long way since its first formulation
by Kendrew and co-workers in 1958 [35], after they had realized the threedimensional structure of globular proteins was not by any means regular and
symmetrical as they had expected. Mainly, the aim of predicting a protein
structure from its amino acid sequence can be seen as a direct consequence
of the complications going with their experimental determination, generally
carried out by X-ray crystallography or NMR. As protein sequences keep
getting discovered in all kinds of organisms, annotating functions to these
gene products is far less evident without detailed knowledge about their
geometrical conformations. As discussed above, protein function is inseparably associated with its structure, so determining these structures is a very
important task in order to understand their biological relevance, eventually
leading to industrial and in particular medical applicability.
Unfortunately, experimental structure verification cant keep up with
this fast-paced sequence discovery and the gap between known sequences
and experimentally derived corresponding structures is getting bigger and
bigger. [36] X-ray crystallography is very expensive and time-consuming but
could yield very high resolution structures. Moreover, the task of deriving
protein structures from NMR measurements is far from evident and asks
for significant specialization in the field, while being exclusively suitable for
smaller proteins and delivering structures with rather poor resolution. The
technique could however be used to determine structures in solution, in contrast to diffraction-based analysis. Nowadays, it takes several months to
years to experimentally derive a single structure; in fact, growing a protein
crystal big enough for X-ray analysis (about 0.5 mm) could already require
months. Additionally, membrane proteins are almost impossible to crystallize which leads to extra complications. [37]
These facts taken into consideration, it should come to no surprise that
21
Figure 1.9: Some ab initio modeled structures from a recent publication. In

red are the experimentally determined high resolution structures. The conformations in blue were rendered with the use of the Anton supercomputer.1
the domain of computational structure prediction has drawn a lot of attention from the beginning of the problem. Yet, after 50 years of intensive
research, there is still a lot to be done. It is generally considered to be one of
the most important unsolved problems in biological engineering today. [38]
Indeed, many goals have already been reached in the field and in the process
of finding a single best method to predict protein three dimensional structure
from its amino acid sequence, two main promising strategies have evolved.
With this in mind, it should be clear that a single best method does not
exist. While Anfinsens dogma states that all information for protein folding
is contained within its amino acid sequence, it unfortunately appeared to
be an extremely challenging task to make predictions based on this particular sequence alone. It still deserves a lot of research dedication however,
as it has great potential when eventually satisfying computational power
and efficient methods will be available. This domain in structural biology is
called ab initio protein structure prediction and aims at making high quality
predictions based on the physical laws of molecular mechanics and dynamics. In theory this means no statistics should be applied and calculation is
completely without bias. In practice, with current available computational
resources, estimations are inevitable. Actually, full molecular dynamics simulations can be carried out, but currently only on oligopeptides ranging from
1
22
about 50 to 100 amino acids for periods up to 1 ms. [39] These computations
are facilitated by the state-of-the-art Anton supercomputer which is specifically designed for particle-particle interaction calculations. [40] A second
revision of this device, Anton 2, has been announced some time ago, but is
still in development at the time of this writing (january 2017).1 The most
impressive results achieved with the Anton supercomputer have elucidated
many interesting facts about the folding process that previously had been
impossible to observe. Figure 1.9 shows some of the realizations achieved
so far and convincingly confirms ab initio structure prediction is one of the
most reliable prediction strategies available today. [39]
Unfortunately, most proteins of biological and medicinal importance
are significantly bigger than the ones that can be predicted ab initio right
now. Therefore, scientists have been focussing on another approach the past
decades which aims to predict the structure of new proteins based on the
coordinates of previously derived conformations. This strategy is called homology modeling and mostly uses statistical inference combined with small
molecular dynamics calculations to carry out structure predictions. It has
many advantages over ab initio modeling, in particular the fact that it can
be used for proteins of over 100 amino acids and because it has much faster
runtimes. As sequence identity increases between two proteins, model quality can be equal or even better than ab initio rendered structures. The
method already produces reliable results with sequence identities starting
at about 30%. If sequences show 80% to 90% similarity, model resolutions
even better than 1.5
A can be obtained. Note that this guideline is only
valid for proteins that have evolved naturally. Sequences that show similarity between proteins are also called conserved regions and are most likely
1
Anton 2 will have as much as 5 times the number of CPU cores running at 1650 MHz
each, compared to 485 MHz of the original Anton supercomputer. It will also sport 152 interaction pipelines running at 1650 MHz per pipeline, a peak throughput of 12.7 TFXOPS,
4096 KB of RAM, 8000 rendered atoms per ASIC and up to 2.7 Tb/s of channel bandwith.
23
Figure 1.10: Graphical User Interface for MODELLER provided by the

Chimera molecular graphics suite.1
situated in similar conformational folds. As the structure-function relationship dictates, a protein is assumed to lose its function if its fold is distorted
by mutation. Therefore, even proteins with up to 70% deviation in sequence
usually still persist the same fold. This principle is key to homology modeling and turns out to be the main reason for its succes and popularity
among researchers. One should be careful though; not all parts of protein
sequences are conserved. Flexible regions exist, mostly acting as connectors
between the functional and conserved domains. These regions are known
as loops and turns and can be highly variable between proteins with high
sequence identity. Review and model evaluation by an expert are therefore
appropriate. [41, 42, 43]
1
24
From the formulation of the problem in 1958 up until now, a lot of

software packages have been developed to perform comparative modeling.
Traditionally one of the most popular and fastest solutions available is MODELLER, developed by the lab of Andrej Sali at the University of California.
Thankfully, MODELLER provides a lot of useful tutorials and comprehensive documentation and is the defacto standard among students trying to
develop their skills in homology modeling. It should therefore be no surprise
we opted for this package, too. Other popular solutions are I-TASSER,
ORCHESTRAR, Prime, MOE, SWISS-MODEL, Composer and the comparative modeling tool of the Rosetta suite. Figure 1.10 shows a 3rd party
graphical interface to the MODELLER program. Moreover, model reliability is indicated on the produced structure. We can clearly see that the
regions with highest error probability are situated in the flexible coils without secondary structure. [41]
In this work, we will combine elements from ab initio and homology
modeling; the exact followed procedure will be discussed later on. Note
that we wont perform extended molecular dynamics calculations on comlete
protein folds but instead focus on the small scale effects of trivial point
mutations. For this, we make thankfully use of whats called the backrub
move. A detailed description of this phenomenon is given in the next section.
25
Figure 1.11: Schematic representation of the backrub move. Rotation 1,3 is

visualized by the red dotted circle. This is the primary rotaton and is traced
out by the central C . As a rigid body, the central C and its surrounding
peptides rotate along the red axis going from C-1 to C+1 . Secondary
rotations 1,2 and 2,3 are represented by the blue dotted circles and move
the individual peptides as rigid bodies along the blue C -C axes.1
1.5.1
The backrub move
In the previous section, two main approaches to protein structure prediction

were highlighted and briefly explained. Most of the time, sequences that
have a high priority to be modeled are originating from genome databases.
This is explained by the fact that there still exists a giant gap between
conformational information and the number of known sequences found in
all kinds of organisms. The solution for these sequences is rather straightforward. They are first aligned to a sequence database with tools like BLAST
or HMMER to find homologs that have already been structuraly annotated.
Based on these structures and if sequence identity is high enough, homology
modeling is performed and in the best case a highly reliable model can be
obtained. If there are no homolog structures found like would be the case
for many membrane proteins, there is not much left to do.
But what strategy should be used if a template with known structure is
1
26
Figure 1.12: Some correct side chain predictions based on the backrub motion. Fixed backbone prediction is shown in red while backrub predictions
are shown in blue. Starting PDB structure is shown in green and the target
point-mutated PDB structure is shown in purple.1
available, has very high sequence identity with a given sequence to be modeled that is artificially designed and counting more than 300 amino acids?
Clearly, homology modeling cannot be used as it only works under the assumption of natural evolution and fold conservation. Ab initio modeling of
the complete sequence also falls of because it is currently intractable to perform molecular dynamics simulation on over 300 amino acids in reasonable
time. Thankfully, a study by Davis and co-workers in 2006 elucidated a
highly predictible pattern in the effects of point mutations on global protein
conformation. This pattern is called the backrub move and is visualized in
figure 1.11. [44]
The backrub motion describes subtle changes in a proteins backbone
triggered by much larger altering side chain conformations. These side chain
movements are often observed due to impacts of other molecules in the
environment like H2 O, but after collision original backbone conformation
is usually restored. Naturally occurring point mutations also cause these
side chain rearrangements, but in contrast to random impacts these changes
are permanent. While a point mutation itself usually does not directly
influence backbone conformation, with cysteine and proline being the only
exceptions to this rule, again backbone is tilted locally due to altering side
1
27
chains. Over a much longer evolutionary timescale, these local backbone

shifts are accumulated and cause a protein backbone to drastically change
and take on new folds. In the task of protein modeling, knowledge of these
backrub moves can provide realistic predictions of backbone perturbations
while maintaining a valid global geometry. [45]
While these backbone shifts cause some strain on the 1 and 3 angles
as seen in figure 1.11, these are usually well within the range of allowed
values and practically never exceeding the amount of one standard deviation.
Secondary movements of the surrounding peptides triggered by the primary
rotation of the central C also try to accomodate for this. [45] In figure 1.12,
some correct side chain predictions based on this backrub pattern upon point
mutation are illustrated.
1.6
28
Molecular docking
In section 1.4 on page 16, we already mentioned the concept of molecular

docking. In essence, molecular docking returns a docking score which acts
as a quantitative measure for goodness of fit between a ligand and a receptor structure. These names can be confusing though, as a ligand suggests
a smaller molecule binding to a taller receptor. In reality, they are very
relative and often used interchangeably. Molecular docking can be applied
on all kinds of (macro-)molecular structures. For example, the modeling of
protein-protein interactions and protein-DNA complexes are both carried
out by specialized docking algorithms. Each specific docking situation often
has its own range of specialized software implementations.
A docking score is calculated based on the total change in free energy
upon binding using physics-based force fields. This is a global energy summation function that takes into account all interaction forces that come into
play. The flexibility of the two actors is another important factor to be
accounted for as described in section 1.1 and section 1.4, which leads to an
even bigger solution space for this particular optimization task. Global optimization techniques such as Monte Carlo simulations and genetic algorithms
are therefore appropriate to come to an acceptable estimate. As discussed
in chapter 2, this implies a wide range of variables to be set when running
docking jobs. [43]
The application domains of molecular docking are countless. One of the
most important domains of research being facilitated by molecular docking
is drug discovery. Nowadays, virtual screening against compound databases
often serves as a starting point in drug development processes. For example,
viral proteins can be easily and directly shut down by binding a targetspecific allosteric effector. More indirectly, monoclonal antibodies can be
designed to target microbial compounds, upon which they can be neutralized
by the immune system of the host. With the rise of computational docking,
29
all these processes can be reliably carried out at a much faster pace. [46]
However, molecular docking is not only used to calculate docking scores.
In many research domains it is actually more important to get a prediction
of the relative conformational positions between ligand and receptor. As an
illustration, many metabolic pathways make use of feedback systems to regulate the accumulation of end compounds, which can be acting as allosteric
effectors on enzymes facilitating their own production. Phosphofructokinase
or PFK is a well-known enzyme in glycolysis catalyzing the phosphorylation
of fructose-6-phosphate into fructose-1,6-bisphosphate. High levels of ATP,
one of the products of glycolysis, will cause PFK to have less affinity for
its substrate by altering its three-dimensional conformation upon binding.
ATP is said to be an allosteric inhibitor for PFK. [47] Getting better insight in the allosteric mechanism can be significantly accelerated by using
molecular docking applications visualizing three-dimensional conformation
of complexes.
Like in protein structure prediction, many software applications have
been developed for molecular docking analysis. Some of them are specialized in virtual high-throughput screening of compounds targeting the pharmaceutical industry and often come with a high price tag, while others are
open source and freely accessible on the internet. The most popular one
in this area is AutoDock Vina, but we will use the more configurable rDock
package. [43]
Chapter 2
Genetic algorithms
30
CHAPTER 2. GENETIC ALGORITHMS
31
Figure 2.1: General flowchart of a genetic algorithm.1
2.1
Introduction
The most powerful problem solver in the Universe as we know it is without

a doubt natural evolution. Since the formation of Earth more than 4 billion
years ago, we have evolved to walking creatures with clear vision and a
wide range of other biological sensors to find our way in this complex world.
While our brain size increased significantly over the years since the beginning
of the genus Homo, still many things are intractable for us to design or
even understand. Evolution does not have any problems with that and just
exploits all resources it can access to build whatever is appropriate in a given
context. This universal problem solver was the inspiration for scientists to
come up with new global optimization strategies now known as evolutionary
1
Picture taken and adapted from https://www.hindawi.com/journals/mpe/2014/
708275/fig4/.
32
programming and genetic algorithms. In this work, we will limit our focus
to genetic algorithms and its applications.
The working mechanism of genetic algorithms is in essence very simple;
in the simplest case, it does not use any prior knowledge to reach a global
optimum for a given problem. The general pipeline is depicted in figure 2.1
and will be briefly explained here. First, an initial population of solution
candidates is generated. These are often just randomly generated, but valid
representations of members of the solution space. A main requirement in
setting up a genetic algorithm is defining the fitness function. Like in natural evolution and Darwins survival of the fittest theorem, this measure is
used to compare the individuals in the population. These individuals are
also called chromosomes, analogous to the biological model for genetics in
which chromosomes are composed of genes. Like in the example of figure
2.1, genetic algorithms usually dont go into more detail concerning reallife genetics and simply assume a gene to be the most fundamental unit of
information to construct chromosomes. For every chromosome in the population at time t, the fitness is calculated. Based on these values, several
methods exist for selection, which are discussed later on. Simply put, the
better performing solution candidates have a higher chance of reproducing
and proceding to the next generation. This process of reproduction happens in two fases: mutation and crossover. The selection operator filters out
individuals eligible for mating and the crossover operator exchanges bits of
information between these candidates. After that, the mutation operator
potentially introduces random mutations in the sequences. This aims to
eliminate bias by exploring new parts of the solution space. A new population is the result at time t + 1 and the same steps are performed over and
over again until the stopping criterion is fulfilled. After a sufficient period
of time, an acceptable global solution estimate should be returned. [48]
Keep in mind that genetic algorithms are by no means the holy grail
33
that solves every problem. They belong to the same category of metaheuristic global search algorithms as ant colony search and particle swarm
optimization algorithms. These algorithms tend to perform very well in
complex situations with enormous solution spaces that would otherwise be
intractable to explore. Therefore, it clearly is the appropriate strategy to
be used in this work. In the next few sections, several considerations are
highlighted in the process of implementing efficient genetic algorithms.
2.2
34
Parameters
There is no such thing as a general genetic algorithm. It is more of a coupling term for many specialized variations. Like Dijkstras algorithm being
a special case of A* search, genetic algorithms consist of many variables
that can be heavily modified and optimized for specific situations. Yet,
a reference algorithm does exist and is better known as the canonical genetic algorithm (CGA). It serves as a basis, a guideline for more optimized
and efficient variants. Upon this basic algorithm, the schema theorem and
building block hypothesis were built which act as mathematical evidences of
algorithm efficiency. [48, 49] The main parameters of every genetic algorithm
are discussed below.
Probability of mutation The first parameter that can be optimized to
increase algorithm efficiency is the probability of mutation (%M). This
parameter describes the chance of mutation for each chromosome or
solution candidate in the selected part of the (sub-)population, going
from one generation to another. In genetic algorithms, the mutational
operator is of less importance than the crossover operator. Yet, values
of this parameter can heavily affect efficiency of the algorithm. Too
high probabilities of mutation lead to an overdose of genetic diversity
which results in incompleteness of the algorithm and no convergence
can be reached. On the other hand, too low values can easily cause
premature convergence and a very poor, suboptimal solution may be
returned.
Probability of crossover Secondly, we should consider the probability of
crossover (%C). Crossover or recombination is the main driving force in
genetic algorithms and natural evolution in general. It tries to exploit
the good elements of solution candidates while maintaining genetic
diversity over different generations by mixing them up from one gen-
35
eration to another. This way, good elements are kept in the population
and new chromosomes based on the previous ones are generated and
evaluated. Like the mutation probability parameter, this parameter is
prone to outliers. Too high values cause too much diversity, while too
low values lead to premature convergence.
Percentage elitism Another important parameter is the percentage of
elitism (%E). This represents the percentage of the population that
is to be kept when moving from one generation to another, chosen
from the upper part of the solution candidate pool with respect to
their relative fitness values. This part is moved without changes from
parent population to offspring and thus has to be kept rather limited
in order to achieve enough genetic diversity.
Population to generation size ratio When deciding on an optimal algorithm strategy, a fixed amount of chromosomes that are to be generated in the experiments has to be defined. Indeed, if this is not the
case, it is easy to gather much better results simply by letting the algorithm run for more generations or with bigger populations as genetic
algorithms are iterative global optimization algorithms. This obviously requires a lot more computational effort and leads us to another
parameter that has to be considered: the population size to generation number ratio (P/G). This ratio can be modified while keeping
the total amount of rendered chromosomes constant. The smaller the
population, the higher the number of computed generations but the
higher the chance of premature convergence. With bigger populations,
less generations can be calculated and it is less likely convergence will
be reached.
2.3
36
Operators
This part of the algorithm contains the actual drivers of the optimization
process. In particular the crossover operator (COP) is of main importance.
It acts as a driving and regulating force, trying to reach a balance between
the two extremes of exploitation and exploration. On the one hand, crossover
brings new permutations from different parts of the solution space to the
population while on the other hand, it tries to maintain the good elements
that already have been reached. This way, an optimal solution may be
achieved after a certain number of generations based on selection and recombination. Obviously, it is essential for the crossover operator to behave
in an efficient way. For example, the crossover operator of the canonical
genetic algorithm may in no case be used in solving something like the
traveling salesman problem (TSP), as the single point crossover operator
would deliver solutions that are not valid and, when using an adjacency
representation, in no way tries to pass good subpaths that already exist in
the population to the next generation. Solution candidate representation
is highly correlated with specific recombination operators. As a matter of
illustration, we make use of the TSP case to underscore the importance of
a good crossover operator. [50, 49]
Two representations are considered that can be applied on the TSP: path
representation and adjacency representation. Path representation requires
each permutation to simply illustrate the order in which cities are visited.
This is in contrast to the adjacency representation, which uses permutations
to describe the visiting order in a very different way. The location of the
gene in its chromosome, or the index - starting at 1 - of the number in the
permutation, defines a direct link to this number or gene. For example, city
4 in permutation [7 6 8 5 3 4 2 1] is connected to city 6. Often, crossover
operators are not interchangeable with different representations. To find
and implement the optimal crossover operator, it is therefore important to
37
use the right representation. Two crossover operators that can be used on
these different representations are briefly discussed below.
Alternating Edge Recombination This recombination method goes along
with the adjacency representation. It takes into account that although
all elements are unique, i.e. each city number only exists once in a
permutation, such tours can still be illegal as they can consist of more
than one cycle. Alternating Edge Recombination (AER) works as follows. First, a random starting edge is chosen from parent 1. This
subtour is extended with the according edge from parent 2 and so on.
In the end, a valid permutation can be reached which combines the
edges from the parents in their offspring. However, subpaths that are
passed from parents to offspring contain only one edge at a time. Good
subpaths are therefore disrupted by the operator which suggests lower
performance. It is essential for crossover operators to keep subpaths
of good solution candidates in the population. [49]
Edge Recombination Crossover This crossover operator works together
with path representation, the most intuitive yet very potent representation. As described above, crossover operators need to be conservative regarding subtours of solution candidates. Edge Recombination
Crossover (ERX) tries to achieve this goal, while keeping new edges
at a minimum. Other path representation crossover operators like
Order Crossover (OX) also keep subtours as much as possible in the
population but introduce a lot of new edges while doing so. In fact,
crossover operators just have to implement recombination. The mutation part is, obviously, already attributed to the mutation operator.
Therefore, ERX tries to introduce mutations as little as possible, rendering offspring chromosomes that are practically solely based upon
recombination, i.e. practically every edge in the offspring chromosome
comes from at least one of the two parents. ERX has a very high time
38
and space complexity. This is explained by its working procedure.

First, an edge map of the parents is constructed. This map describes
which cities are connected to each city, looking at edges in both parents. So, for example, parents [5 8 2 6 4 7 9 1 3] and [2 1 3 5 8 6 4 9 7]
would give an edge map in which city 1 is linked with city 9, 3 and 2.
As we can see, there is one common edge (1-3) between the parents.
Common edges lead to less connected cities in the edge map. ERX
first chooses a random city from one of the two parents. After that,
the city cannot be chosen anymore and is deleted from the left column
of the edge map. All connected cities are then evaluated in the edge
map, giving the city with the lowest amount of connected cities the
highest priority. If two cities have equal numbers of connected cities,
one of them is chosen at random. If there are no connected cities in the
then-current city but there are unvisited cities, one of these is selected
at random as next visited city. If there are no unvisited cities left,
the algorithm is terminated. The resulting permutation is the child
chromosome. [49]
The other class of operators needed in genetic algorithms, albeit of less importance, are the mutation operators (MOP). They are used to introduce
new, random edges in the population in order to increase genetic diversity
and explore other parts of the solution space. The considered operators in
this contextual description are the Reciprocal Exchange and Inversion operators. Reciprocal exchange, also called swap mutation, point mutation,
order-based mutation or simply exchange mutation, simply choses two random cities in a tour and exchanges them. Inversion works by selecting a
random subtour of a solution candidate, inverting it and inserting it back
in the chromosome at a random location. It is also called cut-inversion mutation. The effects of using different mutation operators is expected to be
rather low, as every operator in fact simply introduces randomness without

further relevance or background information. [49]
39
40
Figure 2.2: Schematic representation of the Stochastic Universal Sampling

selection method.1
2.4
Methods of selection
Efficient methods of selection are necessary in every genetic algorithm. The

selection part of the algorithm calculates which individuals from a start
population are ready to mate and generate offspring chromosomes which are
to be reinserted in the population, leading to a next generation set of solution
candidates. As a matter of illustration, 3 different methods of selection are
explained. The exact method of selection can be based on different rules
and conditions, in which individual fitness plays an important role.
Roulette Wheel Selection Another selection method is Roulette Wheel
Selection (RWS), also called Fitness Proportional Selection. In some
way this is analogous to SUS as again the bar representation is used.
However, the selected slice is always chosen randomly, in contrast to
the evenly spaced intervals used in SUS. This implies pure fitness based
selection, with selection probability of the individual being exactly the
same as its relative fitness or copy number in the case of linear ranking. In this method, individuals with much higher fitness values are
greatly favoured compared to others and may dominate next generation populations due to high selection pressure. Rank based Roulette
Wheel Selection may reduce this effect. [49]
Stochastic Universal Sampling One possible method of selecting parent
1
41
chromosomes is Stochastic Universal Sampling (SUS). In this method,

total fitness (F) is divided by the number of individuals that have to
be selected, resulting in evenly spaced fitness intervals. In contrast
to RWS, only one random number between 0 and the interval length
is generated in the beginning. This is the starting point. Try to
represent the total fitness as a bar (from 0 to F) divided in slices with
lengths proportional to the fitness value of every individual. The slice
that contains the starting point is the first selected individual. After
that, the starting point is moved over one interval length and again
the relevant slice or individual is selected. This process is repeated
until the end of the bar is reached. See figure 2.2 for a more graphical
representation of this methodology.
There are two possible ways of representing this bar. The most common one is described above. Another way is linear ranking. For this,
individual fitness is ranked, with lower fitnesses gaining a lower integer
ranking number than higher ones. Each individual is copied a specific
number of times according to its ranking value. Copies with the same
rank then combine to form a slice. This way, slice lengths are defined
by copy number. [49]
Tournament Selection The last widely applied selection method is Tournament Selection (TS). In this case, a subpopulation of k randomly
chosen individuals is picked from the population. From this subgroup,
the individual with highest fitness is selected. This is repeated until the desired number of selected individuals is reached. The value
of k has to be defined in the algorithm. With high values of k, a
high selection pressure is expected as many individuals will compete
against each other and only the fittest wins. Lowering k will reduce
the selection pressure and the chance of premature convergence. [49]
Part II
FIGARO
42
Chapter 3
Application overview
43
CHAPTER 3. APPLICATION OVERVIEW
44
Code Snippet 3.1: Currently hard-coded program parameters.

1 ## P e r c e n t a g e e l i t i s m
2
pctE = 0 . 2 5
3 ## P e r c e n t a g e c r o s s o v e r
4
pctX = 0 . 6
5 ## P e r c e n t a g e m u t a t i o n
6
pctM = 0 . 0 0 5
7 ## P o p u l a t i o n s i z e
8
p o p S i z e = 100
9 ## S u b p o p u l a t i o n s i z e
10
s u b p o p S i z e = 10
11 ## Number o f p u t a t i v e b i n d i n g s i t e s t o be i n s p e c t e d
12
nrBindingSites = 5
13 ## Number o f g e n e r a t i o n s
14
g e n S i z e = 50
15 ## Maximum r e s o l u t i o n o f s t r u c t u r e s i n i n i t i a l p o p u l a t i o n
16
maxRes = 2 . 0
17 ## Maximum number o f e n t i t i e s i n s t r u c t u r e
18
maxEnt = 1
19 ## Minimum s i m i l a r i t y t o q u e r y l i g a n d
20
minSim = 0 . 3
3.1
General outline
In this section, we will give a birds-eye view on the application and describe its global workflow. Over the next few sections, every single aspect of
the application is independently presented and clarified on the basis of the
corresponding code snippets.
The code in snippet 3.2 represents the main backbone of the program.
Note that the program in its current state still has many hard-coded parameters which could be more elegantly queried for as user input during
runtime in future releases. These parameters are given in code snippet 3.1
accompanied by short explanations in the code comments. The values given
45
to them here serve as an example for the program to be run with. However,
the main input for the program still needs to be defined. This is the ligand
for which we aim to generate an optimal receptor structure. This can be a
random small molecule, like some designed compound or a natural organic
molecule like caffeine. For our tests we used a web service that picks random
molecules from the ZINC database, hosted by the BIRC at the University
of California.1 When the program is run, the first task that needs to be
done is mining the PDB for high quality receptor structures that are known
to bind ligands similar to our input molecule. When enough templates are
collected, i.e. the defined population size is reached, the algorithm proceeds
by replicating every individual a number of times according to the defined
subpopulation size. While doing this, point mutations are introduced with
a predefined probability (pctM as shown in code snippet 3.1).
Code Snippet 3.2: Backbone of the application.
1 ## Generate random s m a l l m o l e c u l e from h t t p : / / b c i r c . d o c k i n g . o r g /
random . s h t m l and g e t SMILES ( c u r r e n t l y hardcoded ) .
2
randMol = CCOc1cccc ( n1 )NC(=O) c 2 c c [ nH ] n2
3
4 ## I n i t i a l i z e t h e p o p u l a t i o n .
5
p o p u l a t i o n = i n i t i a l i z e p o p ( randMol , popSize , subpopSize , maxRes

, maxEnt , minSim , n r B i n d i n g S i t e s , pctM )
6
7 ## Write i n i t i a l i n f o r m a t i o n f o r r e s u l t e v a l u a t i o n .
8
w r i t e i n i t i a l i n f o ( population )
9
10 ## Perform GA on p o p u l a t i o n f o r s p e c i f i e d number o f g e n e r a t i o n s .
11
f o r g e n e r a t i o n in range ( 0 , g e n S i z e ) :
12
f o r index , s u b p o p u l a t i o n in enumerate ( p o p u l a t i o n ) :
13
## P r e d i c t i n i t i a l s t r u c t u r e s .
14
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
predict structure backrub , subpopulation )
1
This website can be consulted at http://bcirc.docking.org/random.shtml
46
15
## Dock l i g a n d t o a l l i n d i v i d u a l s i n s u b p o p u l a t i o n .
16
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )
17
## S e l e c t i n d i v i d u a l s f o r mating i n s u b p o p u l a t i o n s .
18
population [ index ] = r o u l e t t e w h e e l s e l e c t i o n ( population [

i n d e x ] , pctE )
19
## Perform c r o s s o v e r b e t w e e n i n d i v i d u a l s i n
subpopulations .
20
t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n [ i n d e x ] , pctX )
21
#s i n g l e p o i n t c r o s s o v e r ( newPopulation , pctX )
22
## P r e d i c t s t r u c t u r e o f recombined s e q u e n c e s .
23
p r e d i c t s t r u c t u r e h o m o l o g y f a s t , population [ index ] )
24
## Dock l i g a n d t o a l l r e c o m b i n a t i o n s i n s u b p o p u l a t i o n .
25
26
27
i f g e n e r a t i o n < g e n S i z e 1:
## S e l e c t mother p r o t e i n s f o r new p o p u l a t i o n b a s e d on
c o m p e t i t i o n b e t w e e n each s u b p o p u l a t i o n s b e s t
competitor .
28
motherSelection = r o u l e t t e w h e e l s e l e c t i o n (
g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) , r e p l a c e=F a l s e )
29
## Generate new p o p u l a t i o n and mutate mother p r o t e i n s .
30
population = prepare next gen ( population ,

m o t h e r S e l e c t i o n , subpopSize , pctM , g e n e r a t i o n )
31
else :
32
bestProtein = report best ( population )
33
print Program f i n i s h e d . The b e s t p r o t e i n i s
b e s t P r o t e i n . id + with a s c o r e o f + s t r (
bestProtein . score ) + \
34
on b i n d i n g s i t e
+ bestProtein . bestBindingSite
+ .
After all sequences have been generated, their structures are modeled by
the backrub application from the Rosetta suite. The exact mechanism for
this is described in later sections. A molecular docking job is then performed on all of these individuals and fitness scores are collected. Based
47
Code Snippet 3.3: Support for distributed computing.

1
import m u l t i p r o c e s s i n g
2
3 ## S e t number o f p r o c e s s o r c o r e s
4
w o r k e r s = m u l t i p r o c e s s i n g . Pool ( 1 0 0 )
on these scores, proteins eligible for mating are selected in each subpopulation and sequences are recombined based on a given crossover probability
(pctX as shown in 3.1). Next, a comparative modeling step returns the conformational models for all recombined structures based on a fast homology
modeling algorithm and again, molecular docking is performed to evaluate
the recombined child sequences. Eventually the best candidates are picked
from all subpopulations. A final selection round using RWS as described
in section 2.4 on page 40 chooses a new set of mother proteins to form the
subpopulations in the next generation. The process is then repeated until
the defined number of generations is reached. The program finishes and
reports information about the best protein in the final population.
FIGARO needs a lot of CPU power to be able to return acceptable
results. Like is the case for all genetic algorithms, the bigger the population size and number of generations, the better the results will be but
also a lot more computational resources are needed. To reduce computation
time significantly, FIGARO supports multiprocessing to distribute the work
over multiple processor cores. The implementation weve opted for is the
multiprocessing package for Python, and is illustrated in code snippet 3.3.
Depending on the machine the program is running on, the number of CPU
cores can be adjusted and the workload will be parallelized efficiently.
48
Code Snippet 3.4: Implementation of the PDB mining process.

1 ## C r e a t e s i n i t i a l PDB p o p u l a t i o n and download t o f o l d e r pdb .
2 ## R e c e p t o r s f o r drugl i k e m o l e c u l e s s i m i l a r t o t h e random
molecule are c o l l e c t e d .
3
def i n i t i a l i z e p o p ( randMol , popSize , subpopSize , maxRes , maxEnt ,

minSim , n r B i n d i n g S i t e s , pctM ) :
population = [ ]
usedPdb = [ ]
similarity = 1
while s i m i l a r i t y > minSim and len ( p o p u l a t i o n ) != p o p S i z e :
pdbOut = u r l l i b 2 . u r l o p e n ( h t t p : / /www. r c s b . o r g /pdb/ r e s t /

s m i l e s Q u e r y ? s m i l e s= + randMol +
&s e a r c h t y p e=s i m i l a r i t y&

s i m i l a r i t y= +s t r ( s i m i l a r i t y )
) . read ( ) . s p l i t l i n e s ( )
10
11
12
f o r l i n e in pdbOut :
i f len ( p o p u l a t i o n ) != p o p S i z e :
h i t = r e . s e a r c h ( (?<= s t r u c t u r e I d =) . ? ( ? = ) ,
line )
13
if hit :
14
pdb = h i t . group ( )
15
c h a i n = r e . s e a r c h ( (?<= c h a i n i d =)\w(?=) ,
u r l l i b 2 . urlopen (
16
h t t p : / /www. r c s b . o r g /pdb/ r e s t /
d e s c r i b e M o l ? s t r u c t u r e I d= + pdb ) . r e a d
( ) ) . group ( )
17
i f pdb not in usedPdb + e x c e p t i o n L i s t and

q u a l i t y c h e c k ( pdb , maxRes , maxEnt ) :
18
l i g a n d = r e . s e a r c h ( (?<=c h e m i c a l I D =)
. ? ( ? = ) , l i n e ) . group ( )
19
w r i t e p d b ( pdb , l i g a n d )
20
f i x p d b ( pdb )
21
e x t r a c t c h a i n ( pdb , c h a i n )
22
s e q = g e t s e q ( pdb )
23
b i n d i n g S i t e s = g e t b i n d i n g s i t e s ( pdb ,
nrBindingSites )

24
49
m o t h e r P r o t e i n = P r o t e i n ( pdb , seq , chain ,

bindingSites )
25
subpop = c r e a t e s u b p o p u l a t i o n (
motherProtein , subpopSize , pctM )
26
p o p u l a t i o n . append ( subpop )
27
usedPdb . append ( pdb )

s i m i l a r i t y = s i m i l a r i t y 0.1
28
29
i f len ( p o p u l a t i o n ) == p o p S i z e :
30
print P o p u l a t i o n i n i t i a l i z a t i o n f i n i s h e d .
31
return p o p u l a t i o n
32
else :
33
s y s . e x i t ( Not enough a v a i l a b l e t e m p l a t e s f o r t h e s e
parameters . )
3.2
Mining the PDB
Mining the PDB is the first step of the algorithm and is crucial to its further
progress. The complete code responsible for this is given in code snippet
3.4. As described in the previous section, the program tries to find existing
templates in the Protein Data Bank that bind to compounds showing high
similarity to our input molecule. This however is not always evident as any
input molecule can be given. Moreover, it is likely that the given compound
does not act as a known natural ligand.
For this reason, the program looks for structures in an iterative manner.
It starts by screening the PDB for structures binding ligands that have 100%
similarity to the query compound. If this does not fill up the total population
which is very likely, similarity drops by 10 percent. This process is repeated
until eventually the population is saturated with templates. In the other
case, the program aborts and notifies the user not enough templates were
found for the given set of parameters.
Thankfully, mining the PDB is greatly facilitated by the available REST-
50
Code Snippet 3.5: Example XML file returned by web server.

1 <?xml version= 1 . 0 standalone= no ?>
2 <s m i l e s Q u e r y R e s u l t s m i l e s= CCOc1cccc ( n1 )NC(=O) c 2 c c [ nH ] n2
s e a r c h t y p e= 3 s i m i l a r i t y= 0 . 5 5 >
3
4
<l i g a n d I n f o>
<l i g a n d s t r u c t u r e I d= 4N9B c h e m i c a l I D= 2HH type= non
polymer m o l e c u l a r W e i g h t= 2 0 2 . 2 1 3 >
<chemicalName>1METHYLN(PYRIDIN3YL)1HPYRAZOLE5
CARBOXAMIDE</ chemicalName>
<f o r m u l a>C10 H10 N4 O</ f o r m u l a>
<InChIKey>LSEUGEFZKAZGSIUHFFFAOYSAN</ InChIKey>
<InChI>InChI=1S/C10H10N4O/ c1 149(461214) 1 0 ( 1 5 )
1383251178/h27H, 1 H3 , ( H, 1 3 , 1 5 )</ InChI>
<s m i l e s>Cn1c ( ccn1 )C(=O) Nc2cccnc2</ s m i l e s>
10
</ l i g a n d>
11
</ l i g a n d I n f o>
12 </ s m i l e s Q u e r y R e s u l t>
ful Web Services API. This API supports queries based on ligands that can
be given by their corresponding SMILES representation. The similarity percentage can be included in the URL. Upon request, carried out by the urllib2
Python package, the server returns an XML file which is parsed manually.
An example XML file returned this way is given in snippet 3.5. We are particularly interested in the structureId attribute of the ligand element of this
file. This is the identifier of the PDB template we are looking for. A regular
expression extracts this id and stores it as a variable. If this identifier is not
contained in the list of already mined PDB files, the procedure continues
and another regular expression extracts the id of the ligand. This needs to
be done because the downloaded PDB file still contains the ligand coordinates. These have to be filtered out for the molecular docking process. The
write pdb function takes care of this again using a custom regex and saves
the PDB file in the appropriate folder. The PDB still needs further prepa-
51
ration though. By filtering out the ligand, missing residues are introduced.
Moreover, a PDB can contain more than one chain, which is not compatible
with our algorithm. The fix pdb and extract chain functions were implemented for that reason. These are not given here but can be consulted in
the complete source code disclosed in the attachments (appendix B) starting
at page 86. After these steps, the PDB file is ready for use and is passed
to the get binding sites function described in the next section. An array of
binding site centers is returned and stored in a Protein object together with
other metadata like sequence and chain id. This class is described in section
3.4.
52
Code Snippet 3.6: Program implementation for binding site detection.

1 ## I d e n t i f i e s p u t a t i v e b i n d i n g p o c k e t s u s i n g t h e SiteHound
package
2 ## ( no i n s t a l l a t i o n needed , i 3 8 6 l i b r a r i e s have t o be i n s t a l l e d
f o r pdb2gmx ) .
3
def g e t b i n d i n g s i t e s ( pdb , n r B i n d i n g S i t e s ) :
bsCenterList = [ ]
s t a r t i n g D i r = o s . getcwd ( )
o s . c h d i r ( s t a r t i n g D i r + /pdb )
o s . system ( . / auto . py i
with open ( pdb + CMET summary . dat , r ) a s summary :
+ pdb + . pdb p CMET k )
f o r i in range ( 0 , n r B i n d i n g S i t e s ) :
10
l i n e = summary . r e a d l i n e ( )
11
x = l i n e . s p l i t ( ) [ 3]
12
y = l i n e . s p l i t ( ) [ 2]
13
z = l i n e . s p l i t ( ) [ 1]
14
b s C e n t e r L i s t . append ( ( + x + , + y + , + z + )
)
15
## Remove temporary f i l e s
16
o s . system ( rm + pdb + + pdb + . e a s y m i f s )
17
## Return t o main d i r e c t o r y .
18
os . chdir ( s t a r t i n g D i r )
19
print Binding s i t e s found
20
return b s C e n t e r L i s t
3.3
Screening for binding sites
The mining of PDB files based on a query ligand only serves as a guideline for our program. To keep bias low, the whole protein conformation is
screened for a range of putative binding sites after being prepared for use.
Eventually our compound will be docked to all of these sites; an approach
better known as blind docking. According to the maximum number of binding pockets to be retrieved as defined in the global program parameters, the
53
SiteHound software package is used to screen for possible binding sites. [52]
The implemention of this part of the application is shown in code snippet
3.6.
The SiteHound algorithm is actually very comparable to general molecular docking. It uses a small methyl carbon or phosphate probe to screen
the surface of a protein for favourable weak interactions. According to their
spatial proximity, these points are clustered by an agglomerative hierarchical
clustering algorithm. A list of these clusters is returned in the end, corresponding to the identified binding pockets. These are ranked based on their
total interaction energy and are stored by FIGARO. An array of binding
site centers is returned, with maximum length nrBindingSites as defined in
3.1. According to a test study based on 77 experimentally derived protein
structures linked to known protein-ligand complexes, the actual binding site
was among the top three binding pockets recognized by SiteHound in 95
percent of the cases. [53]
SiteHound offers two possibilities concerning probe type. We opted for
the CMET or methyl carbon probe. This is the most appropriate one for
docking general drug-like compounds. [53]
54
Code Snippet 3.7: Implementation of the Protein class.

1
class Protein () :
2
3
def
init
( s e l f , id , seq , chain , b i n d i n g S i t e s ) :
s e l f . id = id
s e l f . seq = seq
s e l f . chain = chain
s e l f . bindingSites = bindingSites
s e l f . s c o r e = None
s e l f . parents = [ ]
10
s e l f . crossoverPoints = [ ]
11
s e l f . b e s t B i n d i n g S i t e = None
12
s e l f . p o i n t M u t a t i o n s = {}
13
14
def s e t s c o r e ( s e l f , s c o r e ) :
15
s e l f . score = score
16
17
def s e t p a r e n t s ( s e l f , protA , protB = None , c r o s s o v e r P o i n t 1 =

None , c r o s s o v e r P o i n t 2 = None ) :
18
s e l f . p a r e n t s . append ( ( protA . id , protA . s e q ) )
19
i f protB and r e . s e a r c h ( recombined , protB . id ) :
20
s e l f . p a r e n t s . append ( ( protB . p a r e n t s [ 0 ] [ 0 ] , protB .

parents [ 0 ] [ 1 ] ) )
21
i f protB and not r e . s e a r c h ( recombined , protB . id ) :
22
s e l f . p a r e n t s . append ( ( protB . id , protB . s e q ) )
23
24
25
26
i f crossoverPoint1 :
s e l f . crossoverPoints = [ crossoverPoint1 ]
s e l f . c r o s s o v e r P o i n t s . append ( c r o s s o v e r P o i n t 2 )
27
28
29
def s e t b e s t b i n d i n g s i t e ( s e l f , b i n d i n g S i t e ) :
s e l f . bestBindingSite = bindingSite
30
31
def s e t i d ( s e l f , id ) :
32
s e l f . id = id
33

34
55
def u p d a t e s e q ( s e l f , s e q ) :
35
s e l f . seq = seq
36
37
def s e t p o i n t m u t a t i o n s ( s e l f , p o i n t M u t a t i o n s ) :
38
s e l f . pointMutations = pointMutations
3.4
The Protein class
The Protein class comprises all metadata about a protein structure, such
as its sequence, chain id, identified binding sites and parental sequences. It
was made to centralize important parts of the code on the level of individual
proteins. The implementation of this class is presented in code snippet 3.7.
The class provides functionality to manage all protein data efficiently.
When protein sequences are mutated, it is important to keep data about
these mutations in memory because it will be used later to perform structure prediction based on template structures. To manage this, every protein
object holds a pointMutations map as shown in snippet 3.7. It memorizes
the index of the mutated amino acid in the sequence along with its original value and can be set from an outside call to the set point mutations
function. Furthermore, setters are available for storing docking scores, best
binding site coordinates, updated sequences, new protein ids and parental
sequences. When setting parental sequences, the exact crossover points are
also stored in the appropriate variable, which are crucial for homology modeling processes later on.
56
Code Snippet 3.8: Implementation of the mutation operator.

1 ## P o i n t m u t a t i o n o p e r a t o r
2
def p o i n t m u t a t i o n ( p r o t e i n , pctM ) :
mutable = l i s t ( p r o t e i n . s e q )
p o i n t M u t a t i o n s = {}
f o r index , aminoAcid in enumerate ( mutable ) :

i f random . random ( ) <= pctM and not ( p r o t e i n . s e q [ i n d e x ]
== C or p r o t e i n . s e q [ i n d e x ] == P ) :
7
mutable [ i n d e x ] = random . c h o i c e ( aminoAcids )
i f p r o t e i n . s e q [ i n d e x ] != mutable [ i n d e x ] :
p o i n t M u t a t i o n s [ i n d e x +1] = p r o t e i n . s e q [ i n d e x ] +
mutable [ i n d e x ]
10
p r o t e i n . u p d a t e s e q ( . j o i n ( mutable ) )
11
protein . set point mutations ( pointMutations )
12
print p r o t e i n + p r o t e i n . id + mutated .
13
return p r o t e i n
3.5
Modeling point mutations
When an initial population and its subpopulations are formed, new sequences are created through the introduction of point mutations to the
mother or template proteins. Although a single point mutation is not likely
to change protein fold significantly, these are accumulated quickly over the
course of the program and incrementally alter protein conformation by triggering backrub moves on the protein backbone. This phenomenon is described earlier in section 1.5.1 on page 25.
However, not all point mutations can be treated equally; there are a few
notable exceptions. The first one is proline, which is known to force protein
backbone in certain conformations due to its uncommon and constraining
way of chaining in a polypeptide. To suit our purpose, this amino acid is
therefore never introduced nor replaced by the mutation operator as predictions would likely become unreliable. Moreover, cysteine is known to form
57
disulfide bridges or cystine bonds between distant parts of the polypeptide

chain. These are strong bonds and if broken, protein conformation is not
guaranteed to be stable anymore. For this reason, cysteine is excluded from
the set of available amino acids, too.
The mutation operator loops over a given protein sequence one amino
acid at a time. At every stop, it produces a random float from 0 to 1. If this
value is beneath the pctM parameter, the residue is exchanged by random
selection from the set of available amino acids. This continues until the end
of the sequence is reached. In the meantime, a mutation map is kept in
memory which stores the indices and original residues as key-value pairs.
Eventually, this mutation map is stored in the appropriate Protein object
and the sequence field is updated.
Code Snippet 3.9: Implementation of the point mutation modeling process
based on RosettaBackrub [54] for flexible backbone sampling.
1
def p r e d i c t s t r u c t u r e b a c k r u b ( p r o t e i n ) :
p a r s e r = PDBParser ( )
s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( p r o t e i n . id [ 0 : 4 ] ,
pdb/ +
p r o t e i n . id [ 0 : 4 ] + . pdb )
4
## Give e v e r y t h r e a d i t s own i n p u t and o u t p u t f i l e s .
o s . system ( cp pdb/ + p r o t e i n . id [ 0 : 4 ] + . pdb pdb/ +

p r o t e i n . id + . pdb )
atomList = [ ]
f o r atom in s t r u c t u r e . g e t a t o m s ( ) :
a t o m L i s t . append ( atom )
nbs = N e i g h b o r S e a r c h ( a t o m L i s t )
10
11
f o r mutation in p r o t e i n . p o i n t M u t a t i o n s :
## Write r e s f i l e and g e t a l l r e s i d u e s w i t h i n 6 Angstrom
o f mutated r e s i d u e s u s i n g Biopython p a c k a g e .
12
with open ( r e s f i l e
+ p r o t e i n . id ,
w ) a s r e s f i l e :
13
r e s f i l e . w r i t e ( NATRO\ n s t a r t \n )
14
affectedList = [ ]
15
r e s i d u e = s t r u c t u r e . g e t c h a i n s ( ) . next ( ) [ mutation ]

16
17
58
f o r atom in S e l e c t i o n . u n f o l d e n t i t i e s ( r e s i d u e ,
A ) :
f o r n e i g h b o r in nbs . s e a r c h ( atom . g e t c o o r d ( ) , 6 ,
l e v e l= R ) :
18
19
a f f e c t e d L i s t . append ( g e t r e s i ( n e i g h b o r ) )
r e s f i l e . w r i t e ( s t r ( mutation ) + + p r o t e i n . c h a i n +
PIKAA + p r o t e i n . p o i n t M u t a t i o n s [ mutation ] [ 1 ] +
\n )
20
## Make a f f e c t e d L i s t u n i q u e .
21
a f f e c t e d L i s t = l i s t ( set ( a f f e c t e d L i s t ) )
22
pivotList = [ ]
23
f o r r e s in a f f e c t e d L i s t :
24
25
i f r e s != mutation :
r e s f i l e . write ( str ( r e s ) + + protein . chain
+ NATAA\n )
26
p i v o t L i s t . append ( r e s )
27
i f r e s != 1 :
28
p i v o t L i s t . append ( r e s 1)
29
i f r e s != len ( p r o t e i n . s e q ) :
30
p i v o t L i s t . append ( r e s +1)
31
32
p i v o t L i s t = l i s t ( set ( p i v o t L i s t ) )
## Run b a c k r u b s a m p l i n g w i t h 1 0 . 0 0 0 Monte C ar l o
iterations .
33
command = ROSETTADIR + /main/ s o u r c e / b i n / backrub .

l i n u x g c c r e l e a s e d a t a b a s e + ROSETTADIR + \
34
/main/ d a t a b a s e s pdb/ + p r o t e i n . id + . pdb

n s t r u c t 1 backrub : n t r i a l s
35
str (10000) + r e s f i l e
resfile
+ \
+ p r o t e i n . id
+ p i v o t r e s i d u e s
36
37
f o r p i v o t in p i v o t L i s t :
command += s t r ( p i v o t ) +
38
o s . system ( command )
39
## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .
40
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + 0 0 0 1 .
pdb o pdb / + p r o t e i n . i d + . pdb )

41
59
o s . system ( mv + p r o t e i n . id + 0 0 0 1 . pdb pdb/ +

42
## Remove temporary f i l e s .
43
o s . system ( rm + p r o t e i n . id + r e s f i l e
)
44
## Fix PDB ( o n l y needed when s i d e c h a i n r e f i n e m e n t i s n o t

used )
45
f i x p d b ( p r o t e i n . id )
46
print P r o t e i n + p r o t e i n . id + modeled u s i n g
RosettaBackrub and SCWRL4.
47
Creating reliable models of point mutated structures is one of the most

crucial parts of FIGARO. A scientifically proven strategy is therefore indispensable and several points of attention should be taken into consideration.
The complete implementation of all steps is disclosed in code snippet 3.9.
Using the Biopython bioinformatics software package, the template PDB
file is parsed and all atoms are loaded into memory. After that, all residues
having atoms within a 6
A radius of the mutated amino acid are selected.
These are later added to the job command together with their neighbouring
residues to define which points in the sequence will be used as backrub pivots; all these residues are assumed to be affected by the point mutation. The
RosettaBackrub algorithm also requires a so called resfile to work. This file
defines the exact location of the mutated residue and its new value. Moreover, it contains lines specifying all affected residues as native amino acids
with repacked side chains (using the NATAA flag on line 26 in code snippet
3.9 enabling their side chain rotamers to be sampled while preserving the
native amino acid type. The third and last required input is the number of
Monte Carlo iterations. By default, this is set to 10 000 as suggested in the
literature. [54] Increasing this number will eventually make the predictions
more reliable, leading to another trade-off between application speed and
precision. Functionality to refine side chain conformations by SCWRL4 [55]
60
is also available, but is not activated by default. After the model has been
generated, all temporary files are deleted and the new PDB file is stored in
the appropriate folder. The id field in the Protein object links to this file by
name.
61
Code Snippet 3.10: Implementation of the ligand docking process using

rDock [56].
1
def dock ( p r o t e i n ) :
print Docking s t a r t e d f o r p r o t e i n + p r o t e i n . id + .
b e s t S c o r e = None
b e s t B i n d i n g S i t e = None
## Convert pdb t o mol2 f i l e u s i n g OpenBabel p a c k a g e .
o s . system ( b a b e l h . / pdb/ + p r o t e i n . id + . pdb + p r o t e i n

. id + . mol2 )
f o r b i n d i n g S i t e in p r o t e i n . b i n d i n g S i t e s :
## F i r s t w r i t e s y s t e m f i l e .
with open ( rDockSystem + p r o t e i n . id + . prm , w ) a s

systemFile :
10
s y s t e m F i l e . w r i t e ( RBT PARAMETER FILE V1. 0 0 \ nTITLE

gart DUD\nRECEPTOR FILE + p r o t e i n . id +
. mol2 \nRECEPTOR FLEX 3 . 0 \nSECTION
11
MAPPER\nSITE MAPPER
RbtSphereSiteMapper \nCENTER
+ b i n d i n g S i t e + \nRADIUS 1 5 . 0 \
12
nSMALL SPHERE 1 . 5 \nMIN VOLUME

100\nMAX CAVITIES 1\nVOL INCR
0 . 0 \nGRIDSTEP 0 . 5 \nEND SECTION\
nSECTION CAVITY\
nSCORING FUNCTION
RbtCavityGridSF \nWEIGHT 1 . 0 \
nEND SECTION )
13
## Generate c a v i t y f o r d o c k i n g .
14
o s . system ( r b c a v i t y was d r rDockSystem + p r o t e i n .

id + . prm )
15
## Perform d o c k i n g .
16
o s . system ( rbdock i l i g a n d . sd o o u t p u t + p r o t e i n . id
+ r rDockSystem + p r o t e i n . id +
17
. prm p dock . prm n 1 )
18
## Return d o c k i n g s c o r e .
19
with open ( o u t p u t + p r o t e i n . id + . sd , r ) a s
dockingResult :

20
62
f o r l i n e in d o c k i n g R e s u l t :
21
i f l i n e == >
22
<SCORE>\n :
d o c k i n g S c o r e = f l o a t ( d o c k i n g R e s u l t . next ( ) .
strip () )
i f b e s t S c o r e i s None or d o c k i n g S c o r e < b e s t S c o r e :
23
24
bestScore = dockingScore
25
bestBindingSite = bindingSite
26
protein . set score ( bestScore )
27
protein . set best binding site ( bestBindingSite )
28
29
o s . system ( rm + p r o t e i n . id + . mol2 o u t p u t + p r o t e i n . id
+ . sd rDockSystem + p r o t e i n . id +
30
c a v 1 . grd rDockSystem + p r o t e i n . id + . a s
rDockSystem + p r o t e i n . id + . prm )
31
print Docking f i n i s h e d f o r p r o t e i n + p r o t e i n . id + .
32
3.6
Ligand docking
In this section, we will discuss the ligand docking process and its implementation as shown in snippet 3.10. This blind docking procedure mainly
consists of four parts: converting the PDB file to a mol2 file, generating the
receptor cavity at the targeted binding site, performing the actual ligand
docking job site and selecting the best docking score with respect to the
different binding sites.
Again, several parameters need to be set adequately for reliable results.
Based on suggestions made in rDocks documentation and some own intuition about the given ligand molecule, setting up these parameters can be
done in a fairly straightforward way by generating a .prm rDock system file
satisfying the format constraints given in rDocks documentation1 . The con1
The full documentation for rDock can be downloaded from the projects website at
http://rdock.sourceforge.net/
63
tent we used for our molecule is shown in lines 10 to 12 of code snippet 3.10.
First, the receptor mol2 file and the tolerated backbone flexibility parameter are defined. We opted for a 3
A flexibility range which should provide a
substantial degree of conformational adaptability. Next, the method of cavity grid construction is specified. For our purpose, it is encouraged to opt
for the RbtSphereSiteMapper algorithm provided by rDock. This method
uses a small sphere with adjustable radius to explore the surface around
the given center coordinates. Moreover, a maximum binding site radius is
defined which should be chosen according to the structural information of
the ligand compound. The maximum number of binding pockets to be constructed was set to 1. For the other parameters, we used the default values
as they should not alter results significantly. However, all parameters mentioned in this paragraph should be carefully chosen and evaluated for every
single docking situation.
To be able to use our models, they need to be converted to a new file
format. The OpenBabel package takes care of that and generates the appropriate mol2 files. It is important to note that FIGARO needs a lot of
dependencies that all need to be installed and set up correctly in order to
work. For problems with this in further research, please contact the author.
The docking scores produced by rDock represent changes in free energy.
For thermodynamically favourable binding situations, these will be negative.
Absolute values are therefore used as a fitness measure in our application.
Also note that docking scores themselves are based on a genetic algorithm
within rDock. These are all estimations that can be made more precise
by increasing the number of generations or by running multiple docking
jobs and taking averages. To shorten runtimes of our application, we will
perform each docking job only once. In the end, a fitness score only serves
as a guideline and doesnt need to be too precise.
64
Code Snippet 3.11: Implementation of the two-points crossover operator.

1
2
def t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n , pctX ) :
i f not len ( p o p u l a t i o n ) > 1 :
p o p u l a t i o n [ 0 ] . s e t p a r e n t s ( protA=p o p u l a t i o n [ 0 ] )
p o p u l a t i o n [ 0 ] . s e t i d ( p o p u l a t i o n [ 0 ] . id + recombined )
print No c r o s s o v e r p o s s i b l e .
else :
random . s h u f f l e ( p o p u l a t i o n )
f o r index , p r o t e i n in enumerate ( p o p u l a t i o n ) :
9
10
11
i f random . random ( ) <= pctX :

i f i n d e x != len ( p o p u l a t i o n ) 1:
minLength = min ( [ len ( p r o t e i n . s e q ) , len (
population [ index +1]. seq ) ] )
12
c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )
13
c r o s s o v e r P o i n t 2 = random . r a n d i n t (
c r o s s o v e r P o i n t 1 , minLength )
14
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index + 1 ] ,
15
c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
crossoverPoint2 )
16
protein . update seq ( protein . seq [ 0 :

crossoverPoint1 ] + population [ index +1].
17
seq [ crossoverPoint1 :
crossoverPoint2 ] +
protein . seq [
crossoverPoint2 : ] )
18
19
20
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
else :
p o p u l a t i o n [ index 1 ] . p a r e n t s [ 0 ] [ 1 ] ) ] )
21
minLength )
22
65
23
population [ index 1 ] ,
24
crossoverPoint1 ,
crossoverPoint2 )
25

c r o s s o v e r P o i n t 1 ] + p o p u l a t i o n [ index 1 ] .
26
parents [ 0 ] [ 1 ] [
crossoverPoint1 :
crossoverPoint2 ] +
27
protein . seq [
28
29
print C r o s s o v e r f i n i s h e d .
30
else :
31
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n )
32
33
print No c r o s s o v e r .
3.7
Modeling recombined sequences
Modeling recombined sequences can be a very tricky task. When big parts of
random protein sequences are exchanged, conformational topology is likely
to be altered dramatically and it is not guaranteed that the structure will
be even stable. Moreover, it would be intractable with current computational resources to do a full ab initio remodeling of the generated protein
sequences. However, homolog sequences are far less likely to introduce big
conformational changes upon recombination, a fact that enables crossover to
be a major drive in natural evolution. By splitting up our genetic algorithm
in subpopulations, we can group them by sequence homology and have them
mating in isolated islands; this way, recombined sequences can be modeled
66
with high precision and reliability using homology modeling solutions.

To generate recombined sequences, two crossover operators were implemented. One of them, the two-points crossover operator, is shown in snippet
3.11. Based on the fixed pctX parameter as defined in 3.1, a subsequence
of arbitrary length is selected between two points and exchanged between
two random sequences in the subpopulation. Afterwards, metadata about
the parental chains and the crossover points is saved in the Protein object.
This procedure is repeated for all proteins in the subpopulation. For every
individual, a random float is generated. If the value is bigger than pctX, the
sequence remains unaltered and no parents nor crossover points are set.
Code Snippet 3.12: Implementation of the homology modeling procedure
using MODELLER. [42]
1
2
def p r e d i c t s t r u c t u r e h o m o l o g y f a s t ( p r o t e i n ) :
i f len ( p r o t e i n . p a r e n t s ) > 1 :
write aln ( protein )
env = m o d e l l e r . e n v i r o n ( )
env . i o . a t o m f i l e s d i r e c t o r y = . / pdb
templates = [ ]
f o r p a r e n t in p r o t e i n . p a r e n t s :
8
9
t e m p l a t e s . append ( p a r e n t [ 0 ] )
a = m o d e l l e r . automodel . automodel ( env , a l n f i l e =p r o t e i n . id
+ t o t a l . a l i , knowns=tuple ( t e m p l a t e s ) , s e q u e n c e=
p r o t e i n . id )
10
a . starting model = 1
11
a . ending model = 1
12
a . make ( )
13
prefered .
14
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + .
B99990001 . pdb o f i n a l + p r o t e i n . i d + . pdb )
15
o s . system ( cp + p r o t e i n . id + . B99990001 . pdb f i n a l +


16
## R e p l a c e p r e v i o u s PDB w i t h model .
17
o s . system ( mv f i n a l + p r o t e i n . id + . pdb . / pdb/ +
67
18
## Fix PDB ( o n l y needed when s i d e c h a i n r e f i n e m e n t i s

n o t used )
19
20
21
o s . system ( rm + p r o t e i n . id + )
22
## M o d e l l e r d o e s n o t add c h a i n i d s by i t s e l f .
23
a d d c h a i n i d ( p r o t e i n . id , p r o t e i n . c h a i n )
24
print P r o t e i n + p r o t e i n . id + modeled u s i n g MODELLER

.
25
26
27
else :
o s . system ( cp pdb/ + p r o t e i n . p a r e n t s [ 0 ] [ 0 ] + . pdb pdb/
+ p r o t e i n . id + . pdb )
28
With sequence identities of 95% and above, comparative modeling algorithms should be capable of returning highly reliable models of the recombined protein sequences. [42, 41] Another asset is the fact that the program
exactly knows how sequences have evolved through the course of the application. It is therefore very straightforward to define correct alignments
to MODELLER, the comparative modeling package we opted for. We implemented a very fast strategy to do this as disclosed in code snippet 3.12.
Usually, MODELLER will screen sequence databases for homolog structures
and needs user interaction to select a suitable template. This takes a lot
of time and can be thankfully skipped by our implementation. Alignments
are manually constructed by the write aln function and uses sequence metadata stored in the Protein objects. The implementation of this function can
be found in the attachments chapter (appendix B). Again, functionality to
refine side chains using SCWRL4 is available but not enabled by default.
3.8
68
The selection operator
In the current revision of FIGARO, only one selection operator was implemented and is shown in code snippet 3.13. The used roulette wheel
selection procedure is completely analogous to the description given in section 2.4 starting at page 40. In addition to that, our roulette wheel selection
functions contains two extra arguments, the first being pctE defining the
percentage of elitism. This parameter is given in the global program parameters shown in snippet 3.1. Having said that, it should be noted that
the current implementation of the selection operator doesnt support elitism
yet. To extend the capabilities of our genetic algorithm, this is something
that needs to be fixed in a next release.
The other additional argument is a boolean specifying the need for copying and replacement of PDB files for the selected proteins. This is only
desirable in the case of selection within the subpopulations before recombination. The second selection step happens at the population level when new
template proteins need to be chosen for the next generation and is followed
up by another section of the program with other requirements. Therefore,
the boolean is used to switch between the two levels of selection. Tournament selection (also described in section 2.4) is another approach that
comes to mind for the latter selection round and could be implemented in
next releases of the application. It provides more flexibility regarding selection pressure. For example, it can give more weight to the best individuals
from every subpopulation by increasing selection pressure.
69
Code Snippet 3.13: Implementation of the selection operator.

1
def r o u l e t t e w h e e l s e l e c t i o n ( p o p u l a t i o n , pctE , r e p l a c e=True ) :
newPopulation = [ ]
f o r i in range ( 0 , len ( p o p u l a t i o n ) ) :
totalScore = 0
f o r p r o t e i n in p o p u l a t i o n :
t o t a l S c o r e += p r o t e i n . s c o r e
r o u l e t t e R e s u l t = random . u nif orm ( 0 , t o t a l S c o r e )
index = 0
c u r r e n t T o t a l = p o p u l a t i o n [ i n d e x ] . s c o r e
10
while c u r r e n t T o t a l < r o u l e t t e R e s u l t and i n d e x < len (

p o p u l a t i o n ) 1:
11
i n d e x += 1
12
c u r r e n t T o t a l += p o p u l a t i o n [ i n d e x ] . s c o r e
13
14
i f r e p l a c e == True :
newId = p o p u l a t i o n [ i n d e x ] . id [ 0 : 4 ] + s e l e c t e d +
s t r ( i +1)
15
newPopulation . append ( P r o t e i n ( newId , p o p u l a t i o n [ i n d e x

] . seq ,
16
p o p u l a t i o n [ i n d e x ] . chain
, population [ index ] .
bindingSites ) )
17
## Rename PDBs t o l i n k w i t h new i d s .
18
o s . system ( cp pdb/ + p o p u l a t i o n [ i n d e x ] . id + . pdb

pdb/ + newId + . pdb )
19
20
21
i f r e p l a c e == F a l s e :
newPopulation . append ( p o p u l a t i o n [ i n d e x ] )
22
## Clean up PDB f o l d e r .
23
24
o s . system ( rm pdb/ + p r o t e i n . id + . pdb )
25
print R o u l e t t e Wheel S e l e c t i o n s u c c e e d e d .
26
return newPopulation
Chapter 4
Discussion
70
CHAPTER 4. DISCUSSION
71
A program like FIGARO has no use unless it has been experimentally

validated. Considering the scope of this masters thesis, this was not feasible.
Its also because of that reason that no results of the application were given;
these would be useless without being experimentally verified. This thesis
was written with the aim of providing a new perspective for computational
receptor optimization taking into consideration as much biochemical and
technical points of attention as possible. For every aspect of the application,
strong scientific documentation is available to give it a fundament to work
on. However, this is not enough to guarantee its flawless behaviour. To be
more correct, I wouldnt even recommend it for use right now.
The first thing that obviously needs to be done is verifying the generated structures. It is very likely that they alter significantly from the experimentally derived structures. On the other side, FIGARO offers a lot of
parameters to be fine-tuned in order to get more reliable results. A pipeline
based on experimental feedback could greatly improve FIGAROs performance. For now however, it is largely a matter of guessing if results will
be even approaching the real structures. The evaluation of this application
by fellow researches is another important source of improvement. As I was
completely on my own implementing this program, it could be greatly optimized by independent evaluation. Having said this, I hereby gently request
any possible support from the scientific community.
On the other hand, a lot of computational functionality that wasnt implemented yet could be easily appended in future releases. This will give
the program even more flexibility and potential, like was noted in section
3.8. For example, the amount of implemented evolutionary operators is
rather limited due to the restricted time scope. Another notable example
in this regard is the distributed computing support. Although this was implemented to a certain degree, much larger improvements with respect to
runtime speeds can be achieved. A main suggestion for this is to integrate
CHAPTER 4. DISCUSSION
72
support for Apache Hadoop and MapReduce to make use of as much computational power as possible and lift the application to a whole new level of
computational abilities.
Chapter 5
Conclusion
73
CHAPTER 5. CONCLUSION
74
FIGARO is a promising new approach to optimize artificially designed

receptor structures for a given ligand. While computational optimization
is often seen in literature for ligand molecules, this work tries to tackle
it the other way around optimizing receptor structures for random ligand
compounds. Athough the programs reliability is not yet tested based on
experimentally determined structures, it promises great potential in several
fields of research. The one particularly capturing our imagination is without
a doubt the application in immunotherapy treatment of cancer. FIGARO
offers a new bridge between very specific, even artificially designed chemical
structures and proteins that can vary widely in function. Unique compounds
on the surface of cancer cells could perfectly act as input structures for
FIGARO to target.
Eventually not only receptors could be optimized. Ultimately, this project
aims at providing a starting point in the process of designing new enzymes
involved in all kinds of cellular processes. Binding substrates based on
an induced fit mechanism is one of the key concepts in enzyme dynamics. FIGARO currently tries to optimize just that, but further additions
concerning enzyme mechanics are indispensable in order to be useful in the
task of artificial enzyme design. The most important consideration in this
regard is the stabilization of the transition state complex of a reaction. Integrating this functionality is not evident and certainly requires more than
one equivalent of a full-time working resource, but FIGAROs backbone
should support it nevertheless by extending the implementation of fitness
score calculation.
As a concluding remark, FIGARO wasnt made with the intention to be
the holy grail for cancer treatment or artificial enzyme design, but aims to
provide a new and promising approach to computational receptor optimization. With a 50-year-old problem of protein structure prediction and many
already achieved advantages in this field, the availability of enormeous dis-
CHAPTER 5. CONCLUSION
75
tributed computation possibilities and proven usability of very sophisticated

evolutionary algorithms, FIGARO goes down the path of future potential.
Part III
Appendix
76
Appendix A
Bibliography
77
BIBLIOGRAPHY
78
[1] I. Rocha, P. Maia, P. Evangelista, P. Vilaca, S. Soares, J. P. Pinto,

J. Nielsen, K. R. Patil, E. C. Ferreira, and M. Rocha.
OptFlux:
an open-source software platform for in silico metabolic engineering.

BMC Syst Biol, 4:45, Apr 2010.
[PubMed Central:PMC2864236]
[DOI:10.1186/1752-0509-4-45] [PubMed:20403172].
[2] P. Pharkya, A. P. Burgard, and C. D. Maranas. OptStrain: a computational framework for redesign of microbial production systems. Genome
Res., 14(11):23672376, Nov 2004.
[DOI:10.1101/gr.2872004] [PubMed:15520298].
[3] O. Khersonsky, D. Rothlisberger, O. Dym, S. Albeck, C. J. Jackson,
D. Baker, and D. S. Tawfik. Evolutionary optimization of computationally designed enzymes: Kemp eliminases of the KE07 series. J. Mol.
Biol., 396(4):10251042, Mar 2010. [DOI:10.1016/j.jmb.2009.12.031]
[PubMed:20036254].
[4] L. Giger, S. Caner, R. Obexer, P. Kast, D. Baker, N. Ban, and D. Hilvert. Evolution of a designed retro-aldolase leads to complete active site
remodeling. Nat. Chem. Biol., 9(8):494498, Aug 2013. [PubMed Central:PMC3720730] [DOI:10.1038/nchembio.1276] [PubMed:23748672].
[5] V. Nanda and R. L. Koder. Designing artificial enzymes by intuition
and computation. Nat Chem, 2(1):1524, Jan 2010. [PubMed Central:PMC3443871] [DOI:10.1038/nchem.473] [PubMed:21124375].
[6] S. Paul, S. A. Planque, Y. Nishiyama, C. V. Hanson, and R. J.
Massey.
Nature and nurture of catalytic antibodies.
Med. Biol., 750:5675, 2012.
Adv. Exp.
[DOI:10.1007/978-1-4614-3461-0 5]
[PubMed:22903666].
[7] Y. Xu, N. Yamamoto, and K. D. Janda.
Catalytic antibod-
ies: hapten design strategies and screening methods. Bioorg. Med.
BIBLIOGRAPHY
79
Chem., 12(20):52475268, Oct 2004. [DOI:10.1016/j.bmc.2004.03.077]

[PubMed:15388154].
[8] D. Hilvert.
Critical analysis of antibody catalysis.
Annu. Rev.
Biochem., 69:751793, 2000. [DOI:10.1146/annurev.biochem.69.1.751]

[PubMed:10966475].
[9] C. Jackel,
directed
P. Kast,
evolution.
and D. Hilvert.
Annu
Rev
Protein design by
Biophys,
37:153173,
2008.
[DOI:10.1146/annurev.biophys.37.032807.125832] [PubMed:18573077].
[10] P. Molina-Espeja, E. Garcia-Ruiz, D. Gonzalez-Perez, R. Ullrich,
M. Hofrichter, and M. Alcalde.
Directed evolution of unspe-
cific peroxygenase from Agrocybe aegerita.

biol., 80(11):34963507, Jun 2014.
Appl. Environ. Micro-
[DOI:10.1128/AEM.00490-14] [PubMed:24682297].
[11] M. SELA, F. H. WHITE, and C. B. ANFINSEN. Reductive cleavage
of disulfide bridges in ribonuclease. Science, 125(3250):691692, Apr
1957. [PubMed:13421663].
[12] Carl Branden and John Tooze. Introduction to Protein Structure. Garland Science, 1999.
[13] S. C. Lovell, I. W. Davis, W. B. Arendall, P. I. de Bakker, J. M.
Word, M. G. Prisant, J. S. Richardson, and D. C. Richardson.
Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins, 50(3):437450, Feb 2003. [DOI:10.1002/prot.10286]
[PubMed:12557186].
[14] F. Crick. Central dogma of molecular biology. Nature, 227(5258):561
563, Aug 1970. [PubMed:4913914].
BIBLIOGRAPHY
80
[15] Alka Dwevedi. Protein Folding: Examining the Challenges from Synthesis to Folded Form (SpringerBriefs in Biochemistry and Molecular
Biology). Springer, 2014.
[16] C. B. Anfinsen. Principles that govern the folding of protein chains.
Science, 181(4096):223230, Jul 1973. [PubMed:4124164].
[17] C. B. Anfinsen and H. A. Scheraga. Experimental and theoretical
aspects of protein folding.
Adv. Protein Chem., 29:205300, 1975.
[PubMed:237413].
[18] Levinthal C. How to Fold Graciously. Mossbauer Spectroscopy in
Biological Systems: Proceedings of a meeting held at Allerton House,
Monticello, Illinois, 1969.
[19] K. A. Dill.
Polymer principles and protein folding.
Sci., 8(6):11661180, Jun 1999.
Protein
[DOI:10.1110/ps.8.6.1166] [PubMed:10386867].
[20] Louise A Wallace and C Robert Matthews. Sequential vs. parallel
protein-folding mechanisms: experimental tests for complex folding reactions. Biophysical Chemistry, 101102:113 131, 2002. Special issue
in honour of John A Schellman.
[21] L. PAULING, R. B. COREY, and H. R. BRANSON. The structure of
proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. U.S.A., 37(4):205211, Apr 1951.
[PubMed Central:PMC1063337] [PubMed:14816373].
[22] M. Novotny and G. J. Kleywegt.
in protein structures.
A survey of left-handed helices
J. Mol. Biol., 347(2):231241, Mar 2005.
[DOI:10.1016/j.jmb.2005.01.037] [PubMed:15740737].
BIBLIOGRAPHY
81
[23] W. J. van Heeckeren, J. W. Sellers, and K. Struhl. Role of the conserved

leucines in the leucine zipper dimerization motif of yeast GCN4. Nucleic
Acids Res., 20(14):37213724, Jul 1992. [PubMed Central:PMC334023]
[PubMed:1641337].
[24] L. Lo Conte, B. Ailey, T. J. Hubbard, S. E. Brenner, A. G.
Murzin, and C. Chothia. SCOP: a structural classification of proteins
database. Nucleic Acids Res., 28(1):257259, Jan 2000. [PubMed Central:PMC102479] [PubMed:10592240].
[25] F. M. Pearl, C. F. Bennett, J. E. Bray, A. P. Harrison, N. Martin,
A. Shepherd, I. Sillitoe, J. Thornton, and C. A. Orengo. The CATH
database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res., 31(1):452455, Jan 2003. [PubMed
Central:PMC165509] [PubMed:12520050].
[26] R. D. Finn, A. Bateman, J. Clements, P. Coggill, R. Y. Eberhardt, S. R.
Eddy, A. Heger, K. Hetherington, L. Holm, J. Mistry, E. L. Sonnhammer, J. Tate, and M. Punta. Pfam: the protein families database.
Nucleic Acids Res., 42(Database issue):D222230, Jan 2014. [PubMed
Central:PMC3965110] [DOI:10.1093/nar/gkt1223] [PubMed:24288371].
[27] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat,
H. Weissig, I. N. Shindyalov, and P. E. Bourne. The Protein Data
Bank. Nucleic Acids Res., 28(1):235242, Jan 2000. [PubMed Central:PMC102472] [PubMed:10592235].
[28] H. Berman, K. Henrick, H. Nakamura, and J. L. Markley.
The
worldwide Protein Data Bank (wwPDB): ensuring a single, uniform

archive of PDB data. Nucleic Acids Res., 35(Database issue):D301303,
Jan 2007. [PubMed Central:PMC1669775] [DOI:10.1093/nar/gkl971]
[PubMed:17142228].
BIBLIOGRAPHY
82
[29] P. J. Cock, T. Antao, J. T. Chang, B. A. Chapman, C. J. Cox,

A. Dalke, I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski,
and M. J. de Hoon.
Biopython: freely available Python tools for
computational molecular biology and bioinformatics.

ics, 25(11):14221423, Jun 2009.
Bioinformat-
[DOI:10.1093/bioinformatics/btp163] [PubMed:19304878].
[30] David L. Nelson and Michael M. Cox. Lehninger Principles of Biochemistry. W. H. Freeman, 2008.
[31] H. J. Schneider. Limitations and extensions of the lock-and-key principle: differences between gas state, solution and solid state structures.
Int J Mol Sci, 16(4):66946717, Mar 2015.
[PubMed Cen-
tral:PMC4424984] [DOI:10.3390/ijms16046694] [PubMed:25815592].

[32] W. L. Jorgensen. Rusting of the lock and key model for protein-ligand
binding. Science, 254(5034):954955, Nov 1991. [PubMed:1719636].
[33] T. Keleti.
Two rules of enzyme kinetics for reversible Michaelis-
Menten mechanisms.
FEBS Lett., 208(1):109112, Nov 1986.
[PubMed:3770204].
[34] Y. Liu and B. Kuhlman. RosettaDesign server for protein design. Nucleic Acids Res., 34(Web Server issue):W235238, Jul 2006. [PubMed
Central:PMC1538902] [DOI:10.1093/nar/gkl163] [PubMed:16845000].
[35] J. C. KENDREW, G. BODO, H. M. DINTZIS, R. G. PARRISH,
H. WYCKOFF, and D. C. PHILLIPS. A three-dimensional model of the
myoglobin molecule obtained by x-ray analysis. Nature, 181(4610):662
666, Mar 1958. [PubMed:13517261].
[36] B. Rost and C. Sander. Bridging the protein sequence-structure gap by
structure predictions. Annu Rev Biophys Biomol Struct, 25:113136,
1996. [DOI:10.1146/annurev.bb.25.060196.000553] [PubMed:8800466].
BIBLIOGRAPHY
83
[37] A. Ilari and C. Savino. Protein structure determination by x-ray crystallography. Methods Mol. Biol., 452:6387, 2008. [DOI:10.1007/978-160327-159-2 3] [PubMed:18563369].
[38] B. Montgomery Pettitt.
tein folding.
The unsolved solved-problem of pro-
J. Biomol. Struct. Dyn., 31(9):10241027, 2013.
[PubMed Central:PMC4497552] [DOI:10.1080/07391102.2012.748547]

[PubMed:23384146].
[39] K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E. Shaw.
fast-folding proteins fold.
How
Science, 334(6055):517520, Oct 2011.
[DOI:10.1126/science.1208351] [PubMed:22034434].
[40] David E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey S. Kuskin,
Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J. Bowers, Jack C. Chao, Michael P. Eastwood, Joseph
Gagliardo, J. P. Grossman, C. Richard Ho, Douglas J. Ierardi, Istvan
Kolossv
ary, John L. Klepeis, Timothy Layman, Christine McLeavey,
Mark A. Moraes, Rolf Mueller, Edward C. Priest, Yibing Shan, Jochen
Spengler, Michael Theobald, Brian Towles, and Stanley C. Wang. Anton, a special-purpose machine for molecular dynamics simulation.
SIGARCH Comput. Archit. News, 35(2):112, jun 2007.
[41] Ruben Abagyan Andrew J. W. Orry. Homology Modeling: Methods and
Protocols (Methods in Molecular Biology). Humana Press, 2012.
[42] B. Webb and A. Sali. Protein structure modeling with MODELLER.
Methods Mol. Biol., 1137:115, 2014. [DOI:10.1007/978-1-4939-03665 1] [PubMed:24573470].
[43] Andreas Kukol. Molecular Modeling of Proteins (Methods in Molecular
Biology). Humana Press, 2014.
BIBLIOGRAPHY
84
[44] I. W. Davis, W. B. Arendall, D. C. Richardson, and J. S.

Richardson.
The backrub motion:
when a sidechain dances.
how protein backbone shrugs
Structure, 14(2):265274, Feb 2006.
[DOI:10.1016/j.str.2005.10.007] [PubMed:16472746].
[45] C. A. Smith and T. Kortemme. Backrub-like backbone simulation
recapitulates natural protein conformational variability and improves
mutant side-chain prediction.
J. Mol. Biol., 380(4):742756, Jul
2008. [PubMed Central:PMC2603262] [DOI:10.1016/j.jmb.2008.05.023]

[PubMed:18547585].
[46] X. Y. Meng, H. X. Zhang, M. Mezei, and M. Cui. Molecular docking: a powerful approach for structure-based drug discovery. Curr
Comput Aided Drug Des, 7(2):146157, Jun 2011.
[PubMed Cen-
tral:PMC3151162] [PubMed:21534921].
[47] M. Cloutier and P. Wellstead. The control systems structures of energy
metabolism. J R Soc Interface, 7(45):651665, Apr 2010. [PubMed Central:PMC2842784] [DOI:10.1098/rsif.2009.0371] [PubMed:19828503].
[48] Michael Affenzeller, Stephan Winkler, Stefan Wagner, and Andreas Beham. Genetic Algorithms and Genetic Programming: Modern Concepts
and Practical Applications. Chapman & Hall/CRC, 1st edition, 2009.
[49] Lawrence David Davis. Evolutionary Computation in Practice (Studies
in Computational Intelligence). Springer, 2008.
[50] David L. Applegate, Robert E. Bixby, Vasek Chvatal, and William J.
Cook.
The Traveling Salesman Problem: A Computational Study
(Princeton Series in Applied Mathematics). Princeton University Press,

Princeton, NJ, USA, 2007.
[51] Hartmut Pohlheim. Evolution
are Algorithmen. Verfahren, Operatoren
und Hinweise f
ur die Praxis. Springer, 1999.
BIBLIOGRAPHY
85
[52] D. Ghersi and R. Sanchez. EasyMIFS and SiteHound: a toolkit for the
identification of ligand-binding sites in protein structures. Bioinformatics, 25(23):31853186, Dec 2009. [PubMed Central:PMC2913663]
[DOI:10.1093/bioinformatics/btp562] [PubMed:19789268].
[53] M. Hernandez, D. Ghersi, and R. Sanchez. SITEHOUND-web: a server
for ligand binding site identification in protein structures.
Nucleic
Acids Res., 37(Web Server issue):W413416, Jul 2009. [PubMed Central:PMC2703923] [DOI:10.1093/nar/gkp281] [PubMed:19398430].
[54] F. Lauck, C. A. Smith, G. F. Friedland, E. L. Humphris, and T. Kortemme.
RosettaBackruba web server for flexible backbone pro-
tein structure modeling and design.

Server issue):W569575, Jul 2010.
Nucleic Acids Res., 38(Web

[DOI:10.1093/nar/gkq369] [PubMed:20462859].
[55] G. G. Krivov, M. V. Shapovalov, and R. L. Dunbrack.
Improved
prediction of protein side-chain conformations with SCWRL4. Proteins, 77(4):778795, Dec 2009.
[DOI:10.1002/prot.22488] [PubMed:19603484].
[56] S. Ruiz-Carmona, D. Alvarez-Garcia, N. Foloppe, A. B. GarmendiaDoval, S. Juhos, P. Schmidtke, X. Barril, R. E. Hubbard, and
S. D. Morley.
rDock: a fast, versatile and open source program
for docking ligands to proteins and nucleic acids.

Biol., 10(4):e1003571, Apr 2014.
PLoS Comput.
[DOI:10.1371/journal.pcbi.1003571] [PubMed:24722481].
Appendix B
Attachments
86
APPENDIX B. ATTACHMENTS
B.1
87
main.py
Code Snippet B.1: Complete source code: main.py.
1 ########################################
2 ####
FIGARO
####
3 #### Author : P i e t e r Noyens
####
4 #### Academic y e a r : 2015 2016
####
5 #### S t u d e n t nr : r0307453
####
6 #### 2MSc B i o i n f o r m a t i c s , KU Leuven ####

7 ########################################
8
9 ## Main s c r i p t c o n t a i n i n g h i g h l e v e l GA b a c k b o n e .
10
11
from f u n c t i o n s import
12
import m u l t i p r o c e s s i n g
13
14 ## S e t number o f p r o c e s s o r c o r e s
15
w o r k e r s = m u l t i p r o c e s s i n g . Pool ( 1 0 0 )
16
17 ## Parameters :
18
19 ## P e r c e n t a g e e l i t i s m
20
pctE = 0 . 2 5
21 ## P e r c e n t a g e c r o s s o v e r
22
pctX = 0 . 6
23 ## P e r c e n t a g e m u t a t i o n
24
pctM = 0 . 0 0 5
25 ## P o p u l a t i o n s i z e
26
p o p S i z e = 100
27 ## S u b p o p u l a t i o n s i z e
28
s u b p o p S i z e = 10
29 ## Number o f p u t a t i v e b i n d i n g s i t e s t o be i n s p e c t e d
30
nrBindingSites = 5
31 ## Number o f g e n e r a t i o n s
32
g e n S i z e = 50
33 ## Maximum r e s o l u t i o n o f s t r u c t u r e s i n i n i t i a l p o p u l a t i o n
34
88
maxRes = 2 . 0
35 ## Maximum number o f e n t i t i e s i n s t r u c t u r e
36
maxEnt = 1
37 ## Minimum s i m i l a r i t y t o q u e r y l i g a n d
38
minSim = 0 . 3
39
40 ## Generate random s m a l l m o l e c u l e from h t t p : / / b c i r c . d o c k i n g . o r g /
random . s h t m l and g e t SMILES ( c u r r e n t l y hardcoded ) .
41
randMol = CCOc1cccc ( n1 )NC(=O) c 2 c c [ nH ] n2
42
43 ## I n i t i a l i z e t h e p o p u l a t i o n .
44
p o p u l a t i o n = i n i t i a l i z e p o p ( randMol , popSize , subpopSize , maxRes

, maxEnt , minSim , n r B i n d i n g S i t e s , pctM )
45
46 ## Write i n i t i a l i n f o r m a t i o n f o r r e s u l t e v a l u a t i o n .
47
w r i t e i n i t i a l i n f o ( population )
48
49 ## Perform GA on p o p u l a t i o n f o r s p e c i f i e d number o f g e n e r a t i o n s .
50
51
f o r g e n e r a t i o n in range ( 0 , g e n S i z e ) :
f o r index , s u b p o p u l a t i o n in enumerate ( p o p u l a t i o n ) :
52
## P r e d i c t i n i t i a l s t r u c t u r e s .
53
predict structure backrub , subpopulation )
54
## Dock l i g a n d t o a l l i n d i v i d u a l s i n s u b p o p u l a t i o n .
55
56
## S e l e c t i n d i v i d u a l s f o r mating i n s u b p o p u l a t i o n s .
57
population [ index ] = r o u l e t t e w h e e l s e l e c t i o n ( population [

i n d e x ] , pctE )
58
## Perform c r o s s o v e r b e t w e e n i n d i v i d u a l s i n
subpopulations .
59
t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n [ i n d e x ] , pctX )
60
#s i n g l e p o i n t c r o s s o v e r ( newPopulation , pctX )
61
## P r e d i c t s t r u c t u r e o f recombined s e q u e n c e s .
62
p r e d i c t s t r u c t u r e h o m o l o g y f a s t , population [ index ] )
63
## Dock l i g a n d t o a l l r e c o m b i n a t i o n s i n s u b p o p u l a t i o n .
64
65
66
89
i f g e n e r a t i o n < g e n S i z e 1:
## S e l e c t mother p r o t e i n s f o r new p o p u l a t i o n b a s e d on
c o m p e t i t i o n b e t w e e n each s u b p o p u l a t i o n s b e s t
competitor .
67
motherSelection = r o u l e t t e w h e e l s e l e c t i o n (
g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) , r e p l a c e=F a l s e )
68
## Generate new p o p u l a t i o n and mutate mother p r o t e i n s .
69
population = prepare next gen ( population ,

m o t h e r S e l e c t i o n , subpopSize , pctM , g e n e r a t i o n )
70
else :
71
72
print Program f i n i s h e d . The b e s t p r o t e i n i s
b e s t P r o t e i n . id + with a s c o r e o f + s t r (
bestProtein . score ) + \
73
on b i n d i n g s i t e
+ .
+ bestProtein . bestBindingSite
B.2
90
functions.py
Code Snippet B.2: Complete source code: main.py.
1 ########################################
2 ####
FIGARO
####
####
4 #### Academic y e a r : 2015 2016
####
5 #### S t u d e n t nr : r0307453
####

7 ########################################
8
9 ## S c r i p t c o n t a i n i n g a l l lowl e v e l GA f u n c t i o n s .
10
11
from p r o t e i n import P r o t e i n
12
from Bio .PDB import
13
import m o d e l l e r
14
import m o d e l l e r . automodel
15
import m o d e l l e r . s c r i p t s . c o m p l e t e p d b
16
import r e
17
import u r l l i b 2
18
import o s
19
import s y s
20
import random
21
import copy
22
23
ROSETTADIR = R o s e t t a
24
25
e x c e p t i o n L i s t = [ 2OGM , 4Y5G ]
26
aminoAcids = [ A , D , E , F , G , H , I , K , L , M ,
N , Q , R , S , T , V , W , Y ]
27
28 ## F i l l i n g up s h o r t c o m i n g o f Biopython p a c k a g e no method f o r
r e t r e i v i n g r e s i d u e i n d e x from r e s i d u e o b j e c t a v a i l a b l e .
29
30
31
def g e t r e s i ( r e s ) :
return i n t ( s t r ( r e s ) . s p l i t ( ) [ 3 ] [ 7 : ] )
91
32 ## C r e a t e s i n i t i a l PDB p o p u l a t i o n and download t o f o l d e r pdb .

33 ## R e c e p t o r s f o r drugl i k e m o l e c u l e s s i m i l a r t o t h e random
molecule are c o l l e c t e d .
34
def i n i t i a l i z e p o p ( randMol , popSize , subpopSize , maxRes , maxEnt ,

minSim , n r B i n d i n g S i t e s , pctM ) :
35
population = [ ]
36
usedPdb = [ ]
37
similarity = 1
38
while s i m i l a r i t y > minSim and len ( p o p u l a t i o n ) != p o p S i z e :
39
pdbOut = u r l l i b 2 . u r l o p e n ( h t t p : / /www. r c s b . o r g /pdb/ r e s t /

s m i l e s Q u e r y ? s m i l e s= + randMol +
40
&s e a r c h t y p e=s i m i l a r i t y&

s i m i l a r i t y= +s t r ( s i m i l a r i t y )
) . read ( ) . s p l i t l i n e s ( )
41
42
43
f o r l i n e in pdbOut :
i f len ( p o p u l a t i o n ) != p o p S i z e :
h i t = r e . s e a r c h ( (?<= s t r u c t u r e I d =) . ? ( ? = ) ,
line )
44
if hit :
45
pdb = h i t . group ( )
46
c h a i n = r e . s e a r c h ( (?<= c h a i n i d =)\w(?=) ,
u r l l i b 2 . urlopen (
47
h t t p : / /www. r c s b . o r g /pdb/ r e s t /
d e s c r i b e M o l ? s t r u c t u r e I d= + pdb ) . r e a d
( ) ) . group ( )
48
i f pdb not in usedPdb + e x c e p t i o n L i s t and

q u a l i t y c h e c k ( pdb , maxRes , maxEnt ) :
49
l i g a n d = r e . s e a r c h ( (?<=c h e m i c a l I D =)
. ? ( ? = ) , l i n e ) . group ( )
50
w r i t e p d b ( pdb , l i g a n d )
51
f i x p d b ( pdb )
52
e x t r a c t c h a i n ( pdb , c h a i n )
53
s e q = g e t s e q ( pdb )
54
b i n d i n g S i t e s = g e t b i n d i n g s i t e s ( pdb ,
nrBindingSites )
55
92
m o t h e r P r o t e i n = P r o t e i n ( pdb , seq , chain ,

bindingSites )
56
subpop = c r e a t e s u b p o p u l a t i o n (
motherProtein , subpopSize , pctM )
57
p o p u l a t i o n . append ( subpop )
58
usedPdb . append ( pdb )
59
60
s i m i l a r i t y = s i m i l a r i t y 0.1
i f len ( p o p u l a t i o n ) == p o p S i z e :
61
print P o p u l a t i o n i n i t i a l i z a t i o n f i n i s h e d .
62
return p o p u l a t i o n
63
64
else :
s y s . e x i t ( Not enough a v a i l a b l e t e m p l a t e s f o r t h e s e
parameters . )
65
66 ## Writes i n i t i a l i n f o t o f i l e .
67
68
69
70
71
def w r i t e i n i t i a l i n f o ( p o p u l a t i o n ) :
with open ( i n i t i a l i n f o , w ) a s i n i t i a l F i l e :
f o r subpop in p o p u l a t i o n :
f o r p r o t e i n in subpop :
i n i t i a l F i l e . w r i t e ( p r o t e i n . id + \ t + p r o t e i n .
s e q + \ t + s t r ( p r o t e i n . b i n d i n g S i t e s ) + \n
)
72
print I n i t i a l i n f o w r i t t e n .
73
74 ## E x t r a c t s s i n g l e c h a i n from PDB and w r i t e s as PDB.
75
def e x t r a c t c h a i n ( pdb , c h a i n ) :
76
77
s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( pdb ,
78
Dice . e x t r a c t ( s t r u c t u r e , chain , s t a r t =0 , end=s y s . maxint ,
pdb/ + pdb + . pdb )
f i l e n a m e= pdb/ + pdb + . pdb )

79
80 ## E x t r a c t s s e q u e n c e from PDB.
81
def g e t s e q ( pdb ) :
82
83
b u i l d e r = PPBuilder ( )
84
s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( pdb ,
pdb/ + pdb + . pdb )
85
sequenceList = [ ]
86
f o r p o l y p e p t i d e in b u i l d e r . b u i l d p e p t i d e s ( s t r u c t u r e ) :
87
88
93
s e q u e n c e L i s t . append ( s t r ( p o l y p e p t i d e . g e t s e q u e n c e ( ) ) )
return . j o i n ( s e q u e n c e L i s t )
89
90 ## Checks i f e n t i t y number and r e s o l u t i o n a r e t o l e r a t e d .
91
92
def q u a l i t y c h e c k ( pdb , maxRes , maxEnt ) :

d e s c r i p t i o n = u r l l i b 2 . u r l o p e n ( h t t p : / /www. r c s b . o r g /pdb/ r e s t /
describePDB ? s t r u c t u r e I d= + pdb ) . r e a d ( ) . s p l i t l i n e s ( )
93
eTest = False
94
rTest = False
95
f o r l i n e in d e s c r i p t i o n :
96
e = r e . s e a r c h ( (?<= n r e n t i t i e s =)\w(?=) , l i n e )
97
r = r e . s e a r c h ( (?<= r e s o l u t i o n =) . ? ( ? = ) , l i n e )
98
i f e and in t ( e . group ( ) ) <= maxEnt :
99
100
e T e s t = True
i f r and f l o a t ( r . group ( ) ) <= maxRes :
101
r T e s t = True
102
i f e T e s t and r T e s t :
103
return True
104
105
else :
return F a l s e
106
107 ## C r e a t e s s u b p o p u l a t i o n .
108
def c r e a t e s u b p o p u l a t i o n ( motherProtein , s i z e , pctM ) :
109
subpop = [ ]
110
f o r i in range ( 0 , s i z e ) :
111
subpop . append ( P r o t e i n ( m o t h e r P r o t e i n . id , m o t h e r P r o t e i n .
seq , m o t h e r P r o t e i n . chain , m o t h e r P r o t e i n . b i n d i n g S i t e s )
)
112
p o i n t m u t a t i o n ( subpop [ i ] , pctM )
113
subpop [ i ] . s e t i d ( m o t h e r P r o t e i n . id + + s t r ( i +1) )
114
return subpop
115
116 ## I d e n t i f i e s p u t a t i v e b i n d i n g p o c k e t s u s i n g t h e SiteHound
package
94
117 ## ( no i n s t a l l a t i o n needed , i 3 8 6 l i b r a r i e s have t o be i n s t a l l e d

f o r pdb2gmx ) .
118
def g e t b i n d i n g s i t e s ( pdb , n r B i n d i n g S i t e s ) :
119
bsCenterList = [ ]
120
s t a r t i n g D i r = o s . getcwd ( )
121
o s . c h d i r ( s t a r t i n g D i r + /pdb )
122
o s . system ( . / auto . py i
123
with open ( pdb + CMET summary . dat , r ) a s summary :
+ pdb + . pdb p CMET k )
124
f o r i in range ( 0 , n r B i n d i n g S i t e s ) :
125
l i n e = summary . r e a d l i n e ( )
126
x = l i n e . s p l i t ( ) [ 3]
127
y = l i n e . s p l i t ( ) [ 2]
128
z = l i n e . s p l i t ( ) [ 1]
129
b s C e n t e r L i s t . append ( ( + x + , + y + , + z + )
)
130
131
o s . system ( rm + pdb + + pdb + . e a s y m i f s )
132
## Return t o main d i r e c t o r y .
133
os . chdir ( s t a r t i n g D i r )
134
print Binding s i t e s found
135
return b s C e n t e r L i s t
136
137 ## Performs m o l e c u l a r d o c k i n g o f l i g a n d t o b i n d i n g s i t e s (BS) o f
r e c e p t o r s t r u c t u r e s u s i n g t h e rDock p a c k a g e .
138 ## The g l o b a l d o c k i n g s c o r e f u n c t i o n s e r v e s as a f i t n e s s measure
.
139 ## Don t f o r g e t t o f i r s t c o m p i l e and s e t up rDock and Open B a b e l
correctly .
140
def dock ( p r o t e i n ) :
141
print Docking s t a r t e d f o r p r o t e i n + p r o t e i n . id + .
142
b e s t S c o r e = None
143
b e s t B i n d i n g S i t e = None
144
## Convert pdb t o mol2 f i l e u s i n g OpenBabel p a c k a g e .
145
o s . system ( b a b e l h . / pdb/ + p r o t e i n . id + . pdb + p r o t e i n

. id + . mol2 )
146
f o r b i n d i n g S i t e in p r o t e i n . b i n d i n g S i t e s :
95
147
## F i r s t w r i t e s y s t e m f i l e .
148
with open ( rDockSystem + p r o t e i n . id + . prm , w ) a s

systemFile :
149
s y s t e m F i l e . w r i t e ( RBT PARAMETER FILE V1. 0 0 \ nTITLE

gart DUD\nRECEPTOR FILE + p r o t e i n . id +
. mol2 \nRECEPTOR FLEX 3 . 0 \nSECTION
150
MAPPER\nSITE MAPPER
RbtSphereSiteMapper \nCENTER
+ b i n d i n g S i t e + \nRADIUS 1 5 . 0 \
151
nSMALL SPHERE 1 . 5 \nMIN VOLUME

100\nMAX CAVITIES 1\nVOL INCR
0 . 0 \nGRIDSTEP 0 . 5 \nEND SECTION\
nSECTION CAVITY\
nSCORING FUNCTION
RbtCavityGridSF \nWEIGHT 1 . 0 \
nEND SECTION )
152
## Generate c a v i t y f o r d o c k i n g .
153
o s . system ( r b c a v i t y was d r rDockSystem + p r o t e i n .

id + . prm )
154
## Perform d o c k i n g .
155
o s . system ( rbdock i l i g a n d . sd o o u t p u t + p r o t e i n . id
+ r rDockSystem + p r o t e i n . id +
156
. prm p dock . prm n 1 )
157
## Return d o c k i n g s c o r e .
158
with open ( o u t p u t + p r o t e i n . id + . sd , r ) a s
dockingResult :
159
160
161
f o r l i n e in d o c k i n g R e s u l t :
i f l i n e == >
<SCORE>\n :
d o c k i n g S c o r e = f l o a t ( d o c k i n g R e s u l t . next ( ) .
strip () )
162
i f b e s t S c o r e i s None or d o c k i n g S c o r e < b e s t S c o r e :
163
bestScore = dockingScore
164
bestBindingSite = bindingSite
165
protein . set score ( bestScore )
166
protein . set best binding site ( bestBindingSite )
167
168
96
o s . system ( rm + p r o t e i n . id + . mol2 o u t p u t + p r o t e i n . id
+ . sd rDockSystem + p r o t e i n . id +
169
c a v 1 . grd rDockSystem + p r o t e i n . id + . a s
rDockSystem + p r o t e i n . id + . prm )
170
print Docking f i n i s h e d f o r p r o t e i n + p r o t e i n . id + .
171
172
173 ## R e p o r t s b e s t i n d i v i d u a l from p o p u l a t i o n .
174
def r e p o r t b e s t ( p o p u l a t i o n ) :
175
c u r r e n t B e s t = None
176
177
178
i f c u r r e n t B e s t i s None or p r o t e i n . s c o r e <
currentBest :
179
180
currentBest = protein
return c u r r e n t B e s t
181
182 ## E v a l u a t e s s u b p o p u l a t i o n s and r e t u r n s l i s t o f b e s t c a n d i d a t e s .
183
def g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) :
184
candidatesList = [ ]
185
186
c u r r e n t B e s t = None
187
188
i f c u r r e n t B e s t i s None or p r o t e i n . s c o r e <
currentBest . score :
189
190
191
currentBest = protein
c a n d i d a t e s L i s t . append ( c u r r e n t B e s t )
return c a n d i d a t e s L i s t
192
193 ## R o u l e t t e Wheel S e l e c t i o n o p e r a t o r
194
def r o u l e t t e w h e e l s e l e c t i o n ( p o p u l a t i o n , pctE , r e p l a c e=True ) :
195
newPopulation = [ ]
196
f o r i in range ( 0 , len ( p o p u l a t i o n ) ) :
197
totalScore = 0
198
199
t o t a l S c o r e += p r o t e i n . s c o r e
97
200
r o u l e t t e R e s u l t = random . u nif orm ( 0 , t o t a l S c o r e )
201
index = 0
202
c u r r e n t T o t a l = p o p u l a t i o n [ i n d e x ] . s c o r e
203
while c u r r e n t T o t a l < r o u l e t t e R e s u l t and i n d e x < len (

p o p u l a t i o n ) 1:
204
i n d e x += 1
205
c u r r e n t T o t a l += p o p u l a t i o n [ i n d e x ] . s c o r e
206
207
newId = p o p u l a t i o n [ i n d e x ] . id [ 0 : 4 ] + s e l e c t e d +
s t r ( i +1)
208
newPopulation . append ( P r o t e i n ( newId , p o p u l a t i o n [ i n d e x

] . seq ,
209
p o p u l a t i o n [ i n d e x ] . chain
, population [ index ] .
bindingSites ) )
210
## Rename PDBs t o l i n k w i t h new i d s .
211
o s . system ( cp pdb/ + p o p u l a t i o n [ i n d e x ] . id + . pdb

pdb/ + newId + . pdb )
212
i f r e p l a c e == F a l s e :
213
214
newPopulation . append ( p o p u l a t i o n [ i n d e x ] )
215
## Clean up PDB f o l d e r .
216
217
o s . system ( rm pdb/ + p r o t e i n . id + . pdb )
218
print R o u l e t t e Wheel S e l e c t i o n s u c c e e d e d .
219
220
221 ## TwoP o i n t s C r o s s o v e r o p e r a t o r
222
223
def t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n , pctX ) :
224
225
226
227
else :
228
229
230
231
232
98

233
minLength )
234
235
population [ index + 1 ] ,
236
crossoverPoint1 ,
crossoverPoint2 )
237

crossoverPoint1 ] + population [ index +1].
238
seq [ crossoverPoint1 :
crossoverPoint2 ] +
protein . seq [
239
240
241
else :
242
minLength )
243
244
population [ index 1 ] ,
245
crossoverPoint1 ,
crossoverPoint2 )
246

c r o s s o v e r P o i n t 1 ] + p o p u l a t i o n [ index 1 ] .
247
99
parents [ 0 ] [ 1 ] [
crossoverPoint1 :
crossoverPoint2 ] +
248
protein . seq [
249
250
251
else :
252
253
254
255
256 ## S i n g l e P o i n t C r o s s o v e r o p e r a t o r
257
258
def s i n g l e p o i n t c r o s s o v e r ( p o p u l a t i o n , pctX ) :
259
260
261
262
else :
263
264
265
266
267

268
c r o s s o v e r P o i n t = random . r a n d i n t ( 0 , minLength
)
269
p o p u l a t i o n [ i n d e x + 1 ] , c r o s s o v e r P o i n t 1=
crossoverPoint )
270

crossoverPoint ] + population [ index +1]. seq
[ crossoverPoint : ] )
271
272
else :
273
100

274
c r o s s o v e r P o i n t = random . r a n d i n t ( 0 , minLength
)
275
p o p u l a t i o n [ i n d e x 1 ] , c r o s s o v e r P o i n t 1=
crossoverPoint )
276

c r o s s o v e r P o i n t ] + p o p u l a t i o n [ index 1 ] .
277
parents [ 0 ] [ 1 ] [
crossoverPoint : ] )
278
279
280
else :
281
282
283
284
285 ## P o i n t m u t a t i o n o p e r a t o r
286
def p o i n t m u t a t i o n ( p r o t e i n , pctM ) :
287
mutable = l i s t ( p r o t e i n . s e q )
288
p o i n t M u t a t i o n s = {}
289
f o r index , aminoAcid in enumerate ( mutable ) :
290
i f random . random ( ) <= pctM and not ( p r o t e i n . s e q [ i n d e x ]

== C or p r o t e i n . s e q [ i n d e x ] == P ) :
291
mutable [ i n d e x ] = random . c h o i c e ( aminoAcids )
292
i f p r o t e i n . s e q [ i n d e x ] != mutable [ i n d e x ] :
293
p o i n t M u t a t i o n s [ i n d e x +1] = p r o t e i n . s e q [ i n d e x ] +
mutable [ i n d e x ]
294
p r o t e i n . u p d a t e s e q ( . j o i n ( mutable ) )
295
protein . set point mutations ( pointMutations )
296
print p r o t e i n + p r o t e i n . id + mutated .
297
298
299 ## Writes PDB s t r u c t u r e f i l e w i t h or w i t h o u t l i g a n d .
300
def w r i t e p d b ( pdb , l i g a n d=None , f o l d e r= . / pdb/ ) :
301
302
101
with open ( f o l d e r + pdb + . pdb , w ) a s p d b F i l e :

pdbOrig = u r l l i b 2 . u r l o p e n ( h t t p : / /www. r c s b . o r g /pdb/ f i l e s
/ + pdb + . pdb ) . r e a d ( ) . s p l i t l i n e s ( )
303
304
i f ligand :
f o r l i n e in pdbOrig :
i f not r e . match ( HETATM\ s . ? \ s . ? \ s + l i g a n d ,
305
line ) :
p d b F i l e . w r i t e ( l i n e + \n )
306
307
print PDB f i l e f o r + pdb + i s w r i t t e n w i t h o u t

ligand + ligand + .
308
309
else :
f o r l i n e in pdbOrig :
p d b F i l e . w r i t e ( l i n e + \n )
310
311
print PDB f i l e f o r + pdb + i s w r i t t e n .
312
313 ## Writes PIR a l i g n m e n t f i l e .
314
315
316
def w r i t e p i r ( p r o t e i n ) :
with open ( p r o t e i n . id + . a l i , w ) a s p i r F i l e :
p i r F i l e . w r i t e ( >P1 ; + p r o t e i n . id + \n + s e q u e n c e : +
p r o t e i n . id + : : : : : : : 0 . 0 0 : 0 . 0 0 \ n + p r o t e i n . s e q +
)
317
print PIR f i l e w r i t t e n .
318
319 ## Writes a l i g n m e n t f i l e .
320
321
322
323
def w r i t e a l n ( p r o t e i n ) :
with open ( p r o t e i n . id + t o t a l . a l i , w ) a s a l n F i l e :
a l n F i l e . w r i t e ( >P1 ; + p a r e n t [ 0 ] + \n + s t r u c t u r e
: + parent [ 0 ] + : . : . : . : . : : : : \ n +
324
325
p a r e n t [ 1 ] + \n )
a l n F i l e . w r i t e ( >P1 ; + p r o t e i n . id + \n + s e q u e n c e : +
p r o t e i n . id + : : : : : : : : \ n +
326
327
p r o t e i n . s e q + \n )
print Alignment f i l e w r i t t e n .
328
329 ## Returns l i s t o f t e m p l a t e s .
330
102
def g e t t e m p l a t e s ( p r f , c u t o f f , amount ) :
331
p r f . w r i t e ( f i l e= b u i l d p r o f i l e . p r f , p r o f i l e f o r m a t= TEXT )
332
templates = [ ]
333
i d e n t i t y = 95
334
while i d e n t i t y >= c u t o f f and len ( t e m p l a t e s ) < amount :
335
with open ( b u i l d p r o f i l e . p r f , r ) a s p r o f i l e :
336
f o r l i n e in p r o f i l e :
i f len ( t e m p l a t e s ) < 2 and r e . match ( , l i n e )
337
and i nt ( l i n e . s p l i t ( ) [ 1 0 ] . s t r i p ( . ) ) >=
identity \
338
and l i n e . s p l i t ( ) [ 1 ] [ 0 : 5 ] not in
templates :
339
340
341
t e m p l a t e s . append ( l i n e . s p l i t ( ) [ 1 ] [ 0 : 5 ] )
i d e n t i t y = 10
i f len ( t e m p l a t e s ) == amount :
342
print Templates g a t h e r e d .
343
return t e m p l a t e s
344
345
else :
s y s . e x i t ( Not enough t e m p l a t e s . )
346
347 ## F i x e s m i s s i n g r e s i d u e s i n PDB f i l e .
348
def f i x p d b ( pdb ) :
349
350
env . l i b s . t o p o l o g y . r e a d ( $ {LIB}/ t o p h e a v . l i b )
351
env . l i b s . p a r a m e t e r s . r e a d ( $ {LIB}/ par . l i b )
352
m = m o d e l l e r . s c r i p t s . c o m p l e t e p d b ( env ,
pdb/ + pdb + . pdb
)
353
m. w r i t e ( f i l e= pdb/ + pdb + . pdb )
354
print PDB f i x e d .
355
356 ## Adds c h a i n i d e n t i f i e r t o l i n e s i n PDB.
357
def a d d c h a i n i d ( pdb , c h a i n ) :
358
fixedLines = [ ]
359
with open ( pdb/ + pdb + . pdb , r ) a s p d b F i l e :
360
361
f o r l i n e in p d b F i l e :
i f l i n e [ 0 : 4 ] == ATOM :
362
mutable = l i s t ( l i n e )
363
mutable [ 2 1 ] = c h a i n
364
f i x e d L i n e s . append ( . j o i n ( mutable ) )
365
366
367
else :
f i x e d L i n e s . append ( l i n e )
with open ( pdb/ + pdb + . pdb , w ) a s p d b F i l e :
368
f o r l i n e in f i x e d L i n e s :
369
pdbFile . write ( l i n e )
370
103
print Chain i d e n t i f i e r added .
371
372 ## P r e d i c t s p r o t e i n s t r u c t u r e s w i t h homology m o d e l i n g ( c l a s s i c a l
way , n o t used ) .
373
def p r e d i c t s t r u c t u r e h o m o l o g y ( p r o t e i n ) :
374
## S e l e c t t e m p l a t e s from d a t a b a s e .
375
write pir ( protein )
376
377
378
sdb = m o d e l l e r . s e q u e n c e d b ( env )
379
sdb . r e a d ( s e q d a t a b a s e f i l e= 20160310 pdb95 . b i n ,

s e q d a t a b a s e f o r m a t= BINARY , c h a i n s l i s t= ALL )
380
a l n = m o d e l l e r . a l i g n m e n t ( env )
381
a l n . append ( f i l e=p r o t e i n . id + . a l i , a l i g n m e n t f o r m a t= PIR ,

a l i g n c o d e s= ALL )
382
prf = aln . t o p r o f i l e ()
383
p r f . b u i l d ( sdb , m a t r i x o f f s e t =450, r r f i l e = $ {LIB}/ blosum62 .

sim . mat ,
384
g a p p e n a l t i e s 1 d =(500, 50) , n p r o f i t e r a t i o n s =1,
385
c h e c k p r o f i l e=F a l s e , m a x a l n e v a l u e =0.01)
386
templates = get templates ( prf , 35 , 2)
387
## A l i g n t e m p l a t e s .
388
a l n = m o d e l l e r . a l i g n m e n t ( env )
389
f o r t e m p l a t e in t e m p l a t e s :
390
write pdb ( template [ 0 : 4 ] )
391
m = m o d e l l e r . model ( env , f i l e=t e m p l a t e [ 0 : 4 ] ,

model segment =( FIRST : +t e m p l a t e [ 4 : ] , LAST : +t e m p l a t e
[4:]) )
392
104
a l n . append model (m, a t o m f i l e s=t e m p l a t e [ 0 : 4 ] ,

a l i g n c o d e s=t e m p l a t e )
393
f o r ( w e i g h t s , w r i t e f i t , whole ) in ( ( ( 1 . , 0 . , 0 . , 0 . , 1 . ,
0 . ) , F a l s e , True ) ,
394
((1. , 0.5 , 1. , 1. , 1. ,
0 . ) , F a l s e , True ) ,
395
((1. , 1. , 1. , 1. , 1. ,
0 . ) , True , F a l s e ) ) :
396
a l n . s a l i g n ( r m s c u t o f f =3.5 , n o r m a l i z e p p s c o r e s=F a l s e ,
397
r r f i l e = $ ( LIB ) / a s 1 . sim . mat , overhang =30 ,
398
g a p p e n a l t i e s 1 d =(450, 50) ,
399
g a p p e n a l t i e s 3 d =(0 , 3 ) , g a p g a p s c o r e =0,
g a p r e s i d u e s c o r e =0 ,
400
d e n d r o g r a m f i l e=p r o t e i n . id + t e m p l a t e s .
tree ,
401
a l i g n m e n t t y p e= t r e e ,
402
f e a t u r e w e i g h t s=w e i g h t s ,
403
i m p r o v e a l i g n m e n t=True , f i t =True , w r i t e f i t=
write fit ,
404
w r i t e w h o l e p d b=whole , output= ALIGNMENT

QUALITY )
405
a l n . w r i t e ( f i l e=p r o t e i n . id + t e m p l a t e s . a l i ,
a l i g n m e n t f o r m a t= PIR )
406
## A l i g n s e q u e n c e t o t e m p l a t e s .
407
env . l i b s . t o p o l o g y . r e a d ( f i l e= $ ( LIB ) / t o p h e a v . l i b )
408
a l n b l o c k = len ( a l n )
409
a l n . append ( f i l e=p r o t e i n . id + . a l i , a l i g n m e n t f o r m a t= PIR ,

a l i g n c o d e s=p r o t e i n . id )
410
a l n . s a l i g n ( output= , m a x g a p l e n g t h =20 ,
411
g a p f u n c t i o n=True ,
412
a l i g n m e n t t y p e= PAIRWISE , a l i g n b l o c k=a l n b l o c k
,
413
f e a t u r e w e i g h t s =(1. , 0 . , 0 . , 0 . , 0 . , 0 . ) ,
overhang =0 ,
414
g a p p e n a l t i e s 1 d =(450, 0 ) ,
415
105
g a p p e n a l t i e s 2 d =(0.35 , 1 .2 , 0 . 9 , 1 .2 , 0 . 6 , 8 .6 ,
1.2 , 0. , 0.) ,
416
417
s i m i l a r i t y f l a g=True )
a l n . w r i t e ( f i l e=p r o t e i n . id + t o t a l . a l i , a l i g n m e n t f o r m a t=
PIR )
418
a = m o d e l l e r . automodel . automodel ( env , a l n f i l e =p r o t e i n . id +

total . ali ,
419
knowns=tuple ( t e m p l a t e s ) ,
s e q u e n c e=p r o t e i n . id )
420
421
422
a . make ( )
423
## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL p a c k a g e
) , switch hashes to a c t i v a t e . Deactivation prefered .
424
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + . B99990001
. pdb o f i n a l + p r o t e i n . i d + . pdb )
425

426
427
o s . system ( mv f i n a l + p r o t e i n . id + . pdb . / pdb/ + p r o t e i n

. id + . pdb )
428

used )
429
430
431
432
433
434
print P r o t e i n + p r o t e i n . id + modeled u s i n g M o d e l l e r and

SCWRL4.
435
436
437 ## P r e d i c t s p r o t e i n s t r u c t u r e s w i t h homology m o d e l i n g u s i n g
i n t e r m e d i a t e models ( manual a l i g n m e n t , f a s t e r ) .
438
439
def p r e d i c t s t r u c t u r e h o m o l o g y f a s t ( p r o t e i n ) :
i f len ( p r o t e i n . p a r e n t s ) > 1 :
440
write aln ( protein )
441
442
443
templates = [ ]
444
445
446
106
t e m p l a t e s . append ( p a r e n t [ 0 ] )
a = m o d e l l e r . automodel . automodel ( env , a l n f i l e =p r o t e i n . id
+ t o t a l . a l i , knowns=tuple ( t e m p l a t e s ) , s e q u e n c e=
p r o t e i n . id )
447
448
449
a . make ( )
450
prefered .
451
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + .
B99990001 . pdb o f i n a l + p r o t e i n . i d + . pdb )
452

453
454
o s . system ( mv f i n a l + p r o t e i n . id + . pdb . / pdb/ +

455
## Fix PDB ( o n l y needed when s i d e c h a i n r e f i n e m e n t i s

n o t used )
456
457
458
459
460
461
print P r o t e i n + p r o t e i n . id + modeled u s i n g MODELLER

.
462
463
464
else :
o s . system ( cp pdb/ + p r o t e i n . p a r e n t s [ 0 ] [ 0 ] + . pdb pdb/
+ p r o t e i n . id + . pdb )
465
107
466
467 ## Models p o i n t m u t a t i o n s u s i n g t h e R o s e t t a B a c k r u b a p p l i c a t i o n .
468
def p r e d i c t s t r u c t u r e b a c k r u b ( p r o t e i n ) :
469
470
s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( p r o t e i n . id [ 0 : 4 ] ,
pdb/ +
p r o t e i n . id [ 0 : 4 ] + . pdb )
471
## Give e v e r y t h r e a d i t s own i n p u t and o u t p u t f i l e s .
472
o s . system ( cp pdb/ + p r o t e i n . id [ 0 : 4 ] + . pdb pdb/ +

473
atomList = [ ]
474
f o r atom in s t r u c t u r e . g e t a t o m s ( ) :
475
a t o m L i s t . append ( atom )
476
nbs = N e i g h b o r S e a r c h ( a t o m L i s t )
477
f o r mutation in p r o t e i n . p o i n t M u t a t i o n s :
478
## Write r e s f i l e and g e t a l l r e s i d u e s w i t h i n 6 Angstrom

o f mutated r e s i d u e s u s i n g Biopython p a c k a g e .
479
with open ( r e s f i l e
+ p r o t e i n . id ,
w ) a s r e s f i l e :
480
r e s f i l e . w r i t e ( NATRO\ n s t a r t \n )
481
affectedList = [ ]
482
r e s i d u e = s t r u c t u r e . g e t c h a i n s ( ) . next ( ) [ mutation ]
483
f o r atom in S e l e c t i o n . u n f o l d e n t i t i e s ( r e s i d u e ,
484
A ) :
f o r n e i g h b o r in nbs . s e a r c h ( atom . g e t c o o r d ( ) , 6 ,
l e v e l= R ) :
485
486
a f f e c t e d L i s t . append ( g e t r e s i ( n e i g h b o r ) )
r e s f i l e . w r i t e ( s t r ( mutation ) + + p r o t e i n . c h a i n +
PIKAA + p r o t e i n . p o i n t M u t a t i o n s [ mutation ] [ 1 ] +
\n )
487
## Make a f f e c t e d L i s t u n i q u e .
488
a f f e c t e d L i s t = l i s t ( set ( a f f e c t e d L i s t ) )
489
pivotList = [ ]
490
f o r r e s in a f f e c t e d L i s t :
491
492
i f r e s != mutation :
r e s f i l e . write ( str ( r e s ) + + protein . chain
+ NATAA\n )
493
p i v o t L i s t . append ( r e s )
494
i f r e s != 1 :
108
495
p i v o t L i s t . append ( r e s 1)
496
i f r e s != len ( p r o t e i n . s e q ) :
497
p i v o t L i s t . append ( r e s +1)
498
499
p i v o t L i s t = l i s t ( set ( p i v o t L i s t ) )
## Run b a c k r u b s a m p l i n g w i t h 1 0 . 0 0 0 Monte C ar l o
iterations .
500
command = ROSETTADIR + /main/ s o u r c e / b i n / backrub .

l i n u x g c c r e l e a s e d a t a b a s e + ROSETTADIR + \
/main/ d a t a b a s e s pdb/ + p r o t e i n . id + . pdb
501
n s t r u c t 1 backrub : n t r i a l s
str (10000) + r e s f i l e
502
resfile
+ \
+ p i v o t r e s i d u e s
503
504
f o r p i v o t in p i v o t L i s t :
command += s t r ( p i v o t ) +
505
o s . system ( command )
506
prefered .
507
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + 0 0 0 1 .
pdb o pdb / + p r o t e i n . i d + . pdb )
508
o s . system ( mv + p r o t e i n . id + 0 0 0 1 . pdb pdb/ +

509
510
o s . system ( rm + p r o t e i n . id + r e s f i l e
)
511

used )
512
513
print P r o t e i n + p r o t e i n . id + modeled u s i n g
RosettaBackrub and SCWRL4.
514
515
516
def g e t u n i q u e i d s ( number , m o t h e r S e l e c t i o n ) :
517
list = []
518
while len ( l i s t ) < number :
519
id =
520
521
522
109
f o r d i g i t in range ( 0 , 4 ) :
id += s t r ( random . r a n d i n t ( 0 , 9 ) )
i f id not in l i s t + [ p r o t e i n . id f o r p r o t e i n in
motherSelection ] :
523
524
l i s t . append ( id )
return l i s t
525
526
def p r e p a r e n e x t g e n ( p o p u l a t i o n , m o t h e r S e l e c t i o n , subpopSize ,
pctM , g e n e r a t i o n ) :
527
newPopulation = [ ]
528
i d s = g e t u n i q u e i d s ( len ( m o t h e r S e l e c t i o n ) , m o t h e r S e l e c t i o n )
529
f o r index , p r o t e i n in enumerate ( m o t h e r S e l e c t i o n ) :
530
o s . system ( cp pdb/ + p r o t e i n . id + . pdb pdb/ + i d s [

i n d e x ] + . pdb )
531
newPopulation . append ( c r e a t e s u b p o p u l a t i o n ( P r o t e i n ( i d s [
i n d e x ] , p r o t e i n . seq , p r o t e i n . chain , [ p r o t e i n .
b e s t B i n d i n g S i t e ] ) , subpopSize , pctM ) )
532
## Write b e s t pdb t o f o l d e r b e s t p d b s and p r i n t r e p o r t .
533
534
o s . system ( mkdir p b e s t p d b s && cp pdb/ + b e s t P r o t e i n . id +

. pdb b e s t p d b s / + s t r ( g e n e r a t i o n + 1 ) + +
b e s t P r o t e i n . id + . pdb )
535
536
with open ( r e p o r t , a ) a s r e p o r t :
r e p o r t . w r i t e ( b e s t P r o t e i n . id + \ t + s t r ( b e s t P r o t e i n .
bindingSites ) + \t + bestProtein . bestBindingSite +
\ t + s t r ( b e s t P r o t e i n . s c o r e ) + \n )
537
538
## Remove PDBs from p r e v i o u s g e n e r a t i o n .
539
o s . system ( rm pdb/ + subpop [ 0 ] . id [ 0 : 4 ] + . pdb )
540
541
542
o s . system ( rm pdb/ + p r o t e i n . id [ 0 : 1 4 ] + )
B.3
110
protein.py
Code Snippet B.3: Complete source code: protein.py.
1 ########################################
2 ####
FIGARO
####
####
4 #### Academic y e a r : 2015 2016
####
5 #### S t u d e n t nr : r0307453
####

7 ########################################
8
9 # Class defining the protein o b j e c t .
10
11
import r e
12
13
class Protein () :
14
15
def
init
( s e l f , id , seq , chain , b i n d i n g S i t e s ) :
16
s e l f . id = id
17
s e l f . seq = seq
18
s e l f . chain = chain
19
s e l f . bindingSites = bindingSites
20
s e l f . s c o r e = None
21
s e l f . parents = [ ]
22
s e l f . crossoverPoints = [ ]
23
s e l f . b e s t B i n d i n g S i t e = None
24
s e l f . p o i n t M u t a t i o n s = {}
25
26
def s e t s c o r e ( s e l f , s c o r e ) :
27
s e l f . score = score
28
29
def s e t p a r e n t s ( s e l f , protA , protB = None , c r o s s o v e r P o i n t 1 =

None , c r o s s o v e r P o i n t 2 = None ) :
30
s e l f . p a r e n t s . append ( ( protA . id , protA . s e q ) )
31
i f protB and r e . s e a r c h ( recombined , protB . id ) :
32
s e l f . p a r e n t s . append ( ( protB . p a r e n t s [ 0 ] [ 0 ] , protB .

parents [ 0 ] [ 1 ] ) )
33
i f protB and not r e . s e a r c h ( recombined , protB . id ) :
34
s e l f . p a r e n t s . append ( ( protB . id , protB . s e q ) )
35
36
37
38
s e l f . crossoverPoints = [ crossoverPoint1 ]
s e l f . c r o s s o v e r P o i n t s . append ( c r o s s o v e r P o i n t 2 )
39
40
41
def s e t b e s t b i n d i n g s i t e ( s e l f , b i n d i n g S i t e ) :
s e l f . bestBindingSite = bindingSite
42
43
def s e t i d ( s e l f , id ) :
44
s e l f . id = id
45
46
47
def u p d a t e s e q ( s e l f , s e q ) :
s e l f . seq = seq
48
49
50
def s e t p o i n t m u t a t i o n s ( s e l f , p o i n t M u t a t i o n s ) :
s e l f . pointMutations = pointMutations
111
Appendix C
End summary
112
APPENDIX C. END SUMMARY
113
In this work we propose a fully automated and less biased evolutionary strategy to design and optimize a binding site for random substrate
molecules without known natural binding pocket. Based on an efficient genetic algorithm dubbed FIGARO (a Fast and Interpopulational Genetic
Algorithm for Receptor Optimization), we show that this strategy can be
a promising new approach in finding valuable protein structures that can be
useful as starting points in the task of artificial enzyme design, to speed up
reactions that would otherwise be too slow to have practical relevance. However, the most foreseeable use case - and also the one particularly capturing
our imagination - is without a doubt its employability in the immunotherapy treatment of cancer. FIGARO offers a new bridge between very specific, even artificially designed chemical structures and proteins that can
vary widely in function. Unique compounds on the surface of cancer cells
could perfectly act as input structures for FIGARO to target. With a 50year-old problem of protein structure prediction and many already achieved
advantages in this field, the availability of enormeous distributed computation possibilities and proven usability of very sophisticated evolutionary
algorithms, FIGARO goes down the path of future potential.

FIGARO: A Fast and Interpopulational Genetic Algorithm For Receptor Optimization

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

FIGARO: A Fast and Interpopulational Genetic Algorithm For Receptor Optimization

Hochgeladen von

Copyright:

FIGARO - a Fast and Interpopulational Genetic

Algorithm for Receptor Optimization

FIGARO A Fast and Interpopulational Genetic

peting activity in the near future.

List of Code Snippets

Distributed computing support . . . . . . . . . . . . . . . . .

PDB mining implementation . . . . . . . . . . . . . . . . . .

Binding site detection . . . . . . . . . . . . . . . . . . . . . .

The Protein class . . . . . . . . . . . . . . . . . . . . . . . . .

The mutation operator . . . . . . . . . . . . . . . . . . . . . .

Backrub modeling implementation . . . . . . . . . . . . . . .

3.10 Ligand docking implementation . . . . . . . . . . . . . . . . .

3.11 The crossover operator . . . . . . . . . . . . . . . . . . . . . .

3.12 Homology modeling implementation . . . . . . . . . . . . . .

3.13 The selection operator . . . . . . . . . . . . . . . . . . . . . .

B.3 protein.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Graphical presentation of torsion angles . . . . . . . . . . . .

Ramachandran plot example . . . . . . . . . . . . . . . . . .

Folding funnel representation . . . . . . . . . . . . . . . . . .

-helices and -sheets . . . . . . . . . . . . . . . . . . . . . .

The leucine zipper . . . . . . . . . . . . . . . . . . . . . . . .

Hemoglobin T and R state . . . . . . . . . . . . . . . . . . . .

Lock and key model . . . . . . . . . . . . . . . . . . . . . . .

Induced fit model . . . . . . . . . . . . . . . . . . . . . . . . .

Ab initio modeled structures . . . . . . . . . . . . . . . . . . .

1.10 Chimera GUI for MODELLER . . . . . . . . . . . . . . . . .

1.11 The backrub move . . . . . . . . . . . . . . . . . . . . . . . .

1.12 Side chain predictions based on backrub motion . . . . . . . .

Genetic algorithm pipeline . . . . . . . . . . . . . . . . . . . .

Stochastic Universal Sampling . . . . . . . . . . . . . . . . . .

The protein backbone . . . . . . . . . . . . . . . . . . . . . .

The structure-function relationship . . . . . . . . . . . . . . .

Binding pockets, enzymes and active sites . . . . . . . . . . .

Protein structure prediction . . . . . . . . . . . . . . . . . . .

The backrub move . . . . . . . . . . . . . . . . . . . .

Mining the PDB . . . . . . . . . . . . . . . . . . . . . . . . .

Screening for binding sites . . . . . . . . . . . . . . . . . . . .

The Protein class . . . . . . . . . . . . . . . . . . . . . . . . .

Modeling point mutations . . . . . . . . . . . . . . . . . . . .

Modeling recombined sequences . . . . . . . . . . . . . . . . .

The selection operator . . . . . . . . . . . . . . . . . . . . . .

B.3 protein.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

CHAPTER 1. PROTEIN STRUCTURE

Figure 1.1: Graphical presentation of and torsion angles. Included are

The protein backbone

Protein structure exists in several superposed levels of hierarchy. To come

Picture taken from http://www.ym.edu.tw/jierongh/research_e.html.

CHAPTER 1. PROTEIN STRUCTURE

Picture taken from [13].

CHAPTER 1. PROTEIN STRUCTURE

CHAPTER 1. PROTEIN STRUCTURE

CHAPTER 1. PROTEIN STRUCTURE

Figure 1.3: Graphical representation of the folding funnel theorem.1

Picture taken from http://science.sciencemag.org/content/338/6110/1042/F3.

CHAPTER 1. PROTEIN STRUCTURE

Figure 1.4: -helices (in green) and -sheets (in red).1

CHAPTER 1. PROTEIN STRUCTURE

CHAPTER 1. PROTEIN STRUCTURE

Picture taken from https://commons.wikimedia.org/wiki/File:Leucine_zipper.

CHAPTER 1. PROTEIN STRUCTURE

netic sequences, introducing new functionality in other genes. Evolutionary

CHAPTER 1. PROTEIN STRUCTURE

to infer the atom coordinates. X-ray crystallography is generally known to

CHAPTER 1. PROTEIN STRUCTURE

The structure-function relationship

CHAPTER 1. PROTEIN STRUCTURE