Beruflich Dokumente
Kultur Dokumente
Dissertation presented in
Fulfillment of the requirements
for the degree of Master of Science in
Bioinformatics
Promotors:
Prof. Bart De Moor
Prof. Yves Moreau
ESAT - STADIUS. Stadius Centre for Dynamical Systems.
Signal Processing and Data Analytics
January 2017
Pieter Noyens
"This dissertation is part of the examination and has not been corrected after defence for
eventual errors. Use as a reference is permitted subject to written approval of the promotor
stated on the front page."
Foreword
More than any other previous project I stumbled upon during my study
trajectory, this work learned me what its like to work on something really
big and how to organize the process from end to end. Key to this is trying
to keep a healthy pace in the development process without closing your eyes
for upcoming problems and concerns. In first instance, I did not manage to
find this balance. While I was very happy to be able to come up with an
idea myself, I also found myself floating in a vast space of naive beginners
optimism which was soon to be followed up by a more dark tainted feel
of uncertainty and decreased confidence in success and myself in general.
Looking back to that time right now, I never thought this would have such
a big impact on me. But whenever I showed up at my daily supervisor Dusan
Popovic, he told me that he was deeply surprised by what I had been able
to implement for the time that passed. For this I would like to thank him in
particular, as he created a whole new wave of courage by just speaking out
these few words. So here you have it. Thank you, Dusan. I hope you will
continue to motivate other people around you and I wish there were more
people doing the same thing. Its often not the ability to do things but the
motivation that stands in the way of progression.
Getting back together took a while as there were other unexpected things
happening in my life and that of my closest relatives. But no matter what,
the amount of help and support was incredible. I would like to thank my
great friends and family for being there when I needed them the most. Next
ii
FOREWORD
iii
to those days, I can also look back in joy to the times we worked together in
the library. Those moments meant a lot to me and are the reason I finally
figured out that work and play can ultimately be combined. We called
ourselves The Colleagues and found help with each other during the year
and the examination periods. Marlies, Birger, Stijn, Dani, Bram, Daan and
others, thank you for that!
2016 was also the year I met Eva. At the other side of the world amid
the altitudes of the Argentine mountains, she walked into my life. To this
day, Im honored to experience the positivism that goes out from her at all
times which is a major drive in my work.
As a last word of appreciation I want to thank my promotor Bart De
Moor and co-promotor Yves Moreau. They showed the flexibility to work
out a strategy on my own and gave me the opportunity to implement it in all
freedom. This would later turn out to be the most educational experience
Ive had in my life. I would fall, but also rise again. This is the work that I
am proud to present to you.
Abstract
With a better understanding of biochemical processes, recent advances in
artificial intelligence and the availability of tremendously growing computational power, the field of biological engineering is currently facing an era of
major breakthroughs. Full simulations of biochemical systems have made it
possible to fine-tune metabolic networks to a high degree, which has shown
to successfully render highly optimized microbial strains with maximized
production yields of the desired industrial compounds. [1, 2] An important
domain of research in this field is the development of new artificial enzymes
to speed up non-native reactions in this process; even though nature found
a wide range of extremely efficient biomolecular catalytic machines, not all
industrially relevant reactions have a known natural catalyst to increase the
flux of the chemical process. Several attempts to develop de novo artificial enzymes for non-native reactions have already been made in previous
studies, with a functional Kemp-eliminase and retro-aldolase as the most
representative examples of success. [3, 4] For now however, these techniques
have been largely driven by intuition based on known biochemical reactions
and active site coordination. At best, these enzymes are comparable with
catalytic antibodies regarding catalytic performance, while natural enzymes
surpass them by several orders of magnitude. [5, 6, 7, 8] The level of their
success remains therefore subject of discussion. Much of the factors contributing to enzyme catalytic activity are indeed yet to be discovered, which
makes this kind of biased rational design unlikely to yield enzymes of comiv
ABSTRACT
Program parameters . . . . . . . . . . . . . . . . . . . . . . .
44
3.2
Application backbone . . . . . . . . . . . . . . . . . . . . . .
45
3.3
47
3.4
48
3.5
Response XML . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.6
52
3.7
54
3.8
56
3.9
57
61
64
66
69
B.1 main.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
B.2 functions.py . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
vi
List of Figures
1.1
1.2
1.3
1.4
1.5
10
1.6
14
1.7
16
1.8
16
1.9
21
23
25
26
2.1
31
2.2
40
vii
Contents
Foreword
ii
Abstract
iv
Literature study
1 Protein structure
1.1
1.2
Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
13
1.4
16
1.5
20
1.5.1
25
Molecular docking . . . . . . . . . . . . . . . . . . . . . . . .
28
1.6
2 Genetic algorithms
30
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.2
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
2.3
Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.4
Methods of selection . . . . . . . . . . . . . . . . . . . . . . .
40
viii
CONTENTS
II
FIGARO
3 Application overview
ix
42
43
3.1
General outline . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2
49
3.3
52
3.4
55
3.5
56
3.6
Ligand docking . . . . . . . . . . . . . . . . . . . . . . . . . .
62
3.7
65
3.8
68
4 Discussion
70
5 Conclusion
73
III
Appendix
76
A Bibliography
77
B Attachments
86
B.1 main.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
B.2 functions.py . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
112
Part I
Literature study
Chapter 1
Protein structure
1.1
Figure 1.2: Ramachandran plot based on about 100,000 data points for
general amino acids (glycine, proline and pre-proline excluded).1
between the reactive groups of the individual amino acids. [12] Strong covalent bonds can be found between cysteine residues, where oxidation of
the thiol groups can result in cystine bonds between two sulfur atoms also
known as disulfide bridges. Weaker interactions are in particular van der
Waals forces, electrostatic interactions, hydrophobic interactions and hydrogen bonds. The total of these factors will eventually lead to a final structure
of the polypeptide chain, which can be reached at either one of all possible
levels. Keratin, for example, has a final structure at the secondary level.
It is therefore called a fibrous protein and is extended in length, serving
a structural role in biological systems. Some proteins even lack any form
of secondary structure like insulin. Globular proteins, however, will have
tertiary or quaternary structure and can be catalytically active. More information about these different levels of structure development is given in
following chapters.
A protein conformation can be described by a set of torsion angles of
1
certain bonds in the protein backbone. For every amino acid, two torsion
angles are essential to describe the relative geometrical position in the chain.
These are called the phi (, between the N and C atoms in every chained
amino acid) and psi (, between C and C) dihedral angles and are visually
represented in figure 1.1. Interresidual attractions and repulsions force rotation around the single bonds between stable equiplanar amide units. These
amide or peptide bonds have -bonding characteristics and are therefore geometrically fixed. [12] All empirically observable sets of and angles can
be visualized in a two-dimensional plot, which is called a Ramachandran
plot or diagram. An example is given in figure 1.2. Three main regions
are clearly distinguishable. As is later discussed, these regions represent the
conformational tendency to form what are called -helices (lower left quadrant) and -strands (upper left quadrant). [13] Clearly, many combinations
are never observed in nature due to unfavorable steric hindrance between
residues. A notable exception is glycine, which is the smallest amino acid.
With a side chain of only one hydrogen atom, glycine is the most flexible
amino acid and can take on many conformations in a polypeptide chain,
causing the protein to be more dynamical and adaptable. The Ramachandran plot for glycine as shown in figure 1.1 confirms this hypothesis. This
is an essential property in enzyme-mediated catalysis, as induced fit and
other conformational changes during transition states are indispensable for
functionally active proteins.
1.2
Protein folding
The central dogma in molecular biology states that all genetic information
is contained within the DNA or the genome of an organism, which can be
either replicated by DNA polymerase or transcribed by RNA polymerase
to form rRNA, tRNA or mRNA transcripts. [14] The latter will eventually be translated by ribosomes to form polypeptide chains, which in turn
give rise to functional proteins. All biological systems, both eukaryote and
prokaryote, are based on this principle. While mRNA is being translated
and amino acid building blocks are chained together by the ribosome, interand intraresidual attractions and distractions force the nascent polypeptide coil to adopt a certain conformation at the N-terminus and folding is
started. [15] Initial research by Christian Anfinsen on RNase A or bovine
pancreatic ribonuclease showed that this protein did indeed fold spontaneously into a three dimensional conformation. Anfinsen later concluded
that this three dimensional structure can be lost when the environmental
conditions are altered, for example when pH is increased, temperature is
increased or salt is added. This process is called denaturation. After restoring to initial conditions, Anfinsen observed that RNase A would fold back
spontaneously in its native conformation. This, however, was later found to
be only true for a minority of observations. [16, 17] Spontaneous folding only
occurs in optimal conditions during and after biosynthesis. In the crowded
intracellular environment, other proteins are therefore needed to prevent aggregation and misfolding. These proteins are called chaperones and assist
in the folding process. Heat-shock proteins are a well-known example and
prevent misfolding of proteins due to increased temperature.
The Levinthal paradox states that if proteins were to be folded by random sampling of all possible conformations, it would take about the lifetime
of the universe for a single protein to attain its native fold. As real-life folding takes about a few milliseconds to even a few microseconds, this clearly
Picture
taken
from
cellchemistrycontd.html.
http://oregonstate.edu/instruction/bi314/fall12/
are hydrogen bonds between atoms in the protein backbone. These occur
between the hydrogen atoms bound to electronegative nitrogen atoms and a
double-bonded oxygen atom at another site in the backbone. In 1951, Linus
Pauling already predicted this to give rise to two main structural patterns:
-helices and -sheets. [21] These structural units are visualized in figure
1.4. As is discussed earlier, these patterns can also be inferred from the
Ramachandran plots of thousands of proteins. The third group visible in
these plots (upper right quadrant in figure 1.2) represent the left-handed
-helices. This is a rather small group though; generally speaking, lefthanded -helices only occur in nature as short loop regions or single-turn
-helices. [22]
Once a polypeptide chain is completely folded into a globular protein
unit, tertiary structure is reached. This structural level is most common
for enzymes and will be treated exclusively in this thesis. Separate units
of tertiary structure can function on their own, but are also seen to be
associated with each other, serving as identical or deviating subunits in a
polymeric complex. This final complex then constitutes the functional and
biologically relevant protein. In this work however, we will be constrained
to single polypeptide chains for simplicitys sake.
When visualizing the three-dimensional structure of folded proteins, secondary structure elements are often accented by special illustrative patterns
and colors like arrows, spirals and cylindrical boxes, also known as the cartoon representation of proteins. This makes it possible to get better insight
into its general composition and biological role. Indeed, plotting the absolute coordinates of all atoms should in theory be enough to deduce the
general shape of the protein but at this low-level representation, interesting higher-level motifs are not directly revealed to the biomolecular analyst.
This is important because a wide range of short motifs can be associated
with specific biological roles. An example is the well-known leucine zipper,
10
Figure 1.5: Leucine zipper structure interacting with a DNA double helix.1
which is shown in figure 1.5 and has strong DNA-binding characteristics.
A dipole moment exists along the extent of the -helices, which is caused
by the accumulation of individual dipoles in every amide bond of the helix.
This results in a more positively charged N-terminus, while the C-terminus
of the helix gets more negatively charged. This, in combination with a high
abundance of basic amino acids at the N-terminus side of the helix increasing further the positive charge on this side of the chain, causes it to have
high affinity for the negatively charged strands of DNA. Finally, hydrophobic amino acids at the center of the helix make the helices associate with
each other and a zipper-like structure is formed. [23]
Motifs like the leucine zipper can be combined to form higher-order functional modules called domains. Domains are independently folding units in a
global protein structure and are often duplicated and exchanged between ge1
png.
11
12
1.3
13
A protein is nothing more than its own sequence of amino acids and the
biochemical implications going with that. That is, when one describes a
protein and its biological role, be it an enzyme, membrane or transport protein, it should always be kept in mind that only its amino acid composition
is responsible for that. As described earlier, Anfinsens dogma states that
the amino acid sequence contains all information needed for correct folding inside a cellular environment. As an example, membrane proteins often
fold into two main regions with different physicochemical properties. Often a
highly hydrophobic and more hydrophilic region can be clearly distinguished.
This hydrophobic region is thermodynamically predefined to associate with
lipid compounds, so the protein will be fixed to the double-layered phospholipid cellular membrane at this site. This property is what gives the protein
its correct localization and makes it function as a membrane protein. The
fact that this protein is located at the cellular membrane can thus be reduced
to the fact of having a fold that divides a hydrophobic and hydrophilic environment, which in turn is a direct result of its unique amino acid sequence
that has evolved to fold in this way. If we would introduce enough mutation
to this sequence eventually altering this fold dramatically, it should be clear
that its function and correct localization will be lost, too. [12]
Another well studied example representing the fine-tuned interplay between structure and function is hemoglobin and its variable structural properties. Hemoglobin can exist in two states, which is shown in figure 1.6.
When the structure resides in its taut or tensed (T) state, it exhibits a
smaller affinity to oxygen. However, upon environmental changes like decreased CO2 partial pressure and accompanying pH decrease, the conformation of hemoglobin alters to a more relaxed (R) state which shows much
higher affinity to oxygen. When aerobic organisms breath, CO2 is removed
from the environment causing hemoglobin to transform in its R state, being
14
Picture
taken
from
http://cbc.arizona.edu/classes/bioc462/462a/NOTES/
hemoglobin/hemoglobin_function.htm.
15
extended conformation for their biological role, but single point mutations
are not likely to have any direct effects on functional properties. By contrast, enzymes and other proteins with globular conformation often have
a highly optimized and flexible fold. A single point mutation around the
active site of an enzyme can already cause far less affinity to its substrate,
which in some cases can lead to disastrous impact on the rate of catalysis,
in turn negatively influencing complete metabolic pathways with possibly
severe complications as a result. In other cases though, point mutations like
this can in fact be desirable and increase enzymatic activity to a certain
extent.
16
1.4
Proteins can be specialized in all kinds of tasks. A main differentiator between all protein functional classes is the ability to bind smaller molecules,
also called ligands. We just came to describe a representative of this family,
as hemoglobin can be seen as a receptor protein binding to a dioxygen ligand. As such, transport proteins belong to a higher level class of proteins
1
Picture
taken
from
https://aberdeenc.wordpress.com/tag/
17
with ligand-binding capabilities. Many membrane proteins and in particular enzymes share this classification. In addition to having possibly multiple
binding pockets for ligands, enzymes also possess an active site which can
be situated at either one of the available binding pockets and carries out
the actual catalysis. This particular binding site exhibits high affinity for a
transition state structure of the reaction to catalyze. An often demonstrated
but overly simplified model of enzyme working mechanism is the lock and
key representation, illustrated in figure 1.7. This, however, is an outdated
interpretation of how many enzymes work. According to this model, an enzyme readily possesses high affinity for every substrate molecule involved in
the reaction and is rigidly shaped to bind these ligands exclusively. Indeed,
enzymes are known to have high specificity for a certain range of substrates,
like a lock has for its keys. Upon sturdy binding of these molecules, the
reaction is facilitated by the enzyme and a product for which the enzyme
has much less affinity is released. This theory is nowadays in general considered to be rather incorrect and, instead, an induced fit model is preferred.
This model assumes that initial binding of a substrate molecule triggers a
conformational change in the enzyme, causing the transition state of the
reacting compounds to be thermodynamically favored and stabilized, rather
than the substrates as such. As a result, activation energy is significantly
lowered and the reaction rate is increased. The main difference with regard
to the lock and key model is that, in this model, the enzyme is a flexible
entity that dynamically adapts to the shape of its substrates and transition
state. A schematic overview of this is given in figure 1.8. [31, 32]
Enzyme dynamics and especially the chemical kinetics of enzyme-mediated
catalysis have been a main domain of research the last century. Two main
concepts should be highlighted in this regard. Introduced in 1913 by Leonor
Michaelis and Maud Menten, the rate constant kcat and the affinity constant KM succeeded in characterizing enzyme kinetics based on a few made
18
19
physical laws. Therefore, a total summation over all relevant forces would
give us a quantitative measure for goodness of fit. Several software packages
exist to perform what is called molecular docking, but we will come back to
this later and discuss it in more detail.
1.5
20
The protein folding problem has come a long way since its first formulation
by Kendrew and co-workers in 1958 [35], after they had realized the threedimensional structure of globular proteins was not by any means regular and
symmetrical as they had expected. Mainly, the aim of predicting a protein
structure from its amino acid sequence can be seen as a direct consequence
of the complications going with their experimental determination, generally
carried out by X-ray crystallography or NMR. As protein sequences keep
getting discovered in all kinds of organisms, annotating functions to these
gene products is far less evident without detailed knowledge about their
geometrical conformations. As discussed above, protein function is inseparably associated with its structure, so determining these structures is a very
important task in order to understand their biological relevance, eventually
leading to industrial and in particular medical applicability.
Unfortunately, experimental structure verification cant keep up with
this fast-paced sequence discovery and the gap between known sequences
and experimentally derived corresponding structures is getting bigger and
bigger. [36] X-ray crystallography is very expensive and time-consuming but
could yield very high resolution structures. Moreover, the task of deriving
protein structures from NMR measurements is far from evident and asks
for significant specialization in the field, while being exclusively suitable for
smaller proteins and delivering structures with rather poor resolution. The
technique could however be used to determine structures in solution, in contrast to diffraction-based analysis. Nowadays, it takes several months to
years to experimentally derive a single structure; in fact, growing a protein
crystal big enough for X-ray analysis (about 0.5 mm) could already require
months. Additionally, membrane proteins are almost impossible to crystallize which leads to extra complications. [37]
These facts taken into consideration, it should come to no surprise that
21
22
about 50 to 100 amino acids for periods up to 1 ms. [39] These computations
are facilitated by the state-of-the-art Anton supercomputer which is specifically designed for particle-particle interaction calculations. [40] A second
revision of this device, Anton 2, has been announced some time ago, but is
still in development at the time of this writing (january 2017).1 The most
impressive results achieved with the Anton supercomputer have elucidated
many interesting facts about the folding process that previously had been
impossible to observe. Figure 1.9 shows some of the realizations achieved
so far and convincingly confirms ab initio structure prediction is one of the
most reliable prediction strategies available today. [39]
Unfortunately, most proteins of biological and medicinal importance
are significantly bigger than the ones that can be predicted ab initio right
now. Therefore, scientists have been focussing on another approach the past
decades which aims to predict the structure of new proteins based on the
coordinates of previously derived conformations. This strategy is called homology modeling and mostly uses statistical inference combined with small
molecular dynamics calculations to carry out structure predictions. It has
many advantages over ab initio modeling, in particular the fact that it can
be used for proteins of over 100 amino acids and because it has much faster
runtimes. As sequence identity increases between two proteins, model quality can be equal or even better than ab initio rendered structures. The
method already produces reliable results with sequence identities starting
at about 30%. If sequences show 80% to 90% similarity, model resolutions
even better than 1.5
A can be obtained. Note that this guideline is only
valid for proteins that have evolved naturally. Sequences that show similarity between proteins are also called conserved regions and are most likely
1
Anton 2 will have as much as 5 times the number of CPU cores running at 1650 MHz
each, compared to 485 MHz of the original Anton supercomputer. It will also sport 152 interaction pipelines running at 1650 MHz per pipeline, a peak throughput of 12.7 TFXOPS,
4096 KB of RAM, 8000 rendered atoms per ASIC and up to 2.7 Tb/s of channel bandwith.
23
24
25
1.5.1
26
Figure 1.12: Some correct side chain predictions based on the backrub motion. Fixed backbone prediction is shown in red while backrub predictions
are shown in blue. Starting PDB structure is shown in green and the target
point-mutated PDB structure is shown in purple.1
available, has very high sequence identity with a given sequence to be modeled that is artificially designed and counting more than 300 amino acids?
Clearly, homology modeling cannot be used as it only works under the assumption of natural evolution and fold conservation. Ab initio modeling of
the complete sequence also falls of because it is currently intractable to perform molecular dynamics simulation on over 300 amino acids in reasonable
time. Thankfully, a study by Davis and co-workers in 2006 elucidated a
highly predictible pattern in the effects of point mutations on global protein
conformation. This pattern is called the backrub move and is visualized in
figure 1.11. [44]
The backrub motion describes subtle changes in a proteins backbone
triggered by much larger altering side chain conformations. These side chain
movements are often observed due to impacts of other molecules in the
environment like H2 O, but after collision original backbone conformation
is usually restored. Naturally occurring point mutations also cause these
side chain rearrangements, but in contrast to random impacts these changes
are permanent. While a point mutation itself usually does not directly
influence backbone conformation, with cysteine and proline being the only
exceptions to this rule, again backbone is tilted locally due to altering side
1
27
1.6
28
Molecular docking
29
all these processes can be reliably carried out at a much faster pace. [46]
However, molecular docking is not only used to calculate docking scores.
In many research domains it is actually more important to get a prediction
of the relative conformational positions between ligand and receptor. As an
illustration, many metabolic pathways make use of feedback systems to regulate the accumulation of end compounds, which can be acting as allosteric
effectors on enzymes facilitating their own production. Phosphofructokinase
or PFK is a well-known enzyme in glycolysis catalyzing the phosphorylation
of fructose-6-phosphate into fructose-1,6-bisphosphate. High levels of ATP,
one of the products of glycolysis, will cause PFK to have less affinity for
its substrate by altering its three-dimensional conformation upon binding.
ATP is said to be an allosteric inhibitor for PFK. [47] Getting better insight in the allosteric mechanism can be significantly accelerated by using
molecular docking applications visualizing three-dimensional conformation
of complexes.
Like in protein structure prediction, many software applications have
been developed for molecular docking analysis. Some of them are specialized in virtual high-throughput screening of compounds targeting the pharmaceutical industry and often come with a high price tag, while others are
open source and freely accessible on the internet. The most popular one
in this area is AutoDock Vina, but we will use the more configurable rDock
package. [43]
Chapter 2
Genetic algorithms
30
31
2.1
Introduction
708275/fig4/.
32
programming and genetic algorithms. In this work, we will limit our focus
to genetic algorithms and its applications.
The working mechanism of genetic algorithms is in essence very simple;
in the simplest case, it does not use any prior knowledge to reach a global
optimum for a given problem. The general pipeline is depicted in figure 2.1
and will be briefly explained here. First, an initial population of solution
candidates is generated. These are often just randomly generated, but valid
representations of members of the solution space. A main requirement in
setting up a genetic algorithm is defining the fitness function. Like in natural evolution and Darwins survival of the fittest theorem, this measure is
used to compare the individuals in the population. These individuals are
also called chromosomes, analogous to the biological model for genetics in
which chromosomes are composed of genes. Like in the example of figure
2.1, genetic algorithms usually dont go into more detail concerning reallife genetics and simply assume a gene to be the most fundamental unit of
information to construct chromosomes. For every chromosome in the population at time t, the fitness is calculated. Based on these values, several
methods exist for selection, which are discussed later on. Simply put, the
better performing solution candidates have a higher chance of reproducing
and proceding to the next generation. This process of reproduction happens in two fases: mutation and crossover. The selection operator filters out
individuals eligible for mating and the crossover operator exchanges bits of
information between these candidates. After that, the mutation operator
potentially introduces random mutations in the sequences. This aims to
eliminate bias by exploring new parts of the solution space. A new population is the result at time t + 1 and the same steps are performed over and
over again until the stopping criterion is fulfilled. After a sufficient period
of time, an acceptable global solution estimate should be returned. [48]
Keep in mind that genetic algorithms are by no means the holy grail
33
that solves every problem. They belong to the same category of metaheuristic global search algorithms as ant colony search and particle swarm
optimization algorithms. These algorithms tend to perform very well in
complex situations with enormous solution spaces that would otherwise be
intractable to explore. Therefore, it clearly is the appropriate strategy to
be used in this work. In the next few sections, several considerations are
highlighted in the process of implementing efficient genetic algorithms.
2.2
34
Parameters
There is no such thing as a general genetic algorithm. It is more of a coupling term for many specialized variations. Like Dijkstras algorithm being
a special case of A* search, genetic algorithms consist of many variables
that can be heavily modified and optimized for specific situations. Yet,
a reference algorithm does exist and is better known as the canonical genetic algorithm (CGA). It serves as a basis, a guideline for more optimized
and efficient variants. Upon this basic algorithm, the schema theorem and
building block hypothesis were built which act as mathematical evidences of
algorithm efficiency. [48, 49] The main parameters of every genetic algorithm
are discussed below.
Probability of mutation The first parameter that can be optimized to
increase algorithm efficiency is the probability of mutation (%M). This
parameter describes the chance of mutation for each chromosome or
solution candidate in the selected part of the (sub-)population, going
from one generation to another. In genetic algorithms, the mutational
operator is of less importance than the crossover operator. Yet, values
of this parameter can heavily affect efficiency of the algorithm. Too
high probabilities of mutation lead to an overdose of genetic diversity
which results in incompleteness of the algorithm and no convergence
can be reached. On the other hand, too low values can easily cause
premature convergence and a very poor, suboptimal solution may be
returned.
Probability of crossover Secondly, we should consider the probability of
crossover (%C). Crossover or recombination is the main driving force in
genetic algorithms and natural evolution in general. It tries to exploit
the good elements of solution candidates while maintaining genetic
diversity over different generations by mixing them up from one gen-
35
eration to another. This way, good elements are kept in the population
and new chromosomes based on the previous ones are generated and
evaluated. Like the mutation probability parameter, this parameter is
prone to outliers. Too high values cause too much diversity, while too
low values lead to premature convergence.
Percentage elitism Another important parameter is the percentage of
elitism (%E). This represents the percentage of the population that
is to be kept when moving from one generation to another, chosen
from the upper part of the solution candidate pool with respect to
their relative fitness values. This part is moved without changes from
parent population to offspring and thus has to be kept rather limited
in order to achieve enough genetic diversity.
Population to generation size ratio When deciding on an optimal algorithm strategy, a fixed amount of chromosomes that are to be generated in the experiments has to be defined. Indeed, if this is not the
case, it is easy to gather much better results simply by letting the algorithm run for more generations or with bigger populations as genetic
algorithms are iterative global optimization algorithms. This obviously requires a lot more computational effort and leads us to another
parameter that has to be considered: the population size to generation number ratio (P/G). This ratio can be modified while keeping
the total amount of rendered chromosomes constant. The smaller the
population, the higher the number of computed generations but the
higher the chance of premature convergence. With bigger populations,
less generations can be calculated and it is less likely convergence will
be reached.
2.3
36
Operators
This part of the algorithm contains the actual drivers of the optimization
process. In particular the crossover operator (COP) is of main importance.
It acts as a driving and regulating force, trying to reach a balance between
the two extremes of exploitation and exploration. On the one hand, crossover
brings new permutations from different parts of the solution space to the
population while on the other hand, it tries to maintain the good elements
that already have been reached. This way, an optimal solution may be
achieved after a certain number of generations based on selection and recombination. Obviously, it is essential for the crossover operator to behave
in an efficient way. For example, the crossover operator of the canonical
genetic algorithm may in no case be used in solving something like the
traveling salesman problem (TSP), as the single point crossover operator
would deliver solutions that are not valid and, when using an adjacency
representation, in no way tries to pass good subpaths that already exist in
the population to the next generation. Solution candidate representation
is highly correlated with specific recombination operators. As a matter of
illustration, we make use of the TSP case to underscore the importance of
a good crossover operator. [50, 49]
Two representations are considered that can be applied on the TSP: path
representation and adjacency representation. Path representation requires
each permutation to simply illustrate the order in which cities are visited.
This is in contrast to the adjacency representation, which uses permutations
to describe the visiting order in a very different way. The location of the
gene in its chromosome, or the index - starting at 1 - of the number in the
permutation, defines a direct link to this number or gene. For example, city
4 in permutation [7 6 8 5 3 4 2 1] is connected to city 6. Often, crossover
operators are not interchangeable with different representations. To find
and implement the optimal crossover operator, it is therefore important to
37
use the right representation. Two crossover operators that can be used on
these different representations are briefly discussed below.
Alternating Edge Recombination This recombination method goes along
with the adjacency representation. It takes into account that although
all elements are unique, i.e. each city number only exists once in a
permutation, such tours can still be illegal as they can consist of more
than one cycle. Alternating Edge Recombination (AER) works as follows. First, a random starting edge is chosen from parent 1. This
subtour is extended with the according edge from parent 2 and so on.
In the end, a valid permutation can be reached which combines the
edges from the parents in their offspring. However, subpaths that are
passed from parents to offspring contain only one edge at a time. Good
subpaths are therefore disrupted by the operator which suggests lower
performance. It is essential for crossover operators to keep subpaths
of good solution candidates in the population. [49]
Edge Recombination Crossover This crossover operator works together
with path representation, the most intuitive yet very potent representation. As described above, crossover operators need to be conservative regarding subtours of solution candidates. Edge Recombination
Crossover (ERX) tries to achieve this goal, while keeping new edges
at a minimum. Other path representation crossover operators like
Order Crossover (OX) also keep subtours as much as possible in the
population but introduce a lot of new edges while doing so. In fact,
crossover operators just have to implement recombination. The mutation part is, obviously, already attributed to the mutation operator.
Therefore, ERX tries to introduce mutations as little as possible, rendering offspring chromosomes that are practically solely based upon
recombination, i.e. practically every edge in the offspring chromosome
comes from at least one of the two parents. ERX has a very high time
38
39
40
2.4
Methods of selection
41
Part II
FIGARO
42
Chapter 3
Application overview
43
44
pctE = 0 . 2 5
3 ## P e r c e n t a g e c r o s s o v e r
4
pctX = 0 . 6
5 ## P e r c e n t a g e m u t a t i o n
6
pctM = 0 . 0 0 5
7 ## P o p u l a t i o n s i z e
8
p o p S i z e = 100
9 ## S u b p o p u l a t i o n s i z e
10
s u b p o p S i z e = 10
11 ## Number o f p u t a t i v e b i n d i n g s i t e s t o be i n s p e c t e d
12
nrBindingSites = 5
13 ## Number o f g e n e r a t i o n s
14
g e n S i z e = 50
15 ## Maximum r e s o l u t i o n o f s t r u c t u r e s i n i n i t i a l p o p u l a t i o n
16
maxRes = 2 . 0
17 ## Maximum number o f e n t i t i e s i n s t r u c t u r e
18
maxEnt = 1
19 ## Minimum s i m i l a r i t y t o q u e r y l i g a n d
20
minSim = 0 . 3
3.1
General outline
In this section, we will give a birds-eye view on the application and describe its global workflow. Over the next few sections, every single aspect of
the application is independently presented and clarified on the basis of the
corresponding code snippets.
The code in snippet 3.2 represents the main backbone of the program.
Note that the program in its current state still has many hard-coded parameters which could be more elegantly queried for as user input during
runtime in future releases. These parameters are given in code snippet 3.1
accompanied by short explanations in the code comments. The values given
45
to them here serve as an example for the program to be run with. However,
the main input for the program still needs to be defined. This is the ligand
for which we aim to generate an optimal receptor structure. This can be a
random small molecule, like some designed compound or a natural organic
molecule like caffeine. For our tests we used a web service that picks random
molecules from the ZINC database, hosted by the BIRC at the University
of California.1 When the program is run, the first task that needs to be
done is mining the PDB for high quality receptor structures that are known
to bind ligands similar to our input molecule. When enough templates are
collected, i.e. the defined population size is reached, the algorithm proceeds
by replicating every individual a number of times according to the defined
subpopulation size. While doing this, point mutations are introduced with
a predefined probability (pctM as shown in code snippet 3.1).
Code Snippet 3.2: Backbone of the application.
1 ## Generate random s m a l l m o l e c u l e from h t t p : / / b c i r c . d o c k i n g . o r g /
random . s h t m l and g e t SMILES ( c u r r e n t l y hardcoded ) .
2
3
4 ## I n i t i a l i z e t h e p o p u l a t i o n .
5
6
7 ## Write i n i t i a l i n f o r m a t i o n f o r r e s u l t e v a l u a t i o n .
8
w r i t e i n i t i a l i n f o ( population )
9
10 ## Perform GA on p o p u l a t i o n f o r s p e c i f i e d number o f g e n e r a t i o n s .
11
f o r g e n e r a t i o n in range ( 0 , g e n S i z e ) :
12
f o r index , s u b p o p u l a t i o n in enumerate ( p o p u l a t i o n ) :
13
## P r e d i c t i n i t i a l s t r u c t u r e s .
14
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
predict structure backrub , subpopulation )
1
46
15
## Dock l i g a n d t o a l l i n d i v i d u a l s i n s u b p o p u l a t i o n .
16
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )
17
## S e l e c t i n d i v i d u a l s f o r mating i n s u b p o p u l a t i o n s .
18
19
## Perform c r o s s o v e r b e t w e e n i n d i v i d u a l s i n
subpopulations .
20
t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n [ i n d e x ] , pctX )
21
#s i n g l e p o i n t c r o s s o v e r ( newPopulation , pctX )
22
## P r e d i c t s t r u c t u r e o f recombined s e q u e n c e s .
23
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
p r e d i c t s t r u c t u r e h o m o l o g y f a s t , population [ index ] )
24
## Dock l i g a n d t o a l l r e c o m b i n a t i o n s i n s u b p o p u l a t i o n .
25
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )
26
27
i f g e n e r a t i o n < g e n S i z e 1:
## S e l e c t mother p r o t e i n s f o r new p o p u l a t i o n b a s e d on
c o m p e t i t i o n b e t w e e n each s u b p o p u l a t i o n s b e s t
competitor .
28
motherSelection = r o u l e t t e w h e e l s e l e c t i o n (
g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) , r e p l a c e=F a l s e )
29
30
31
else :
32
33
b e s t P r o t e i n . id + with a s c o r e o f + s t r (
bestProtein . score ) + \
34
on b i n d i n g s i t e
+ bestProtein . bestBindingSite
+ .
After all sequences have been generated, their structures are modeled by
the backrub application from the Rosetta suite. The exact mechanism for
this is described in later sections. A molecular docking job is then performed on all of these individuals and fitness scores are collected. Based
47
import m u l t i p r o c e s s i n g
2
3 ## S e t number o f p r o c e s s o r c o r e s
4
w o r k e r s = m u l t i p r o c e s s i n g . Pool ( 1 0 0 )
on these scores, proteins eligible for mating are selected in each subpopulation and sequences are recombined based on a given crossover probability
(pctX as shown in 3.1). Next, a comparative modeling step returns the conformational models for all recombined structures based on a fast homology
modeling algorithm and again, molecular docking is performed to evaluate
the recombined child sequences. Eventually the best candidates are picked
from all subpopulations. A final selection round using RWS as described
in section 2.4 on page 40 chooses a new set of mother proteins to form the
subpopulations in the next generation. The process is then repeated until
the defined number of generations is reached. The program finishes and
reports information about the best protein in the final population.
FIGARO needs a lot of CPU power to be able to return acceptable
results. Like is the case for all genetic algorithms, the bigger the population size and number of generations, the better the results will be but
also a lot more computational resources are needed. To reduce computation
time significantly, FIGARO supports multiprocessing to distribute the work
over multiple processor cores. The implementation weve opted for is the
multiprocessing package for Python, and is illustrated in code snippet 3.3.
Depending on the machine the program is running on, the number of CPU
cores can be adjusted and the workload will be parallelized efficiently.
48
population = [ ]
usedPdb = [ ]
similarity = 1
10
11
12
f o r l i n e in pdbOut :
i f len ( p o p u l a t i o n ) != p o p S i z e :
h i t = r e . s e a r c h ( (?<= s t r u c t u r e I d =) . ? ( ? = ) ,
line )
13
if hit :
14
pdb = h i t . group ( )
15
c h a i n = r e . s e a r c h ( (?<= c h a i n i d =)\w(?=) ,
u r l l i b 2 . urlopen (
16
h t t p : / /www. r c s b . o r g /pdb/ r e s t /
d e s c r i b e M o l ? s t r u c t u r e I d= + pdb ) . r e a d
( ) ) . group ( )
17
18
l i g a n d = r e . s e a r c h ( (?<=c h e m i c a l I D =)
. ? ( ? = ) , l i n e ) . group ( )
19
w r i t e p d b ( pdb , l i g a n d )
20
f i x p d b ( pdb )
21
e x t r a c t c h a i n ( pdb , c h a i n )
22
s e q = g e t s e q ( pdb )
23
b i n d i n g S i t e s = g e t b i n d i n g s i t e s ( pdb ,
nrBindingSites )
49
25
subpop = c r e a t e s u b p o p u l a t i o n (
motherProtein , subpopSize , pctM )
26
p o p u l a t i o n . append ( subpop )
27
28
29
i f len ( p o p u l a t i o n ) == p o p S i z e :
30
print P o p u l a t i o n i n i t i a l i z a t i o n f i n i s h e d .
31
return p o p u l a t i o n
32
else :
33
s y s . e x i t ( Not enough a v a i l a b l e t e m p l a t e s f o r t h e s e
parameters . )
3.2
Mining the PDB is the first step of the algorithm and is crucial to its further
progress. The complete code responsible for this is given in code snippet
3.4. As described in the previous section, the program tries to find existing
templates in the Protein Data Bank that bind to compounds showing high
similarity to our input molecule. This however is not always evident as any
input molecule can be given. Moreover, it is likely that the given compound
does not act as a known natural ligand.
For this reason, the program looks for structures in an iterative manner.
It starts by screening the PDB for structures binding ligands that have 100%
similarity to the query compound. If this does not fill up the total population
which is very likely, similarity drops by 10 percent. This process is repeated
until eventually the population is saturated with templates. In the other
case, the program aborts and notifies the user not enough templates were
found for the given set of parameters.
Thankfully, mining the PDB is greatly facilitated by the available REST-
50
<l i g a n d I n f o>
<l i g a n d s t r u c t u r e I d= 4N9B c h e m i c a l I D= 2HH type= non
polymer m o l e c u l a r W e i g h t= 2 0 2 . 2 1 3 >
<chemicalName>1METHYLN(PYRIDIN3YL)1HPYRAZOLE5
CARBOXAMIDE</ chemicalName>
<InChIKey>LSEUGEFZKAZGSIUHFFFAOYSAN</ InChIKey>
<InChI>InChI=1S/C10H10N4O/ c1 149(461214) 1 0 ( 1 5 )
1383251178/h27H, 1 H3 , ( H, 1 3 , 1 5 )</ InChI>
10
</ l i g a n d>
11
</ l i g a n d I n f o>
12 </ s m i l e s Q u e r y R e s u l t>
ful Web Services API. This API supports queries based on ligands that can
be given by their corresponding SMILES representation. The similarity percentage can be included in the URL. Upon request, carried out by the urllib2
Python package, the server returns an XML file which is parsed manually.
An example XML file returned this way is given in snippet 3.5. We are particularly interested in the structureId attribute of the ligand element of this
file. This is the identifier of the PDB template we are looking for. A regular
expression extracts this id and stores it as a variable. If this identifier is not
contained in the list of already mined PDB files, the procedure continues
and another regular expression extracts the id of the ligand. This needs to
be done because the downloaded PDB file still contains the ligand coordinates. These have to be filtered out for the molecular docking process. The
write pdb function takes care of this again using a custom regex and saves
the PDB file in the appropriate folder. The PDB still needs further prepa-
51
ration though. By filtering out the ligand, missing residues are introduced.
Moreover, a PDB can contain more than one chain, which is not compatible
with our algorithm. The fix pdb and extract chain functions were implemented for that reason. These are not given here but can be consulted in
the complete source code disclosed in the attachments (appendix B) starting
at page 86. After these steps, the PDB file is ready for use and is passed
to the get binding sites function described in the next section. An array of
binding site centers is returned and stored in a Protein object together with
other metadata like sequence and chain id. This class is described in section
3.4.
52
def g e t b i n d i n g s i t e s ( pdb , n r B i n d i n g S i t e s ) :
bsCenterList = [ ]
s t a r t i n g D i r = o s . getcwd ( )
o s . c h d i r ( s t a r t i n g D i r + /pdb )
o s . system ( . / auto . py i
f o r i in range ( 0 , n r B i n d i n g S i t e s ) :
10
l i n e = summary . r e a d l i n e ( )
11
x = l i n e . s p l i t ( ) [ 3]
12
y = l i n e . s p l i t ( ) [ 2]
13
z = l i n e . s p l i t ( ) [ 1]
14
b s C e n t e r L i s t . append ( ( + x + , + y + , + z + )
)
15
## Remove temporary f i l e s
16
17
## Return t o main d i r e c t o r y .
18
os . chdir ( s t a r t i n g D i r )
19
20
return b s C e n t e r L i s t
3.3
The mining of PDB files based on a query ligand only serves as a guideline for our program. To keep bias low, the whole protein conformation is
screened for a range of putative binding sites after being prepared for use.
Eventually our compound will be docked to all of these sites; an approach
better known as blind docking. According to the maximum number of binding pockets to be retrieved as defined in the global program parameters, the
53
SiteHound software package is used to screen for possible binding sites. [52]
The implemention of this part of the application is shown in code snippet
3.6.
The SiteHound algorithm is actually very comparable to general molecular docking. It uses a small methyl carbon or phosphate probe to screen
the surface of a protein for favourable weak interactions. According to their
spatial proximity, these points are clustered by an agglomerative hierarchical
clustering algorithm. A list of these clusters is returned in the end, corresponding to the identified binding pockets. These are ranked based on their
total interaction energy and are stored by FIGARO. An array of binding
site centers is returned, with maximum length nrBindingSites as defined in
3.1. According to a test study based on 77 experimentally derived protein
structures linked to known protein-ligand complexes, the actual binding site
was among the top three binding pockets recognized by SiteHound in 95
percent of the cases. [53]
SiteHound offers two possibilities concerning probe type. We opted for
the CMET or methyl carbon probe. This is the most appropriate one for
docking general drug-like compounds. [53]
54
class Protein () :
2
3
def
init
( s e l f , id , seq , chain , b i n d i n g S i t e s ) :
s e l f . id = id
s e l f . seq = seq
s e l f . chain = chain
s e l f . bindingSites = bindingSites
s e l f . s c o r e = None
s e l f . parents = [ ]
10
s e l f . crossoverPoints = [ ]
11
s e l f . b e s t B i n d i n g S i t e = None
12
s e l f . p o i n t M u t a t i o n s = {}
13
14
def s e t s c o r e ( s e l f , s c o r e ) :
15
s e l f . score = score
16
17
18
19
20
21
22
23
24
25
26
i f crossoverPoint1 :
s e l f . crossoverPoints = [ crossoverPoint1 ]
i f crossoverPoint2 :
s e l f . c r o s s o v e r P o i n t s . append ( c r o s s o v e r P o i n t 2 )
27
28
29
def s e t b e s t b i n d i n g s i t e ( s e l f , b i n d i n g S i t e ) :
s e l f . bestBindingSite = bindingSite
30
31
def s e t i d ( s e l f , id ) :
32
s e l f . id = id
33
55
def u p d a t e s e q ( s e l f , s e q ) :
35
s e l f . seq = seq
36
37
def s e t p o i n t m u t a t i o n s ( s e l f , p o i n t M u t a t i o n s ) :
38
s e l f . pointMutations = pointMutations
3.4
The Protein class comprises all metadata about a protein structure, such
as its sequence, chain id, identified binding sites and parental sequences. It
was made to centralize important parts of the code on the level of individual
proteins. The implementation of this class is presented in code snippet 3.7.
The class provides functionality to manage all protein data efficiently.
When protein sequences are mutated, it is important to keep data about
these mutations in memory because it will be used later to perform structure prediction based on template structures. To manage this, every protein
object holds a pointMutations map as shown in snippet 3.7. It memorizes
the index of the mutated amino acid in the sequence along with its original value and can be set from an outside call to the set point mutations
function. Furthermore, setters are available for storing docking scores, best
binding site coordinates, updated sequences, new protein ids and parental
sequences. When setting parental sequences, the exact crossover points are
also stored in the appropriate variable, which are crucial for homology modeling processes later on.
56
def p o i n t m u t a t i o n ( p r o t e i n , pctM ) :
mutable = l i s t ( p r o t e i n . s e q )
p o i n t M u t a t i o n s = {}
== C or p r o t e i n . s e q [ i n d e x ] == P ) :
7
i f p r o t e i n . s e q [ i n d e x ] != mutable [ i n d e x ] :
p o i n t M u t a t i o n s [ i n d e x +1] = p r o t e i n . s e q [ i n d e x ] +
mutable [ i n d e x ]
10
p r o t e i n . u p d a t e s e q ( . j o i n ( mutable ) )
11
12
print p r o t e i n + p r o t e i n . id + mutated .
13
return p r o t e i n
3.5
When an initial population and its subpopulations are formed, new sequences are created through the introduction of point mutations to the
mother or template proteins. Although a single point mutation is not likely
to change protein fold significantly, these are accumulated quickly over the
course of the program and incrementally alter protein conformation by triggering backrub moves on the protein backbone. This phenomenon is described earlier in section 1.5.1 on page 25.
However, not all point mutations can be treated equally; there are a few
notable exceptions. The first one is proline, which is known to force protein
backbone in certain conformations due to its uncommon and constraining
way of chaining in a polypeptide. To suit our purpose, this amino acid is
therefore never introduced nor replaced by the mutation operator as predictions would likely become unreliable. Moreover, cysteine is known to form
57
def p r e d i c t s t r u c t u r e b a c k r u b ( p r o t e i n ) :
p a r s e r = PDBParser ( )
s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( p r o t e i n . id [ 0 : 4 ] ,
pdb/ +
p r o t e i n . id [ 0 : 4 ] + . pdb )
4
atomList = [ ]
f o r atom in s t r u c t u r e . g e t a t o m s ( ) :
a t o m L i s t . append ( atom )
nbs = N e i g h b o r S e a r c h ( a t o m L i s t )
10
11
f o r mutation in p r o t e i n . p o i n t M u t a t i o n s :
## Write r e s f i l e and g e t a l l r e s i d u e s w i t h i n 6 Angstrom
o f mutated r e s i d u e s u s i n g Biopython p a c k a g e .
12
with open ( r e s f i l e
+ p r o t e i n . id ,
w ) a s r e s f i l e :
13
r e s f i l e . w r i t e ( NATRO\ n s t a r t \n )
14
affectedList = [ ]
15
r e s i d u e = s t r u c t u r e . g e t c h a i n s ( ) . next ( ) [ mutation ]
58
f o r atom in S e l e c t i o n . u n f o l d e n t i t i e s ( r e s i d u e ,
A ) :
f o r n e i g h b o r in nbs . s e a r c h ( atom . g e t c o o r d ( ) , 6 ,
l e v e l= R ) :
18
19
a f f e c t e d L i s t . append ( g e t r e s i ( n e i g h b o r ) )
r e s f i l e . w r i t e ( s t r ( mutation ) + + p r o t e i n . c h a i n +
PIKAA + p r o t e i n . p o i n t M u t a t i o n s [ mutation ] [ 1 ] +
\n )
20
## Make a f f e c t e d L i s t u n i q u e .
21
a f f e c t e d L i s t = l i s t ( set ( a f f e c t e d L i s t ) )
22
pivotList = [ ]
23
f o r r e s in a f f e c t e d L i s t :
24
25
i f r e s != mutation :
r e s f i l e . write ( str ( r e s ) + + protein . chain
+ NATAA\n )
26
p i v o t L i s t . append ( r e s )
27
i f r e s != 1 :
28
p i v o t L i s t . append ( r e s 1)
29
i f r e s != len ( p r o t e i n . s e q ) :
30
p i v o t L i s t . append ( r e s +1)
31
32
p i v o t L i s t = l i s t ( set ( p i v o t L i s t ) )
## Run b a c k r u b s a m p l i n g w i t h 1 0 . 0 0 0 Monte C ar l o
iterations .
33
34
35
str (10000) + r e s f i l e
resfile
+ \
+ p r o t e i n . id
+ p i v o t r e s i d u e s
36
37
f o r p i v o t in p i v o t L i s t :
command += s t r ( p i v o t ) +
38
o s . system ( command )
39
## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .
40
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + 0 0 0 1 .
pdb o pdb / + p r o t e i n . i d + . pdb )
59
42
## Remove temporary f i l e s .
43
o s . system ( rm + p r o t e i n . id + r e s f i l e
+ p r o t e i n . id
)
44
45
f i x p d b ( p r o t e i n . id )
46
print P r o t e i n + p r o t e i n . id + modeled u s i n g
RosettaBackrub and SCWRL4.
47
return p r o t e i n
60
is also available, but is not activated by default. After the model has been
generated, all temporary files are deleted and the new PDB file is stored in
the appropriate folder. The id field in the Protein object links to this file by
name.
61
def dock ( p r o t e i n ) :
print Docking s t a r t e d f o r p r o t e i n + p r o t e i n . id + .
b e s t S c o r e = None
b e s t B i n d i n g S i t e = None
f o r b i n d i n g S i t e in p r o t e i n . b i n d i n g S i t e s :
## F i r s t w r i t e s y s t e m f i l e .
10
11
MAPPER\nSITE MAPPER
RbtSphereSiteMapper \nCENTER
+ b i n d i n g S i t e + \nRADIUS 1 5 . 0 \
12
## Generate c a v i t y f o r d o c k i n g .
14
15
## Perform d o c k i n g .
16
o s . system ( rbdock i l i g a n d . sd o o u t p u t + p r o t e i n . id
+ r rDockSystem + p r o t e i n . id +
17
18
## Return d o c k i n g s c o r e .
19
with open ( o u t p u t + p r o t e i n . id + . sd , r ) a s
dockingResult :
62
f o r l i n e in d o c k i n g R e s u l t :
21
i f l i n e == >
22
<SCORE>\n :
d o c k i n g S c o r e = f l o a t ( d o c k i n g R e s u l t . next ( ) .
strip () )
i f b e s t S c o r e i s None or d o c k i n g S c o r e < b e s t S c o r e :
23
24
bestScore = dockingScore
25
bestBindingSite = bindingSite
26
27
28
## Remove temporary f i l e s
29
o s . system ( rm + p r o t e i n . id + . mol2 o u t p u t + p r o t e i n . id
+ . sd rDockSystem + p r o t e i n . id +
30
c a v 1 . grd rDockSystem + p r o t e i n . id + . a s
rDockSystem + p r o t e i n . id + . prm )
31
print Docking f i n i s h e d f o r p r o t e i n + p r o t e i n . id + .
32
return p r o t e i n
3.6
Ligand docking
In this section, we will discuss the ligand docking process and its implementation as shown in snippet 3.10. This blind docking procedure mainly
consists of four parts: converting the PDB file to a mol2 file, generating the
receptor cavity at the targeted binding site, performing the actual ligand
docking job site and selecting the best docking score with respect to the
different binding sites.
Again, several parameters need to be set adequately for reliable results.
Based on suggestions made in rDocks documentation and some own intuition about the given ligand molecule, setting up these parameters can be
done in a fairly straightforward way by generating a .prm rDock system file
satisfying the format constraints given in rDocks documentation1 . The con1
The full documentation for rDock can be downloaded from the projects website at
http://rdock.sourceforge.net/
63
tent we used for our molecule is shown in lines 10 to 12 of code snippet 3.10.
First, the receptor mol2 file and the tolerated backbone flexibility parameter are defined. We opted for a 3
A flexibility range which should provide a
substantial degree of conformational adaptability. Next, the method of cavity grid construction is specified. For our purpose, it is encouraged to opt
for the RbtSphereSiteMapper algorithm provided by rDock. This method
uses a small sphere with adjustable radius to explore the surface around
the given center coordinates. Moreover, a maximum binding site radius is
defined which should be chosen according to the structural information of
the ligand compound. The maximum number of binding pockets to be constructed was set to 1. For the other parameters, we used the default values
as they should not alter results significantly. However, all parameters mentioned in this paragraph should be carefully chosen and evaluated for every
single docking situation.
To be able to use our models, they need to be converted to a new file
format. The OpenBabel package takes care of that and generates the appropriate mol2 files. It is important to note that FIGARO needs a lot of
dependencies that all need to be installed and set up correctly in order to
work. For problems with this in further research, please contact the author.
The docking scores produced by rDock represent changes in free energy.
For thermodynamically favourable binding situations, these will be negative.
Absolute values are therefore used as a fitness measure in our application.
Also note that docking scores themselves are based on a genetic algorithm
within rDock. These are all estimations that can be made more precise
by increasing the number of generations or by running multiple docking
jobs and taking averages. To shorten runtimes of our application, we will
perform each docking job only once. In the end, a fitness score only serves
as a guideline and doesnt need to be too precise.
64
def t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n , pctX ) :
i f not len ( p o p u l a t i o n ) > 1 :
p o p u l a t i o n [ 0 ] . s e t p a r e n t s ( protA=p o p u l a t i o n [ 0 ] )
p o p u l a t i o n [ 0 ] . s e t i d ( p o p u l a t i o n [ 0 ] . id + recombined )
print No c r o s s o v e r p o s s i b l e .
else :
random . s h u f f l e ( p o p u l a t i o n )
f o r index , p r o t e i n in enumerate ( p o p u l a t i o n ) :
9
10
11
12
c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )
13
c r o s s o v e r P o i n t 2 = random . r a n d i n t (
c r o s s o v e r P o i n t 1 , minLength )
14
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index + 1 ] ,
15
c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
c r o s s o v e r P o i n t 2=
crossoverPoint2 )
16
17
seq [ crossoverPoint1 :
crossoverPoint2 ] +
protein . seq [
crossoverPoint2 : ] )
18
19
20
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
else :
minLength = min ( [ len ( p r o t e i n . s e q ) , len (
p o p u l a t i o n [ index 1 ] . p a r e n t s [ 0 ] [ 1 ] ) ] )
21
c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )
22
c r o s s o v e r P o i n t 2 = random . r a n d i n t (
65
c r o s s o v e r P o i n t 1 , minLength )
23
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index 1 ] ,
24
c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
c r o s s o v e r P o i n t 2=
crossoverPoint2 )
25
26
parents [ 0 ] [ 1 ] [
crossoverPoint1 :
crossoverPoint2 ] +
27
protein . seq [
crossoverPoint2 : ] )
28
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
29
print C r o s s o v e r f i n i s h e d .
30
else :
31
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n )
32
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
33
print No c r o s s o v e r .
3.7
Modeling recombined sequences can be a very tricky task. When big parts of
random protein sequences are exchanged, conformational topology is likely
to be altered dramatically and it is not guaranteed that the structure will
be even stable. Moreover, it would be intractable with current computational resources to do a full ab initio remodeling of the generated protein
sequences. However, homolog sequences are far less likely to introduce big
conformational changes upon recombination, a fact that enables crossover to
be a major drive in natural evolution. By splitting up our genetic algorithm
in subpopulations, we can group them by sequence homology and have them
mating in isolated islands; this way, recombined sequences can be modeled
66
def p r e d i c t s t r u c t u r e h o m o l o g y f a s t ( p r o t e i n ) :
i f len ( p r o t e i n . p a r e n t s ) > 1 :
env = m o d e l l e r . e n v i r o n ( )
env . i o . a t o m f i l e s d i r e c t o r y = . / pdb
templates = [ ]
f o r p a r e n t in p r o t e i n . p a r e n t s :
8
9
t e m p l a t e s . append ( p a r e n t [ 0 ] )
a = m o d e l l e r . automodel . automodel ( env , a l n f i l e =p r o t e i n . id
+ t o t a l . a l i , knowns=tuple ( t e m p l a t e s ) , s e q u e n c e=
p r o t e i n . id )
10
a . starting model = 1
11
a . ending model = 1
12
a . make ( )
13
## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .
14
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + .
B99990001 . pdb o f i n a l + p r o t e i n . i d + . pdb )
15
## R e p l a c e p r e v i o u s PDB w i t h model .
17
67
p r o t e i n . id + . pdb )
18
19
f i x p d b ( p r o t e i n . id )
20
## Remove temporary f i l e s .
21
o s . system ( rm + p r o t e i n . id + )
22
## M o d e l l e r d o e s n o t add c h a i n i d s by i t s e l f .
23
a d d c h a i n i d ( p r o t e i n . id , p r o t e i n . c h a i n )
24
25
26
27
return p r o t e i n
else :
o s . system ( cp pdb/ + p r o t e i n . p a r e n t s [ 0 ] [ 0 ] + . pdb pdb/
+ p r o t e i n . id + . pdb )
28
return p r o t e i n
With sequence identities of 95% and above, comparative modeling algorithms should be capable of returning highly reliable models of the recombined protein sequences. [42, 41] Another asset is the fact that the program
exactly knows how sequences have evolved through the course of the application. It is therefore very straightforward to define correct alignments
to MODELLER, the comparative modeling package we opted for. We implemented a very fast strategy to do this as disclosed in code snippet 3.12.
Usually, MODELLER will screen sequence databases for homolog structures
and needs user interaction to select a suitable template. This takes a lot
of time and can be thankfully skipped by our implementation. Alignments
are manually constructed by the write aln function and uses sequence metadata stored in the Protein objects. The implementation of this function can
be found in the attachments chapter (appendix B). Again, functionality to
refine side chains using SCWRL4 is available but not enabled by default.
3.8
68
In the current revision of FIGARO, only one selection operator was implemented and is shown in code snippet 3.13. The used roulette wheel
selection procedure is completely analogous to the description given in section 2.4 starting at page 40. In addition to that, our roulette wheel selection
functions contains two extra arguments, the first being pctE defining the
percentage of elitism. This parameter is given in the global program parameters shown in snippet 3.1. Having said that, it should be noted that
the current implementation of the selection operator doesnt support elitism
yet. To extend the capabilities of our genetic algorithm, this is something
that needs to be fixed in a next release.
The other additional argument is a boolean specifying the need for copying and replacement of PDB files for the selected proteins. This is only
desirable in the case of selection within the subpopulations before recombination. The second selection step happens at the population level when new
template proteins need to be chosen for the next generation and is followed
up by another section of the program with other requirements. Therefore,
the boolean is used to switch between the two levels of selection. Tournament selection (also described in section 2.4) is another approach that
comes to mind for the latter selection round and could be implemented in
next releases of the application. It provides more flexibility regarding selection pressure. For example, it can give more weight to the best individuals
from every subpopulation by increasing selection pressure.
69
newPopulation = [ ]
f o r i in range ( 0 , len ( p o p u l a t i o n ) ) :
totalScore = 0
f o r p r o t e i n in p o p u l a t i o n :
t o t a l S c o r e += p r o t e i n . s c o r e
index = 0
c u r r e n t T o t a l = p o p u l a t i o n [ i n d e x ] . s c o r e
10
11
i n d e x += 1
12
c u r r e n t T o t a l += p o p u l a t i o n [ i n d e x ] . s c o r e
13
14
i f r e p l a c e == True :
newId = p o p u l a t i o n [ i n d e x ] . id [ 0 : 4 ] + s e l e c t e d +
s t r ( i +1)
15
16
p o p u l a t i o n [ i n d e x ] . chain
, population [ index ] .
bindingSites ) )
17
18
19
20
21
i f r e p l a c e == F a l s e :
newPopulation . append ( p o p u l a t i o n [ i n d e x ] )
i f r e p l a c e == True :
22
## Clean up PDB f o l d e r .
23
f o r p r o t e i n in p o p u l a t i o n :
24
25
print R o u l e t t e Wheel S e l e c t i o n s u c c e e d e d .
26
return newPopulation
Chapter 4
Discussion
70
CHAPTER 4. DISCUSSION
71
CHAPTER 4. DISCUSSION
72
support for Apache Hadoop and MapReduce to make use of as much computational power as possible and lift the application to a whole new level of
computational abilities.
Chapter 5
Conclusion
73
CHAPTER 5. CONCLUSION
74
CHAPTER 5. CONCLUSION
75
Part III
Appendix
76
Appendix A
Bibliography
77
BIBLIOGRAPHY
78
OptFlux:
[PubMed Central:PMC2864236]
[DOI:10.1186/1752-0509-4-45] [PubMed:20403172].
[2] P. Pharkya, A. P. Burgard, and C. D. Maranas. OptStrain: a computational framework for redesign of microbial production systems. Genome
Res., 14(11):23672376, Nov 2004.
[PubMed Central:PMC525696]
[DOI:10.1101/gr.2872004] [PubMed:15520298].
[3] O. Khersonsky, D. Rothlisberger, O. Dym, S. Albeck, C. J. Jackson,
D. Baker, and D. S. Tawfik. Evolutionary optimization of computationally designed enzymes: Kemp eliminases of the KE07 series. J. Mol.
Biol., 396(4):10251042, Mar 2010. [DOI:10.1016/j.jmb.2009.12.031]
[PubMed:20036254].
[4] L. Giger, S. Caner, R. Obexer, P. Kast, D. Baker, N. Ban, and D. Hilvert. Evolution of a designed retro-aldolase leads to complete active site
remodeling. Nat. Chem. Biol., 9(8):494498, Aug 2013. [PubMed Central:PMC3720730] [DOI:10.1038/nchembio.1276] [PubMed:23748672].
[5] V. Nanda and R. L. Koder. Designing artificial enzymes by intuition
and computation. Nat Chem, 2(1):1524, Jan 2010. [PubMed Central:PMC3443871] [DOI:10.1038/nchem.473] [PubMed:21124375].
[6] S. Paul, S. A. Planque, Y. Nishiyama, C. V. Hanson, and R. J.
Massey.
Adv. Exp.
[DOI:10.1007/978-1-4614-3461-0 5]
[PubMed:22903666].
[7] Y. Xu, N. Yamamoto, and K. D. Janda.
Catalytic antibod-
BIBLIOGRAPHY
79
Annu. Rev.
P. Kast,
evolution.
and D. Hilvert.
Annu
Rev
Protein design by
Biophys,
37:153173,
2008.
[DOI:10.1146/annurev.biophys.37.032807.125832] [PubMed:18573077].
[10] P. Molina-Espeja, E. Garcia-Ruiz, D. Gonzalez-Perez, R. Ullrich,
M. Hofrichter, and M. Alcalde.
[PubMed Central:PMC4018863]
[DOI:10.1128/AEM.00490-14] [PubMed:24682297].
[11] M. SELA, F. H. WHITE, and C. B. ANFINSEN. Reductive cleavage
of disulfide bridges in ribonuclease. Science, 125(3250):691692, Apr
1957. [PubMed:13421663].
[12] Carl Branden and John Tooze. Introduction to Protein Structure. Garland Science, 1999.
[13] S. C. Lovell, I. W. Davis, W. B. Arendall, P. I. de Bakker, J. M.
Word, M. G. Prisant, J. S. Richardson, and D. C. Richardson.
Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins, 50(3):437450, Feb 2003. [DOI:10.1002/prot.10286]
[PubMed:12557186].
[14] F. Crick. Central dogma of molecular biology. Nature, 227(5258):561
563, Aug 1970. [PubMed:4913914].
BIBLIOGRAPHY
80
[15] Alka Dwevedi. Protein Folding: Examining the Challenges from Synthesis to Folded Form (SpringerBriefs in Biochemistry and Molecular
Biology). Springer, 2014.
[16] C. B. Anfinsen. Principles that govern the folding of protein chains.
Science, 181(4096):223230, Jul 1973. [PubMed:4124164].
[17] C. B. Anfinsen and H. A. Scheraga. Experimental and theoretical
aspects of protein folding.
[PubMed:237413].
[18] Levinthal C. How to Fold Graciously. Mossbauer Spectroscopy in
Biological Systems: Proceedings of a meeting held at Allerton House,
Monticello, Illinois, 1969.
[19] K. A. Dill.
Protein
[PubMed Central:PMC2144345]
[DOI:10.1110/ps.8.6.1166] [PubMed:10386867].
[20] Louise A Wallace and C Robert Matthews. Sequential vs. parallel
protein-folding mechanisms: experimental tests for complex folding reactions. Biophysical Chemistry, 101102:113 131, 2002. Special issue
in honour of John A Schellman.
[21] L. PAULING, R. B. COREY, and H. R. BRANSON. The structure of
proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. U.S.A., 37(4):205211, Apr 1951.
[PubMed Central:PMC1063337] [PubMed:14816373].
[22] M. Novotny and G. J. Kleywegt.
in protein structures.
[DOI:10.1016/j.jmb.2005.01.037] [PubMed:15740737].
BIBLIOGRAPHY
81
The
BIBLIOGRAPHY
82
Bioinformat-
[PubMed Central:PMC2682512]
[DOI:10.1093/bioinformatics/btp163] [PubMed:19304878].
[30] David L. Nelson and Michael M. Cox. Lehninger Principles of Biochemistry. W. H. Freeman, 2008.
[31] H. J. Schneider. Limitations and extensions of the lock-and-key principle: differences between gas state, solution and solid state structures.
[PubMed Cen-
Menten mechanisms.
[PubMed:3770204].
[34] Y. Liu and B. Kuhlman. RosettaDesign server for protein design. Nucleic Acids Res., 34(Web Server issue):W235238, Jul 2006. [PubMed
Central:PMC1538902] [DOI:10.1093/nar/gkl163] [PubMed:16845000].
[35] J. C. KENDREW, G. BODO, H. M. DINTZIS, R. G. PARRISH,
H. WYCKOFF, and D. C. PHILLIPS. A three-dimensional model of the
myoglobin molecule obtained by x-ray analysis. Nature, 181(4610):662
666, Mar 1958. [PubMed:13517261].
[36] B. Rost and C. Sander. Bridging the protein sequence-structure gap by
structure predictions. Annu Rev Biophys Biomol Struct, 25:113136,
1996. [DOI:10.1146/annurev.bb.25.060196.000553] [PubMed:8800466].
BIBLIOGRAPHY
83
[37] A. Ilari and C. Savino. Protein structure determination by x-ray crystallography. Methods Mol. Biol., 452:6387, 2008. [DOI:10.1007/978-160327-159-2 3] [PubMed:18563369].
[38] B. Montgomery Pettitt.
tein folding.
How
[DOI:10.1126/science.1208351] [PubMed:22034434].
[40] David E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey S. Kuskin,
Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J. Bowers, Jack C. Chao, Michael P. Eastwood, Joseph
Gagliardo, J. P. Grossman, C. Richard Ho, Douglas J. Ierardi, Istvan
Kolossv
ary, John L. Klepeis, Timothy Layman, Christine McLeavey,
Mark A. Moraes, Rolf Mueller, Edward C. Priest, Yibing Shan, Jochen
Spengler, Michael Theobald, Brian Towles, and Stanley C. Wang. Anton, a special-purpose machine for molecular dynamics simulation.
SIGARCH Comput. Archit. News, 35(2):112, jun 2007.
[41] Ruben Abagyan Andrew J. W. Orry. Homology Modeling: Methods and
Protocols (Methods in Molecular Biology). Humana Press, 2012.
[42] B. Webb and A. Sali. Protein structure modeling with MODELLER.
Methods Mol. Biol., 1137:115, 2014. [DOI:10.1007/978-1-4939-03665 1] [PubMed:24573470].
[43] Andreas Kukol. Molecular Modeling of Proteins (Methods in Molecular
Biology). Humana Press, 2014.
BIBLIOGRAPHY
84
[DOI:10.1016/j.str.2005.10.007] [PubMed:16472746].
[45] C. A. Smith and T. Kortemme. Backrub-like backbone simulation
recapitulates natural protein conformational variability and improves
mutant side-chain prediction.
[PubMed Cen-
tral:PMC3151162] [PubMed:21534921].
[47] M. Cloutier and P. Wellstead. The control systems structures of energy
metabolism. J R Soc Interface, 7(45):651665, Apr 2010. [PubMed Central:PMC2842784] [DOI:10.1098/rsif.2009.0371] [PubMed:19828503].
[48] Michael Affenzeller, Stephan Winkler, Stefan Wagner, and Andreas Beham. Genetic Algorithms and Genetic Programming: Modern Concepts
and Practical Applications. Chapman & Hall/CRC, 1st edition, 2009.
[49] Lawrence David Davis. Evolutionary Computation in Practice (Studies
in Computational Intelligence). Springer, 2008.
[50] David L. Applegate, Robert E. Bixby, Vasek Chvatal, and William J.
Cook.
BIBLIOGRAPHY
85
[52] D. Ghersi and R. Sanchez. EasyMIFS and SiteHound: a toolkit for the
identification of ligand-binding sites in protein structures. Bioinformatics, 25(23):31853186, Dec 2009. [PubMed Central:PMC2913663]
[DOI:10.1093/bioinformatics/btp562] [PubMed:19789268].
[53] M. Hernandez, D. Ghersi, and R. Sanchez. SITEHOUND-web: a server
for ligand binding site identification in protein structures.
Nucleic
Acids Res., 37(Web Server issue):W413416, Jul 2009. [PubMed Central:PMC2703923] [DOI:10.1093/nar/gkp281] [PubMed:19398430].
[54] F. Lauck, C. A. Smith, G. F. Friedland, E. L. Humphris, and T. Kortemme.
[DOI:10.1093/nar/gkq369] [PubMed:20462859].
[55] G. G. Krivov, M. V. Shapovalov, and R. L. Dunbrack.
Improved
prediction of protein side-chain conformations with SCWRL4. Proteins, 77(4):778795, Dec 2009.
[PubMed Central:PMC2885146]
[DOI:10.1002/prot.22488] [PubMed:19603484].
[56] S. Ruiz-Carmona, D. Alvarez-Garcia, N. Foloppe, A. B. GarmendiaDoval, S. Juhos, P. Schmidtke, X. Barril, R. E. Hubbard, and
S. D. Morley.
PLoS Comput.
[PubMed Central:PMC3983074]
[DOI:10.1371/journal.pcbi.1003571] [PubMed:24722481].
Appendix B
Attachments
86
APPENDIX B. ATTACHMENTS
B.1
87
main.py
Code Snippet B.1: Complete source code: main.py.
1 ########################################
2 ####
FIGARO
####
####
####
5 #### S t u d e n t nr : r0307453
####
from f u n c t i o n s import
12
import m u l t i p r o c e s s i n g
13
14 ## S e t number o f p r o c e s s o r c o r e s
15
w o r k e r s = m u l t i p r o c e s s i n g . Pool ( 1 0 0 )
16
17 ## Parameters :
18
19 ## P e r c e n t a g e e l i t i s m
20
pctE = 0 . 2 5
21 ## P e r c e n t a g e c r o s s o v e r
22
pctX = 0 . 6
23 ## P e r c e n t a g e m u t a t i o n
24
pctM = 0 . 0 0 5
25 ## P o p u l a t i o n s i z e
26
p o p S i z e = 100
27 ## S u b p o p u l a t i o n s i z e
28
s u b p o p S i z e = 10
29 ## Number o f p u t a t i v e b i n d i n g s i t e s t o be i n s p e c t e d
30
nrBindingSites = 5
31 ## Number o f g e n e r a t i o n s
32
g e n S i z e = 50
33 ## Maximum r e s o l u t i o n o f s t r u c t u r e s i n i n i t i a l p o p u l a t i o n
APPENDIX B. ATTACHMENTS
34
88
maxRes = 2 . 0
35 ## Maximum number o f e n t i t i e s i n s t r u c t u r e
36
maxEnt = 1
37 ## Minimum s i m i l a r i t y t o q u e r y l i g a n d
38
minSim = 0 . 3
39
40 ## Generate random s m a l l m o l e c u l e from h t t p : / / b c i r c . d o c k i n g . o r g /
random . s h t m l and g e t SMILES ( c u r r e n t l y hardcoded ) .
41
42
43 ## I n i t i a l i z e t h e p o p u l a t i o n .
44
45
46 ## Write i n i t i a l i n f o r m a t i o n f o r r e s u l t e v a l u a t i o n .
47
w r i t e i n i t i a l i n f o ( population )
48
49 ## Perform GA on p o p u l a t i o n f o r s p e c i f i e d number o f g e n e r a t i o n s .
50
51
f o r g e n e r a t i o n in range ( 0 , g e n S i z e ) :
f o r index , s u b p o p u l a t i o n in enumerate ( p o p u l a t i o n ) :
52
## P r e d i c t i n i t i a l s t r u c t u r e s .
53
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
predict structure backrub , subpopulation )
54
## Dock l i g a n d t o a l l i n d i v i d u a l s i n s u b p o p u l a t i o n .
55
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )
56
## S e l e c t i n d i v i d u a l s f o r mating i n s u b p o p u l a t i o n s .
57
58
## Perform c r o s s o v e r b e t w e e n i n d i v i d u a l s i n
subpopulations .
59
t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n [ i n d e x ] , pctX )
60
#s i n g l e p o i n t c r o s s o v e r ( newPopulation , pctX )
61
## P r e d i c t s t r u c t u r e o f recombined s e q u e n c e s .
62
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
p r e d i c t s t r u c t u r e h o m o l o g y f a s t , population [ index ] )
63
## Dock l i g a n d t o a l l r e c o m b i n a t i o n s i n s u b p o p u l a t i o n .
APPENDIX B. ATTACHMENTS
64
65
66
89
p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )
i f g e n e r a t i o n < g e n S i z e 1:
## S e l e c t mother p r o t e i n s f o r new p o p u l a t i o n b a s e d on
c o m p e t i t i o n b e t w e e n each s u b p o p u l a t i o n s b e s t
competitor .
67
motherSelection = r o u l e t t e w h e e l s e l e c t i o n (
g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) , r e p l a c e=F a l s e )
68
69
70
else :
71
72
b e s t P r o t e i n . id + with a s c o r e o f + s t r (
bestProtein . score ) + \
73
on b i n d i n g s i t e
+ .
+ bestProtein . bestBindingSite
APPENDIX B. ATTACHMENTS
B.2
90
functions.py
Code Snippet B.2: Complete source code: main.py.
1 ########################################
2 ####
FIGARO
####
####
####
5 #### S t u d e n t nr : r0307453
####
from p r o t e i n import P r o t e i n
12
13
import m o d e l l e r
14
import m o d e l l e r . automodel
15
import m o d e l l e r . s c r i p t s . c o m p l e t e p d b
16
import r e
17
import u r l l i b 2
18
import o s
19
import s y s
20
import random
21
import copy
22
23
ROSETTADIR = R o s e t t a
24
25
e x c e p t i o n L i s t = [ 2OGM , 4Y5G ]
26
aminoAcids = [ A , D , E , F , G , H , I , K , L , M ,
N , Q , R , S , T , V , W , Y ]
27
28 ## F i l l i n g up s h o r t c o m i n g o f Biopython p a c k a g e no method f o r
r e t r e i v i n g r e s i d u e i n d e x from r e s i d u e o b j e c t a v a i l a b l e .
29
30
31
def g e t r e s i ( r e s ) :
return i n t ( s t r ( r e s ) . s p l i t ( ) [ 3 ] [ 7 : ] )
APPENDIX B. ATTACHMENTS
91
35
population = [ ]
36
usedPdb = [ ]
37
similarity = 1
38
39
40
41
42
43
f o r l i n e in pdbOut :
i f len ( p o p u l a t i o n ) != p o p S i z e :
h i t = r e . s e a r c h ( (?<= s t r u c t u r e I d =) . ? ( ? = ) ,
line )
44
if hit :
45
pdb = h i t . group ( )
46
c h a i n = r e . s e a r c h ( (?<= c h a i n i d =)\w(?=) ,
u r l l i b 2 . urlopen (
47
h t t p : / /www. r c s b . o r g /pdb/ r e s t /
d e s c r i b e M o l ? s t r u c t u r e I d= + pdb ) . r e a d
( ) ) . group ( )
48
49
l i g a n d = r e . s e a r c h ( (?<=c h e m i c a l I D =)
. ? ( ? = ) , l i n e ) . group ( )
50
w r i t e p d b ( pdb , l i g a n d )
51
f i x p d b ( pdb )
52
e x t r a c t c h a i n ( pdb , c h a i n )
53
s e q = g e t s e q ( pdb )
54
b i n d i n g S i t e s = g e t b i n d i n g s i t e s ( pdb ,
nrBindingSites )
APPENDIX B. ATTACHMENTS
55
92
56
subpop = c r e a t e s u b p o p u l a t i o n (
motherProtein , subpopSize , pctM )
57
p o p u l a t i o n . append ( subpop )
58
59
60
s i m i l a r i t y = s i m i l a r i t y 0.1
i f len ( p o p u l a t i o n ) == p o p S i z e :
61
print P o p u l a t i o n i n i t i a l i z a t i o n f i n i s h e d .
62
return p o p u l a t i o n
63
64
else :
s y s . e x i t ( Not enough a v a i l a b l e t e m p l a t e s f o r t h e s e
parameters . )
65
66 ## Writes i n i t i a l i n f o t o f i l e .
67
68
69
70
71
def w r i t e i n i t i a l i n f o ( p o p u l a t i o n ) :
with open ( i n i t i a l i n f o , w ) a s i n i t i a l F i l e :
f o r subpop in p o p u l a t i o n :
f o r p r o t e i n in subpop :
i n i t i a l F i l e . w r i t e ( p r o t e i n . id + \ t + p r o t e i n .
s e q + \ t + s t r ( p r o t e i n . b i n d i n g S i t e s ) + \n
)
72
print I n i t i a l i n f o w r i t t e n .
73
74 ## E x t r a c t s s i n g l e c h a i n from PDB and w r i t e s as PDB.
75
def e x t r a c t c h a i n ( pdb , c h a i n ) :
76
p a r s e r = PDBParser ( )
77
s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( pdb ,
78
def g e t s e q ( pdb ) :
82
p a r s e r = PDBParser ( )
83
b u i l d e r = PPBuilder ( )
84
s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( pdb ,
APPENDIX B. ATTACHMENTS
85
sequenceList = [ ]
86
f o r p o l y p e p t i d e in b u i l d e r . b u i l d p e p t i d e s ( s t r u c t u r e ) :
87
88
93
s e q u e n c e L i s t . append ( s t r ( p o l y p e p t i d e . g e t s e q u e n c e ( ) ) )
return . j o i n ( s e q u e n c e L i s t )
89
90 ## Checks i f e n t i t y number and r e s o l u t i o n a r e t o l e r a t e d .
91
92
93
eTest = False
94
rTest = False
95
f o r l i n e in d e s c r i p t i o n :
96
e = r e . s e a r c h ( (?<= n r e n t i t i e s =)\w(?=) , l i n e )
97
r = r e . s e a r c h ( (?<= r e s o l u t i o n =) . ? ( ? = ) , l i n e )
98
99
100
e T e s t = True
i f r and f l o a t ( r . group ( ) ) <= maxRes :
101
r T e s t = True
102
i f e T e s t and r T e s t :
103
return True
104
105
else :
return F a l s e
106
107 ## C r e a t e s s u b p o p u l a t i o n .
108
109
subpop = [ ]
110
f o r i in range ( 0 , s i z e ) :
111
subpop . append ( P r o t e i n ( m o t h e r P r o t e i n . id , m o t h e r P r o t e i n .
seq , m o t h e r P r o t e i n . chain , m o t h e r P r o t e i n . b i n d i n g S i t e s )
)
112
p o i n t m u t a t i o n ( subpop [ i ] , pctM )
113
subpop [ i ] . s e t i d ( m o t h e r P r o t e i n . id + + s t r ( i +1) )
114
return subpop
115
116 ## I d e n t i f i e s p u t a t i v e b i n d i n g p o c k e t s u s i n g t h e SiteHound
package
APPENDIX B. ATTACHMENTS
94
def g e t b i n d i n g s i t e s ( pdb , n r B i n d i n g S i t e s ) :
119
bsCenterList = [ ]
120
s t a r t i n g D i r = o s . getcwd ( )
121
o s . c h d i r ( s t a r t i n g D i r + /pdb )
122
o s . system ( . / auto . py i
123
124
f o r i in range ( 0 , n r B i n d i n g S i t e s ) :
125
l i n e = summary . r e a d l i n e ( )
126
x = l i n e . s p l i t ( ) [ 3]
127
y = l i n e . s p l i t ( ) [ 2]
128
z = l i n e . s p l i t ( ) [ 1]
129
b s C e n t e r L i s t . append ( ( + x + , + y + , + z + )
)
130
## Remove temporary f i l e s
131
132
## Return t o main d i r e c t o r y .
133
os . chdir ( s t a r t i n g D i r )
134
135
return b s C e n t e r L i s t
136
137 ## Performs m o l e c u l a r d o c k i n g o f l i g a n d t o b i n d i n g s i t e s (BS) o f
r e c e p t o r s t r u c t u r e s u s i n g t h e rDock p a c k a g e .
138 ## The g l o b a l d o c k i n g s c o r e f u n c t i o n s e r v e s as a f i t n e s s measure
.
139 ## Don t f o r g e t t o f i r s t c o m p i l e and s e t up rDock and Open B a b e l
correctly .
140
def dock ( p r o t e i n ) :
141
print Docking s t a r t e d f o r p r o t e i n + p r o t e i n . id + .
142
b e s t S c o r e = None
143
b e s t B i n d i n g S i t e = None
144
145
146
f o r b i n d i n g S i t e in p r o t e i n . b i n d i n g S i t e s :
APPENDIX B. ATTACHMENTS
95
147
## F i r s t w r i t e s y s t e m f i l e .
148
149
150
MAPPER\nSITE MAPPER
RbtSphereSiteMapper \nCENTER
+ b i n d i n g S i t e + \nRADIUS 1 5 . 0 \
151
## Generate c a v i t y f o r d o c k i n g .
153
154
## Perform d o c k i n g .
155
o s . system ( rbdock i l i g a n d . sd o o u t p u t + p r o t e i n . id
+ r rDockSystem + p r o t e i n . id +
156
157
## Return d o c k i n g s c o r e .
158
with open ( o u t p u t + p r o t e i n . id + . sd , r ) a s
dockingResult :
159
160
161
f o r l i n e in d o c k i n g R e s u l t :
i f l i n e == >
<SCORE>\n :
d o c k i n g S c o r e = f l o a t ( d o c k i n g R e s u l t . next ( ) .
strip () )
162
i f b e s t S c o r e i s None or d o c k i n g S c o r e < b e s t S c o r e :
163
bestScore = dockingScore
164
bestBindingSite = bindingSite
165
166
167
## Remove temporary f i l e s
APPENDIX B. ATTACHMENTS
168
96
o s . system ( rm + p r o t e i n . id + . mol2 o u t p u t + p r o t e i n . id
+ . sd rDockSystem + p r o t e i n . id +
169
c a v 1 . grd rDockSystem + p r o t e i n . id + . a s
rDockSystem + p r o t e i n . id + . prm )
170
print Docking f i n i s h e d f o r p r o t e i n + p r o t e i n . id + .
171
return p r o t e i n
172
173 ## R e p o r t s b e s t i n d i v i d u a l from p o p u l a t i o n .
174
def r e p o r t b e s t ( p o p u l a t i o n ) :
175
c u r r e n t B e s t = None
176
f o r subpop in p o p u l a t i o n :
177
178
f o r p r o t e i n in subpop :
i f c u r r e n t B e s t i s None or p r o t e i n . s c o r e <
currentBest :
179
180
currentBest = protein
return c u r r e n t B e s t
181
182 ## E v a l u a t e s s u b p o p u l a t i o n s and r e t u r n s l i s t o f b e s t c a n d i d a t e s .
183
def g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) :
184
candidatesList = [ ]
185
f o r subpop in p o p u l a t i o n :
186
c u r r e n t B e s t = None
187
f o r p r o t e i n in subpop :
188
i f c u r r e n t B e s t i s None or p r o t e i n . s c o r e <
currentBest . score :
189
190
191
currentBest = protein
c a n d i d a t e s L i s t . append ( c u r r e n t B e s t )
return c a n d i d a t e s L i s t
192
193 ## R o u l e t t e Wheel S e l e c t i o n o p e r a t o r
194
195
newPopulation = [ ]
196
f o r i in range ( 0 , len ( p o p u l a t i o n ) ) :
197
totalScore = 0
198
f o r p r o t e i n in p o p u l a t i o n :
199
t o t a l S c o r e += p r o t e i n . s c o r e
APPENDIX B. ATTACHMENTS
97
200
201
index = 0
202
c u r r e n t T o t a l = p o p u l a t i o n [ i n d e x ] . s c o r e
203
204
i n d e x += 1
205
c u r r e n t T o t a l += p o p u l a t i o n [ i n d e x ] . s c o r e
206
i f r e p l a c e == True :
207
newId = p o p u l a t i o n [ i n d e x ] . id [ 0 : 4 ] + s e l e c t e d +
s t r ( i +1)
208
209
p o p u l a t i o n [ i n d e x ] . chain
, population [ index ] .
bindingSites ) )
210
211
212
i f r e p l a c e == F a l s e :
213
214
newPopulation . append ( p o p u l a t i o n [ i n d e x ] )
i f r e p l a c e == True :
215
## Clean up PDB f o l d e r .
216
f o r p r o t e i n in p o p u l a t i o n :
217
218
print R o u l e t t e Wheel S e l e c t i o n s u c c e e d e d .
219
return newPopulation
220
221 ## TwoP o i n t s C r o s s o v e r o p e r a t o r
222
223
def t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n , pctX ) :
i f not len ( p o p u l a t i o n ) > 1 :
224
p o p u l a t i o n [ 0 ] . s e t p a r e n t s ( protA=p o p u l a t i o n [ 0 ] )
225
p o p u l a t i o n [ 0 ] . s e t i d ( p o p u l a t i o n [ 0 ] . id + recombined )
226
print No c r o s s o v e r p o s s i b l e .
227
else :
228
random . s h u f f l e ( p o p u l a t i o n )
229
f o r index , p r o t e i n in enumerate ( p o p u l a t i o n ) :
APPENDIX B. ATTACHMENTS
230
231
232
98
233
c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )
234
c r o s s o v e r P o i n t 2 = random . r a n d i n t (
c r o s s o v e r P o i n t 1 , minLength )
235
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index + 1 ] ,
236
c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
c r o s s o v e r P o i n t 2=
crossoverPoint2 )
237
238
seq [ crossoverPoint1 :
crossoverPoint2 ] +
protein . seq [
crossoverPoint2 : ] )
239
240
241
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
else :
minLength = min ( [ len ( p r o t e i n . s e q ) , len (
p o p u l a t i o n [ index 1 ] . p a r e n t s [ 0 ] [ 1 ] ) ] )
242
c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )
243
c r o s s o v e r P o i n t 2 = random . r a n d i n t (
c r o s s o v e r P o i n t 1 , minLength )
244
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index 1 ] ,
245
c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
c r o s s o v e r P o i n t 2=
crossoverPoint2 )
246
APPENDIX B. ATTACHMENTS
247
99
parents [ 0 ] [ 1 ] [
crossoverPoint1 :
crossoverPoint2 ] +
248
protein . seq [
crossoverPoint2 : ] )
249
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
250
print C r o s s o v e r f i n i s h e d .
251
else :
252
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n )
253
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
254
print No c r o s s o v e r .
255
256 ## S i n g l e P o i n t C r o s s o v e r o p e r a t o r
257
258
def s i n g l e p o i n t c r o s s o v e r ( p o p u l a t i o n , pctX ) :
i f not len ( p o p u l a t i o n ) > 1 :
259
p o p u l a t i o n [ 0 ] . s e t p a r e n t s ( protA=p o p u l a t i o n [ 0 ] )
260
p o p u l a t i o n [ 0 ] . s e t i d ( p o p u l a t i o n [ 0 ] . id + recombined )
261
print No c r o s s o v e r p o s s i b l e .
262
else :
263
random . s h u f f l e ( p o p u l a t i o n )
264
f o r index , p r o t e i n in enumerate ( p o p u l a t i o n ) :
265
266
267
268
c r o s s o v e r P o i n t = random . r a n d i n t ( 0 , minLength
)
269
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
p o p u l a t i o n [ i n d e x + 1 ] , c r o s s o v e r P o i n t 1=
crossoverPoint )
270
271
272
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
else :
APPENDIX B. ATTACHMENTS
273
100
274
c r o s s o v e r P o i n t = random . r a n d i n t ( 0 , minLength
)
275
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
p o p u l a t i o n [ i n d e x 1 ] , c r o s s o v e r P o i n t 1=
crossoverPoint )
276
277
parents [ 0 ] [ 1 ] [
crossoverPoint : ] )
278
279
280
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
print C r o s s o v e r f i n i s h e d .
else :
281
p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n )
282
p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
283
print No c r o s s o v e r .
284
285 ## P o i n t m u t a t i o n o p e r a t o r
286
def p o i n t m u t a t i o n ( p r o t e i n , pctM ) :
287
mutable = l i s t ( p r o t e i n . s e q )
288
p o i n t M u t a t i o n s = {}
289
290
291
292
i f p r o t e i n . s e q [ i n d e x ] != mutable [ i n d e x ] :
293
p o i n t M u t a t i o n s [ i n d e x +1] = p r o t e i n . s e q [ i n d e x ] +
mutable [ i n d e x ]
294
p r o t e i n . u p d a t e s e q ( . j o i n ( mutable ) )
295
296
print p r o t e i n + p r o t e i n . id + mutated .
297
return p r o t e i n
298
299 ## Writes PDB s t r u c t u r e f i l e w i t h or w i t h o u t l i g a n d .
300
APPENDIX B. ATTACHMENTS
301
302
101
303
304
i f ligand :
f o r l i n e in pdbOrig :
i f not r e . match ( HETATM\ s . ? \ s . ? \ s + l i g a n d ,
305
line ) :
p d b F i l e . w r i t e ( l i n e + \n )
306
307
308
309
else :
f o r l i n e in pdbOrig :
p d b F i l e . w r i t e ( l i n e + \n )
310
311
312
313 ## Writes PIR a l i g n m e n t f i l e .
314
315
316
def w r i t e p i r ( p r o t e i n ) :
with open ( p r o t e i n . id + . a l i , w ) a s p i r F i l e :
p i r F i l e . w r i t e ( >P1 ; + p r o t e i n . id + \n + s e q u e n c e : +
p r o t e i n . id + : : : : : : : 0 . 0 0 : 0 . 0 0 \ n + p r o t e i n . s e q +
)
317
print PIR f i l e w r i t t e n .
318
319 ## Writes a l i g n m e n t f i l e .
320
321
322
323
def w r i t e a l n ( p r o t e i n ) :
with open ( p r o t e i n . id + t o t a l . a l i , w ) a s a l n F i l e :
f o r p a r e n t in p r o t e i n . p a r e n t s :
a l n F i l e . w r i t e ( >P1 ; + p a r e n t [ 0 ] + \n + s t r u c t u r e
: + parent [ 0 ] + : . : . : . : . : : : : \ n +
324
325
p a r e n t [ 1 ] + \n )
a l n F i l e . w r i t e ( >P1 ; + p r o t e i n . id + \n + s e q u e n c e : +
p r o t e i n . id + : : : : : : : : \ n +
326
327
p r o t e i n . s e q + \n )
print Alignment f i l e w r i t t e n .
328
329 ## Returns l i s t o f t e m p l a t e s .
APPENDIX B. ATTACHMENTS
330
102
def g e t t e m p l a t e s ( p r f , c u t o f f , amount ) :
331
p r f . w r i t e ( f i l e= b u i l d p r o f i l e . p r f , p r o f i l e f o r m a t= TEXT )
332
templates = [ ]
333
i d e n t i t y = 95
334
335
with open ( b u i l d p r o f i l e . p r f , r ) a s p r o f i l e :
336
f o r l i n e in p r o f i l e :
i f len ( t e m p l a t e s ) < 2 and r e . match ( , l i n e )
337
and i nt ( l i n e . s p l i t ( ) [ 1 0 ] . s t r i p ( . ) ) >=
identity \
338
and l i n e . s p l i t ( ) [ 1 ] [ 0 : 5 ] not in
templates :
339
340
341
t e m p l a t e s . append ( l i n e . s p l i t ( ) [ 1 ] [ 0 : 5 ] )
i d e n t i t y = 10
i f len ( t e m p l a t e s ) == amount :
342
print Templates g a t h e r e d .
343
return t e m p l a t e s
344
345
else :
s y s . e x i t ( Not enough t e m p l a t e s . )
346
347 ## F i x e s m i s s i n g r e s i d u e s i n PDB f i l e .
348
def f i x p d b ( pdb ) :
349
env = m o d e l l e r . e n v i r o n ( )
350
env . l i b s . t o p o l o g y . r e a d ( $ {LIB}/ t o p h e a v . l i b )
351
352
m = m o d e l l e r . s c r i p t s . c o m p l e t e p d b ( env ,
)
353
354
print PDB f i x e d .
355
356 ## Adds c h a i n i d e n t i f i e r t o l i n e s i n PDB.
357
def a d d c h a i n i d ( pdb , c h a i n ) :
358
fixedLines = [ ]
359
360
361
f o r l i n e in p d b F i l e :
i f l i n e [ 0 : 4 ] == ATOM :
APPENDIX B. ATTACHMENTS
362
mutable = l i s t ( l i n e )
363
mutable [ 2 1 ] = c h a i n
364
f i x e d L i n e s . append ( . j o i n ( mutable ) )
365
366
367
else :
f i x e d L i n e s . append ( l i n e )
with open ( pdb/ + pdb + . pdb , w ) a s p d b F i l e :
368
f o r l i n e in f i x e d L i n e s :
369
pdbFile . write ( l i n e )
370
103
371
372 ## P r e d i c t s p r o t e i n s t r u c t u r e s w i t h homology m o d e l i n g ( c l a s s i c a l
way , n o t used ) .
373
def p r e d i c t s t r u c t u r e h o m o l o g y ( p r o t e i n ) :
374
## S e l e c t t e m p l a t e s from d a t a b a s e .
375
376
env = m o d e l l e r . e n v i r o n ( )
377
env . i o . a t o m f i l e s d i r e c t o r y = . / pdb
378
sdb = m o d e l l e r . s e q u e n c e d b ( env )
379
380
a l n = m o d e l l e r . a l i g n m e n t ( env )
381
382
prf = aln . t o p r o f i l e ()
383
384
385
c h e c k p r o f i l e=F a l s e , m a x a l n e v a l u e =0.01)
386
387
## A l i g n t e m p l a t e s .
388
a l n = m o d e l l e r . a l i g n m e n t ( env )
389
f o r t e m p l a t e in t e m p l a t e s :
390
391
APPENDIX B. ATTACHMENTS
392
104
393
f o r ( w e i g h t s , w r i t e f i t , whole ) in ( ( ( 1 . , 0 . , 0 . , 0 . , 1 . ,
0 . ) , F a l s e , True ) ,
394
((1. , 0.5 , 1. , 1. , 1. ,
0 . ) , F a l s e , True ) ,
395
((1. , 1. , 1. , 1. , 1. ,
0 . ) , True , F a l s e ) ) :
396
a l n . s a l i g n ( r m s c u t o f f =3.5 , n o r m a l i z e p p s c o r e s=F a l s e ,
397
398
g a p p e n a l t i e s 1 d =(450, 50) ,
399
g a p p e n a l t i e s 3 d =(0 , 3 ) , g a p g a p s c o r e =0,
g a p r e s i d u e s c o r e =0 ,
400
d e n d r o g r a m f i l e=p r o t e i n . id + t e m p l a t e s .
tree ,
401
a l i g n m e n t t y p e= t r e e ,
402
f e a t u r e w e i g h t s=w e i g h t s ,
403
i m p r o v e a l i g n m e n t=True , f i t =True , w r i t e f i t=
write fit ,
404
405
a l n . w r i t e ( f i l e=p r o t e i n . id + t e m p l a t e s . a l i ,
a l i g n m e n t f o r m a t= PIR )
406
## A l i g n s e q u e n c e t o t e m p l a t e s .
407
env . l i b s . t o p o l o g y . r e a d ( f i l e= $ ( LIB ) / t o p h e a v . l i b )
408
a l n b l o c k = len ( a l n )
409
410
a l n . s a l i g n ( output= , m a x g a p l e n g t h =20 ,
411
g a p f u n c t i o n=True ,
412
a l i g n m e n t t y p e= PAIRWISE , a l i g n b l o c k=a l n b l o c k
,
413
f e a t u r e w e i g h t s =(1. , 0 . , 0 . , 0 . , 0 . , 0 . ) ,
overhang =0 ,
414
g a p p e n a l t i e s 1 d =(450, 0 ) ,
APPENDIX B. ATTACHMENTS
415
105
g a p p e n a l t i e s 2 d =(0.35 , 1 .2 , 0 . 9 , 1 .2 , 0 . 6 , 8 .6 ,
1.2 , 0. , 0.) ,
416
417
s i m i l a r i t y f l a g=True )
a l n . w r i t e ( f i l e=p r o t e i n . id + t o t a l . a l i , a l i g n m e n t f o r m a t=
PIR )
418
419
knowns=tuple ( t e m p l a t e s ) ,
s e q u e n c e=p r o t e i n . id )
420
a . starting model = 1
421
a . ending model = 1
422
a . make ( )
423
## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL p a c k a g e
) , switch hashes to a c t i v a t e . Deactivation prefered .
424
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + . B99990001
. pdb o f i n a l + p r o t e i n . i d + . pdb )
425
426
## R e p l a c e p r e v i o u s PDB w i t h model .
427
428
429
f i x p d b ( p r o t e i n . id )
430
## Remove temporary f i l e s .
431
o s . system ( rm + p r o t e i n . id + )
432
## M o d e l l e r d o e s n o t add c h a i n i d s by i t s e l f .
433
a d d c h a i n i d ( p r o t e i n . id , p r o t e i n . c h a i n )
434
435
return p r o t e i n
436
437 ## P r e d i c t s p r o t e i n s t r u c t u r e s w i t h homology m o d e l i n g u s i n g
i n t e r m e d i a t e models ( manual a l i g n m e n t , f a s t e r ) .
438
439
def p r e d i c t s t r u c t u r e h o m o l o g y f a s t ( p r o t e i n ) :
i f len ( p r o t e i n . p a r e n t s ) > 1 :
APPENDIX B. ATTACHMENTS
440
441
env = m o d e l l e r . e n v i r o n ( )
442
env . i o . a t o m f i l e s d i r e c t o r y = . / pdb
443
templates = [ ]
444
f o r p a r e n t in p r o t e i n . p a r e n t s :
445
446
106
t e m p l a t e s . append ( p a r e n t [ 0 ] )
a = m o d e l l e r . automodel . automodel ( env , a l n f i l e =p r o t e i n . id
+ t o t a l . a l i , knowns=tuple ( t e m p l a t e s ) , s e q u e n c e=
p r o t e i n . id )
447
a . starting model = 1
448
a . ending model = 1
449
a . make ( )
450
## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .
451
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + .
B99990001 . pdb o f i n a l + p r o t e i n . i d + . pdb )
452
453
## R e p l a c e p r e v i o u s PDB w i t h model .
454
455
456
f i x p d b ( p r o t e i n . id )
457
## Remove temporary f i l e s .
458
o s . system ( rm + p r o t e i n . id + )
459
## M o d e l l e r d o e s n o t add c h a i n i d s by i t s e l f .
460
a d d c h a i n i d ( p r o t e i n . id , p r o t e i n . c h a i n )
461
462
463
464
return p r o t e i n
else :
o s . system ( cp pdb/ + p r o t e i n . p a r e n t s [ 0 ] [ 0 ] + . pdb pdb/
+ p r o t e i n . id + . pdb )
465
return p r o t e i n
APPENDIX B. ATTACHMENTS
107
466
467 ## Models p o i n t m u t a t i o n s u s i n g t h e R o s e t t a B a c k r u b a p p l i c a t i o n .
468
def p r e d i c t s t r u c t u r e b a c k r u b ( p r o t e i n ) :
469
p a r s e r = PDBParser ( )
470
s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( p r o t e i n . id [ 0 : 4 ] ,
pdb/ +
p r o t e i n . id [ 0 : 4 ] + . pdb )
471
472
473
atomList = [ ]
474
f o r atom in s t r u c t u r e . g e t a t o m s ( ) :
475
a t o m L i s t . append ( atom )
476
nbs = N e i g h b o r S e a r c h ( a t o m L i s t )
477
f o r mutation in p r o t e i n . p o i n t M u t a t i o n s :
478
479
with open ( r e s f i l e
+ p r o t e i n . id ,
w ) a s r e s f i l e :
480
r e s f i l e . w r i t e ( NATRO\ n s t a r t \n )
481
affectedList = [ ]
482
r e s i d u e = s t r u c t u r e . g e t c h a i n s ( ) . next ( ) [ mutation ]
483
f o r atom in S e l e c t i o n . u n f o l d e n t i t i e s ( r e s i d u e ,
484
A ) :
f o r n e i g h b o r in nbs . s e a r c h ( atom . g e t c o o r d ( ) , 6 ,
l e v e l= R ) :
485
486
a f f e c t e d L i s t . append ( g e t r e s i ( n e i g h b o r ) )
r e s f i l e . w r i t e ( s t r ( mutation ) + + p r o t e i n . c h a i n +
PIKAA + p r o t e i n . p o i n t M u t a t i o n s [ mutation ] [ 1 ] +
\n )
487
## Make a f f e c t e d L i s t u n i q u e .
488
a f f e c t e d L i s t = l i s t ( set ( a f f e c t e d L i s t ) )
489
pivotList = [ ]
490
f o r r e s in a f f e c t e d L i s t :
491
492
i f r e s != mutation :
r e s f i l e . write ( str ( r e s ) + + protein . chain
+ NATAA\n )
493
p i v o t L i s t . append ( r e s )
494
i f r e s != 1 :
APPENDIX B. ATTACHMENTS
108
495
p i v o t L i s t . append ( r e s 1)
496
i f r e s != len ( p r o t e i n . s e q ) :
497
p i v o t L i s t . append ( r e s +1)
498
499
p i v o t L i s t = l i s t ( set ( p i v o t L i s t ) )
## Run b a c k r u b s a m p l i n g w i t h 1 0 . 0 0 0 Monte C ar l o
iterations .
500
501
n s t r u c t 1 backrub : n t r i a l s
str (10000) + r e s f i l e
502
resfile
+ \
+ p r o t e i n . id
+ p i v o t r e s i d u e s
503
504
f o r p i v o t in p i v o t L i s t :
command += s t r ( p i v o t ) +
505
o s . system ( command )
506
## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .
507
# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + 0 0 0 1 .
pdb o pdb / + p r o t e i n . i d + . pdb )
508
509
## Remove temporary f i l e s .
510
o s . system ( rm + p r o t e i n . id + r e s f i l e
+ p r o t e i n . id
)
511
512
f i x p d b ( p r o t e i n . id )
513
print P r o t e i n + p r o t e i n . id + modeled u s i n g
RosettaBackrub and SCWRL4.
514
return p r o t e i n
515
516
def g e t u n i q u e i d s ( number , m o t h e r S e l e c t i o n ) :
517
list = []
518
519
id =
APPENDIX B. ATTACHMENTS
520
521
522
109
f o r d i g i t in range ( 0 , 4 ) :
id += s t r ( random . r a n d i n t ( 0 , 9 ) )
i f id not in l i s t + [ p r o t e i n . id f o r p r o t e i n in
motherSelection ] :
523
524
l i s t . append ( id )
return l i s t
525
526
def p r e p a r e n e x t g e n ( p o p u l a t i o n , m o t h e r S e l e c t i o n , subpopSize ,
pctM , g e n e r a t i o n ) :
527
newPopulation = [ ]
528
i d s = g e t u n i q u e i d s ( len ( m o t h e r S e l e c t i o n ) , m o t h e r S e l e c t i o n )
529
f o r index , p r o t e i n in enumerate ( m o t h e r S e l e c t i o n ) :
530
531
newPopulation . append ( c r e a t e s u b p o p u l a t i o n ( P r o t e i n ( i d s [
i n d e x ] , p r o t e i n . seq , p r o t e i n . chain , [ p r o t e i n .
b e s t B i n d i n g S i t e ] ) , subpopSize , pctM ) )
532
533
534
535
536
with open ( r e p o r t , a ) a s r e p o r t :
r e p o r t . w r i t e ( b e s t P r o t e i n . id + \ t + s t r ( b e s t P r o t e i n .
bindingSites ) + \t + bestProtein . bestBindingSite +
\ t + s t r ( b e s t P r o t e i n . s c o r e ) + \n )
537
f o r subpop in p o p u l a t i o n :
538
539
540
f o r p r o t e i n in subpop :
541
542
o s . system ( rm pdb/ + p r o t e i n . id [ 0 : 1 4 ] + )
return newPopulation
APPENDIX B. ATTACHMENTS
B.3
110
protein.py
Code Snippet B.3: Complete source code: protein.py.
1 ########################################
2 ####
FIGARO
####
####
####
5 #### S t u d e n t nr : r0307453
####
import r e
12
13
class Protein () :
14
15
def
init
( s e l f , id , seq , chain , b i n d i n g S i t e s ) :
16
s e l f . id = id
17
s e l f . seq = seq
18
s e l f . chain = chain
19
s e l f . bindingSites = bindingSites
20
s e l f . s c o r e = None
21
s e l f . parents = [ ]
22
s e l f . crossoverPoints = [ ]
23
s e l f . b e s t B i n d i n g S i t e = None
24
s e l f . p o i n t M u t a t i o n s = {}
25
26
def s e t s c o r e ( s e l f , s c o r e ) :
27
s e l f . score = score
28
29
30
31
APPENDIX B. ATTACHMENTS
32
33
34
35
36
37
38
i f crossoverPoint1 :
s e l f . crossoverPoints = [ crossoverPoint1 ]
i f crossoverPoint2 :
s e l f . c r o s s o v e r P o i n t s . append ( c r o s s o v e r P o i n t 2 )
39
40
41
def s e t b e s t b i n d i n g s i t e ( s e l f , b i n d i n g S i t e ) :
s e l f . bestBindingSite = bindingSite
42
43
def s e t i d ( s e l f , id ) :
44
s e l f . id = id
45
46
47
def u p d a t e s e q ( s e l f , s e q ) :
s e l f . seq = seq
48
49
50
def s e t p o i n t m u t a t i o n s ( s e l f , p o i n t M u t a t i o n s ) :
s e l f . pointMutations = pointMutations
111
Appendix C
End summary
112
113
In this work we propose a fully automated and less biased evolutionary strategy to design and optimize a binding site for random substrate
molecules without known natural binding pocket. Based on an efficient genetic algorithm dubbed FIGARO (a Fast and Interpopulational Genetic
Algorithm for Receptor Optimization), we show that this strategy can be
a promising new approach in finding valuable protein structures that can be
useful as starting points in the task of artificial enzyme design, to speed up
reactions that would otherwise be too slow to have practical relevance. However, the most foreseeable use case - and also the one particularly capturing
our imagination - is without a doubt its employability in the immunotherapy treatment of cancer. FIGARO offers a new bridge between very specific, even artificially designed chemical structures and proteins that can
vary widely in function. Unique compounds on the surface of cancer cells
could perfectly act as input structures for FIGARO to target. With a 50year-old problem of protein structure prediction and many already achieved
advantages in this field, the availability of enormeous distributed computation possibilities and proven usability of very sophisticated evolutionary
algorithms, FIGARO goes down the path of future potential.