Sie sind auf Seite 1von 125

FIGARO - a Fast and Interpopulational Genetic

Algorithm for Receptor Optimization


FIGARO - Een nieuwe evolutionaire strategie voor receptor optimalisatie

Dissertation presented in
Fulfillment of the requirements
for the degree of Master of Science in
Bioinformatics

Promotors:
Prof. Bart De Moor
Prof. Yves Moreau
ESAT - STADIUS. Stadius Centre for Dynamical Systems.
Signal Processing and Data Analytics
January 2017

Pieter Noyens

"This dissertation is part of the examination and has not been corrected after defence for
eventual errors. Use as a reference is permitted subject to written approval of the promotor
stated on the front page."

FIGARO A Fast and Interpopulational Genetic


Algorithm for Receptor Optimization
Pieter Noyens
pieter.noyens@student.kuleuven.be
January 2017

Foreword
More than any other previous project I stumbled upon during my study
trajectory, this work learned me what its like to work on something really
big and how to organize the process from end to end. Key to this is trying
to keep a healthy pace in the development process without closing your eyes
for upcoming problems and concerns. In first instance, I did not manage to
find this balance. While I was very happy to be able to come up with an
idea myself, I also found myself floating in a vast space of naive beginners
optimism which was soon to be followed up by a more dark tainted feel
of uncertainty and decreased confidence in success and myself in general.
Looking back to that time right now, I never thought this would have such
a big impact on me. But whenever I showed up at my daily supervisor Dusan
Popovic, he told me that he was deeply surprised by what I had been able
to implement for the time that passed. For this I would like to thank him in
particular, as he created a whole new wave of courage by just speaking out
these few words. So here you have it. Thank you, Dusan. I hope you will
continue to motivate other people around you and I wish there were more
people doing the same thing. Its often not the ability to do things but the
motivation that stands in the way of progression.
Getting back together took a while as there were other unexpected things
happening in my life and that of my closest relatives. But no matter what,
the amount of help and support was incredible. I would like to thank my
great friends and family for being there when I needed them the most. Next
ii

FOREWORD

iii

to those days, I can also look back in joy to the times we worked together in
the library. Those moments meant a lot to me and are the reason I finally
figured out that work and play can ultimately be combined. We called
ourselves The Colleagues and found help with each other during the year
and the examination periods. Marlies, Birger, Stijn, Dani, Bram, Daan and
others, thank you for that!
2016 was also the year I met Eva. At the other side of the world amid
the altitudes of the Argentine mountains, she walked into my life. To this
day, Im honored to experience the positivism that goes out from her at all
times which is a major drive in my work.
As a last word of appreciation I want to thank my promotor Bart De
Moor and co-promotor Yves Moreau. They showed the flexibility to work
out a strategy on my own and gave me the opportunity to implement it in all
freedom. This would later turn out to be the most educational experience
Ive had in my life. I would fall, but also rise again. This is the work that I
am proud to present to you.

Abstract
With a better understanding of biochemical processes, recent advances in
artificial intelligence and the availability of tremendously growing computational power, the field of biological engineering is currently facing an era of
major breakthroughs. Full simulations of biochemical systems have made it
possible to fine-tune metabolic networks to a high degree, which has shown
to successfully render highly optimized microbial strains with maximized
production yields of the desired industrial compounds. [1, 2] An important
domain of research in this field is the development of new artificial enzymes
to speed up non-native reactions in this process; even though nature found
a wide range of extremely efficient biomolecular catalytic machines, not all
industrially relevant reactions have a known natural catalyst to increase the
flux of the chemical process. Several attempts to develop de novo artificial enzymes for non-native reactions have already been made in previous
studies, with a functional Kemp-eliminase and retro-aldolase as the most
representative examples of success. [3, 4] For now however, these techniques
have been largely driven by intuition based on known biochemical reactions
and active site coordination. At best, these enzymes are comparable with
catalytic antibodies regarding catalytic performance, while natural enzymes
surpass them by several orders of magnitude. [5, 6, 7, 8] The level of their
success remains therefore subject of discussion. Much of the factors contributing to enzyme catalytic activity are indeed yet to be discovered, which
makes this kind of biased rational design unlikely to yield enzymes of comiv

ABSTRACT

peting activity in the near future.


Another approach that has been applied so far is directed evolution,
which succeeds in eliminating bias to a large extent but requires a lot of
practical resources and lab hours. [9, 10] This iterative technique consists of
routinely evaluating the effects of mutations introduced to known protein
sequences and eventually selecting the best mutants for further optimization,
but none or only part of this process is carried out computationally at the
moment. Therefore, this method is often combined with rational design in
order to speed up the process, but again bias is increased.
In this work we propose a fully automated and less biased evolutionary strategy to design and optimize a binding site for random substrate
molecules without known natural binding pocket. Based on an efficient genetic algorithm dubbed FIGARO (a Fast and Interpopulational Genetic
Algorithm for Receptor Optimization), we show that this strategy can be
a promising new approach in finding valuable protein structures that can
be useful either as starting points in the task of artificial enzyme design, or
may also be used as general optimized receptors in important domains of
research like the antibody treatment of cancer.

List of Code Snippets


3.1

Program parameters . . . . . . . . . . . . . . . . . . . . . . .

44

3.2

Application backbone . . . . . . . . . . . . . . . . . . . . . .

45

3.3

Distributed computing support . . . . . . . . . . . . . . . . .

47

3.4

PDB mining implementation . . . . . . . . . . . . . . . . . .

48

3.5

Response XML . . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.6

Binding site detection . . . . . . . . . . . . . . . . . . . . . .

52

3.7

The Protein class . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.8

The mutation operator . . . . . . . . . . . . . . . . . . . . . .

56

3.9

Backrub modeling implementation . . . . . . . . . . . . . . .

57

3.10 Ligand docking implementation . . . . . . . . . . . . . . . . .

61

3.11 The crossover operator . . . . . . . . . . . . . . . . . . . . . .

64

3.12 Homology modeling implementation . . . . . . . . . . . . . .

66

3.13 The selection operator . . . . . . . . . . . . . . . . . . . . . .

69

B.1 main.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

B.2 functions.py . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

B.3 protein.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

vi

List of Figures
1.1

Graphical presentation of torsion angles . . . . . . . . . . . .

1.2

Ramachandran plot example . . . . . . . . . . . . . . . . . .

1.3

Folding funnel representation . . . . . . . . . . . . . . . . . .

1.4

-helices and -sheets . . . . . . . . . . . . . . . . . . . . . .

1.5

The leucine zipper . . . . . . . . . . . . . . . . . . . . . . . .

10

1.6

Hemoglobin T and R state . . . . . . . . . . . . . . . . . . . .

14

1.7

Lock and key model . . . . . . . . . . . . . . . . . . . . . . .

16

1.8

Induced fit model . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.9

Ab initio modeled structures . . . . . . . . . . . . . . . . . . .

21

1.10 Chimera GUI for MODELLER . . . . . . . . . . . . . . . . .

23

1.11 The backrub move . . . . . . . . . . . . . . . . . . . . . . . .

25

1.12 Side chain predictions based on backrub motion . . . . . . . .

26

2.1

Genetic algorithm pipeline . . . . . . . . . . . . . . . . . . . .

31

2.2

Stochastic Universal Sampling . . . . . . . . . . . . . . . . . .

40

vii

Contents
Foreword

ii

Abstract

iv

Literature study

1 Protein structure

1.1

The protein backbone . . . . . . . . . . . . . . . . . . . . . .

1.2

Protein folding . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

The structure-function relationship . . . . . . . . . . . . . . .

13

1.4

Binding pockets, enzymes and active sites . . . . . . . . . . .

16

1.5

Protein structure prediction . . . . . . . . . . . . . . . . . . .

20

1.5.1

The backrub move . . . . . . . . . . . . . . . . . . . .

25

Molecular docking . . . . . . . . . . . . . . . . . . . . . . . .

28

1.6

2 Genetic algorithms

30

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.2

Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.3

Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.4

Methods of selection . . . . . . . . . . . . . . . . . . . . . . .

40

viii

CONTENTS

II

FIGARO

3 Application overview

ix

42
43

3.1

General outline . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.2

Mining the PDB . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.3

Screening for binding sites . . . . . . . . . . . . . . . . . . . .

52

3.4

The Protein class . . . . . . . . . . . . . . . . . . . . . . . . .

55

3.5

Modeling point mutations . . . . . . . . . . . . . . . . . . . .

56

3.6

Ligand docking . . . . . . . . . . . . . . . . . . . . . . . . . .

62

3.7

Modeling recombined sequences . . . . . . . . . . . . . . . . .

65

3.8

The selection operator . . . . . . . . . . . . . . . . . . . . . .

68

4 Discussion

70

5 Conclusion

73

III

Appendix

76

A Bibliography

77

B Attachments

86

B.1 main.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

B.2 functions.py . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

B.3 protein.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


C End summary

112

Part I

Literature study

Chapter 1

Protein structure

CHAPTER 1. PROTEIN STRUCTURE

Figure 1.1: Graphical presentation of and torsion angles. Included are


Ramachandran plots for several important amino acids.1

1.1

The protein backbone

Protein structure exists in several superposed levels of hierarchy. To come


to a final set of biologically relevant capabilities like catalysis, ligand binding and membrane association, the structural properties of proteins all have
to be fine-tuned to a high extent. This is known as the structure-function
relationship, which will be further discussed later on. It also implies that
the altering of structure will likely result in reduction or loss of function. In
short, four levels of hierarchy exist regarding protein structure development,
classified as the primary, secondary, tertiary and quaternary structure. The
primary structure is the most fundamental one and consists of the amino
acid sequence in its most basic, unfolded form. It contains all information
needed for further folding processes (known as Anfinsens dogma [11]) and
is the backbone for every protein. The 20 amino acids all have their specific physicochemical properties, which accounts for an enormous range of
possibilities regarding final sequence composition and protein conformation.
This final conformation is based on both strong and weak interaction forces
1

Picture taken from http://www.ym.edu.tw/jierongh/research_e.html.

CHAPTER 1. PROTEIN STRUCTURE

Figure 1.2: Ramachandran plot based on about 100,000 data points for
general amino acids (glycine, proline and pre-proline excluded).1
between the reactive groups of the individual amino acids. [12] Strong covalent bonds can be found between cysteine residues, where oxidation of
the thiol groups can result in cystine bonds between two sulfur atoms also
known as disulfide bridges. Weaker interactions are in particular van der
Waals forces, electrostatic interactions, hydrophobic interactions and hydrogen bonds. The total of these factors will eventually lead to a final structure
of the polypeptide chain, which can be reached at either one of all possible
levels. Keratin, for example, has a final structure at the secondary level.
It is therefore called a fibrous protein and is extended in length, serving
a structural role in biological systems. Some proteins even lack any form
of secondary structure like insulin. Globular proteins, however, will have
tertiary or quaternary structure and can be catalytically active. More information about these different levels of structure development is given in
following chapters.
A protein conformation can be described by a set of torsion angles of
1

Picture taken from [13].

CHAPTER 1. PROTEIN STRUCTURE

certain bonds in the protein backbone. For every amino acid, two torsion
angles are essential to describe the relative geometrical position in the chain.
These are called the phi (, between the N and C atoms in every chained
amino acid) and psi (, between C and C) dihedral angles and are visually
represented in figure 1.1. Interresidual attractions and repulsions force rotation around the single bonds between stable equiplanar amide units. These
amide or peptide bonds have -bonding characteristics and are therefore geometrically fixed. [12] All empirically observable sets of and angles can
be visualized in a two-dimensional plot, which is called a Ramachandran
plot or diagram. An example is given in figure 1.2. Three main regions
are clearly distinguishable. As is later discussed, these regions represent the
conformational tendency to form what are called -helices (lower left quadrant) and -strands (upper left quadrant). [13] Clearly, many combinations
are never observed in nature due to unfavorable steric hindrance between
residues. A notable exception is glycine, which is the smallest amino acid.
With a side chain of only one hydrogen atom, glycine is the most flexible
amino acid and can take on many conformations in a polypeptide chain,
causing the protein to be more dynamical and adaptable. The Ramachandran plot for glycine as shown in figure 1.1 confirms this hypothesis. This
is an essential property in enzyme-mediated catalysis, as induced fit and
other conformational changes during transition states are indispensable for
functionally active proteins.

CHAPTER 1. PROTEIN STRUCTURE

1.2

Protein folding

The central dogma in molecular biology states that all genetic information
is contained within the DNA or the genome of an organism, which can be
either replicated by DNA polymerase or transcribed by RNA polymerase
to form rRNA, tRNA or mRNA transcripts. [14] The latter will eventually be translated by ribosomes to form polypeptide chains, which in turn
give rise to functional proteins. All biological systems, both eukaryote and
prokaryote, are based on this principle. While mRNA is being translated
and amino acid building blocks are chained together by the ribosome, interand intraresidual attractions and distractions force the nascent polypeptide coil to adopt a certain conformation at the N-terminus and folding is
started. [15] Initial research by Christian Anfinsen on RNase A or bovine
pancreatic ribonuclease showed that this protein did indeed fold spontaneously into a three dimensional conformation. Anfinsen later concluded
that this three dimensional structure can be lost when the environmental
conditions are altered, for example when pH is increased, temperature is
increased or salt is added. This process is called denaturation. After restoring to initial conditions, Anfinsen observed that RNase A would fold back
spontaneously in its native conformation. This, however, was later found to
be only true for a minority of observations. [16, 17] Spontaneous folding only
occurs in optimal conditions during and after biosynthesis. In the crowded
intracellular environment, other proteins are therefore needed to prevent aggregation and misfolding. These proteins are called chaperones and assist
in the folding process. Heat-shock proteins are a well-known example and
prevent misfolding of proteins due to increased temperature.
The Levinthal paradox states that if proteins were to be folded by random sampling of all possible conformations, it would take about the lifetime
of the universe for a single protein to attain its native fold. As real-life folding takes about a few milliseconds to even a few microseconds, this clearly

CHAPTER 1. PROTEIN STRUCTURE

Figure 1.3: Graphical representation of the folding funnel theorem.1


wasnt the case and folding had to be guided in a certain pathway or multiple possible pathways. [18] This hypothesis is based on the fact that more
compact protein conformations, established by research in polymer thermodynamics, have less sampling possibilities with respect to its potential
energy surface. Each step in the folding process yields a more compact protein structure with less potential energy, thus reducing the search space to
be sampled. This can be represented by a folding funnel represented in figure 1.3 in which the native state is attained by several traces going through
intermediate semi-stable states along the funnel. [19] Eventually a unique
native conformation can be reached efficiently this way. Moreover, it is assumed that folding is further accelerated by parallelizing the process. [20]
Secondary structures are first formed locally and later associated to form
a more global conformation. However, the exact mechanism of folding remains unraveled, as there is no general applicable theory that explains why
some proteins fold faster than others and take different routes to their native
conformation.
What we do know is that folding is predominantly driven by a small
set of physical interactions between the amino acids and its side chains and
1

Picture taken from http://science.sciencemag.org/content/338/6110/1042/F3.

CHAPTER 1. PROTEIN STRUCTURE

Figure 1.4: -helices (in green) and -sheets (in red).1


constraints with respect to the dihedral torsion angles of the protein backbone. Also, constraints exist on the geometrically orientation of the atoms
in the side chains of the amino acids. Single bonds are in general freely
rotatable, but small differences in potential energy between configurations
tend to prefer energetically more favorable rotamers in final protein structures. Some combinations are even completely excluded from the set of
possible rotamers, resulting in a rotamer library which is frequently used in
computational methods to predict protein three-dimensional structure. [15]
As discussed earlier, protein structure can be described from 4 different
perspectives. The primary structure in essence just represents the amino
acid sequence of the protein, being held together by peptide bridges allowing for the and torsion angles as discussed above. We therefore wont
go into further chemical detail about this structural level, as we assume the
reader to be familiar with the basic organic chemistry behind macromolecular structures.
The main driving forces causing secondary structure elements to form
1

Picture

taken

from

cellchemistrycontd.html.

http://oregonstate.edu/instruction/bi314/fall12/

CHAPTER 1. PROTEIN STRUCTURE

are hydrogen bonds between atoms in the protein backbone. These occur
between the hydrogen atoms bound to electronegative nitrogen atoms and a
double-bonded oxygen atom at another site in the backbone. In 1951, Linus
Pauling already predicted this to give rise to two main structural patterns:
-helices and -sheets. [21] These structural units are visualized in figure
1.4. As is discussed earlier, these patterns can also be inferred from the
Ramachandran plots of thousands of proteins. The third group visible in
these plots (upper right quadrant in figure 1.2) represent the left-handed
-helices. This is a rather small group though; generally speaking, lefthanded -helices only occur in nature as short loop regions or single-turn
-helices. [22]
Once a polypeptide chain is completely folded into a globular protein
unit, tertiary structure is reached. This structural level is most common
for enzymes and will be treated exclusively in this thesis. Separate units
of tertiary structure can function on their own, but are also seen to be
associated with each other, serving as identical or deviating subunits in a
polymeric complex. This final complex then constitutes the functional and
biologically relevant protein. In this work however, we will be constrained
to single polypeptide chains for simplicitys sake.
When visualizing the three-dimensional structure of folded proteins, secondary structure elements are often accented by special illustrative patterns
and colors like arrows, spirals and cylindrical boxes, also known as the cartoon representation of proteins. This makes it possible to get better insight
into its general composition and biological role. Indeed, plotting the absolute coordinates of all atoms should in theory be enough to deduce the
general shape of the protein but at this low-level representation, interesting higher-level motifs are not directly revealed to the biomolecular analyst.
This is important because a wide range of short motifs can be associated
with specific biological roles. An example is the well-known leucine zipper,

CHAPTER 1. PROTEIN STRUCTURE

10

Figure 1.5: Leucine zipper structure interacting with a DNA double helix.1
which is shown in figure 1.5 and has strong DNA-binding characteristics.
A dipole moment exists along the extent of the -helices, which is caused
by the accumulation of individual dipoles in every amide bond of the helix.
This results in a more positively charged N-terminus, while the C-terminus
of the helix gets more negatively charged. This, in combination with a high
abundance of basic amino acids at the N-terminus side of the helix increasing further the positive charge on this side of the chain, causes it to have
high affinity for the negatively charged strands of DNA. Finally, hydrophobic amino acids at the center of the helix make the helices associate with
each other and a zipper-like structure is formed. [23]
Motifs like the leucine zipper can be combined to form higher-order functional modules called domains. Domains are independently folding units in a
global protein structure and are often duplicated and exchanged between ge1

png.

Picture taken from https://commons.wikimedia.org/wiki/File:Leucine_zipper.

CHAPTER 1. PROTEIN STRUCTURE

11

netic sequences, introducing new functionality in other genes. Evolutionary


processes like mutation, crossover and conservation allow protein domains
to be efficiently optimized and transported to other places in the genome,
serving as fundamental modular building blocks for protein structure and
function. [15]
In order for scientists to keep a clear overview on all these known domains and experimentally determined structures, bioinformaticians all over
the world created several databases to store this knowledge online. The three
main reference databases for protein domain classification are SCOP [24],
CATH [25] and Pfam [26]. While Pfam is a sequence-based database inferring domain characteristics from Hidden Markov Model (HMM) analysis of
new sequences, SCOP and CATH share the same structure-based nature and
derive their information from experimentally determined protein structures,
which means that they need an underlying database of verified structures
to do their analysis on. In 1971, scientists at Texas A&M University began
a small database and molecular visualization project eventually leading to
whats now known as the Protein Data Bank (PDB), serving as the sole
reference in protein structure storage and consultation today. [27] The PDB
currently contains more than 100,000 protein structures and also supports
geometrical information about other biological macromolecules like nucleic
acids and protein-nucleic acid complexes. Thankfully, it also offers a RESTful Web Services API to allow for automated structure mining based on
extended querying possibilities.
The PDB requires protein structure coordinates to be stored in a specific file format defined by the wwPDB, a central organization managing
the PDB. [28] In order to maintain a coherent framework for structural and
functional analysis, strict rules have been established for this format but
unfortunately these are not always respected. Also, not all PDB structural
information is of high quality. This is mainly due to differences in methods

CHAPTER 1. PROTEIN STRUCTURE

12

to infer the atom coordinates. X-ray crystallography is generally known to


deliver better quality structures with higher resolution than NMR, but even
in this category structural quality can differ significantly. In general, structures with a resolution smaller than 1
Angstrom are considered to be of very
high quality, while resolutions of 3
Angstrom and more indicate rather poor
quality. Also, it has to be mentioned that membrane proteins do not tend to
crystallize easily, so high quality structures of these proteins are rather hard
to find. All together it has no doubt that it is very important to handle the
PDB and its format with care in order to get reliable results. Thankfully,
several software packages like Biopython do a good job accounting for these
imperfections. [29]

CHAPTER 1. PROTEIN STRUCTURE

1.3

13

The structure-function relationship

A protein is nothing more than its own sequence of amino acids and the
biochemical implications going with that. That is, when one describes a
protein and its biological role, be it an enzyme, membrane or transport protein, it should always be kept in mind that only its amino acid composition
is responsible for that. As described earlier, Anfinsens dogma states that
the amino acid sequence contains all information needed for correct folding inside a cellular environment. As an example, membrane proteins often
fold into two main regions with different physicochemical properties. Often a
highly hydrophobic and more hydrophilic region can be clearly distinguished.
This hydrophobic region is thermodynamically predefined to associate with
lipid compounds, so the protein will be fixed to the double-layered phospholipid cellular membrane at this site. This property is what gives the protein
its correct localization and makes it function as a membrane protein. The
fact that this protein is located at the cellular membrane can thus be reduced
to the fact of having a fold that divides a hydrophobic and hydrophilic environment, which in turn is a direct result of its unique amino acid sequence
that has evolved to fold in this way. If we would introduce enough mutation
to this sequence eventually altering this fold dramatically, it should be clear
that its function and correct localization will be lost, too. [12]
Another well studied example representing the fine-tuned interplay between structure and function is hemoglobin and its variable structural properties. Hemoglobin can exist in two states, which is shown in figure 1.6.
When the structure resides in its taut or tensed (T) state, it exhibits a
smaller affinity to oxygen. However, upon environmental changes like decreased CO2 partial pressure and accompanying pH decrease, the conformation of hemoglobin alters to a more relaxed (R) state which shows much
higher affinity to oxygen. When aerobic organisms breath, CO2 is removed
from the environment causing hemoglobin to transform in its R state, being

CHAPTER 1. PROTEIN STRUCTURE

14

Figure 1.6: Hemoglobin structure conformation in its T (left) and R (right)


state. The R state brings two histidine residues close to each other, enabling
a more favoured binding of O2 .1
optimized for oxygen take-up and transport. Also, oxygen binding further
stabilizes the R state, causing more oxygen to be taken up easily. When
hemoglobin enters an environment where the release of oxygen is desired,
structure again switches to the T conformational state. The TCA cycle in
tissue cells, happening in- and outside the membranes of the mitochondria,
needs oxygen to produce energy and is preceded by glycolysis. Glycolysis
implies the co-production of 2,3-BPG, which can exit the cell into the blood
stream, bind to hemoglobin and act as an allosteric regulator inducing a conformational transition to the T state. This disfavors oxygen binding upon
which it is released, taken up by the cells and exposed to the intracellular
mitochondria. [30]
It should be clear that it doesnt end with these two examples. All proteins need their unique structures to be functional. It has to be mentioned
though that this applies on different levels of detail. For example, structural
proteins like keratin only having secondary structure do need a sturdy and
1

Picture

taken

from

http://cbc.arizona.edu/classes/bioc462/462a/NOTES/

hemoglobin/hemoglobin_function.htm.

CHAPTER 1. PROTEIN STRUCTURE

15

extended conformation for their biological role, but single point mutations
are not likely to have any direct effects on functional properties. By contrast, enzymes and other proteins with globular conformation often have
a highly optimized and flexible fold. A single point mutation around the
active site of an enzyme can already cause far less affinity to its substrate,
which in some cases can lead to disastrous impact on the rate of catalysis,
in turn negatively influencing complete metabolic pathways with possibly
severe complications as a result. In other cases though, point mutations like
this can in fact be desirable and increase enzymatic activity to a certain
extent.

CHAPTER 1. PROTEIN STRUCTURE

16

Figure 1.7: Graphical representation of the lock and key theorem.1

Figure 1.8: Graphical representation of the induced fit theorem.1

1.4

Binding pockets, enzymes and active sites

Proteins can be specialized in all kinds of tasks. A main differentiator between all protein functional classes is the ability to bind smaller molecules,
also called ligands. We just came to describe a representative of this family,
as hemoglobin can be seen as a receptor protein binding to a dioxygen ligand. As such, transport proteins belong to a higher level class of proteins
1

Picture

taken

from

induced-fit-hypothesis/ and edited.

https://aberdeenc.wordpress.com/tag/

CHAPTER 1. PROTEIN STRUCTURE

17

with ligand-binding capabilities. Many membrane proteins and in particular enzymes share this classification. In addition to having possibly multiple
binding pockets for ligands, enzymes also possess an active site which can
be situated at either one of the available binding pockets and carries out
the actual catalysis. This particular binding site exhibits high affinity for a
transition state structure of the reaction to catalyze. An often demonstrated
but overly simplified model of enzyme working mechanism is the lock and
key representation, illustrated in figure 1.7. This, however, is an outdated
interpretation of how many enzymes work. According to this model, an enzyme readily possesses high affinity for every substrate molecule involved in
the reaction and is rigidly shaped to bind these ligands exclusively. Indeed,
enzymes are known to have high specificity for a certain range of substrates,
like a lock has for its keys. Upon sturdy binding of these molecules, the
reaction is facilitated by the enzyme and a product for which the enzyme
has much less affinity is released. This theory is nowadays in general considered to be rather incorrect and, instead, an induced fit model is preferred.
This model assumes that initial binding of a substrate molecule triggers a
conformational change in the enzyme, causing the transition state of the
reacting compounds to be thermodynamically favored and stabilized, rather
than the substrates as such. As a result, activation energy is significantly
lowered and the reaction rate is increased. The main difference with regard
to the lock and key model is that, in this model, the enzyme is a flexible
entity that dynamically adapts to the shape of its substrates and transition
state. A schematic overview of this is given in figure 1.8. [31, 32]
Enzyme dynamics and especially the chemical kinetics of enzyme-mediated
catalysis have been a main domain of research the last century. Two main
concepts should be highlighted in this regard. Introduced in 1913 by Leonor
Michaelis and Maud Menten, the rate constant kcat and the affinity constant KM succeeded in characterizing enzyme kinetics based on a few made

CHAPTER 1. PROTEIN STRUCTURE

18

estimations. [33] In short, the rate constant describes at which turnover


rate an enzyme is able to catalyze the reaction under the assumption of
complete saturation of the available active sites and fixed enzyme concentration, thereby defining the maximum speed or Vmax of the catalyzed reaction. The affinity constant defines at which substrate concentration half
of that Vmax is reached. When enzymes have higher affinity for a certain
substrate, this point is reached at lower substrate concentration. Thus, a
lower KM indicates higher enzyme affinity for the substrate. If we want to
increase enzyme activity, either of these two parameters can be targeted.
An often used value for comparable enzyme efficiency is the ratio of kcat
to KM . Optimizing enzyme dynamics and stabilization of the transition
state would return a better kcat value, while optimizing substrate-binding in
general should lower KM . [33]
In our project, we will be mainly focusing on the induced fit binding of a
given ligand rather than enzyme dynamics and transition state stabilization;
this did not fit within the scope of this masters thesis. In order to optimize
enzyme catalytic efficiency, it should be a first step to increase affinity to the
substrate which is straight-forward. Transition states can exist in multiple
intermediate steps to eventually yield the product, so modeling this is more
complex. Later on however, it should be within reach to extend this work
and integrate enzyme mechanics. For example, the Rosetta suite also offers
active site design and evaluation functionality, which could be a valuable
extension to the fitness function in our genetic algorithm. [34] Moreover, the
same docking procedure as applied in this work could be used to optimize
the binding, and thus stabilization, of transition state analogs. A multiobjective adaptation of our genetic algorithm should support that.
Binding of a ligand to its binding pocket is based on a wide range of
weak interactions leading to many degrees of freedom regarding optimization. All these interactions can be computationally quantified based on

CHAPTER 1. PROTEIN STRUCTURE

19

physical laws. Therefore, a total summation over all relevant forces would
give us a quantitative measure for goodness of fit. Several software packages
exist to perform what is called molecular docking, but we will come back to
this later and discuss it in more detail.

CHAPTER 1. PROTEIN STRUCTURE

1.5

20

Protein structure prediction

The protein folding problem has come a long way since its first formulation
by Kendrew and co-workers in 1958 [35], after they had realized the threedimensional structure of globular proteins was not by any means regular and
symmetrical as they had expected. Mainly, the aim of predicting a protein
structure from its amino acid sequence can be seen as a direct consequence
of the complications going with their experimental determination, generally
carried out by X-ray crystallography or NMR. As protein sequences keep
getting discovered in all kinds of organisms, annotating functions to these
gene products is far less evident without detailed knowledge about their
geometrical conformations. As discussed above, protein function is inseparably associated with its structure, so determining these structures is a very
important task in order to understand their biological relevance, eventually
leading to industrial and in particular medical applicability.
Unfortunately, experimental structure verification cant keep up with
this fast-paced sequence discovery and the gap between known sequences
and experimentally derived corresponding structures is getting bigger and
bigger. [36] X-ray crystallography is very expensive and time-consuming but
could yield very high resolution structures. Moreover, the task of deriving
protein structures from NMR measurements is far from evident and asks
for significant specialization in the field, while being exclusively suitable for
smaller proteins and delivering structures with rather poor resolution. The
technique could however be used to determine structures in solution, in contrast to diffraction-based analysis. Nowadays, it takes several months to
years to experimentally derive a single structure; in fact, growing a protein
crystal big enough for X-ray analysis (about 0.5 mm) could already require
months. Additionally, membrane proteins are almost impossible to crystallize which leads to extra complications. [37]
These facts taken into consideration, it should come to no surprise that

CHAPTER 1. PROTEIN STRUCTURE

21

Figure 1.9: Some ab initio modeled structures from a recent publication. In


red are the experimentally determined high resolution structures. The conformations in blue were rendered with the use of the Anton supercomputer.1
the domain of computational structure prediction has drawn a lot of attention from the beginning of the problem. Yet, after 50 years of intensive
research, there is still a lot to be done. It is generally considered to be one of
the most important unsolved problems in biological engineering today. [38]
Indeed, many goals have already been reached in the field and in the process
of finding a single best method to predict protein three dimensional structure
from its amino acid sequence, two main promising strategies have evolved.
With this in mind, it should be clear that a single best method does not
exist. While Anfinsens dogma states that all information for protein folding
is contained within its amino acid sequence, it unfortunately appeared to
be an extremely challenging task to make predictions based on this particular sequence alone. It still deserves a lot of research dedication however,
as it has great potential when eventually satisfying computational power
and efficient methods will be available. This domain in structural biology is
called ab initio protein structure prediction and aims at making high quality
predictions based on the physical laws of molecular mechanics and dynamics. In theory this means no statistics should be applied and calculation is
completely without bias. In practice, with current available computational
resources, estimations are inevitable. Actually, full molecular dynamics simulations can be carried out, but currently only on oligopeptides ranging from
1

Picture taken from [39].

CHAPTER 1. PROTEIN STRUCTURE

22

about 50 to 100 amino acids for periods up to 1 ms. [39] These computations
are facilitated by the state-of-the-art Anton supercomputer which is specifically designed for particle-particle interaction calculations. [40] A second
revision of this device, Anton 2, has been announced some time ago, but is
still in development at the time of this writing (january 2017).1 The most
impressive results achieved with the Anton supercomputer have elucidated
many interesting facts about the folding process that previously had been
impossible to observe. Figure 1.9 shows some of the realizations achieved
so far and convincingly confirms ab initio structure prediction is one of the
most reliable prediction strategies available today. [39]
Unfortunately, most proteins of biological and medicinal importance
are significantly bigger than the ones that can be predicted ab initio right
now. Therefore, scientists have been focussing on another approach the past
decades which aims to predict the structure of new proteins based on the
coordinates of previously derived conformations. This strategy is called homology modeling and mostly uses statistical inference combined with small
molecular dynamics calculations to carry out structure predictions. It has
many advantages over ab initio modeling, in particular the fact that it can
be used for proteins of over 100 amino acids and because it has much faster
runtimes. As sequence identity increases between two proteins, model quality can be equal or even better than ab initio rendered structures. The
method already produces reliable results with sequence identities starting
at about 30%. If sequences show 80% to 90% similarity, model resolutions
even better than 1.5
A can be obtained. Note that this guideline is only
valid for proteins that have evolved naturally. Sequences that show similarity between proteins are also called conserved regions and are most likely
1

Anton 2 will have as much as 5 times the number of CPU cores running at 1650 MHz

each, compared to 485 MHz of the original Anton supercomputer. It will also sport 152 interaction pipelines running at 1650 MHz per pipeline, a peak throughput of 12.7 TFXOPS,
4096 KB of RAM, 8000 rendered atoms per ASIC and up to 2.7 Tb/s of channel bandwith.

CHAPTER 1. PROTEIN STRUCTURE

23

Figure 1.10: Graphical User Interface for MODELLER provided by the


Chimera molecular graphics suite.1
situated in similar conformational folds. As the structure-function relationship dictates, a protein is assumed to lose its function if its fold is distorted
by mutation. Therefore, even proteins with up to 70% deviation in sequence
usually still persist the same fold. This principle is key to homology modeling and turns out to be the main reason for its succes and popularity
among researchers. One should be careful though; not all parts of protein
sequences are conserved. Flexible regions exist, mostly acting as connectors
between the functional and conserved domains. These regions are known
as loops and turns and can be highly variable between proteins with high
sequence identity. Review and model evaluation by an expert are therefore
appropriate. [41, 42, 43]
1

Picture taken from [41].

CHAPTER 1. PROTEIN STRUCTURE

24

From the formulation of the problem in 1958 up until now, a lot of


software packages have been developed to perform comparative modeling.
Traditionally one of the most popular and fastest solutions available is MODELLER, developed by the lab of Andrej Sali at the University of California.
Thankfully, MODELLER provides a lot of useful tutorials and comprehensive documentation and is the defacto standard among students trying to
develop their skills in homology modeling. It should therefore be no surprise
we opted for this package, too. Other popular solutions are I-TASSER,
ORCHESTRAR, Prime, MOE, SWISS-MODEL, Composer and the comparative modeling tool of the Rosetta suite. Figure 1.10 shows a 3rd party
graphical interface to the MODELLER program. Moreover, model reliability is indicated on the produced structure. We can clearly see that the
regions with highest error probability are situated in the flexible coils without secondary structure. [41]
In this work, we will combine elements from ab initio and homology
modeling; the exact followed procedure will be discussed later on. Note
that we wont perform extended molecular dynamics calculations on comlete
protein folds but instead focus on the small scale effects of trivial point
mutations. For this, we make thankfully use of whats called the backrub
move. A detailed description of this phenomenon is given in the next section.

CHAPTER 1. PROTEIN STRUCTURE

25

Figure 1.11: Schematic representation of the backrub move. Rotation 1,3 is


visualized by the red dotted circle. This is the primary rotaton and is traced
out by the central C . As a rigid body, the central C and its surrounding
peptides rotate along the red axis going from C-1 to C+1 . Secondary
rotations 1,2 and 2,3 are represented by the blue dotted circles and move
the individual peptides as rigid bodies along the blue C -C axes.1

1.5.1

The backrub move

In the previous section, two main approaches to protein structure prediction


were highlighted and briefly explained. Most of the time, sequences that
have a high priority to be modeled are originating from genome databases.
This is explained by the fact that there still exists a giant gap between
conformational information and the number of known sequences found in
all kinds of organisms. The solution for these sequences is rather straightforward. They are first aligned to a sequence database with tools like BLAST
or HMMER to find homologs that have already been structuraly annotated.
Based on these structures and if sequence identity is high enough, homology
modeling is performed and in the best case a highly reliable model can be
obtained. If there are no homolog structures found like would be the case
for many membrane proteins, there is not much left to do.
But what strategy should be used if a template with known structure is
1

Picture taken from [44].

CHAPTER 1. PROTEIN STRUCTURE

26

Figure 1.12: Some correct side chain predictions based on the backrub motion. Fixed backbone prediction is shown in red while backrub predictions
are shown in blue. Starting PDB structure is shown in green and the target
point-mutated PDB structure is shown in purple.1
available, has very high sequence identity with a given sequence to be modeled that is artificially designed and counting more than 300 amino acids?
Clearly, homology modeling cannot be used as it only works under the assumption of natural evolution and fold conservation. Ab initio modeling of
the complete sequence also falls of because it is currently intractable to perform molecular dynamics simulation on over 300 amino acids in reasonable
time. Thankfully, a study by Davis and co-workers in 2006 elucidated a
highly predictible pattern in the effects of point mutations on global protein
conformation. This pattern is called the backrub move and is visualized in
figure 1.11. [44]
The backrub motion describes subtle changes in a proteins backbone
triggered by much larger altering side chain conformations. These side chain
movements are often observed due to impacts of other molecules in the
environment like H2 O, but after collision original backbone conformation
is usually restored. Naturally occurring point mutations also cause these
side chain rearrangements, but in contrast to random impacts these changes
are permanent. While a point mutation itself usually does not directly
influence backbone conformation, with cysteine and proline being the only
exceptions to this rule, again backbone is tilted locally due to altering side
1

Picture taken from [45].

CHAPTER 1. PROTEIN STRUCTURE

27

chains. Over a much longer evolutionary timescale, these local backbone


shifts are accumulated and cause a protein backbone to drastically change
and take on new folds. In the task of protein modeling, knowledge of these
backrub moves can provide realistic predictions of backbone perturbations
while maintaining a valid global geometry. [45]
While these backbone shifts cause some strain on the 1 and 3 angles
as seen in figure 1.11, these are usually well within the range of allowed
values and practically never exceeding the amount of one standard deviation.
Secondary movements of the surrounding peptides triggered by the primary
rotation of the central C also try to accomodate for this. [45] In figure 1.12,
some correct side chain predictions based on this backrub pattern upon point
mutation are illustrated.

CHAPTER 1. PROTEIN STRUCTURE

1.6

28

Molecular docking

In section 1.4 on page 16, we already mentioned the concept of molecular


docking. In essence, molecular docking returns a docking score which acts
as a quantitative measure for goodness of fit between a ligand and a receptor structure. These names can be confusing though, as a ligand suggests
a smaller molecule binding to a taller receptor. In reality, they are very
relative and often used interchangeably. Molecular docking can be applied
on all kinds of (macro-)molecular structures. For example, the modeling of
protein-protein interactions and protein-DNA complexes are both carried
out by specialized docking algorithms. Each specific docking situation often
has its own range of specialized software implementations.
A docking score is calculated based on the total change in free energy
upon binding using physics-based force fields. This is a global energy summation function that takes into account all interaction forces that come into
play. The flexibility of the two actors is another important factor to be
accounted for as described in section 1.1 and section 1.4, which leads to an
even bigger solution space for this particular optimization task. Global optimization techniques such as Monte Carlo simulations and genetic algorithms
are therefore appropriate to come to an acceptable estimate. As discussed
in chapter 2, this implies a wide range of variables to be set when running
docking jobs. [43]
The application domains of molecular docking are countless. One of the
most important domains of research being facilitated by molecular docking
is drug discovery. Nowadays, virtual screening against compound databases
often serves as a starting point in drug development processes. For example,
viral proteins can be easily and directly shut down by binding a targetspecific allosteric effector. More indirectly, monoclonal antibodies can be
designed to target microbial compounds, upon which they can be neutralized
by the immune system of the host. With the rise of computational docking,

CHAPTER 1. PROTEIN STRUCTURE

29

all these processes can be reliably carried out at a much faster pace. [46]
However, molecular docking is not only used to calculate docking scores.
In many research domains it is actually more important to get a prediction
of the relative conformational positions between ligand and receptor. As an
illustration, many metabolic pathways make use of feedback systems to regulate the accumulation of end compounds, which can be acting as allosteric
effectors on enzymes facilitating their own production. Phosphofructokinase
or PFK is a well-known enzyme in glycolysis catalyzing the phosphorylation
of fructose-6-phosphate into fructose-1,6-bisphosphate. High levels of ATP,
one of the products of glycolysis, will cause PFK to have less affinity for
its substrate by altering its three-dimensional conformation upon binding.
ATP is said to be an allosteric inhibitor for PFK. [47] Getting better insight in the allosteric mechanism can be significantly accelerated by using
molecular docking applications visualizing three-dimensional conformation
of complexes.
Like in protein structure prediction, many software applications have
been developed for molecular docking analysis. Some of them are specialized in virtual high-throughput screening of compounds targeting the pharmaceutical industry and often come with a high price tag, while others are
open source and freely accessible on the internet. The most popular one
in this area is AutoDock Vina, but we will use the more configurable rDock
package. [43]

Chapter 2

Genetic algorithms

30

CHAPTER 2. GENETIC ALGORITHMS

31

Figure 2.1: General flowchart of a genetic algorithm.1

2.1

Introduction

The most powerful problem solver in the Universe as we know it is without


a doubt natural evolution. Since the formation of Earth more than 4 billion
years ago, we have evolved to walking creatures with clear vision and a
wide range of other biological sensors to find our way in this complex world.
While our brain size increased significantly over the years since the beginning
of the genus Homo, still many things are intractable for us to design or
even understand. Evolution does not have any problems with that and just
exploits all resources it can access to build whatever is appropriate in a given
context. This universal problem solver was the inspiration for scientists to
come up with new global optimization strategies now known as evolutionary
1

Picture taken and adapted from https://www.hindawi.com/journals/mpe/2014/

708275/fig4/.

CHAPTER 2. GENETIC ALGORITHMS

32

programming and genetic algorithms. In this work, we will limit our focus
to genetic algorithms and its applications.
The working mechanism of genetic algorithms is in essence very simple;
in the simplest case, it does not use any prior knowledge to reach a global
optimum for a given problem. The general pipeline is depicted in figure 2.1
and will be briefly explained here. First, an initial population of solution
candidates is generated. These are often just randomly generated, but valid
representations of members of the solution space. A main requirement in
setting up a genetic algorithm is defining the fitness function. Like in natural evolution and Darwins survival of the fittest theorem, this measure is
used to compare the individuals in the population. These individuals are
also called chromosomes, analogous to the biological model for genetics in
which chromosomes are composed of genes. Like in the example of figure
2.1, genetic algorithms usually dont go into more detail concerning reallife genetics and simply assume a gene to be the most fundamental unit of
information to construct chromosomes. For every chromosome in the population at time t, the fitness is calculated. Based on these values, several
methods exist for selection, which are discussed later on. Simply put, the
better performing solution candidates have a higher chance of reproducing
and proceding to the next generation. This process of reproduction happens in two fases: mutation and crossover. The selection operator filters out
individuals eligible for mating and the crossover operator exchanges bits of
information between these candidates. After that, the mutation operator
potentially introduces random mutations in the sequences. This aims to
eliminate bias by exploring new parts of the solution space. A new population is the result at time t + 1 and the same steps are performed over and
over again until the stopping criterion is fulfilled. After a sufficient period
of time, an acceptable global solution estimate should be returned. [48]
Keep in mind that genetic algorithms are by no means the holy grail

CHAPTER 2. GENETIC ALGORITHMS

33

that solves every problem. They belong to the same category of metaheuristic global search algorithms as ant colony search and particle swarm
optimization algorithms. These algorithms tend to perform very well in
complex situations with enormous solution spaces that would otherwise be
intractable to explore. Therefore, it clearly is the appropriate strategy to
be used in this work. In the next few sections, several considerations are
highlighted in the process of implementing efficient genetic algorithms.

CHAPTER 2. GENETIC ALGORITHMS

2.2

34

Parameters

There is no such thing as a general genetic algorithm. It is more of a coupling term for many specialized variations. Like Dijkstras algorithm being
a special case of A* search, genetic algorithms consist of many variables
that can be heavily modified and optimized for specific situations. Yet,
a reference algorithm does exist and is better known as the canonical genetic algorithm (CGA). It serves as a basis, a guideline for more optimized
and efficient variants. Upon this basic algorithm, the schema theorem and
building block hypothesis were built which act as mathematical evidences of
algorithm efficiency. [48, 49] The main parameters of every genetic algorithm
are discussed below.
Probability of mutation The first parameter that can be optimized to
increase algorithm efficiency is the probability of mutation (%M). This
parameter describes the chance of mutation for each chromosome or
solution candidate in the selected part of the (sub-)population, going
from one generation to another. In genetic algorithms, the mutational
operator is of less importance than the crossover operator. Yet, values
of this parameter can heavily affect efficiency of the algorithm. Too
high probabilities of mutation lead to an overdose of genetic diversity
which results in incompleteness of the algorithm and no convergence
can be reached. On the other hand, too low values can easily cause
premature convergence and a very poor, suboptimal solution may be
returned.
Probability of crossover Secondly, we should consider the probability of
crossover (%C). Crossover or recombination is the main driving force in
genetic algorithms and natural evolution in general. It tries to exploit
the good elements of solution candidates while maintaining genetic
diversity over different generations by mixing them up from one gen-

CHAPTER 2. GENETIC ALGORITHMS

35

eration to another. This way, good elements are kept in the population
and new chromosomes based on the previous ones are generated and
evaluated. Like the mutation probability parameter, this parameter is
prone to outliers. Too high values cause too much diversity, while too
low values lead to premature convergence.
Percentage elitism Another important parameter is the percentage of
elitism (%E). This represents the percentage of the population that
is to be kept when moving from one generation to another, chosen
from the upper part of the solution candidate pool with respect to
their relative fitness values. This part is moved without changes from
parent population to offspring and thus has to be kept rather limited
in order to achieve enough genetic diversity.
Population to generation size ratio When deciding on an optimal algorithm strategy, a fixed amount of chromosomes that are to be generated in the experiments has to be defined. Indeed, if this is not the
case, it is easy to gather much better results simply by letting the algorithm run for more generations or with bigger populations as genetic
algorithms are iterative global optimization algorithms. This obviously requires a lot more computational effort and leads us to another
parameter that has to be considered: the population size to generation number ratio (P/G). This ratio can be modified while keeping
the total amount of rendered chromosomes constant. The smaller the
population, the higher the number of computed generations but the
higher the chance of premature convergence. With bigger populations,
less generations can be calculated and it is less likely convergence will
be reached.

CHAPTER 2. GENETIC ALGORITHMS

2.3

36

Operators

This part of the algorithm contains the actual drivers of the optimization
process. In particular the crossover operator (COP) is of main importance.
It acts as a driving and regulating force, trying to reach a balance between
the two extremes of exploitation and exploration. On the one hand, crossover
brings new permutations from different parts of the solution space to the
population while on the other hand, it tries to maintain the good elements
that already have been reached. This way, an optimal solution may be
achieved after a certain number of generations based on selection and recombination. Obviously, it is essential for the crossover operator to behave
in an efficient way. For example, the crossover operator of the canonical
genetic algorithm may in no case be used in solving something like the
traveling salesman problem (TSP), as the single point crossover operator
would deliver solutions that are not valid and, when using an adjacency
representation, in no way tries to pass good subpaths that already exist in
the population to the next generation. Solution candidate representation
is highly correlated with specific recombination operators. As a matter of
illustration, we make use of the TSP case to underscore the importance of
a good crossover operator. [50, 49]
Two representations are considered that can be applied on the TSP: path
representation and adjacency representation. Path representation requires
each permutation to simply illustrate the order in which cities are visited.
This is in contrast to the adjacency representation, which uses permutations
to describe the visiting order in a very different way. The location of the
gene in its chromosome, or the index - starting at 1 - of the number in the
permutation, defines a direct link to this number or gene. For example, city
4 in permutation [7 6 8 5 3 4 2 1] is connected to city 6. Often, crossover
operators are not interchangeable with different representations. To find
and implement the optimal crossover operator, it is therefore important to

CHAPTER 2. GENETIC ALGORITHMS

37

use the right representation. Two crossover operators that can be used on
these different representations are briefly discussed below.
Alternating Edge Recombination This recombination method goes along
with the adjacency representation. It takes into account that although
all elements are unique, i.e. each city number only exists once in a
permutation, such tours can still be illegal as they can consist of more
than one cycle. Alternating Edge Recombination (AER) works as follows. First, a random starting edge is chosen from parent 1. This
subtour is extended with the according edge from parent 2 and so on.
In the end, a valid permutation can be reached which combines the
edges from the parents in their offspring. However, subpaths that are
passed from parents to offspring contain only one edge at a time. Good
subpaths are therefore disrupted by the operator which suggests lower
performance. It is essential for crossover operators to keep subpaths
of good solution candidates in the population. [49]
Edge Recombination Crossover This crossover operator works together
with path representation, the most intuitive yet very potent representation. As described above, crossover operators need to be conservative regarding subtours of solution candidates. Edge Recombination
Crossover (ERX) tries to achieve this goal, while keeping new edges
at a minimum. Other path representation crossover operators like
Order Crossover (OX) also keep subtours as much as possible in the
population but introduce a lot of new edges while doing so. In fact,
crossover operators just have to implement recombination. The mutation part is, obviously, already attributed to the mutation operator.
Therefore, ERX tries to introduce mutations as little as possible, rendering offspring chromosomes that are practically solely based upon
recombination, i.e. practically every edge in the offspring chromosome
comes from at least one of the two parents. ERX has a very high time

CHAPTER 2. GENETIC ALGORITHMS

38

and space complexity. This is explained by its working procedure.


First, an edge map of the parents is constructed. This map describes
which cities are connected to each city, looking at edges in both parents. So, for example, parents [5 8 2 6 4 7 9 1 3] and [2 1 3 5 8 6 4 9 7]
would give an edge map in which city 1 is linked with city 9, 3 and 2.
As we can see, there is one common edge (1-3) between the parents.
Common edges lead to less connected cities in the edge map. ERX
first chooses a random city from one of the two parents. After that,
the city cannot be chosen anymore and is deleted from the left column
of the edge map. All connected cities are then evaluated in the edge
map, giving the city with the lowest amount of connected cities the
highest priority. If two cities have equal numbers of connected cities,
one of them is chosen at random. If there are no connected cities in the
then-current city but there are unvisited cities, one of these is selected
at random as next visited city. If there are no unvisited cities left,
the algorithm is terminated. The resulting permutation is the child
chromosome. [49]
The other class of operators needed in genetic algorithms, albeit of less importance, are the mutation operators (MOP). They are used to introduce
new, random edges in the population in order to increase genetic diversity
and explore other parts of the solution space. The considered operators in
this contextual description are the Reciprocal Exchange and Inversion operators. Reciprocal exchange, also called swap mutation, point mutation,
order-based mutation or simply exchange mutation, simply choses two random cities in a tour and exchanges them. Inversion works by selecting a
random subtour of a solution candidate, inverting it and inserting it back
in the chromosome at a random location. It is also called cut-inversion mutation. The effects of using different mutation operators is expected to be
rather low, as every operator in fact simply introduces randomness without

CHAPTER 2. GENETIC ALGORITHMS


further relevance or background information. [49]

39

CHAPTER 2. GENETIC ALGORITHMS

40

Figure 2.2: Schematic representation of the Stochastic Universal Sampling


selection method.1

2.4

Methods of selection

Efficient methods of selection are necessary in every genetic algorithm. The


selection part of the algorithm calculates which individuals from a start
population are ready to mate and generate offspring chromosomes which are
to be reinserted in the population, leading to a next generation set of solution
candidates. As a matter of illustration, 3 different methods of selection are
explained. The exact method of selection can be based on different rules
and conditions, in which individual fitness plays an important role.
Roulette Wheel Selection Another selection method is Roulette Wheel
Selection (RWS), also called Fitness Proportional Selection. In some
way this is analogous to SUS as again the bar representation is used.
However, the selected slice is always chosen randomly, in contrast to
the evenly spaced intervals used in SUS. This implies pure fitness based
selection, with selection probability of the individual being exactly the
same as its relative fitness or copy number in the case of linear ranking. In this method, individuals with much higher fitness values are
greatly favoured compared to others and may dominate next generation populations due to high selection pressure. Rank based Roulette
Wheel Selection may reduce this effect. [49]
Stochastic Universal Sampling One possible method of selecting parent
1

Picture taken from [51].

CHAPTER 2. GENETIC ALGORITHMS

41

chromosomes is Stochastic Universal Sampling (SUS). In this method,


total fitness (F) is divided by the number of individuals that have to
be selected, resulting in evenly spaced fitness intervals. In contrast
to RWS, only one random number between 0 and the interval length
is generated in the beginning. This is the starting point. Try to
represent the total fitness as a bar (from 0 to F) divided in slices with
lengths proportional to the fitness value of every individual. The slice
that contains the starting point is the first selected individual. After
that, the starting point is moved over one interval length and again
the relevant slice or individual is selected. This process is repeated
until the end of the bar is reached. See figure 2.2 for a more graphical
representation of this methodology.
There are two possible ways of representing this bar. The most common one is described above. Another way is linear ranking. For this,
individual fitness is ranked, with lower fitnesses gaining a lower integer
ranking number than higher ones. Each individual is copied a specific
number of times according to its ranking value. Copies with the same
rank then combine to form a slice. This way, slice lengths are defined
by copy number. [49]
Tournament Selection The last widely applied selection method is Tournament Selection (TS). In this case, a subpopulation of k randomly
chosen individuals is picked from the population. From this subgroup,
the individual with highest fitness is selected. This is repeated until the desired number of selected individuals is reached. The value
of k has to be defined in the algorithm. With high values of k, a
high selection pressure is expected as many individuals will compete
against each other and only the fittest wins. Lowering k will reduce
the selection pressure and the chance of premature convergence. [49]

Part II

FIGARO

42

Chapter 3

Application overview

43

CHAPTER 3. APPLICATION OVERVIEW

44

Code Snippet 3.1: Currently hard-coded program parameters.


1 ## P e r c e n t a g e e l i t i s m
2

pctE = 0 . 2 5

3 ## P e r c e n t a g e c r o s s o v e r
4

pctX = 0 . 6

5 ## P e r c e n t a g e m u t a t i o n
6

pctM = 0 . 0 0 5

7 ## P o p u l a t i o n s i z e
8

p o p S i z e = 100

9 ## S u b p o p u l a t i o n s i z e
10

s u b p o p S i z e = 10

11 ## Number o f p u t a t i v e b i n d i n g s i t e s t o be i n s p e c t e d
12

nrBindingSites = 5

13 ## Number o f g e n e r a t i o n s
14

g e n S i z e = 50

15 ## Maximum r e s o l u t i o n o f s t r u c t u r e s i n i n i t i a l p o p u l a t i o n
16

maxRes = 2 . 0

17 ## Maximum number o f e n t i t i e s i n s t r u c t u r e
18

maxEnt = 1

19 ## Minimum s i m i l a r i t y t o q u e r y l i g a n d
20

minSim = 0 . 3

3.1

General outline

In this section, we will give a birds-eye view on the application and describe its global workflow. Over the next few sections, every single aspect of
the application is independently presented and clarified on the basis of the
corresponding code snippets.
The code in snippet 3.2 represents the main backbone of the program.
Note that the program in its current state still has many hard-coded parameters which could be more elegantly queried for as user input during
runtime in future releases. These parameters are given in code snippet 3.1
accompanied by short explanations in the code comments. The values given

CHAPTER 3. APPLICATION OVERVIEW

45

to them here serve as an example for the program to be run with. However,
the main input for the program still needs to be defined. This is the ligand
for which we aim to generate an optimal receptor structure. This can be a
random small molecule, like some designed compound or a natural organic
molecule like caffeine. For our tests we used a web service that picks random
molecules from the ZINC database, hosted by the BIRC at the University
of California.1 When the program is run, the first task that needs to be
done is mining the PDB for high quality receptor structures that are known
to bind ligands similar to our input molecule. When enough templates are
collected, i.e. the defined population size is reached, the algorithm proceeds
by replicating every individual a number of times according to the defined
subpopulation size. While doing this, point mutations are introduced with
a predefined probability (pctM as shown in code snippet 3.1).
Code Snippet 3.2: Backbone of the application.
1 ## Generate random s m a l l m o l e c u l e from h t t p : / / b c i r c . d o c k i n g . o r g /
random . s h t m l and g e t SMILES ( c u r r e n t l y hardcoded ) .
2

randMol = CCOc1cccc ( n1 )NC(=O) c 2 c c [ nH ] n2

3
4 ## I n i t i a l i z e t h e p o p u l a t i o n .
5

p o p u l a t i o n = i n i t i a l i z e p o p ( randMol , popSize , subpopSize , maxRes


, maxEnt , minSim , n r B i n d i n g S i t e s , pctM )

6
7 ## Write i n i t i a l i n f o r m a t i o n f o r r e s u l t e v a l u a t i o n .
8

w r i t e i n i t i a l i n f o ( population )

9
10 ## Perform GA on p o p u l a t i o n f o r s p e c i f i e d number o f g e n e r a t i o n s .
11

f o r g e n e r a t i o n in range ( 0 , g e n S i z e ) :

12

f o r index , s u b p o p u l a t i o n in enumerate ( p o p u l a t i o n ) :

13

## P r e d i c t i n i t i a l s t r u c t u r e s .

14

p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
predict structure backrub , subpopulation )
1

This website can be consulted at http://bcirc.docking.org/random.shtml

CHAPTER 3. APPLICATION OVERVIEW

46

15

## Dock l i g a n d t o a l l i n d i v i d u a l s i n s u b p o p u l a t i o n .

16

p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )

17

## S e l e c t i n d i v i d u a l s f o r mating i n s u b p o p u l a t i o n s .

18

population [ index ] = r o u l e t t e w h e e l s e l e c t i o n ( population [


i n d e x ] , pctE )

19

## Perform c r o s s o v e r b e t w e e n i n d i v i d u a l s i n
subpopulations .

20

t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n [ i n d e x ] , pctX )

21

#s i n g l e p o i n t c r o s s o v e r ( newPopulation , pctX )

22

## P r e d i c t s t r u c t u r e o f recombined s e q u e n c e s .

23

p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
p r e d i c t s t r u c t u r e h o m o l o g y f a s t , population [ index ] )

24

## Dock l i g a n d t o a l l r e c o m b i n a t i o n s i n s u b p o p u l a t i o n .

25

p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )

26
27

i f g e n e r a t i o n < g e n S i z e 1:
## S e l e c t mother p r o t e i n s f o r new p o p u l a t i o n b a s e d on
c o m p e t i t i o n b e t w e e n each s u b p o p u l a t i o n s b e s t
competitor .

28

motherSelection = r o u l e t t e w h e e l s e l e c t i o n (
g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) , r e p l a c e=F a l s e )

29

## Generate new p o p u l a t i o n and mutate mother p r o t e i n s .

30

population = prepare next gen ( population ,


m o t h e r S e l e c t i o n , subpopSize , pctM , g e n e r a t i o n )

31

else :

32

bestProtein = report best ( population )

33

print Program f i n i s h e d . The b e s t p r o t e i n i s

b e s t P r o t e i n . id + with a s c o r e o f + s t r (
bestProtein . score ) + \
34

on b i n d i n g s i t e

+ bestProtein . bestBindingSite

+ .

After all sequences have been generated, their structures are modeled by
the backrub application from the Rosetta suite. The exact mechanism for
this is described in later sections. A molecular docking job is then performed on all of these individuals and fitness scores are collected. Based

CHAPTER 3. APPLICATION OVERVIEW

47

Code Snippet 3.3: Support for distributed computing.


1

import m u l t i p r o c e s s i n g

2
3 ## S e t number o f p r o c e s s o r c o r e s
4

w o r k e r s = m u l t i p r o c e s s i n g . Pool ( 1 0 0 )

on these scores, proteins eligible for mating are selected in each subpopulation and sequences are recombined based on a given crossover probability
(pctX as shown in 3.1). Next, a comparative modeling step returns the conformational models for all recombined structures based on a fast homology
modeling algorithm and again, molecular docking is performed to evaluate
the recombined child sequences. Eventually the best candidates are picked
from all subpopulations. A final selection round using RWS as described
in section 2.4 on page 40 chooses a new set of mother proteins to form the
subpopulations in the next generation. The process is then repeated until
the defined number of generations is reached. The program finishes and
reports information about the best protein in the final population.
FIGARO needs a lot of CPU power to be able to return acceptable
results. Like is the case for all genetic algorithms, the bigger the population size and number of generations, the better the results will be but
also a lot more computational resources are needed. To reduce computation
time significantly, FIGARO supports multiprocessing to distribute the work
over multiple processor cores. The implementation weve opted for is the
multiprocessing package for Python, and is illustrated in code snippet 3.3.
Depending on the machine the program is running on, the number of CPU
cores can be adjusted and the workload will be parallelized efficiently.

CHAPTER 3. APPLICATION OVERVIEW

48

Code Snippet 3.4: Implementation of the PDB mining process.


1 ## C r e a t e s i n i t i a l PDB p o p u l a t i o n and download t o f o l d e r pdb .
2 ## R e c e p t o r s f o r drugl i k e m o l e c u l e s s i m i l a r t o t h e random
molecule are c o l l e c t e d .
3

def i n i t i a l i z e p o p ( randMol , popSize , subpopSize , maxRes , maxEnt ,


minSim , n r B i n d i n g S i t e s , pctM ) :

population = [ ]

usedPdb = [ ]

similarity = 1

while s i m i l a r i t y > minSim and len ( p o p u l a t i o n ) != p o p S i z e :

pdbOut = u r l l i b 2 . u r l o p e n ( h t t p : / /www. r c s b . o r g /pdb/ r e s t /


s m i l e s Q u e r y ? s m i l e s= + randMol +

&s e a r c h t y p e=s i m i l a r i t y&


s i m i l a r i t y= +s t r ( s i m i l a r i t y )
) . read ( ) . s p l i t l i n e s ( )

10
11
12

f o r l i n e in pdbOut :
i f len ( p o p u l a t i o n ) != p o p S i z e :
h i t = r e . s e a r c h ( (?<= s t r u c t u r e I d =) . ? ( ? = ) ,
line )

13

if hit :

14

pdb = h i t . group ( )

15

c h a i n = r e . s e a r c h ( (?<= c h a i n i d =)\w(?=) ,
u r l l i b 2 . urlopen (

16

h t t p : / /www. r c s b . o r g /pdb/ r e s t /
d e s c r i b e M o l ? s t r u c t u r e I d= + pdb ) . r e a d
( ) ) . group ( )

17

i f pdb not in usedPdb + e x c e p t i o n L i s t and


q u a l i t y c h e c k ( pdb , maxRes , maxEnt ) :

18

l i g a n d = r e . s e a r c h ( (?<=c h e m i c a l I D =)
. ? ( ? = ) , l i n e ) . group ( )

19

w r i t e p d b ( pdb , l i g a n d )

20

f i x p d b ( pdb )

21

e x t r a c t c h a i n ( pdb , c h a i n )

22

s e q = g e t s e q ( pdb )

23

b i n d i n g S i t e s = g e t b i n d i n g s i t e s ( pdb ,
nrBindingSites )

CHAPTER 3. APPLICATION OVERVIEW


24

49

m o t h e r P r o t e i n = P r o t e i n ( pdb , seq , chain ,


bindingSites )

25

subpop = c r e a t e s u b p o p u l a t i o n (
motherProtein , subpopSize , pctM )

26

p o p u l a t i o n . append ( subpop )

27

usedPdb . append ( pdb )


s i m i l a r i t y = s i m i l a r i t y 0.1

28
29

i f len ( p o p u l a t i o n ) == p o p S i z e :

30

print P o p u l a t i o n i n i t i a l i z a t i o n f i n i s h e d .

31

return p o p u l a t i o n

32

else :

33

s y s . e x i t ( Not enough a v a i l a b l e t e m p l a t e s f o r t h e s e
parameters . )

3.2

Mining the PDB

Mining the PDB is the first step of the algorithm and is crucial to its further
progress. The complete code responsible for this is given in code snippet
3.4. As described in the previous section, the program tries to find existing
templates in the Protein Data Bank that bind to compounds showing high
similarity to our input molecule. This however is not always evident as any
input molecule can be given. Moreover, it is likely that the given compound
does not act as a known natural ligand.
For this reason, the program looks for structures in an iterative manner.
It starts by screening the PDB for structures binding ligands that have 100%
similarity to the query compound. If this does not fill up the total population
which is very likely, similarity drops by 10 percent. This process is repeated
until eventually the population is saturated with templates. In the other
case, the program aborts and notifies the user not enough templates were
found for the given set of parameters.
Thankfully, mining the PDB is greatly facilitated by the available REST-

CHAPTER 3. APPLICATION OVERVIEW

50

Code Snippet 3.5: Example XML file returned by web server.


1 <?xml version= 1 . 0 standalone= no ?>
2 <s m i l e s Q u e r y R e s u l t s m i l e s= CCOc1cccc ( n1 )NC(=O) c 2 c c [ nH ] n2
s e a r c h t y p e= 3 s i m i l a r i t y= 0 . 5 5 >
3
4

<l i g a n d I n f o>
<l i g a n d s t r u c t u r e I d= 4N9B c h e m i c a l I D= 2HH type= non
polymer m o l e c u l a r W e i g h t= 2 0 2 . 2 1 3 >

<chemicalName>1METHYLN(PYRIDIN3YL)1HPYRAZOLE5
CARBOXAMIDE</ chemicalName>

<f o r m u l a>C10 H10 N4 O</ f o r m u l a>

<InChIKey>LSEUGEFZKAZGSIUHFFFAOYSAN</ InChIKey>

<InChI>InChI=1S/C10H10N4O/ c1 149(461214) 1 0 ( 1 5 )
1383251178/h27H, 1 H3 , ( H, 1 3 , 1 5 )</ InChI>

<s m i l e s>Cn1c ( ccn1 )C(=O) Nc2cccnc2</ s m i l e s>

10

</ l i g a n d>

11

</ l i g a n d I n f o>

12 </ s m i l e s Q u e r y R e s u l t>

ful Web Services API. This API supports queries based on ligands that can
be given by their corresponding SMILES representation. The similarity percentage can be included in the URL. Upon request, carried out by the urllib2
Python package, the server returns an XML file which is parsed manually.
An example XML file returned this way is given in snippet 3.5. We are particularly interested in the structureId attribute of the ligand element of this
file. This is the identifier of the PDB template we are looking for. A regular
expression extracts this id and stores it as a variable. If this identifier is not
contained in the list of already mined PDB files, the procedure continues
and another regular expression extracts the id of the ligand. This needs to
be done because the downloaded PDB file still contains the ligand coordinates. These have to be filtered out for the molecular docking process. The
write pdb function takes care of this again using a custom regex and saves
the PDB file in the appropriate folder. The PDB still needs further prepa-

CHAPTER 3. APPLICATION OVERVIEW

51

ration though. By filtering out the ligand, missing residues are introduced.
Moreover, a PDB can contain more than one chain, which is not compatible
with our algorithm. The fix pdb and extract chain functions were implemented for that reason. These are not given here but can be consulted in
the complete source code disclosed in the attachments (appendix B) starting
at page 86. After these steps, the PDB file is ready for use and is passed
to the get binding sites function described in the next section. An array of
binding site centers is returned and stored in a Protein object together with
other metadata like sequence and chain id. This class is described in section
3.4.

CHAPTER 3. APPLICATION OVERVIEW

52

Code Snippet 3.6: Program implementation for binding site detection.


1 ## I d e n t i f i e s p u t a t i v e b i n d i n g p o c k e t s u s i n g t h e SiteHound
package
2 ## ( no i n s t a l l a t i o n needed , i 3 8 6 l i b r a r i e s have t o be i n s t a l l e d
f o r pdb2gmx ) .
3

def g e t b i n d i n g s i t e s ( pdb , n r B i n d i n g S i t e s ) :

bsCenterList = [ ]

s t a r t i n g D i r = o s . getcwd ( )

o s . c h d i r ( s t a r t i n g D i r + /pdb )

o s . system ( . / auto . py i

with open ( pdb + CMET summary . dat , r ) a s summary :

+ pdb + . pdb p CMET k )

f o r i in range ( 0 , n r B i n d i n g S i t e s ) :

10

l i n e = summary . r e a d l i n e ( )

11

x = l i n e . s p l i t ( ) [ 3]

12

y = l i n e . s p l i t ( ) [ 2]

13

z = l i n e . s p l i t ( ) [ 1]

14

b s C e n t e r L i s t . append ( ( + x + , + y + , + z + )
)

15

## Remove temporary f i l e s

16

o s . system ( rm + pdb + + pdb + . e a s y m i f s )

17

## Return t o main d i r e c t o r y .

18

os . chdir ( s t a r t i n g D i r )

19

print Binding s i t e s found

20

return b s C e n t e r L i s t

3.3

Screening for binding sites

The mining of PDB files based on a query ligand only serves as a guideline for our program. To keep bias low, the whole protein conformation is
screened for a range of putative binding sites after being prepared for use.
Eventually our compound will be docked to all of these sites; an approach
better known as blind docking. According to the maximum number of binding pockets to be retrieved as defined in the global program parameters, the

CHAPTER 3. APPLICATION OVERVIEW

53

SiteHound software package is used to screen for possible binding sites. [52]
The implemention of this part of the application is shown in code snippet
3.6.
The SiteHound algorithm is actually very comparable to general molecular docking. It uses a small methyl carbon or phosphate probe to screen
the surface of a protein for favourable weak interactions. According to their
spatial proximity, these points are clustered by an agglomerative hierarchical
clustering algorithm. A list of these clusters is returned in the end, corresponding to the identified binding pockets. These are ranked based on their
total interaction energy and are stored by FIGARO. An array of binding
site centers is returned, with maximum length nrBindingSites as defined in
3.1. According to a test study based on 77 experimentally derived protein
structures linked to known protein-ligand complexes, the actual binding site
was among the top three binding pockets recognized by SiteHound in 95
percent of the cases. [53]
SiteHound offers two possibilities concerning probe type. We opted for
the CMET or methyl carbon probe. This is the most appropriate one for
docking general drug-like compounds. [53]

CHAPTER 3. APPLICATION OVERVIEW

54

Code Snippet 3.7: Implementation of the Protein class.


1

class Protein () :

2
3

def

init

( s e l f , id , seq , chain , b i n d i n g S i t e s ) :

s e l f . id = id

s e l f . seq = seq

s e l f . chain = chain

s e l f . bindingSites = bindingSites

s e l f . s c o r e = None

s e l f . parents = [ ]

10

s e l f . crossoverPoints = [ ]

11

s e l f . b e s t B i n d i n g S i t e = None

12

s e l f . p o i n t M u t a t i o n s = {}

13
14

def s e t s c o r e ( s e l f , s c o r e ) :

15

s e l f . score = score

16
17

def s e t p a r e n t s ( s e l f , protA , protB = None , c r o s s o v e r P o i n t 1 =


None , c r o s s o v e r P o i n t 2 = None ) :

18

s e l f . p a r e n t s . append ( ( protA . id , protA . s e q ) )

19

i f protB and r e . s e a r c h ( recombined , protB . id ) :

20

s e l f . p a r e n t s . append ( ( protB . p a r e n t s [ 0 ] [ 0 ] , protB .


parents [ 0 ] [ 1 ] ) )

21

i f protB and not r e . s e a r c h ( recombined , protB . id ) :

22

s e l f . p a r e n t s . append ( ( protB . id , protB . s e q ) )

23
24
25
26

i f crossoverPoint1 :
s e l f . crossoverPoints = [ crossoverPoint1 ]
i f crossoverPoint2 :
s e l f . c r o s s o v e r P o i n t s . append ( c r o s s o v e r P o i n t 2 )

27
28
29

def s e t b e s t b i n d i n g s i t e ( s e l f , b i n d i n g S i t e ) :
s e l f . bestBindingSite = bindingSite

30
31

def s e t i d ( s e l f , id ) :

32

s e l f . id = id

33

CHAPTER 3. APPLICATION OVERVIEW


34

55

def u p d a t e s e q ( s e l f , s e q ) :

35

s e l f . seq = seq

36
37

def s e t p o i n t m u t a t i o n s ( s e l f , p o i n t M u t a t i o n s ) :

38

s e l f . pointMutations = pointMutations

3.4

The Protein class

The Protein class comprises all metadata about a protein structure, such
as its sequence, chain id, identified binding sites and parental sequences. It
was made to centralize important parts of the code on the level of individual
proteins. The implementation of this class is presented in code snippet 3.7.
The class provides functionality to manage all protein data efficiently.
When protein sequences are mutated, it is important to keep data about
these mutations in memory because it will be used later to perform structure prediction based on template structures. To manage this, every protein
object holds a pointMutations map as shown in snippet 3.7. It memorizes
the index of the mutated amino acid in the sequence along with its original value and can be set from an outside call to the set point mutations
function. Furthermore, setters are available for storing docking scores, best
binding site coordinates, updated sequences, new protein ids and parental
sequences. When setting parental sequences, the exact crossover points are
also stored in the appropriate variable, which are crucial for homology modeling processes later on.

CHAPTER 3. APPLICATION OVERVIEW

56

Code Snippet 3.8: Implementation of the mutation operator.


1 ## P o i n t m u t a t i o n o p e r a t o r
2

def p o i n t m u t a t i o n ( p r o t e i n , pctM ) :

mutable = l i s t ( p r o t e i n . s e q )

p o i n t M u t a t i o n s = {}

f o r index , aminoAcid in enumerate ( mutable ) :


i f random . random ( ) <= pctM and not ( p r o t e i n . s e q [ i n d e x ]

== C or p r o t e i n . s e q [ i n d e x ] == P ) :
7

mutable [ i n d e x ] = random . c h o i c e ( aminoAcids )

i f p r o t e i n . s e q [ i n d e x ] != mutable [ i n d e x ] :

p o i n t M u t a t i o n s [ i n d e x +1] = p r o t e i n . s e q [ i n d e x ] +
mutable [ i n d e x ]

10

p r o t e i n . u p d a t e s e q ( . j o i n ( mutable ) )

11

protein . set point mutations ( pointMutations )

12

print p r o t e i n + p r o t e i n . id + mutated .

13

return p r o t e i n

3.5

Modeling point mutations

When an initial population and its subpopulations are formed, new sequences are created through the introduction of point mutations to the
mother or template proteins. Although a single point mutation is not likely
to change protein fold significantly, these are accumulated quickly over the
course of the program and incrementally alter protein conformation by triggering backrub moves on the protein backbone. This phenomenon is described earlier in section 1.5.1 on page 25.
However, not all point mutations can be treated equally; there are a few
notable exceptions. The first one is proline, which is known to force protein
backbone in certain conformations due to its uncommon and constraining
way of chaining in a polypeptide. To suit our purpose, this amino acid is
therefore never introduced nor replaced by the mutation operator as predictions would likely become unreliable. Moreover, cysteine is known to form

CHAPTER 3. APPLICATION OVERVIEW

57

disulfide bridges or cystine bonds between distant parts of the polypeptide


chain. These are strong bonds and if broken, protein conformation is not
guaranteed to be stable anymore. For this reason, cysteine is excluded from
the set of available amino acids, too.
The mutation operator loops over a given protein sequence one amino
acid at a time. At every stop, it produces a random float from 0 to 1. If this
value is beneath the pctM parameter, the residue is exchanged by random
selection from the set of available amino acids. This continues until the end
of the sequence is reached. In the meantime, a mutation map is kept in
memory which stores the indices and original residues as key-value pairs.
Eventually, this mutation map is stored in the appropriate Protein object
and the sequence field is updated.
Code Snippet 3.9: Implementation of the point mutation modeling process
based on RosettaBackrub [54] for flexible backbone sampling.
1

def p r e d i c t s t r u c t u r e b a c k r u b ( p r o t e i n ) :

p a r s e r = PDBParser ( )

s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( p r o t e i n . id [ 0 : 4 ] ,

pdb/ +

p r o t e i n . id [ 0 : 4 ] + . pdb )
4

## Give e v e r y t h r e a d i t s own i n p u t and o u t p u t f i l e s .

o s . system ( cp pdb/ + p r o t e i n . id [ 0 : 4 ] + . pdb pdb/ +


p r o t e i n . id + . pdb )

atomList = [ ]

f o r atom in s t r u c t u r e . g e t a t o m s ( ) :

a t o m L i s t . append ( atom )

nbs = N e i g h b o r S e a r c h ( a t o m L i s t )

10
11

f o r mutation in p r o t e i n . p o i n t M u t a t i o n s :
## Write r e s f i l e and g e t a l l r e s i d u e s w i t h i n 6 Angstrom
o f mutated r e s i d u e s u s i n g Biopython p a c k a g e .

12

with open ( r e s f i l e

+ p r o t e i n . id ,

w ) a s r e s f i l e :

13

r e s f i l e . w r i t e ( NATRO\ n s t a r t \n )

14

affectedList = [ ]

15

r e s i d u e = s t r u c t u r e . g e t c h a i n s ( ) . next ( ) [ mutation ]

CHAPTER 3. APPLICATION OVERVIEW


16
17

58

f o r atom in S e l e c t i o n . u n f o l d e n t i t i e s ( r e s i d u e ,

A ) :

f o r n e i g h b o r in nbs . s e a r c h ( atom . g e t c o o r d ( ) , 6 ,
l e v e l= R ) :

18
19

a f f e c t e d L i s t . append ( g e t r e s i ( n e i g h b o r ) )
r e s f i l e . w r i t e ( s t r ( mutation ) + + p r o t e i n . c h a i n +
PIKAA + p r o t e i n . p o i n t M u t a t i o n s [ mutation ] [ 1 ] +
\n )

20

## Make a f f e c t e d L i s t u n i q u e .

21

a f f e c t e d L i s t = l i s t ( set ( a f f e c t e d L i s t ) )

22

pivotList = [ ]

23

f o r r e s in a f f e c t e d L i s t :

24
25

i f r e s != mutation :
r e s f i l e . write ( str ( r e s ) + + protein . chain
+ NATAA\n )

26

p i v o t L i s t . append ( r e s )

27

i f r e s != 1 :

28

p i v o t L i s t . append ( r e s 1)

29

i f r e s != len ( p r o t e i n . s e q ) :

30

p i v o t L i s t . append ( r e s +1)

31
32

p i v o t L i s t = l i s t ( set ( p i v o t L i s t ) )
## Run b a c k r u b s a m p l i n g w i t h 1 0 . 0 0 0 Monte C ar l o
iterations .

33

command = ROSETTADIR + /main/ s o u r c e / b i n / backrub .


l i n u x g c c r e l e a s e d a t a b a s e + ROSETTADIR + \

34

/main/ d a t a b a s e s pdb/ + p r o t e i n . id + . pdb


n s t r u c t 1 backrub : n t r i a l s

35

str (10000) + r e s f i l e

resfile

+ \
+ p r o t e i n . id

+ p i v o t r e s i d u e s
36
37

f o r p i v o t in p i v o t L i s t :
command += s t r ( p i v o t ) +

38

o s . system ( command )

39

## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .

40

# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + 0 0 0 1 .
pdb o pdb / + p r o t e i n . i d + . pdb )

CHAPTER 3. APPLICATION OVERVIEW


41

59

o s . system ( mv + p r o t e i n . id + 0 0 0 1 . pdb pdb/ +


p r o t e i n . id + . pdb )

42

## Remove temporary f i l e s .

43

o s . system ( rm + p r o t e i n . id + r e s f i l e

+ p r o t e i n . id

)
44

## Fix PDB ( o n l y needed when s i d e c h a i n r e f i n e m e n t i s n o t


used )

45

f i x p d b ( p r o t e i n . id )

46

print P r o t e i n + p r o t e i n . id + modeled u s i n g
RosettaBackrub and SCWRL4.

47

return p r o t e i n

Creating reliable models of point mutated structures is one of the most


crucial parts of FIGARO. A scientifically proven strategy is therefore indispensable and several points of attention should be taken into consideration.
The complete implementation of all steps is disclosed in code snippet 3.9.
Using the Biopython bioinformatics software package, the template PDB
file is parsed and all atoms are loaded into memory. After that, all residues
having atoms within a 6
A radius of the mutated amino acid are selected.
These are later added to the job command together with their neighbouring
residues to define which points in the sequence will be used as backrub pivots; all these residues are assumed to be affected by the point mutation. The
RosettaBackrub algorithm also requires a so called resfile to work. This file
defines the exact location of the mutated residue and its new value. Moreover, it contains lines specifying all affected residues as native amino acids
with repacked side chains (using the NATAA flag on line 26 in code snippet
3.9 enabling their side chain rotamers to be sampled while preserving the
native amino acid type. The third and last required input is the number of
Monte Carlo iterations. By default, this is set to 10 000 as suggested in the
literature. [54] Increasing this number will eventually make the predictions
more reliable, leading to another trade-off between application speed and
precision. Functionality to refine side chain conformations by SCWRL4 [55]

CHAPTER 3. APPLICATION OVERVIEW

60

is also available, but is not activated by default. After the model has been
generated, all temporary files are deleted and the new PDB file is stored in
the appropriate folder. The id field in the Protein object links to this file by
name.

CHAPTER 3. APPLICATION OVERVIEW

61

Code Snippet 3.10: Implementation of the ligand docking process using


rDock [56].
1

def dock ( p r o t e i n ) :

print Docking s t a r t e d f o r p r o t e i n + p r o t e i n . id + .

b e s t S c o r e = None

b e s t B i n d i n g S i t e = None

## Convert pdb t o mol2 f i l e u s i n g OpenBabel p a c k a g e .

o s . system ( b a b e l h . / pdb/ + p r o t e i n . id + . pdb + p r o t e i n


. id + . mol2 )

f o r b i n d i n g S i t e in p r o t e i n . b i n d i n g S i t e s :

## F i r s t w r i t e s y s t e m f i l e .

with open ( rDockSystem + p r o t e i n . id + . prm , w ) a s


systemFile :

10

s y s t e m F i l e . w r i t e ( RBT PARAMETER FILE V1. 0 0 \ nTITLE


gart DUD\nRECEPTOR FILE + p r o t e i n . id +
. mol2 \nRECEPTOR FLEX 3 . 0 \nSECTION

11

MAPPER\nSITE MAPPER
RbtSphereSiteMapper \nCENTER
+ b i n d i n g S i t e + \nRADIUS 1 5 . 0 \

12

nSMALL SPHERE 1 . 5 \nMIN VOLUME


100\nMAX CAVITIES 1\nVOL INCR
0 . 0 \nGRIDSTEP 0 . 5 \nEND SECTION\
nSECTION CAVITY\
nSCORING FUNCTION
RbtCavityGridSF \nWEIGHT 1 . 0 \
nEND SECTION )
13

## Generate c a v i t y f o r d o c k i n g .

14

o s . system ( r b c a v i t y was d r rDockSystem + p r o t e i n .


id + . prm )

15

## Perform d o c k i n g .

16

o s . system ( rbdock i l i g a n d . sd o o u t p u t + p r o t e i n . id
+ r rDockSystem + p r o t e i n . id +

17

. prm p dock . prm n 1 )

18

## Return d o c k i n g s c o r e .

19

with open ( o u t p u t + p r o t e i n . id + . sd , r ) a s
dockingResult :

CHAPTER 3. APPLICATION OVERVIEW


20

62

f o r l i n e in d o c k i n g R e s u l t :

21

i f l i n e == >

22

<SCORE>\n :

d o c k i n g S c o r e = f l o a t ( d o c k i n g R e s u l t . next ( ) .
strip () )
i f b e s t S c o r e i s None or d o c k i n g S c o r e < b e s t S c o r e :

23
24

bestScore = dockingScore

25

bestBindingSite = bindingSite

26

protein . set score ( bestScore )

27

protein . set best binding site ( bestBindingSite )

28

## Remove temporary f i l e s

29

o s . system ( rm + p r o t e i n . id + . mol2 o u t p u t + p r o t e i n . id
+ . sd rDockSystem + p r o t e i n . id +

30

c a v 1 . grd rDockSystem + p r o t e i n . id + . a s
rDockSystem + p r o t e i n . id + . prm )

31

print Docking f i n i s h e d f o r p r o t e i n + p r o t e i n . id + .

32

return p r o t e i n

3.6

Ligand docking

In this section, we will discuss the ligand docking process and its implementation as shown in snippet 3.10. This blind docking procedure mainly
consists of four parts: converting the PDB file to a mol2 file, generating the
receptor cavity at the targeted binding site, performing the actual ligand
docking job site and selecting the best docking score with respect to the
different binding sites.
Again, several parameters need to be set adequately for reliable results.
Based on suggestions made in rDocks documentation and some own intuition about the given ligand molecule, setting up these parameters can be
done in a fairly straightforward way by generating a .prm rDock system file
satisfying the format constraints given in rDocks documentation1 . The con1

The full documentation for rDock can be downloaded from the projects website at

http://rdock.sourceforge.net/

CHAPTER 3. APPLICATION OVERVIEW

63

tent we used for our molecule is shown in lines 10 to 12 of code snippet 3.10.
First, the receptor mol2 file and the tolerated backbone flexibility parameter are defined. We opted for a 3
A flexibility range which should provide a
substantial degree of conformational adaptability. Next, the method of cavity grid construction is specified. For our purpose, it is encouraged to opt
for the RbtSphereSiteMapper algorithm provided by rDock. This method
uses a small sphere with adjustable radius to explore the surface around
the given center coordinates. Moreover, a maximum binding site radius is
defined which should be chosen according to the structural information of
the ligand compound. The maximum number of binding pockets to be constructed was set to 1. For the other parameters, we used the default values
as they should not alter results significantly. However, all parameters mentioned in this paragraph should be carefully chosen and evaluated for every
single docking situation.
To be able to use our models, they need to be converted to a new file
format. The OpenBabel package takes care of that and generates the appropriate mol2 files. It is important to note that FIGARO needs a lot of
dependencies that all need to be installed and set up correctly in order to
work. For problems with this in further research, please contact the author.
The docking scores produced by rDock represent changes in free energy.
For thermodynamically favourable binding situations, these will be negative.
Absolute values are therefore used as a fitness measure in our application.
Also note that docking scores themselves are based on a genetic algorithm
within rDock. These are all estimations that can be made more precise
by increasing the number of generations or by running multiple docking
jobs and taking averages. To shorten runtimes of our application, we will
perform each docking job only once. In the end, a fitness score only serves
as a guideline and doesnt need to be too precise.

CHAPTER 3. APPLICATION OVERVIEW

64

Code Snippet 3.11: Implementation of the two-points crossover operator.


1
2

def t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n , pctX ) :
i f not len ( p o p u l a t i o n ) > 1 :

p o p u l a t i o n [ 0 ] . s e t p a r e n t s ( protA=p o p u l a t i o n [ 0 ] )

p o p u l a t i o n [ 0 ] . s e t i d ( p o p u l a t i o n [ 0 ] . id + recombined )

print No c r o s s o v e r p o s s i b l e .

else :

random . s h u f f l e ( p o p u l a t i o n )

f o r index , p r o t e i n in enumerate ( p o p u l a t i o n ) :

9
10
11

i f random . random ( ) <= pctX :


i f i n d e x != len ( p o p u l a t i o n ) 1:
minLength = min ( [ len ( p r o t e i n . s e q ) , len (
population [ index +1]. seq ) ] )

12

c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )

13

c r o s s o v e r P o i n t 2 = random . r a n d i n t (
c r o s s o v e r P o i n t 1 , minLength )

14

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index + 1 ] ,

15

c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
c r o s s o v e r P o i n t 2=
crossoverPoint2 )

16

protein . update seq ( protein . seq [ 0 :


crossoverPoint1 ] + population [ index +1].

17

seq [ crossoverPoint1 :
crossoverPoint2 ] +
protein . seq [
crossoverPoint2 : ] )

18
19
20

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
else :
minLength = min ( [ len ( p r o t e i n . s e q ) , len (
p o p u l a t i o n [ index 1 ] . p a r e n t s [ 0 ] [ 1 ] ) ] )

21

c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )

22

c r o s s o v e r P o i n t 2 = random . r a n d i n t (

CHAPTER 3. APPLICATION OVERVIEW

65

c r o s s o v e r P o i n t 1 , minLength )
23

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index 1 ] ,

24

c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
c r o s s o v e r P o i n t 2=
crossoverPoint2 )

25

protein . update seq ( protein . seq [ 0 :


c r o s s o v e r P o i n t 1 ] + p o p u l a t i o n [ index 1 ] .

26

parents [ 0 ] [ 1 ] [
crossoverPoint1 :
crossoverPoint2 ] +

27

protein . seq [
crossoverPoint2 : ] )

28

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )

29

print C r o s s o v e r f i n i s h e d .

30

else :

31

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n )

32

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )

33

print No c r o s s o v e r .

3.7

Modeling recombined sequences

Modeling recombined sequences can be a very tricky task. When big parts of
random protein sequences are exchanged, conformational topology is likely
to be altered dramatically and it is not guaranteed that the structure will
be even stable. Moreover, it would be intractable with current computational resources to do a full ab initio remodeling of the generated protein
sequences. However, homolog sequences are far less likely to introduce big
conformational changes upon recombination, a fact that enables crossover to
be a major drive in natural evolution. By splitting up our genetic algorithm
in subpopulations, we can group them by sequence homology and have them
mating in isolated islands; this way, recombined sequences can be modeled

CHAPTER 3. APPLICATION OVERVIEW

66

with high precision and reliability using homology modeling solutions.


To generate recombined sequences, two crossover operators were implemented. One of them, the two-points crossover operator, is shown in snippet
3.11. Based on the fixed pctX parameter as defined in 3.1, a subsequence
of arbitrary length is selected between two points and exchanged between
two random sequences in the subpopulation. Afterwards, metadata about
the parental chains and the crossover points is saved in the Protein object.
This procedure is repeated for all proteins in the subpopulation. For every
individual, a random float is generated. If the value is bigger than pctX, the
sequence remains unaltered and no parents nor crossover points are set.
Code Snippet 3.12: Implementation of the homology modeling procedure
using MODELLER. [42]
1
2

def p r e d i c t s t r u c t u r e h o m o l o g y f a s t ( p r o t e i n ) :
i f len ( p r o t e i n . p a r e n t s ) > 1 :

write aln ( protein )

env = m o d e l l e r . e n v i r o n ( )

env . i o . a t o m f i l e s d i r e c t o r y = . / pdb

templates = [ ]

f o r p a r e n t in p r o t e i n . p a r e n t s :

8
9

t e m p l a t e s . append ( p a r e n t [ 0 ] )
a = m o d e l l e r . automodel . automodel ( env , a l n f i l e =p r o t e i n . id
+ t o t a l . a l i , knowns=tuple ( t e m p l a t e s ) , s e q u e n c e=
p r o t e i n . id )

10

a . starting model = 1

11

a . ending model = 1

12

a . make ( )

13

## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .

14

# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + .
B99990001 . pdb o f i n a l + p r o t e i n . i d + . pdb )

15

o s . system ( cp + p r o t e i n . id + . B99990001 . pdb f i n a l +


p r o t e i n . id + . pdb )

CHAPTER 3. APPLICATION OVERVIEW


16

## R e p l a c e p r e v i o u s PDB w i t h model .

17

o s . system ( mv f i n a l + p r o t e i n . id + . pdb . / pdb/ +

67

p r o t e i n . id + . pdb )
18

## Fix PDB ( o n l y needed when s i d e c h a i n r e f i n e m e n t i s


n o t used )

19

f i x p d b ( p r o t e i n . id )

20

## Remove temporary f i l e s .

21

o s . system ( rm + p r o t e i n . id + )

22

## M o d e l l e r d o e s n o t add c h a i n i d s by i t s e l f .

23

a d d c h a i n i d ( p r o t e i n . id , p r o t e i n . c h a i n )

24

print P r o t e i n + p r o t e i n . id + modeled u s i n g MODELLER


.

25
26
27

return p r o t e i n
else :
o s . system ( cp pdb/ + p r o t e i n . p a r e n t s [ 0 ] [ 0 ] + . pdb pdb/
+ p r o t e i n . id + . pdb )

28

return p r o t e i n

With sequence identities of 95% and above, comparative modeling algorithms should be capable of returning highly reliable models of the recombined protein sequences. [42, 41] Another asset is the fact that the program
exactly knows how sequences have evolved through the course of the application. It is therefore very straightforward to define correct alignments
to MODELLER, the comparative modeling package we opted for. We implemented a very fast strategy to do this as disclosed in code snippet 3.12.
Usually, MODELLER will screen sequence databases for homolog structures
and needs user interaction to select a suitable template. This takes a lot
of time and can be thankfully skipped by our implementation. Alignments
are manually constructed by the write aln function and uses sequence metadata stored in the Protein objects. The implementation of this function can
be found in the attachments chapter (appendix B). Again, functionality to
refine side chains using SCWRL4 is available but not enabled by default.

CHAPTER 3. APPLICATION OVERVIEW

3.8

68

The selection operator

In the current revision of FIGARO, only one selection operator was implemented and is shown in code snippet 3.13. The used roulette wheel
selection procedure is completely analogous to the description given in section 2.4 starting at page 40. In addition to that, our roulette wheel selection
functions contains two extra arguments, the first being pctE defining the
percentage of elitism. This parameter is given in the global program parameters shown in snippet 3.1. Having said that, it should be noted that
the current implementation of the selection operator doesnt support elitism
yet. To extend the capabilities of our genetic algorithm, this is something
that needs to be fixed in a next release.
The other additional argument is a boolean specifying the need for copying and replacement of PDB files for the selected proteins. This is only
desirable in the case of selection within the subpopulations before recombination. The second selection step happens at the population level when new
template proteins need to be chosen for the next generation and is followed
up by another section of the program with other requirements. Therefore,
the boolean is used to switch between the two levels of selection. Tournament selection (also described in section 2.4) is another approach that
comes to mind for the latter selection round and could be implemented in
next releases of the application. It provides more flexibility regarding selection pressure. For example, it can give more weight to the best individuals
from every subpopulation by increasing selection pressure.

CHAPTER 3. APPLICATION OVERVIEW

69

Code Snippet 3.13: Implementation of the selection operator.


1

def r o u l e t t e w h e e l s e l e c t i o n ( p o p u l a t i o n , pctE , r e p l a c e=True ) :

newPopulation = [ ]

f o r i in range ( 0 , len ( p o p u l a t i o n ) ) :

totalScore = 0

f o r p r o t e i n in p o p u l a t i o n :

t o t a l S c o r e += p r o t e i n . s c o r e

r o u l e t t e R e s u l t = random . u nif orm ( 0 , t o t a l S c o r e )

index = 0

c u r r e n t T o t a l = p o p u l a t i o n [ i n d e x ] . s c o r e

10

while c u r r e n t T o t a l < r o u l e t t e R e s u l t and i n d e x < len (


p o p u l a t i o n ) 1:

11

i n d e x += 1

12

c u r r e n t T o t a l += p o p u l a t i o n [ i n d e x ] . s c o r e

13
14

i f r e p l a c e == True :
newId = p o p u l a t i o n [ i n d e x ] . id [ 0 : 4 ] + s e l e c t e d +
s t r ( i +1)

15

newPopulation . append ( P r o t e i n ( newId , p o p u l a t i o n [ i n d e x


] . seq ,

16

p o p u l a t i o n [ i n d e x ] . chain
, population [ index ] .
bindingSites ) )

17

## Rename PDBs t o l i n k w i t h new i d s .

18

o s . system ( cp pdb/ + p o p u l a t i o n [ i n d e x ] . id + . pdb


pdb/ + newId + . pdb )

19
20
21

i f r e p l a c e == F a l s e :
newPopulation . append ( p o p u l a t i o n [ i n d e x ] )
i f r e p l a c e == True :

22

## Clean up PDB f o l d e r .

23

f o r p r o t e i n in p o p u l a t i o n :

24

o s . system ( rm pdb/ + p r o t e i n . id + . pdb )

25

print R o u l e t t e Wheel S e l e c t i o n s u c c e e d e d .

26

return newPopulation

Chapter 4

Discussion

70

CHAPTER 4. DISCUSSION

71

A program like FIGARO has no use unless it has been experimentally


validated. Considering the scope of this masters thesis, this was not feasible.
Its also because of that reason that no results of the application were given;
these would be useless without being experimentally verified. This thesis
was written with the aim of providing a new perspective for computational
receptor optimization taking into consideration as much biochemical and
technical points of attention as possible. For every aspect of the application,
strong scientific documentation is available to give it a fundament to work
on. However, this is not enough to guarantee its flawless behaviour. To be
more correct, I wouldnt even recommend it for use right now.
The first thing that obviously needs to be done is verifying the generated structures. It is very likely that they alter significantly from the experimentally derived structures. On the other side, FIGARO offers a lot of
parameters to be fine-tuned in order to get more reliable results. A pipeline
based on experimental feedback could greatly improve FIGAROs performance. For now however, it is largely a matter of guessing if results will
be even approaching the real structures. The evaluation of this application
by fellow researches is another important source of improvement. As I was
completely on my own implementing this program, it could be greatly optimized by independent evaluation. Having said this, I hereby gently request
any possible support from the scientific community.
On the other hand, a lot of computational functionality that wasnt implemented yet could be easily appended in future releases. This will give
the program even more flexibility and potential, like was noted in section
3.8. For example, the amount of implemented evolutionary operators is
rather limited due to the restricted time scope. Another notable example
in this regard is the distributed computing support. Although this was implemented to a certain degree, much larger improvements with respect to
runtime speeds can be achieved. A main suggestion for this is to integrate

CHAPTER 4. DISCUSSION

72

support for Apache Hadoop and MapReduce to make use of as much computational power as possible and lift the application to a whole new level of
computational abilities.

Chapter 5

Conclusion

73

CHAPTER 5. CONCLUSION

74

FIGARO is a promising new approach to optimize artificially designed


receptor structures for a given ligand. While computational optimization
is often seen in literature for ligand molecules, this work tries to tackle
it the other way around optimizing receptor structures for random ligand
compounds. Athough the programs reliability is not yet tested based on
experimentally determined structures, it promises great potential in several
fields of research. The one particularly capturing our imagination is without
a doubt the application in immunotherapy treatment of cancer. FIGARO
offers a new bridge between very specific, even artificially designed chemical
structures and proteins that can vary widely in function. Unique compounds
on the surface of cancer cells could perfectly act as input structures for
FIGARO to target.
Eventually not only receptors could be optimized. Ultimately, this project
aims at providing a starting point in the process of designing new enzymes
involved in all kinds of cellular processes. Binding substrates based on
an induced fit mechanism is one of the key concepts in enzyme dynamics. FIGARO currently tries to optimize just that, but further additions
concerning enzyme mechanics are indispensable in order to be useful in the
task of artificial enzyme design. The most important consideration in this
regard is the stabilization of the transition state complex of a reaction. Integrating this functionality is not evident and certainly requires more than
one equivalent of a full-time working resource, but FIGAROs backbone
should support it nevertheless by extending the implementation of fitness
score calculation.
As a concluding remark, FIGARO wasnt made with the intention to be
the holy grail for cancer treatment or artificial enzyme design, but aims to
provide a new and promising approach to computational receptor optimization. With a 50-year-old problem of protein structure prediction and many
already achieved advantages in this field, the availability of enormeous dis-

CHAPTER 5. CONCLUSION

75

tributed computation possibilities and proven usability of very sophisticated


evolutionary algorithms, FIGARO goes down the path of future potential.

Part III

Appendix

76

Appendix A

Bibliography

77

BIBLIOGRAPHY

78

[1] I. Rocha, P. Maia, P. Evangelista, P. Vilaca, S. Soares, J. P. Pinto,


J. Nielsen, K. R. Patil, E. C. Ferreira, and M. Rocha.

OptFlux:

an open-source software platform for in silico metabolic engineering.


BMC Syst Biol, 4:45, Apr 2010.

[PubMed Central:PMC2864236]

[DOI:10.1186/1752-0509-4-45] [PubMed:20403172].
[2] P. Pharkya, A. P. Burgard, and C. D. Maranas. OptStrain: a computational framework for redesign of microbial production systems. Genome
Res., 14(11):23672376, Nov 2004.

[PubMed Central:PMC525696]

[DOI:10.1101/gr.2872004] [PubMed:15520298].
[3] O. Khersonsky, D. Rothlisberger, O. Dym, S. Albeck, C. J. Jackson,
D. Baker, and D. S. Tawfik. Evolutionary optimization of computationally designed enzymes: Kemp eliminases of the KE07 series. J. Mol.
Biol., 396(4):10251042, Mar 2010. [DOI:10.1016/j.jmb.2009.12.031]
[PubMed:20036254].
[4] L. Giger, S. Caner, R. Obexer, P. Kast, D. Baker, N. Ban, and D. Hilvert. Evolution of a designed retro-aldolase leads to complete active site
remodeling. Nat. Chem. Biol., 9(8):494498, Aug 2013. [PubMed Central:PMC3720730] [DOI:10.1038/nchembio.1276] [PubMed:23748672].
[5] V. Nanda and R. L. Koder. Designing artificial enzymes by intuition
and computation. Nat Chem, 2(1):1524, Jan 2010. [PubMed Central:PMC3443871] [DOI:10.1038/nchem.473] [PubMed:21124375].
[6] S. Paul, S. A. Planque, Y. Nishiyama, C. V. Hanson, and R. J.
Massey.

Nature and nurture of catalytic antibodies.

Med. Biol., 750:5675, 2012.

Adv. Exp.

[DOI:10.1007/978-1-4614-3461-0 5]

[PubMed:22903666].
[7] Y. Xu, N. Yamamoto, and K. D. Janda.

Catalytic antibod-

ies: hapten design strategies and screening methods. Bioorg. Med.

BIBLIOGRAPHY

79

Chem., 12(20):52475268, Oct 2004. [DOI:10.1016/j.bmc.2004.03.077]


[PubMed:15388154].
[8] D. Hilvert.

Critical analysis of antibody catalysis.

Annu. Rev.

Biochem., 69:751793, 2000. [DOI:10.1146/annurev.biochem.69.1.751]


[PubMed:10966475].
[9] C. Jackel,
directed

P. Kast,

evolution.

and D. Hilvert.
Annu

Rev

Protein design by

Biophys,

37:153173,

2008.

[DOI:10.1146/annurev.biophys.37.032807.125832] [PubMed:18573077].
[10] P. Molina-Espeja, E. Garcia-Ruiz, D. Gonzalez-Perez, R. Ullrich,
M. Hofrichter, and M. Alcalde.

Directed evolution of unspe-

cific peroxygenase from Agrocybe aegerita.


biol., 80(11):34963507, Jun 2014.

Appl. Environ. Micro-

[PubMed Central:PMC4018863]

[DOI:10.1128/AEM.00490-14] [PubMed:24682297].
[11] M. SELA, F. H. WHITE, and C. B. ANFINSEN. Reductive cleavage
of disulfide bridges in ribonuclease. Science, 125(3250):691692, Apr
1957. [PubMed:13421663].
[12] Carl Branden and John Tooze. Introduction to Protein Structure. Garland Science, 1999.
[13] S. C. Lovell, I. W. Davis, W. B. Arendall, P. I. de Bakker, J. M.
Word, M. G. Prisant, J. S. Richardson, and D. C. Richardson.
Structure validation by Calpha geometry: phi,psi and Cbeta deviation. Proteins, 50(3):437450, Feb 2003. [DOI:10.1002/prot.10286]
[PubMed:12557186].
[14] F. Crick. Central dogma of molecular biology. Nature, 227(5258):561
563, Aug 1970. [PubMed:4913914].

BIBLIOGRAPHY

80

[15] Alka Dwevedi. Protein Folding: Examining the Challenges from Synthesis to Folded Form (SpringerBriefs in Biochemistry and Molecular
Biology). Springer, 2014.
[16] C. B. Anfinsen. Principles that govern the folding of protein chains.
Science, 181(4096):223230, Jul 1973. [PubMed:4124164].
[17] C. B. Anfinsen and H. A. Scheraga. Experimental and theoretical
aspects of protein folding.

Adv. Protein Chem., 29:205300, 1975.

[PubMed:237413].
[18] Levinthal C. How to Fold Graciously. Mossbauer Spectroscopy in
Biological Systems: Proceedings of a meeting held at Allerton House,
Monticello, Illinois, 1969.
[19] K. A. Dill.

Polymer principles and protein folding.

Sci., 8(6):11661180, Jun 1999.

Protein

[PubMed Central:PMC2144345]

[DOI:10.1110/ps.8.6.1166] [PubMed:10386867].
[20] Louise A Wallace and C Robert Matthews. Sequential vs. parallel
protein-folding mechanisms: experimental tests for complex folding reactions. Biophysical Chemistry, 101102:113 131, 2002. Special issue
in honour of John A Schellman.
[21] L. PAULING, R. B. COREY, and H. R. BRANSON. The structure of
proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. U.S.A., 37(4):205211, Apr 1951.
[PubMed Central:PMC1063337] [PubMed:14816373].
[22] M. Novotny and G. J. Kleywegt.
in protein structures.

A survey of left-handed helices

J. Mol. Biol., 347(2):231241, Mar 2005.

[DOI:10.1016/j.jmb.2005.01.037] [PubMed:15740737].

BIBLIOGRAPHY

81

[23] W. J. van Heeckeren, J. W. Sellers, and K. Struhl. Role of the conserved


leucines in the leucine zipper dimerization motif of yeast GCN4. Nucleic
Acids Res., 20(14):37213724, Jul 1992. [PubMed Central:PMC334023]
[PubMed:1641337].
[24] L. Lo Conte, B. Ailey, T. J. Hubbard, S. E. Brenner, A. G.
Murzin, and C. Chothia. SCOP: a structural classification of proteins
database. Nucleic Acids Res., 28(1):257259, Jan 2000. [PubMed Central:PMC102479] [PubMed:10592240].
[25] F. M. Pearl, C. F. Bennett, J. E. Bray, A. P. Harrison, N. Martin,
A. Shepherd, I. Sillitoe, J. Thornton, and C. A. Orengo. The CATH
database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res., 31(1):452455, Jan 2003. [PubMed
Central:PMC165509] [PubMed:12520050].
[26] R. D. Finn, A. Bateman, J. Clements, P. Coggill, R. Y. Eberhardt, S. R.
Eddy, A. Heger, K. Hetherington, L. Holm, J. Mistry, E. L. Sonnhammer, J. Tate, and M. Punta. Pfam: the protein families database.
Nucleic Acids Res., 42(Database issue):D222230, Jan 2014. [PubMed
Central:PMC3965110] [DOI:10.1093/nar/gkt1223] [PubMed:24288371].
[27] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat,
H. Weissig, I. N. Shindyalov, and P. E. Bourne. The Protein Data
Bank. Nucleic Acids Res., 28(1):235242, Jan 2000. [PubMed Central:PMC102472] [PubMed:10592235].
[28] H. Berman, K. Henrick, H. Nakamura, and J. L. Markley.

The

worldwide Protein Data Bank (wwPDB): ensuring a single, uniform


archive of PDB data. Nucleic Acids Res., 35(Database issue):D301303,
Jan 2007. [PubMed Central:PMC1669775] [DOI:10.1093/nar/gkl971]
[PubMed:17142228].

BIBLIOGRAPHY

82

[29] P. J. Cock, T. Antao, J. T. Chang, B. A. Chapman, C. J. Cox,


A. Dalke, I. Friedberg, T. Hamelryck, F. Kauff, B. Wilczynski,
and M. J. de Hoon.

Biopython: freely available Python tools for

computational molecular biology and bioinformatics.


ics, 25(11):14221423, Jun 2009.

Bioinformat-

[PubMed Central:PMC2682512]

[DOI:10.1093/bioinformatics/btp163] [PubMed:19304878].
[30] David L. Nelson and Michael M. Cox. Lehninger Principles of Biochemistry. W. H. Freeman, 2008.
[31] H. J. Schneider. Limitations and extensions of the lock-and-key principle: differences between gas state, solution and solid state structures.

Int J Mol Sci, 16(4):66946717, Mar 2015.

[PubMed Cen-

tral:PMC4424984] [DOI:10.3390/ijms16046694] [PubMed:25815592].


[32] W. L. Jorgensen. Rusting of the lock and key model for protein-ligand
binding. Science, 254(5034):954955, Nov 1991. [PubMed:1719636].
[33] T. Keleti.

Two rules of enzyme kinetics for reversible Michaelis-

Menten mechanisms.

FEBS Lett., 208(1):109112, Nov 1986.

[PubMed:3770204].
[34] Y. Liu and B. Kuhlman. RosettaDesign server for protein design. Nucleic Acids Res., 34(Web Server issue):W235238, Jul 2006. [PubMed
Central:PMC1538902] [DOI:10.1093/nar/gkl163] [PubMed:16845000].
[35] J. C. KENDREW, G. BODO, H. M. DINTZIS, R. G. PARRISH,
H. WYCKOFF, and D. C. PHILLIPS. A three-dimensional model of the
myoglobin molecule obtained by x-ray analysis. Nature, 181(4610):662
666, Mar 1958. [PubMed:13517261].
[36] B. Rost and C. Sander. Bridging the protein sequence-structure gap by
structure predictions. Annu Rev Biophys Biomol Struct, 25:113136,
1996. [DOI:10.1146/annurev.bb.25.060196.000553] [PubMed:8800466].

BIBLIOGRAPHY

83

[37] A. Ilari and C. Savino. Protein structure determination by x-ray crystallography. Methods Mol. Biol., 452:6387, 2008. [DOI:10.1007/978-160327-159-2 3] [PubMed:18563369].
[38] B. Montgomery Pettitt.
tein folding.

The unsolved solved-problem of pro-

J. Biomol. Struct. Dyn., 31(9):10241027, 2013.

[PubMed Central:PMC4497552] [DOI:10.1080/07391102.2012.748547]


[PubMed:23384146].
[39] K. Lindorff-Larsen, S. Piana, R. O. Dror, and D. E. Shaw.
fast-folding proteins fold.

How

Science, 334(6055):517520, Oct 2011.

[DOI:10.1126/science.1208351] [PubMed:22034434].
[40] David E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey S. Kuskin,
Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J. Bowers, Jack C. Chao, Michael P. Eastwood, Joseph
Gagliardo, J. P. Grossman, C. Richard Ho, Douglas J. Ierardi, Istvan
Kolossv
ary, John L. Klepeis, Timothy Layman, Christine McLeavey,
Mark A. Moraes, Rolf Mueller, Edward C. Priest, Yibing Shan, Jochen
Spengler, Michael Theobald, Brian Towles, and Stanley C. Wang. Anton, a special-purpose machine for molecular dynamics simulation.
SIGARCH Comput. Archit. News, 35(2):112, jun 2007.
[41] Ruben Abagyan Andrew J. W. Orry. Homology Modeling: Methods and
Protocols (Methods in Molecular Biology). Humana Press, 2012.
[42] B. Webb and A. Sali. Protein structure modeling with MODELLER.
Methods Mol. Biol., 1137:115, 2014. [DOI:10.1007/978-1-4939-03665 1] [PubMed:24573470].
[43] Andreas Kukol. Molecular Modeling of Proteins (Methods in Molecular
Biology). Humana Press, 2014.

BIBLIOGRAPHY

84

[44] I. W. Davis, W. B. Arendall, D. C. Richardson, and J. S.


Richardson.

The backrub motion:

when a sidechain dances.

how protein backbone shrugs

Structure, 14(2):265274, Feb 2006.

[DOI:10.1016/j.str.2005.10.007] [PubMed:16472746].
[45] C. A. Smith and T. Kortemme. Backrub-like backbone simulation
recapitulates natural protein conformational variability and improves
mutant side-chain prediction.

J. Mol. Biol., 380(4):742756, Jul

2008. [PubMed Central:PMC2603262] [DOI:10.1016/j.jmb.2008.05.023]


[PubMed:18547585].
[46] X. Y. Meng, H. X. Zhang, M. Mezei, and M. Cui. Molecular docking: a powerful approach for structure-based drug discovery. Curr
Comput Aided Drug Des, 7(2):146157, Jun 2011.

[PubMed Cen-

tral:PMC3151162] [PubMed:21534921].
[47] M. Cloutier and P. Wellstead. The control systems structures of energy
metabolism. J R Soc Interface, 7(45):651665, Apr 2010. [PubMed Central:PMC2842784] [DOI:10.1098/rsif.2009.0371] [PubMed:19828503].
[48] Michael Affenzeller, Stephan Winkler, Stefan Wagner, and Andreas Beham. Genetic Algorithms and Genetic Programming: Modern Concepts
and Practical Applications. Chapman & Hall/CRC, 1st edition, 2009.
[49] Lawrence David Davis. Evolutionary Computation in Practice (Studies
in Computational Intelligence). Springer, 2008.
[50] David L. Applegate, Robert E. Bixby, Vasek Chvatal, and William J.
Cook.

The Traveling Salesman Problem: A Computational Study

(Princeton Series in Applied Mathematics). Princeton University Press,


Princeton, NJ, USA, 2007.
[51] Hartmut Pohlheim. Evolution
are Algorithmen. Verfahren, Operatoren
und Hinweise f
ur die Praxis. Springer, 1999.

BIBLIOGRAPHY

85

[52] D. Ghersi and R. Sanchez. EasyMIFS and SiteHound: a toolkit for the
identification of ligand-binding sites in protein structures. Bioinformatics, 25(23):31853186, Dec 2009. [PubMed Central:PMC2913663]
[DOI:10.1093/bioinformatics/btp562] [PubMed:19789268].
[53] M. Hernandez, D. Ghersi, and R. Sanchez. SITEHOUND-web: a server
for ligand binding site identification in protein structures.

Nucleic

Acids Res., 37(Web Server issue):W413416, Jul 2009. [PubMed Central:PMC2703923] [DOI:10.1093/nar/gkp281] [PubMed:19398430].
[54] F. Lauck, C. A. Smith, G. F. Friedland, E. L. Humphris, and T. Kortemme.

RosettaBackruba web server for flexible backbone pro-

tein structure modeling and design.


Server issue):W569575, Jul 2010.

Nucleic Acids Res., 38(Web


[PubMed Central:PMC2896185]

[DOI:10.1093/nar/gkq369] [PubMed:20462859].
[55] G. G. Krivov, M. V. Shapovalov, and R. L. Dunbrack.

Improved

prediction of protein side-chain conformations with SCWRL4. Proteins, 77(4):778795, Dec 2009.

[PubMed Central:PMC2885146]

[DOI:10.1002/prot.22488] [PubMed:19603484].
[56] S. Ruiz-Carmona, D. Alvarez-Garcia, N. Foloppe, A. B. GarmendiaDoval, S. Juhos, P. Schmidtke, X. Barril, R. E. Hubbard, and
S. D. Morley.

rDock: a fast, versatile and open source program

for docking ligands to proteins and nucleic acids.


Biol., 10(4):e1003571, Apr 2014.

PLoS Comput.

[PubMed Central:PMC3983074]

[DOI:10.1371/journal.pcbi.1003571] [PubMed:24722481].

Appendix B

Attachments

86

APPENDIX B. ATTACHMENTS

B.1

87

main.py
Code Snippet B.1: Complete source code: main.py.

1 ########################################
2 ####

FIGARO

####

3 #### Author : P i e t e r Noyens

####

4 #### Academic y e a r : 2015 2016

####

5 #### S t u d e n t nr : r0307453

####

6 #### 2MSc B i o i n f o r m a t i c s , KU Leuven ####


7 ########################################
8
9 ## Main s c r i p t c o n t a i n i n g h i g h l e v e l GA b a c k b o n e .
10
11

from f u n c t i o n s import

12

import m u l t i p r o c e s s i n g

13
14 ## S e t number o f p r o c e s s o r c o r e s
15

w o r k e r s = m u l t i p r o c e s s i n g . Pool ( 1 0 0 )

16
17 ## Parameters :
18
19 ## P e r c e n t a g e e l i t i s m
20

pctE = 0 . 2 5

21 ## P e r c e n t a g e c r o s s o v e r
22

pctX = 0 . 6

23 ## P e r c e n t a g e m u t a t i o n
24

pctM = 0 . 0 0 5

25 ## P o p u l a t i o n s i z e
26

p o p S i z e = 100

27 ## S u b p o p u l a t i o n s i z e
28

s u b p o p S i z e = 10

29 ## Number o f p u t a t i v e b i n d i n g s i t e s t o be i n s p e c t e d
30

nrBindingSites = 5

31 ## Number o f g e n e r a t i o n s
32

g e n S i z e = 50

33 ## Maximum r e s o l u t i o n o f s t r u c t u r e s i n i n i t i a l p o p u l a t i o n

APPENDIX B. ATTACHMENTS
34

88

maxRes = 2 . 0

35 ## Maximum number o f e n t i t i e s i n s t r u c t u r e
36

maxEnt = 1

37 ## Minimum s i m i l a r i t y t o q u e r y l i g a n d
38

minSim = 0 . 3

39
40 ## Generate random s m a l l m o l e c u l e from h t t p : / / b c i r c . d o c k i n g . o r g /
random . s h t m l and g e t SMILES ( c u r r e n t l y hardcoded ) .
41

randMol = CCOc1cccc ( n1 )NC(=O) c 2 c c [ nH ] n2

42
43 ## I n i t i a l i z e t h e p o p u l a t i o n .
44

p o p u l a t i o n = i n i t i a l i z e p o p ( randMol , popSize , subpopSize , maxRes


, maxEnt , minSim , n r B i n d i n g S i t e s , pctM )

45
46 ## Write i n i t i a l i n f o r m a t i o n f o r r e s u l t e v a l u a t i o n .
47

w r i t e i n i t i a l i n f o ( population )

48
49 ## Perform GA on p o p u l a t i o n f o r s p e c i f i e d number o f g e n e r a t i o n s .
50
51

f o r g e n e r a t i o n in range ( 0 , g e n S i z e ) :
f o r index , s u b p o p u l a t i o n in enumerate ( p o p u l a t i o n ) :

52

## P r e d i c t i n i t i a l s t r u c t u r e s .

53

p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
predict structure backrub , subpopulation )

54

## Dock l i g a n d t o a l l i n d i v i d u a l s i n s u b p o p u l a t i o n .

55

p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )

56

## S e l e c t i n d i v i d u a l s f o r mating i n s u b p o p u l a t i o n s .

57

population [ index ] = r o u l e t t e w h e e l s e l e c t i o n ( population [


i n d e x ] , pctE )

58

## Perform c r o s s o v e r b e t w e e n i n d i v i d u a l s i n
subpopulations .

59

t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n [ i n d e x ] , pctX )

60

#s i n g l e p o i n t c r o s s o v e r ( newPopulation , pctX )

61

## P r e d i c t s t r u c t u r e o f recombined s e q u e n c e s .

62

p o p u l a t i o n [ i n d e x ] = w o r k e r s .map(
p r e d i c t s t r u c t u r e h o m o l o g y f a s t , population [ index ] )

63

## Dock l i g a n d t o a l l r e c o m b i n a t i o n s i n s u b p o p u l a t i o n .

APPENDIX B. ATTACHMENTS
64
65
66

89

p o p u l a t i o n [ i n d e x ] = w o r k e r s .map( dock , p o p u l a t i o n [ i n d e x ] )
i f g e n e r a t i o n < g e n S i z e 1:
## S e l e c t mother p r o t e i n s f o r new p o p u l a t i o n b a s e d on
c o m p e t i t i o n b e t w e e n each s u b p o p u l a t i o n s b e s t
competitor .

67

motherSelection = r o u l e t t e w h e e l s e l e c t i o n (
g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) , r e p l a c e=F a l s e )

68

## Generate new p o p u l a t i o n and mutate mother p r o t e i n s .

69

population = prepare next gen ( population ,


m o t h e r S e l e c t i o n , subpopSize , pctM , g e n e r a t i o n )

70

else :

71

bestProtein = report best ( population )

72

print Program f i n i s h e d . The b e s t p r o t e i n i s

b e s t P r o t e i n . id + with a s c o r e o f + s t r (
bestProtein . score ) + \
73

on b i n d i n g s i t e
+ .

+ bestProtein . bestBindingSite

APPENDIX B. ATTACHMENTS

B.2

90

functions.py
Code Snippet B.2: Complete source code: main.py.

1 ########################################
2 ####

FIGARO

####

3 #### Author : P i e t e r Noyens

####

4 #### Academic y e a r : 2015 2016

####

5 #### S t u d e n t nr : r0307453

####

6 #### 2MSc B i o i n f o r m a t i c s , KU Leuven ####


7 ########################################
8
9 ## S c r i p t c o n t a i n i n g a l l lowl e v e l GA f u n c t i o n s .
10
11

from p r o t e i n import P r o t e i n

12

from Bio .PDB import

13

import m o d e l l e r

14

import m o d e l l e r . automodel

15

import m o d e l l e r . s c r i p t s . c o m p l e t e p d b

16

import r e

17

import u r l l i b 2

18

import o s

19

import s y s

20

import random

21

import copy

22
23

ROSETTADIR = R o s e t t a

24
25

e x c e p t i o n L i s t = [ 2OGM , 4Y5G ]

26

aminoAcids = [ A , D , E , F , G , H , I , K , L , M ,
N , Q , R , S , T , V , W , Y ]

27
28 ## F i l l i n g up s h o r t c o m i n g o f Biopython p a c k a g e no method f o r
r e t r e i v i n g r e s i d u e i n d e x from r e s i d u e o b j e c t a v a i l a b l e .
29
30
31

def g e t r e s i ( r e s ) :
return i n t ( s t r ( r e s ) . s p l i t ( ) [ 3 ] [ 7 : ] )

APPENDIX B. ATTACHMENTS

91

32 ## C r e a t e s i n i t i a l PDB p o p u l a t i o n and download t o f o l d e r pdb .


33 ## R e c e p t o r s f o r drugl i k e m o l e c u l e s s i m i l a r t o t h e random
molecule are c o l l e c t e d .
34

def i n i t i a l i z e p o p ( randMol , popSize , subpopSize , maxRes , maxEnt ,


minSim , n r B i n d i n g S i t e s , pctM ) :

35

population = [ ]

36

usedPdb = [ ]

37

similarity = 1

38

while s i m i l a r i t y > minSim and len ( p o p u l a t i o n ) != p o p S i z e :

39

pdbOut = u r l l i b 2 . u r l o p e n ( h t t p : / /www. r c s b . o r g /pdb/ r e s t /


s m i l e s Q u e r y ? s m i l e s= + randMol +

40

&s e a r c h t y p e=s i m i l a r i t y&


s i m i l a r i t y= +s t r ( s i m i l a r i t y )
) . read ( ) . s p l i t l i n e s ( )

41
42
43

f o r l i n e in pdbOut :
i f len ( p o p u l a t i o n ) != p o p S i z e :
h i t = r e . s e a r c h ( (?<= s t r u c t u r e I d =) . ? ( ? = ) ,
line )

44

if hit :

45

pdb = h i t . group ( )

46

c h a i n = r e . s e a r c h ( (?<= c h a i n i d =)\w(?=) ,
u r l l i b 2 . urlopen (

47

h t t p : / /www. r c s b . o r g /pdb/ r e s t /
d e s c r i b e M o l ? s t r u c t u r e I d= + pdb ) . r e a d
( ) ) . group ( )

48

i f pdb not in usedPdb + e x c e p t i o n L i s t and


q u a l i t y c h e c k ( pdb , maxRes , maxEnt ) :

49

l i g a n d = r e . s e a r c h ( (?<=c h e m i c a l I D =)
. ? ( ? = ) , l i n e ) . group ( )

50

w r i t e p d b ( pdb , l i g a n d )

51

f i x p d b ( pdb )

52

e x t r a c t c h a i n ( pdb , c h a i n )

53

s e q = g e t s e q ( pdb )

54

b i n d i n g S i t e s = g e t b i n d i n g s i t e s ( pdb ,
nrBindingSites )

APPENDIX B. ATTACHMENTS
55

92

m o t h e r P r o t e i n = P r o t e i n ( pdb , seq , chain ,


bindingSites )

56

subpop = c r e a t e s u b p o p u l a t i o n (
motherProtein , subpopSize , pctM )

57

p o p u l a t i o n . append ( subpop )

58

usedPdb . append ( pdb )

59
60

s i m i l a r i t y = s i m i l a r i t y 0.1
i f len ( p o p u l a t i o n ) == p o p S i z e :

61

print P o p u l a t i o n i n i t i a l i z a t i o n f i n i s h e d .

62

return p o p u l a t i o n

63
64

else :
s y s . e x i t ( Not enough a v a i l a b l e t e m p l a t e s f o r t h e s e
parameters . )

65
66 ## Writes i n i t i a l i n f o t o f i l e .
67
68
69
70
71

def w r i t e i n i t i a l i n f o ( p o p u l a t i o n ) :
with open ( i n i t i a l i n f o , w ) a s i n i t i a l F i l e :
f o r subpop in p o p u l a t i o n :
f o r p r o t e i n in subpop :
i n i t i a l F i l e . w r i t e ( p r o t e i n . id + \ t + p r o t e i n .
s e q + \ t + s t r ( p r o t e i n . b i n d i n g S i t e s ) + \n
)

72

print I n i t i a l i n f o w r i t t e n .

73
74 ## E x t r a c t s s i n g l e c h a i n from PDB and w r i t e s as PDB.
75

def e x t r a c t c h a i n ( pdb , c h a i n ) :

76

p a r s e r = PDBParser ( )

77

s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( pdb ,

78

Dice . e x t r a c t ( s t r u c t u r e , chain , s t a r t =0 , end=s y s . maxint ,

pdb/ + pdb + . pdb )

f i l e n a m e= pdb/ + pdb + . pdb )


79
80 ## E x t r a c t s s e q u e n c e from PDB.
81

def g e t s e q ( pdb ) :

82

p a r s e r = PDBParser ( )

83

b u i l d e r = PPBuilder ( )

84

s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( pdb ,

pdb/ + pdb + . pdb )

APPENDIX B. ATTACHMENTS
85

sequenceList = [ ]

86

f o r p o l y p e p t i d e in b u i l d e r . b u i l d p e p t i d e s ( s t r u c t u r e ) :

87
88

93

s e q u e n c e L i s t . append ( s t r ( p o l y p e p t i d e . g e t s e q u e n c e ( ) ) )
return . j o i n ( s e q u e n c e L i s t )

89
90 ## Checks i f e n t i t y number and r e s o l u t i o n a r e t o l e r a t e d .
91
92

def q u a l i t y c h e c k ( pdb , maxRes , maxEnt ) :


d e s c r i p t i o n = u r l l i b 2 . u r l o p e n ( h t t p : / /www. r c s b . o r g /pdb/ r e s t /
describePDB ? s t r u c t u r e I d= + pdb ) . r e a d ( ) . s p l i t l i n e s ( )

93

eTest = False

94

rTest = False

95

f o r l i n e in d e s c r i p t i o n :

96

e = r e . s e a r c h ( (?<= n r e n t i t i e s =)\w(?=) , l i n e )

97

r = r e . s e a r c h ( (?<= r e s o l u t i o n =) . ? ( ? = ) , l i n e )

98

i f e and in t ( e . group ( ) ) <= maxEnt :

99
100

e T e s t = True
i f r and f l o a t ( r . group ( ) ) <= maxRes :

101

r T e s t = True

102

i f e T e s t and r T e s t :

103

return True

104
105

else :
return F a l s e

106
107 ## C r e a t e s s u b p o p u l a t i o n .
108

def c r e a t e s u b p o p u l a t i o n ( motherProtein , s i z e , pctM ) :

109

subpop = [ ]

110

f o r i in range ( 0 , s i z e ) :

111

subpop . append ( P r o t e i n ( m o t h e r P r o t e i n . id , m o t h e r P r o t e i n .
seq , m o t h e r P r o t e i n . chain , m o t h e r P r o t e i n . b i n d i n g S i t e s )
)

112

p o i n t m u t a t i o n ( subpop [ i ] , pctM )

113

subpop [ i ] . s e t i d ( m o t h e r P r o t e i n . id + + s t r ( i +1) )

114

return subpop

115
116 ## I d e n t i f i e s p u t a t i v e b i n d i n g p o c k e t s u s i n g t h e SiteHound
package

APPENDIX B. ATTACHMENTS

94

117 ## ( no i n s t a l l a t i o n needed , i 3 8 6 l i b r a r i e s have t o be i n s t a l l e d


f o r pdb2gmx ) .
118

def g e t b i n d i n g s i t e s ( pdb , n r B i n d i n g S i t e s ) :

119

bsCenterList = [ ]

120

s t a r t i n g D i r = o s . getcwd ( )

121

o s . c h d i r ( s t a r t i n g D i r + /pdb )

122

o s . system ( . / auto . py i

123

with open ( pdb + CMET summary . dat , r ) a s summary :

+ pdb + . pdb p CMET k )

124

f o r i in range ( 0 , n r B i n d i n g S i t e s ) :

125

l i n e = summary . r e a d l i n e ( )

126

x = l i n e . s p l i t ( ) [ 3]

127

y = l i n e . s p l i t ( ) [ 2]

128

z = l i n e . s p l i t ( ) [ 1]

129

b s C e n t e r L i s t . append ( ( + x + , + y + , + z + )
)

130

## Remove temporary f i l e s

131

o s . system ( rm + pdb + + pdb + . e a s y m i f s )

132

## Return t o main d i r e c t o r y .

133

os . chdir ( s t a r t i n g D i r )

134

print Binding s i t e s found

135

return b s C e n t e r L i s t

136
137 ## Performs m o l e c u l a r d o c k i n g o f l i g a n d t o b i n d i n g s i t e s (BS) o f
r e c e p t o r s t r u c t u r e s u s i n g t h e rDock p a c k a g e .
138 ## The g l o b a l d o c k i n g s c o r e f u n c t i o n s e r v e s as a f i t n e s s measure
.
139 ## Don t f o r g e t t o f i r s t c o m p i l e and s e t up rDock and Open B a b e l
correctly .
140

def dock ( p r o t e i n ) :

141

print Docking s t a r t e d f o r p r o t e i n + p r o t e i n . id + .

142

b e s t S c o r e = None

143

b e s t B i n d i n g S i t e = None

144

## Convert pdb t o mol2 f i l e u s i n g OpenBabel p a c k a g e .

145

o s . system ( b a b e l h . / pdb/ + p r o t e i n . id + . pdb + p r o t e i n


. id + . mol2 )

146

f o r b i n d i n g S i t e in p r o t e i n . b i n d i n g S i t e s :

APPENDIX B. ATTACHMENTS

95

147

## F i r s t w r i t e s y s t e m f i l e .

148

with open ( rDockSystem + p r o t e i n . id + . prm , w ) a s


systemFile :

149

s y s t e m F i l e . w r i t e ( RBT PARAMETER FILE V1. 0 0 \ nTITLE


gart DUD\nRECEPTOR FILE + p r o t e i n . id +
. mol2 \nRECEPTOR FLEX 3 . 0 \nSECTION

150

MAPPER\nSITE MAPPER
RbtSphereSiteMapper \nCENTER
+ b i n d i n g S i t e + \nRADIUS 1 5 . 0 \

151

nSMALL SPHERE 1 . 5 \nMIN VOLUME


100\nMAX CAVITIES 1\nVOL INCR
0 . 0 \nGRIDSTEP 0 . 5 \nEND SECTION\
nSECTION CAVITY\
nSCORING FUNCTION
RbtCavityGridSF \nWEIGHT 1 . 0 \
nEND SECTION )
152

## Generate c a v i t y f o r d o c k i n g .

153

o s . system ( r b c a v i t y was d r rDockSystem + p r o t e i n .


id + . prm )

154

## Perform d o c k i n g .

155

o s . system ( rbdock i l i g a n d . sd o o u t p u t + p r o t e i n . id
+ r rDockSystem + p r o t e i n . id +

156

. prm p dock . prm n 1 )

157

## Return d o c k i n g s c o r e .

158

with open ( o u t p u t + p r o t e i n . id + . sd , r ) a s
dockingResult :

159
160
161

f o r l i n e in d o c k i n g R e s u l t :
i f l i n e == >

<SCORE>\n :

d o c k i n g S c o r e = f l o a t ( d o c k i n g R e s u l t . next ( ) .
strip () )

162

i f b e s t S c o r e i s None or d o c k i n g S c o r e < b e s t S c o r e :

163

bestScore = dockingScore

164

bestBindingSite = bindingSite

165

protein . set score ( bestScore )

166

protein . set best binding site ( bestBindingSite )

167

## Remove temporary f i l e s

APPENDIX B. ATTACHMENTS
168

96

o s . system ( rm + p r o t e i n . id + . mol2 o u t p u t + p r o t e i n . id
+ . sd rDockSystem + p r o t e i n . id +

169

c a v 1 . grd rDockSystem + p r o t e i n . id + . a s
rDockSystem + p r o t e i n . id + . prm )

170

print Docking f i n i s h e d f o r p r o t e i n + p r o t e i n . id + .

171

return p r o t e i n

172
173 ## R e p o r t s b e s t i n d i v i d u a l from p o p u l a t i o n .
174

def r e p o r t b e s t ( p o p u l a t i o n ) :

175

c u r r e n t B e s t = None

176

f o r subpop in p o p u l a t i o n :

177
178

f o r p r o t e i n in subpop :
i f c u r r e n t B e s t i s None or p r o t e i n . s c o r e <
currentBest :

179
180

currentBest = protein
return c u r r e n t B e s t

181
182 ## E v a l u a t e s s u b p o p u l a t i o n s and r e t u r n s l i s t o f b e s t c a n d i d a t e s .
183

def g e t c a n d i d a t e s l i s t ( p o p u l a t i o n ) :

184

candidatesList = [ ]

185

f o r subpop in p o p u l a t i o n :

186

c u r r e n t B e s t = None

187

f o r p r o t e i n in subpop :

188

i f c u r r e n t B e s t i s None or p r o t e i n . s c o r e <
currentBest . score :

189
190
191

currentBest = protein
c a n d i d a t e s L i s t . append ( c u r r e n t B e s t )
return c a n d i d a t e s L i s t

192
193 ## R o u l e t t e Wheel S e l e c t i o n o p e r a t o r
194

def r o u l e t t e w h e e l s e l e c t i o n ( p o p u l a t i o n , pctE , r e p l a c e=True ) :

195

newPopulation = [ ]

196

f o r i in range ( 0 , len ( p o p u l a t i o n ) ) :

197

totalScore = 0

198

f o r p r o t e i n in p o p u l a t i o n :

199

t o t a l S c o r e += p r o t e i n . s c o r e

APPENDIX B. ATTACHMENTS

97

200

r o u l e t t e R e s u l t = random . u nif orm ( 0 , t o t a l S c o r e )

201

index = 0

202

c u r r e n t T o t a l = p o p u l a t i o n [ i n d e x ] . s c o r e

203

while c u r r e n t T o t a l < r o u l e t t e R e s u l t and i n d e x < len (


p o p u l a t i o n ) 1:

204

i n d e x += 1

205

c u r r e n t T o t a l += p o p u l a t i o n [ i n d e x ] . s c o r e

206

i f r e p l a c e == True :

207

newId = p o p u l a t i o n [ i n d e x ] . id [ 0 : 4 ] + s e l e c t e d +
s t r ( i +1)

208

newPopulation . append ( P r o t e i n ( newId , p o p u l a t i o n [ i n d e x


] . seq ,

209

p o p u l a t i o n [ i n d e x ] . chain
, population [ index ] .
bindingSites ) )

210

## Rename PDBs t o l i n k w i t h new i d s .

211

o s . system ( cp pdb/ + p o p u l a t i o n [ i n d e x ] . id + . pdb


pdb/ + newId + . pdb )

212

i f r e p l a c e == F a l s e :

213
214

newPopulation . append ( p o p u l a t i o n [ i n d e x ] )
i f r e p l a c e == True :

215

## Clean up PDB f o l d e r .

216

f o r p r o t e i n in p o p u l a t i o n :

217

o s . system ( rm pdb/ + p r o t e i n . id + . pdb )

218

print R o u l e t t e Wheel S e l e c t i o n s u c c e e d e d .

219

return newPopulation

220
221 ## TwoP o i n t s C r o s s o v e r o p e r a t o r
222
223

def t w o p o i n t s c r o s s o v e r ( p o p u l a t i o n , pctX ) :
i f not len ( p o p u l a t i o n ) > 1 :

224

p o p u l a t i o n [ 0 ] . s e t p a r e n t s ( protA=p o p u l a t i o n [ 0 ] )

225

p o p u l a t i o n [ 0 ] . s e t i d ( p o p u l a t i o n [ 0 ] . id + recombined )

226

print No c r o s s o v e r p o s s i b l e .

227

else :

228

random . s h u f f l e ( p o p u l a t i o n )

229

f o r index , p r o t e i n in enumerate ( p o p u l a t i o n ) :

APPENDIX B. ATTACHMENTS
230
231
232

98

i f random . random ( ) <= pctX :


i f i n d e x != len ( p o p u l a t i o n ) 1:
minLength = min ( [ len ( p r o t e i n . s e q ) , len (
population [ index +1]. seq ) ] )

233

c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )

234

c r o s s o v e r P o i n t 2 = random . r a n d i n t (
c r o s s o v e r P o i n t 1 , minLength )

235

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index + 1 ] ,

236

c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
c r o s s o v e r P o i n t 2=
crossoverPoint2 )

237

protein . update seq ( protein . seq [ 0 :


crossoverPoint1 ] + population [ index +1].

238

seq [ crossoverPoint1 :
crossoverPoint2 ] +
protein . seq [
crossoverPoint2 : ] )

239
240
241

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
else :
minLength = min ( [ len ( p r o t e i n . s e q ) , len (
p o p u l a t i o n [ index 1 ] . p a r e n t s [ 0 ] [ 1 ] ) ] )

242

c r o s s o v e r P o i n t 1 = random . r a n d i n t ( 0 ,
minLength )

243

c r o s s o v e r P o i n t 2 = random . r a n d i n t (
c r o s s o v e r P o i n t 1 , minLength )

244

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
population [ index 1 ] ,

245

c r o s s o v e r P o i n t 1=
crossoverPoint1 ,
c r o s s o v e r P o i n t 2=
crossoverPoint2 )

246

protein . update seq ( protein . seq [ 0 :


c r o s s o v e r P o i n t 1 ] + p o p u l a t i o n [ index 1 ] .

APPENDIX B. ATTACHMENTS
247

99
parents [ 0 ] [ 1 ] [
crossoverPoint1 :
crossoverPoint2 ] +

248

protein . seq [
crossoverPoint2 : ] )

249

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )

250

print C r o s s o v e r f i n i s h e d .

251

else :

252

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n )

253

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )

254

print No c r o s s o v e r .

255
256 ## S i n g l e P o i n t C r o s s o v e r o p e r a t o r
257
258

def s i n g l e p o i n t c r o s s o v e r ( p o p u l a t i o n , pctX ) :
i f not len ( p o p u l a t i o n ) > 1 :

259

p o p u l a t i o n [ 0 ] . s e t p a r e n t s ( protA=p o p u l a t i o n [ 0 ] )

260

p o p u l a t i o n [ 0 ] . s e t i d ( p o p u l a t i o n [ 0 ] . id + recombined )

261

print No c r o s s o v e r p o s s i b l e .

262

else :

263

random . s h u f f l e ( p o p u l a t i o n )

264

f o r index , p r o t e i n in enumerate ( p o p u l a t i o n ) :

265
266
267

i f random . random ( ) <= pctX :


i f i n d e x != len ( p o p u l a t i o n ) 1:
minLength = min ( [ len ( p r o t e i n . s e q ) , len (
population [ index +1]. seq ) ] )

268

c r o s s o v e r P o i n t = random . r a n d i n t ( 0 , minLength
)

269

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
p o p u l a t i o n [ i n d e x + 1 ] , c r o s s o v e r P o i n t 1=
crossoverPoint )

270

protein . update seq ( protein . seq [ 0 :


crossoverPoint ] + population [ index +1]. seq
[ crossoverPoint : ] )

271
272

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
else :

APPENDIX B. ATTACHMENTS
273

100

minLength = min ( [ len ( p r o t e i n . s e q ) , len (


p o p u l a t i o n [ index 1 ] . p a r e n t s [ 0 ] [ 1 ] ) ] )

274

c r o s s o v e r P o i n t = random . r a n d i n t ( 0 , minLength
)

275

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n , protB=
p o p u l a t i o n [ i n d e x 1 ] , c r o s s o v e r P o i n t 1=
crossoverPoint )

276

protein . update seq ( protein . seq [ 0 :


c r o s s o v e r P o i n t ] + p o p u l a t i o n [ index 1 ] .

277

parents [ 0 ] [ 1 ] [
crossoverPoint : ] )

278
279
280

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )
print C r o s s o v e r f i n i s h e d .
else :

281

p r o t e i n . s e t p a r e n t s ( protA=p r o t e i n )

282

p r o t e i n . s e t i d ( p r o t e i n . id + recombined )

283

print No c r o s s o v e r .

284
285 ## P o i n t m u t a t i o n o p e r a t o r
286

def p o i n t m u t a t i o n ( p r o t e i n , pctM ) :

287

mutable = l i s t ( p r o t e i n . s e q )

288

p o i n t M u t a t i o n s = {}

289

f o r index , aminoAcid in enumerate ( mutable ) :

290

i f random . random ( ) <= pctM and not ( p r o t e i n . s e q [ i n d e x ]


== C or p r o t e i n . s e q [ i n d e x ] == P ) :

291

mutable [ i n d e x ] = random . c h o i c e ( aminoAcids )

292

i f p r o t e i n . s e q [ i n d e x ] != mutable [ i n d e x ] :

293

p o i n t M u t a t i o n s [ i n d e x +1] = p r o t e i n . s e q [ i n d e x ] +
mutable [ i n d e x ]

294

p r o t e i n . u p d a t e s e q ( . j o i n ( mutable ) )

295

protein . set point mutations ( pointMutations )

296

print p r o t e i n + p r o t e i n . id + mutated .

297

return p r o t e i n

298
299 ## Writes PDB s t r u c t u r e f i l e w i t h or w i t h o u t l i g a n d .
300

def w r i t e p d b ( pdb , l i g a n d=None , f o l d e r= . / pdb/ ) :

APPENDIX B. ATTACHMENTS
301
302

101

with open ( f o l d e r + pdb + . pdb , w ) a s p d b F i l e :


pdbOrig = u r l l i b 2 . u r l o p e n ( h t t p : / /www. r c s b . o r g /pdb/ f i l e s
/ + pdb + . pdb ) . r e a d ( ) . s p l i t l i n e s ( )

303
304

i f ligand :
f o r l i n e in pdbOrig :
i f not r e . match ( HETATM\ s . ? \ s . ? \ s + l i g a n d ,

305

line ) :
p d b F i l e . w r i t e ( l i n e + \n )

306
307

print PDB f i l e f o r + pdb + i s w r i t t e n w i t h o u t


ligand + ligand + .

308
309

else :
f o r l i n e in pdbOrig :
p d b F i l e . w r i t e ( l i n e + \n )

310
311

print PDB f i l e f o r + pdb + i s w r i t t e n .

312
313 ## Writes PIR a l i g n m e n t f i l e .
314
315
316

def w r i t e p i r ( p r o t e i n ) :
with open ( p r o t e i n . id + . a l i , w ) a s p i r F i l e :
p i r F i l e . w r i t e ( >P1 ; + p r o t e i n . id + \n + s e q u e n c e : +
p r o t e i n . id + : : : : : : : 0 . 0 0 : 0 . 0 0 \ n + p r o t e i n . s e q +
)

317

print PIR f i l e w r i t t e n .

318
319 ## Writes a l i g n m e n t f i l e .
320
321
322
323

def w r i t e a l n ( p r o t e i n ) :
with open ( p r o t e i n . id + t o t a l . a l i , w ) a s a l n F i l e :
f o r p a r e n t in p r o t e i n . p a r e n t s :
a l n F i l e . w r i t e ( >P1 ; + p a r e n t [ 0 ] + \n + s t r u c t u r e
: + parent [ 0 ] + : . : . : . : . : : : : \ n +

324
325

p a r e n t [ 1 ] + \n )
a l n F i l e . w r i t e ( >P1 ; + p r o t e i n . id + \n + s e q u e n c e : +
p r o t e i n . id + : : : : : : : : \ n +

326
327

p r o t e i n . s e q + \n )
print Alignment f i l e w r i t t e n .

328
329 ## Returns l i s t o f t e m p l a t e s .

APPENDIX B. ATTACHMENTS
330

102

def g e t t e m p l a t e s ( p r f , c u t o f f , amount ) :

331

p r f . w r i t e ( f i l e= b u i l d p r o f i l e . p r f , p r o f i l e f o r m a t= TEXT )

332

templates = [ ]

333

i d e n t i t y = 95

334

while i d e n t i t y >= c u t o f f and len ( t e m p l a t e s ) < amount :

335

with open ( b u i l d p r o f i l e . p r f , r ) a s p r o f i l e :

336

f o r l i n e in p r o f i l e :
i f len ( t e m p l a t e s ) < 2 and r e . match ( , l i n e )

337

and i nt ( l i n e . s p l i t ( ) [ 1 0 ] . s t r i p ( . ) ) >=
identity \
338

and l i n e . s p l i t ( ) [ 1 ] [ 0 : 5 ] not in
templates :

339
340
341

t e m p l a t e s . append ( l i n e . s p l i t ( ) [ 1 ] [ 0 : 5 ] )
i d e n t i t y = 10
i f len ( t e m p l a t e s ) == amount :

342

print Templates g a t h e r e d .

343

return t e m p l a t e s

344
345

else :
s y s . e x i t ( Not enough t e m p l a t e s . )

346
347 ## F i x e s m i s s i n g r e s i d u e s i n PDB f i l e .
348

def f i x p d b ( pdb ) :

349

env = m o d e l l e r . e n v i r o n ( )

350

env . l i b s . t o p o l o g y . r e a d ( $ {LIB}/ t o p h e a v . l i b )

351

env . l i b s . p a r a m e t e r s . r e a d ( $ {LIB}/ par . l i b )

352

m = m o d e l l e r . s c r i p t s . c o m p l e t e p d b ( env ,

pdb/ + pdb + . pdb

)
353

m. w r i t e ( f i l e= pdb/ + pdb + . pdb )

354

print PDB f i x e d .

355
356 ## Adds c h a i n i d e n t i f i e r t o l i n e s i n PDB.
357

def a d d c h a i n i d ( pdb , c h a i n ) :

358

fixedLines = [ ]

359

with open ( pdb/ + pdb + . pdb , r ) a s p d b F i l e :

360
361

f o r l i n e in p d b F i l e :
i f l i n e [ 0 : 4 ] == ATOM :

APPENDIX B. ATTACHMENTS
362

mutable = l i s t ( l i n e )

363

mutable [ 2 1 ] = c h a i n

364

f i x e d L i n e s . append ( . j o i n ( mutable ) )

365
366
367

else :
f i x e d L i n e s . append ( l i n e )
with open ( pdb/ + pdb + . pdb , w ) a s p d b F i l e :

368

f o r l i n e in f i x e d L i n e s :

369

pdbFile . write ( l i n e )

370

103

print Chain i d e n t i f i e r added .

371
372 ## P r e d i c t s p r o t e i n s t r u c t u r e s w i t h homology m o d e l i n g ( c l a s s i c a l
way , n o t used ) .
373

def p r e d i c t s t r u c t u r e h o m o l o g y ( p r o t e i n ) :

374

## S e l e c t t e m p l a t e s from d a t a b a s e .

375

write pir ( protein )

376

env = m o d e l l e r . e n v i r o n ( )

377

env . i o . a t o m f i l e s d i r e c t o r y = . / pdb

378

sdb = m o d e l l e r . s e q u e n c e d b ( env )

379

sdb . r e a d ( s e q d a t a b a s e f i l e= 20160310 pdb95 . b i n ,


s e q d a t a b a s e f o r m a t= BINARY , c h a i n s l i s t= ALL )

380

a l n = m o d e l l e r . a l i g n m e n t ( env )

381

a l n . append ( f i l e=p r o t e i n . id + . a l i , a l i g n m e n t f o r m a t= PIR ,


a l i g n c o d e s= ALL )

382

prf = aln . t o p r o f i l e ()

383

p r f . b u i l d ( sdb , m a t r i x o f f s e t =450, r r f i l e = $ {LIB}/ blosum62 .


sim . mat ,

384

g a p p e n a l t i e s 1 d =(500, 50) , n p r o f i t e r a t i o n s =1,

385

c h e c k p r o f i l e=F a l s e , m a x a l n e v a l u e =0.01)

386

templates = get templates ( prf , 35 , 2)

387

## A l i g n t e m p l a t e s .

388

a l n = m o d e l l e r . a l i g n m e n t ( env )

389

f o r t e m p l a t e in t e m p l a t e s :

390

write pdb ( template [ 0 : 4 ] )

391

m = m o d e l l e r . model ( env , f i l e=t e m p l a t e [ 0 : 4 ] ,


model segment =( FIRST : +t e m p l a t e [ 4 : ] , LAST : +t e m p l a t e
[4:]) )

APPENDIX B. ATTACHMENTS
392

104

a l n . append model (m, a t o m f i l e s=t e m p l a t e [ 0 : 4 ] ,


a l i g n c o d e s=t e m p l a t e )

393

f o r ( w e i g h t s , w r i t e f i t , whole ) in ( ( ( 1 . , 0 . , 0 . , 0 . , 1 . ,
0 . ) , F a l s e , True ) ,

394

((1. , 0.5 , 1. , 1. , 1. ,
0 . ) , F a l s e , True ) ,

395

((1. , 1. , 1. , 1. , 1. ,
0 . ) , True , F a l s e ) ) :

396

a l n . s a l i g n ( r m s c u t o f f =3.5 , n o r m a l i z e p p s c o r e s=F a l s e ,

397

r r f i l e = $ ( LIB ) / a s 1 . sim . mat , overhang =30 ,

398

g a p p e n a l t i e s 1 d =(450, 50) ,

399

g a p p e n a l t i e s 3 d =(0 , 3 ) , g a p g a p s c o r e =0,
g a p r e s i d u e s c o r e =0 ,

400

d e n d r o g r a m f i l e=p r o t e i n . id + t e m p l a t e s .
tree ,

401

a l i g n m e n t t y p e= t r e e ,

402

f e a t u r e w e i g h t s=w e i g h t s ,

403

i m p r o v e a l i g n m e n t=True , f i t =True , w r i t e f i t=
write fit ,

404

w r i t e w h o l e p d b=whole , output= ALIGNMENT


QUALITY )

405

a l n . w r i t e ( f i l e=p r o t e i n . id + t e m p l a t e s . a l i ,
a l i g n m e n t f o r m a t= PIR )

406

## A l i g n s e q u e n c e t o t e m p l a t e s .

407

env . l i b s . t o p o l o g y . r e a d ( f i l e= $ ( LIB ) / t o p h e a v . l i b )

408

a l n b l o c k = len ( a l n )

409

a l n . append ( f i l e=p r o t e i n . id + . a l i , a l i g n m e n t f o r m a t= PIR ,


a l i g n c o d e s=p r o t e i n . id )

410

a l n . s a l i g n ( output= , m a x g a p l e n g t h =20 ,

411

g a p f u n c t i o n=True ,

412

a l i g n m e n t t y p e= PAIRWISE , a l i g n b l o c k=a l n b l o c k
,

413

f e a t u r e w e i g h t s =(1. , 0 . , 0 . , 0 . , 0 . , 0 . ) ,
overhang =0 ,

414

g a p p e n a l t i e s 1 d =(450, 0 ) ,

APPENDIX B. ATTACHMENTS
415

105

g a p p e n a l t i e s 2 d =(0.35 , 1 .2 , 0 . 9 , 1 .2 , 0 . 6 , 8 .6 ,
1.2 , 0. , 0.) ,

416
417

s i m i l a r i t y f l a g=True )
a l n . w r i t e ( f i l e=p r o t e i n . id + t o t a l . a l i , a l i g n m e n t f o r m a t=
PIR )

418

a = m o d e l l e r . automodel . automodel ( env , a l n f i l e =p r o t e i n . id +


total . ali ,

419

knowns=tuple ( t e m p l a t e s ) ,
s e q u e n c e=p r o t e i n . id )

420

a . starting model = 1

421

a . ending model = 1

422

a . make ( )

423

## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL p a c k a g e
) , switch hashes to a c t i v a t e . Deactivation prefered .

424

# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + . B99990001
. pdb o f i n a l + p r o t e i n . i d + . pdb )

425

o s . system ( cp + p r o t e i n . id + . B99990001 . pdb f i n a l +


p r o t e i n . id + . pdb )

426

## R e p l a c e p r e v i o u s PDB w i t h model .

427

o s . system ( mv f i n a l + p r o t e i n . id + . pdb . / pdb/ + p r o t e i n


. id + . pdb )

428

## Fix PDB ( o n l y needed when s i d e c h a i n r e f i n e m e n t i s n o t


used )

429

f i x p d b ( p r o t e i n . id )

430

## Remove temporary f i l e s .

431

o s . system ( rm + p r o t e i n . id + )

432

## M o d e l l e r d o e s n o t add c h a i n i d s by i t s e l f .

433

a d d c h a i n i d ( p r o t e i n . id , p r o t e i n . c h a i n )

434

print P r o t e i n + p r o t e i n . id + modeled u s i n g M o d e l l e r and


SCWRL4.

435

return p r o t e i n

436
437 ## P r e d i c t s p r o t e i n s t r u c t u r e s w i t h homology m o d e l i n g u s i n g
i n t e r m e d i a t e models ( manual a l i g n m e n t , f a s t e r ) .
438
439

def p r e d i c t s t r u c t u r e h o m o l o g y f a s t ( p r o t e i n ) :
i f len ( p r o t e i n . p a r e n t s ) > 1 :

APPENDIX B. ATTACHMENTS
440

write aln ( protein )

441

env = m o d e l l e r . e n v i r o n ( )

442

env . i o . a t o m f i l e s d i r e c t o r y = . / pdb

443

templates = [ ]

444

f o r p a r e n t in p r o t e i n . p a r e n t s :

445
446

106

t e m p l a t e s . append ( p a r e n t [ 0 ] )
a = m o d e l l e r . automodel . automodel ( env , a l n f i l e =p r o t e i n . id
+ t o t a l . a l i , knowns=tuple ( t e m p l a t e s ) , s e q u e n c e=
p r o t e i n . id )

447

a . starting model = 1

448

a . ending model = 1

449

a . make ( )

450

## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .

451

# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + .
B99990001 . pdb o f i n a l + p r o t e i n . i d + . pdb )

452

o s . system ( cp + p r o t e i n . id + . B99990001 . pdb f i n a l +


p r o t e i n . id + . pdb )

453

## R e p l a c e p r e v i o u s PDB w i t h model .

454

o s . system ( mv f i n a l + p r o t e i n . id + . pdb . / pdb/ +


p r o t e i n . id + . pdb )

455

## Fix PDB ( o n l y needed when s i d e c h a i n r e f i n e m e n t i s


n o t used )

456

f i x p d b ( p r o t e i n . id )

457

## Remove temporary f i l e s .

458

o s . system ( rm + p r o t e i n . id + )

459

## M o d e l l e r d o e s n o t add c h a i n i d s by i t s e l f .

460

a d d c h a i n i d ( p r o t e i n . id , p r o t e i n . c h a i n )

461

print P r o t e i n + p r o t e i n . id + modeled u s i n g MODELLER


.

462
463
464

return p r o t e i n
else :
o s . system ( cp pdb/ + p r o t e i n . p a r e n t s [ 0 ] [ 0 ] + . pdb pdb/
+ p r o t e i n . id + . pdb )

465

return p r o t e i n

APPENDIX B. ATTACHMENTS

107

466
467 ## Models p o i n t m u t a t i o n s u s i n g t h e R o s e t t a B a c k r u b a p p l i c a t i o n .
468

def p r e d i c t s t r u c t u r e b a c k r u b ( p r o t e i n ) :

469

p a r s e r = PDBParser ( )

470

s t r u c t u r e = p a r s e r . g e t s t r u c t u r e ( p r o t e i n . id [ 0 : 4 ] ,

pdb/ +

p r o t e i n . id [ 0 : 4 ] + . pdb )
471

## Give e v e r y t h r e a d i t s own i n p u t and o u t p u t f i l e s .

472

o s . system ( cp pdb/ + p r o t e i n . id [ 0 : 4 ] + . pdb pdb/ +


p r o t e i n . id + . pdb )

473

atomList = [ ]

474

f o r atom in s t r u c t u r e . g e t a t o m s ( ) :

475

a t o m L i s t . append ( atom )

476

nbs = N e i g h b o r S e a r c h ( a t o m L i s t )

477

f o r mutation in p r o t e i n . p o i n t M u t a t i o n s :

478

## Write r e s f i l e and g e t a l l r e s i d u e s w i t h i n 6 Angstrom


o f mutated r e s i d u e s u s i n g Biopython p a c k a g e .

479

with open ( r e s f i l e

+ p r o t e i n . id ,

w ) a s r e s f i l e :

480

r e s f i l e . w r i t e ( NATRO\ n s t a r t \n )

481

affectedList = [ ]

482

r e s i d u e = s t r u c t u r e . g e t c h a i n s ( ) . next ( ) [ mutation ]

483

f o r atom in S e l e c t i o n . u n f o l d e n t i t i e s ( r e s i d u e ,

484

A ) :

f o r n e i g h b o r in nbs . s e a r c h ( atom . g e t c o o r d ( ) , 6 ,
l e v e l= R ) :

485
486

a f f e c t e d L i s t . append ( g e t r e s i ( n e i g h b o r ) )
r e s f i l e . w r i t e ( s t r ( mutation ) + + p r o t e i n . c h a i n +
PIKAA + p r o t e i n . p o i n t M u t a t i o n s [ mutation ] [ 1 ] +
\n )

487

## Make a f f e c t e d L i s t u n i q u e .

488

a f f e c t e d L i s t = l i s t ( set ( a f f e c t e d L i s t ) )

489

pivotList = [ ]

490

f o r r e s in a f f e c t e d L i s t :

491
492

i f r e s != mutation :
r e s f i l e . write ( str ( r e s ) + + protein . chain
+ NATAA\n )

493

p i v o t L i s t . append ( r e s )

494

i f r e s != 1 :

APPENDIX B. ATTACHMENTS

108

495

p i v o t L i s t . append ( r e s 1)

496

i f r e s != len ( p r o t e i n . s e q ) :

497

p i v o t L i s t . append ( r e s +1)

498
499

p i v o t L i s t = l i s t ( set ( p i v o t L i s t ) )
## Run b a c k r u b s a m p l i n g w i t h 1 0 . 0 0 0 Monte C ar l o
iterations .

500

command = ROSETTADIR + /main/ s o u r c e / b i n / backrub .


l i n u x g c c r e l e a s e d a t a b a s e + ROSETTADIR + \
/main/ d a t a b a s e s pdb/ + p r o t e i n . id + . pdb

501

n s t r u c t 1 backrub : n t r i a l s
str (10000) + r e s f i l e

502

resfile

+ \
+ p r o t e i n . id

+ p i v o t r e s i d u e s
503
504

f o r p i v o t in p i v o t L i s t :
command += s t r ( p i v o t ) +

505

o s . system ( command )

506

## R e f i n e s i d e c h a i n s u s i n g rotamer l i b r a r i e s (SCWRL
package ) , switch hashes to a c t i v a t e . Deactivation
prefered .

507

# os . s y s t e m ( . / s c w r l 4 / S c w r l 4 i + p r o t e i n . i d + 0 0 0 1 .
pdb o pdb / + p r o t e i n . i d + . pdb )

508

o s . system ( mv + p r o t e i n . id + 0 0 0 1 . pdb pdb/ +


p r o t e i n . id + . pdb )

509

## Remove temporary f i l e s .

510

o s . system ( rm + p r o t e i n . id + r e s f i l e

+ p r o t e i n . id

)
511

## Fix PDB ( o n l y needed when s i d e c h a i n r e f i n e m e n t i s n o t


used )

512

f i x p d b ( p r o t e i n . id )

513

print P r o t e i n + p r o t e i n . id + modeled u s i n g
RosettaBackrub and SCWRL4.

514

return p r o t e i n

515
516

def g e t u n i q u e i d s ( number , m o t h e r S e l e c t i o n ) :

517

list = []

518

while len ( l i s t ) < number :

519

id =

APPENDIX B. ATTACHMENTS
520
521
522

109

f o r d i g i t in range ( 0 , 4 ) :
id += s t r ( random . r a n d i n t ( 0 , 9 ) )
i f id not in l i s t + [ p r o t e i n . id f o r p r o t e i n in
motherSelection ] :

523
524

l i s t . append ( id )
return l i s t

525
526

def p r e p a r e n e x t g e n ( p o p u l a t i o n , m o t h e r S e l e c t i o n , subpopSize ,
pctM , g e n e r a t i o n ) :

527

newPopulation = [ ]

528

i d s = g e t u n i q u e i d s ( len ( m o t h e r S e l e c t i o n ) , m o t h e r S e l e c t i o n )

529

f o r index , p r o t e i n in enumerate ( m o t h e r S e l e c t i o n ) :

530

o s . system ( cp pdb/ + p r o t e i n . id + . pdb pdb/ + i d s [


i n d e x ] + . pdb )

531

newPopulation . append ( c r e a t e s u b p o p u l a t i o n ( P r o t e i n ( i d s [
i n d e x ] , p r o t e i n . seq , p r o t e i n . chain , [ p r o t e i n .
b e s t B i n d i n g S i t e ] ) , subpopSize , pctM ) )

532

## Write b e s t pdb t o f o l d e r b e s t p d b s and p r i n t r e p o r t .

533

bestProtein = report best ( population )

534

o s . system ( mkdir p b e s t p d b s && cp pdb/ + b e s t P r o t e i n . id +


. pdb b e s t p d b s / + s t r ( g e n e r a t i o n + 1 ) + +
b e s t P r o t e i n . id + . pdb )

535
536

with open ( r e p o r t , a ) a s r e p o r t :
r e p o r t . w r i t e ( b e s t P r o t e i n . id + \ t + s t r ( b e s t P r o t e i n .
bindingSites ) + \t + bestProtein . bestBindingSite +
\ t + s t r ( b e s t P r o t e i n . s c o r e ) + \n )

537

f o r subpop in p o p u l a t i o n :

538

## Remove PDBs from p r e v i o u s g e n e r a t i o n .

539

o s . system ( rm pdb/ + subpop [ 0 ] . id [ 0 : 4 ] + . pdb )

540

f o r p r o t e i n in subpop :

541
542

o s . system ( rm pdb/ + p r o t e i n . id [ 0 : 1 4 ] + )
return newPopulation

APPENDIX B. ATTACHMENTS

B.3

110

protein.py
Code Snippet B.3: Complete source code: protein.py.

1 ########################################
2 ####

FIGARO

####

3 #### Author : P i e t e r Noyens

####

4 #### Academic y e a r : 2015 2016

####

5 #### S t u d e n t nr : r0307453

####

6 #### 2MSc B i o i n f o r m a t i c s , KU Leuven ####


7 ########################################
8
9 # Class defining the protein o b j e c t .
10
11

import r e

12
13

class Protein () :

14
15

def

init

( s e l f , id , seq , chain , b i n d i n g S i t e s ) :

16

s e l f . id = id

17

s e l f . seq = seq

18

s e l f . chain = chain

19

s e l f . bindingSites = bindingSites

20

s e l f . s c o r e = None

21

s e l f . parents = [ ]

22

s e l f . crossoverPoints = [ ]

23

s e l f . b e s t B i n d i n g S i t e = None

24

s e l f . p o i n t M u t a t i o n s = {}

25
26

def s e t s c o r e ( s e l f , s c o r e ) :

27

s e l f . score = score

28
29

def s e t p a r e n t s ( s e l f , protA , protB = None , c r o s s o v e r P o i n t 1 =


None , c r o s s o v e r P o i n t 2 = None ) :

30

s e l f . p a r e n t s . append ( ( protA . id , protA . s e q ) )

31

i f protB and r e . s e a r c h ( recombined , protB . id ) :

APPENDIX B. ATTACHMENTS
32

s e l f . p a r e n t s . append ( ( protB . p a r e n t s [ 0 ] [ 0 ] , protB .


parents [ 0 ] [ 1 ] ) )

33

i f protB and not r e . s e a r c h ( recombined , protB . id ) :

34

s e l f . p a r e n t s . append ( ( protB . id , protB . s e q ) )

35
36
37
38

i f crossoverPoint1 :
s e l f . crossoverPoints = [ crossoverPoint1 ]
i f crossoverPoint2 :
s e l f . c r o s s o v e r P o i n t s . append ( c r o s s o v e r P o i n t 2 )

39
40
41

def s e t b e s t b i n d i n g s i t e ( s e l f , b i n d i n g S i t e ) :
s e l f . bestBindingSite = bindingSite

42
43

def s e t i d ( s e l f , id ) :

44

s e l f . id = id

45
46
47

def u p d a t e s e q ( s e l f , s e q ) :
s e l f . seq = seq

48
49
50

def s e t p o i n t m u t a t i o n s ( s e l f , p o i n t M u t a t i o n s ) :
s e l f . pointMutations = pointMutations

111

Appendix C

End summary

112

APPENDIX C. END SUMMARY

113

In this work we propose a fully automated and less biased evolutionary strategy to design and optimize a binding site for random substrate
molecules without known natural binding pocket. Based on an efficient genetic algorithm dubbed FIGARO (a Fast and Interpopulational Genetic
Algorithm for Receptor Optimization), we show that this strategy can be
a promising new approach in finding valuable protein structures that can be
useful as starting points in the task of artificial enzyme design, to speed up
reactions that would otherwise be too slow to have practical relevance. However, the most foreseeable use case - and also the one particularly capturing
our imagination - is without a doubt its employability in the immunotherapy treatment of cancer. FIGARO offers a new bridge between very specific, even artificially designed chemical structures and proteins that can
vary widely in function. Unique compounds on the surface of cancer cells
could perfectly act as input structures for FIGARO to target. With a 50year-old problem of protein structure prediction and many already achieved
advantages in this field, the availability of enormeous distributed computation possibilities and proven usability of very sophisticated evolutionary
algorithms, FIGARO goes down the path of future potential.

Das könnte Ihnen auch gefallen