Protein Network Paper 2008

28
Current Protein and Peptide Science, 2008, 9, 28-38
Proteins As Networks: Usefulness of Graph Theory in Protein Science

Arun Krishnan1, Joseph P. Zbilut2, Masaru Tomita1 and Alessandro Giuliani3,*
1
Institute of Advanced Biosciences, Keio University, Tsuruoka City, Japan; 2Molecular Biophysics and Physiology
Dept., Rush University Medical Center, Chicago, USA; 3Environment and Health Dept., Istituto Superiore di Sanita,
Viale Regina Elena 299, 00161, Roma, Italy
Abstract: The network paradigm is based on the derivation of emerging properties of studied systems by their representation as oriented graphs: any system is traced back to a set of nodes (its constituent elements) linked by edges (arcs) correspondent to the relations existing between the nodes. This allows for a straightforward quantitative formalization of systems by means of the computation of mathematical descriptors of such graphs (graph theory). The network paradigm is
particularly useful when it is clear which elements of the modelled system must play the role of nodes and arcs respectively, and when topological constraints have a major role with respect to kinetic ones. In this review we demonstrate how
nodes and arcs of protein topology are characterized at different levels of definition:
n
tio
1. Recurrence matrix of hydrophobicity patterns along the sequence

2. Contact matrix of alpha carbons of 3D structures
u
rib
3. Correlation matrix of motions of different portion of the molecule in molecular dynamics.
These three conditions represent different but potentially correlated reticular systems that can be profitably analysed by
means of network analysis tools.
t
s
i
D
r
Keywords: Systems biology, protein folding, recurrence quantification analysis, molecular dynamics, computational biology.
INTRODUCTION
In an important paper [1], Hans Frauenfelder and Peter

Wolynes focused on the peculiarity of protein science and
specifically on the sequence-structure relation puzzle, where
thorough and accurate knowledge of first principles and
potentials (hydrophobic interactions, hydrogen bonding, size
constraints etc.) acting at microscopic levels inform empirical (and very inaccurate) predictions of the actual structure
assumed by proteins when in solution. Yet the next level
mesoscopic principles needed to predict the 3D structure
of proteins remained essentially unknown.
o
F
t
o
N
Protein folding as a topic has intrigued scientists for

many years. Different models have been proposed for protein folding arising from a host of theoretical [25], simulated [69] or experimental [10-12] techniques. Among the
many different models that have been proposed are the classical nucleation-propagation model [13, 14], the nucleationcondensation model [15], the sequential and hierarchical
model [15] and the modular model [16]. More recently, a
unified model of protein folding that is based on the effective
energy surface of a polypeptide chain, has been introduced
by Wolynes et al. [17]. According to this unified model, protein folding consists of a progressive organization of ensembles of partially folded structures that arise through multiple
routes. A detailed review of the different models can be
found in [12].
*Address correspondence to this author at the Environment and Health
Dept., Istituto Superiore di Sanita, Viale Regina Elena 299, 00161, Roma,
Italy; E-mail: alessandro.giuliani@iss.it
1389-2037/08 $55.00+.00
With the discovery of natively unfolded proteins and the

increasing recognition of the role played by protein flexibility, the original form of the sequence structure puzzle moved
from a relatively clear (at least in principle) one-to-one mapping from a residue position along a linear chain to a vector
in a three dimensional coordinate system with a much more
fuzzy formulation [18-20]. The perception of proteins as
dynamical systems in which the relative positions of their
residues vary in time, changed the nature of the sequence/structure problem from a uniquely defined prediction
task to a more fluid description (and prediction) of high level
features such as flexibility, stability, biological activity and
folding rate [20].
Proteins occupy a unique position in the hierarchy of
natural systems, since they lie in a grey region between
chemistry and biology. From the biological side, although
any single protein would not be considered as alive, it does
not take many of them (plus a bit of nucleic acid) before lifelike behaviour begins to emerge (as in the case of viruses for
example). Still more puzzling is the behaviour of such protein systems like prion-like particles that are able to rearrange the structural configuration of other systems that
come into contact with them [21].
From a chemical viewpoint proteins are linear heteropolymers that, unlike most synthetic polymers are formed of
basically non-periodic sequences of 20 different monomers
[18]. While artificial, periodic, polymers are generally very
large extended molecules forming a matrix, the great majority of proteins fold as self-contained structures determined
by the sequence of monomers.
2008 Bentham Science Publishers Ltd.
Proteins As Networks
Current Protein and Peptide Science, 2008, Vol. 9, No. 1
Thus, we can consider the particular linear arrangement

of amino acids as a sort of recipe for making a watersoluble polymer with a well-defined three-dimensional architecture, keeping in mind that well-defined does not mean
fixed for the above mentioned role of dynamics [18].
The task of interpreting this recipe is naturally solved
by the specific physico-chemical environment (in terms of
pH, hydrophobicity, ionic strength, molecular crowding,
density, etc.) that transforms linear, static information into
3D dynamical information. Our task as modellers is to try to
understand some of these transformation rules.
MULTI SCALE MODELLING
A protein, like any complex system, can be described at
different scales of definition in both space and time. We can
simplify this characterization by the identification of three
main organizational levels: sequence, structure and dynamics.
The first step in protein modelling is finding the most
convenient formalization for connecting these three different
layers. Such a formalization should have the same basic elements at all the three levels (so that a mapping can be easily
performed) and, possibly, the same dimensionality of the
solution space (so as to avoid inconsistencies).
This shift of perspective allows us to project all three

representations, namely, sequence, structure and dynamics
onto N x N matrices whose elements are the relationships
between the N residues in the different spaces. Thus in the
case of sequence space, the protein will be represented by an
N x N matrix whose columns and rows are the residues and
whose elements are the similarities of the specific residue
pairs with respect to a particular physico-chemical property
[18]. In the case of structure, the same N x N matrix will
have as elements the Euclidian distances between the pairs of
residues [24], while in the case of dynamics the N x N matrix will report the correlation coefficients of residue trajectories in time [25]. In graph theory these matrices are called
adjacency matrices to stress the fact they report the topological relations between the elements: these topological relations can mirror ordinary 3D Euclidean space (contact matrix) or less intuitive (but nevertheless fully physically definable) spaces like the between-residues physico-chemical
similarities (sequence) and the between-trajectories correlations (dynamics).
o
F
t
o
N
On the contrary we can assume an intrinsic (relativistic)

geometry in which the protein is no longer defined in terms
of an absolute system of coordinates external to it, but by
means of the relation between its basic elements in the different spaces [22, 23]. Thus we are no longer concerned
about the actual values of the space coordinates of residue 13
of protein A but simply about which residues are at a given
distance from residue 13. Similar statements can be obviously made for sequence physico-chemical descriptors (e.g.
which residues have a hydrophobicity similar to residue 13)
and dynamics (which residues move together with residue
13).
n
tio
Fig. 1 reports a pictorial example of these different representations: quite obviously, the intrinsic geometry approach,
taking into consideration the relationships between the elements of the system instead of looking for an absolute representation, allows us to use the same mathematical object (an
N x N symmetrical matrix) for describing all the different
aspects of protein architecture and behaviour, thereby offering a consistent basis for modelling. This provides an easy to
use method for going from one layer to the other.
u
rib
t
s
i
D
r
The amino acid residues are the most natural basic elements for protein modelling given that they are the monomers constituting the polymers under study and are common
to all the three different layers. This ends up in a physicochemical description of specific residues at the sequence
layer, a position in space for structure and a relative mobility
for dynamics. The different characterization of the same
elements (aminoacid residues) relative to the three organization levels poses a problem for the dimensionality aspect
of modelling: sequence is mono dimensional, and can be
equated to a time series with time substituted by the order of
the residues along the sequence, whereas structure is obviously a three dimensional object. A molecular dynamics
simulation lives in a M x N space with N being the number
of residues and M (= 4) being the number of coordinates (3
spatial coordinates + time coordinate). These dimensionality
considerations tacitly imply an absolute space existing independently from the particular system; what we call an extrinsic geometry [22, 23]. In other words physico-chemical descriptors, or space coordinates have an existence of their
own, independent of the particular protein we are describingthey are an absolute system of coordinates.
29
In panel a) of Fig 1 the dynamics of the 1-40 -amyloid

peptide is reported [26]: the elements of the matrix are the
covariances of the single residues trajectories in time, coloured by a scale proportional to their entity. The texture of
the figure clearly evidences the portion of the peptides that
move together. Panel b) corresponds to the contact matrix of
TNF- [24]. Here a binary formalization of spatial distances
between residues is adopted and a dot is inserted in the matrix for every residue pair whose elements are at a distance
less than 6 and whose distance along the sequence is less
than three residues (so as to avoid trivial contacts). On both
axes of the figures are reported the patches of the molecule
that are known to be arranged into sheets so that the correspondence between the elements of secondary structure and
patterns in the matrix is made evident.
Panel c) corresponds to a Recurrence Plot (RP) of the
human P53 protein sequence coded in terms of the Miyazawa-Jernigan hydrophobicity scale of the constituent amino
acids [18]. As in the a) and the b) panels, here too the elements of the matrix correspond to a relation between the
corresponding i-th row and j-th column residues. In this case
the relationship is the similarity between the hydrophobicity
distribution of a patch of 4 residues having as the first element the i-th and j-th residue respectively. When the distance
between the two distributions is lower than a predetermined
threshold, a dot is blackened in the matrix thereby giving rise
to the RP corresponding to a binary N x N matrix. It is immediately noticeable that the natively unfolded transactivation region of P53 is represented as a strongly recurrent (highly repetitive) box between residues 61 and 98 as
well as the resemblance of this portion of different patches
30 Current Protein and Peptide Science, 2008, Vol. 9, No. 1
Krishnan et al.
n
tio
u
rib
t
s
i
D
r
Fig. (1). The figure reports the network formalization in terms of NxN adjacency matrix for the three layers of protein formalization: a) dynamics, covariance matrix of the motions of A 1-40 peptide, b) 3D structure, TNF-A contact matrix with the indication of secondary structure elements, c) sequence: hydrophobicity recurrence plot of P53 protein.
o
F
t
o
N
along the sequence in the form of horizontal (vertical) lines

in the plot. The N x N symmetric matrix thus is the common
mathematical support for sequence, structure and dynamics
of a specific protein.
In order to compare different proteins, or even different
matrices of the same protein in a quantitative way we need to
derive some invariants out of these symmetric matrices:
these descriptors will allow us to compare the different N x
N matrices on a quantitative basis [27].
DERIVATION OF QUANTITATIVE DESCRIPTORS

FROM ADJACENCY MATRICES.
The network paradigm is the prevailing metaphor in biology; thus we read about gene networks [28-30], metabolic
networks [31, 32], ecological networks [33] as well as signalling networks [34]. In its most basic definition, a network
is a wiring diagram in which some elements (nodes) are connected by some relations (arcs, edges). The network can be
analysed by both purely topological approaches in which all
the nodes and edges are considered as equivalent (as in the
case of binary matrices of recurrences and contacts), as well
as dynamical approaches in which the relations take the form
of differential equations (or correlations, conditional probabilities and so forth).
Here we will concentrate on static approaches in which
the network (even when representing a dynamical system
like molecular dynamics simulations) can be considered as
fully represented by a wiring diagram like the one reported

in Fig. 2 in which the nodes are the residues and the arcs the
relations between them (similarity in hydrophobicity distribution for sequence, covariance of motions for molecular
dynamics, contacts for 3D structures).
There are a number of synthetic numerical descriptors of
networks capturing the basic topological features of such
systems. For the sake of simplicity, let us start with the simplest wiring diagram of all, in which every arc has the same
value, i.e. each contact is a contact, each recurrence is a recurrence without going in deeply into the intensity of the
relation between the two nodes. In the case of proteins this is
a very satisfactory simplification that encapsulates the essentials of the studied systems.
The basic mathematical object to approach the topological study of networks is the so-called graph. The graph is
defined as a tuple (V,E), with V as a set of vertices (or
nodes) and E as a set of edges (or arrows, arcs). The degree
of a node is the number of arcs connected to it [28].
The most basic network feature is the possibility to

reach a given node i starting from another node j by a path
along a graph: this possibility defines an equivalence relation
to be connected to , that partitions a given graph G in
equivalence classes called components made by all the
nodes that are connected among them. The set of nodes mutually reachable inside a given graph is called a connected
component. A graph G is called connected if it is made by
31
Fig. (2). Isomorphism between network graph formalization and adjacency matrix. The thickness of the edges of the graph and of the squares
of the matrix is proportional to the intensity of the relation between the corresponding elements.
only one component . The number of distinct paths connecting two nodes can be considered as an index of connectivity of the graph nodes: the greater the number of paths
connecting the two nodes, the greater the two nodes are correlated, and the higher their connectivity index.
This allows for a straightforward application of clustering
algorithms able to delineate supernodes, i.e., groups of
nodes highly connected among each other and forming functional modules. This metric property of topological graph
representation was exploited in many different fields extending from organic chemistry (where the graphs are the molecules with atoms as nodes and chemical bonds as edges [35],
to social networks [36]. In the case of protein domains, a
very clear application with aminoacid residues = nodes and
contacts = edges can be found in [37]. In this paper, by
means of a pure network topology, it is possible to identify
conformations in the folding transition state (TS) ensemble,
and provide a basis for the understanding of the heterogeneity of the TS and denatured state ensemble as well as the
existence of multiple folding pathways. Network topology is
described by means of intuitive computations. Each node of
the graph can be labelled by the number of nodes connected
to it (degree): this gives rise to the degree distribution P(k)
describing the general wiring pattern of the network having as the abscissa the number k of connections and as the
ordinates, the number of nodes having k connections. In
analogy with statistical mechanics, these distributions are
defined as scaling laws [38, 39].
k(i) * ( k(i) 1)

2
Fig. 3 shows two typical kinds of scaling laws (node degree distributions): the panel on the left shows a Poisson
distribution in which there is a privileged scale of number of
connections and a decreasing number of nodes having less
than average or more than average links. The panel on the
right depicts a so-called scale-free network [28, 38], in
which there is a large majority of nodes with a low number
of connections and a very small number of nodes having a
large number of links. These highly connected nodes are
called hubs. The tendency of having subsets of nodes
strongly connected among them can be measured by the socalled aggregation coefficient. Consider a generic node i of
the network having k(i) edges connecting it to other k(i)
nodes. In order that these nodes possess the maximal connectivity (each node connected to each other) we should have a
total number of edges equal to:
n
tio
Expression (1) corresponds to the maximal number of

connections among k(i) nodes when self connections are
avoided. Thus, it is perfectly natural to define the aggregation coefficient in terms of the ratio between the number of
actually observed (Ei) and the maximal number of connections expressed by (1). Thus, the aggregation coefficient
relative to node i, Ci is expressed as:
u
rib
t
s
i
D
r
o
F
t
o
N
(1)

Ei

Ci = 2 *
( k(i) * ( k(i) 1))
(2)
The aggregation coefficient for the entire network corresponds to the average of Ci over all the nodes. The operative
counterpart of clustering tendency is the concept of modularity, that is, the possibility to isolate portions of a more general network that can be considered as partially independent
sub-networks (also known as modules) that can be studied
as such, without necessarily referring to the whole network.
This is the same as the concept of stable classification in
classical multidimensional statistics, in which well behaved clusters are defined as collections of statistical units
very near each other (in the network language having many
mutual connections) and distant from the elements of the
other cluster (in network language, having only a few arcs
connecting elements of different modules) [28].
The above defined measures, when applied to N x N adjacency matrices (protein networks) in sequence, structure
and dynamic spaces allow us to quantitatively describe protein systems at different levels of specification by means of
directly comparable measures. In the following we sketch a
path from protein sequence through protein structure and
protein dynamics network-based representations (by the
agency of the relative N x N adjacency matrixes) showing
how the use of a common mathematical support can help to
derive some interesting conjectures about the protein folding
process.
PROTEINS DISPLAY SIX RESIDUE LONG HYDROPHOBICITY PATTERNS ALONG THEIR SEQUENCE
Protein sequences are, with rare exceptions (e.g. fibrous
polymerising proteins such as collagen or silk), quasirandom strings of symbols with scant evidence of order or
Krishnan et al.
Fig. (3). Two possible scaling of the degree of nodes (k) and their relative frequency (p(k)), left: Poisson scaling, right: scale free distribution.
periodicity: a reliable estimate of the entropy reduction due

to the autocorrelation of residues in an average protein sequence is only about 1% [27].
Nevertheless, such quasi-random strings are the basic
recipes producing refined three-dimensional structures that
sustain specific physiological roles. Thus, the observed
quasi-randomness may be a specious image obscuring the
underlying meaning [27]. It is interesting to note that a similar situation occurs in the case of human languages where it
is almost impossible to generate meaningful texts using just
periodic repetitions of symbols. Nevertheless, even if very
weak, the presence of regularities in both human texts and
protein sequences is of utmost importance for deriving important hints about the underlying message. This concept
was set forth in a very clear manner by Rackovsky [40] who
was able to demonstrate the presence of specific sequence
signals in terms of autocorrelation structures of different
physical properties when used to code diverse sets of protein
sequences. In [27], a set of 1141 eukaryotic protein sequences
(ftp://ftp.ebi.ac.uk/pub/contrib/swissprot/testsets)
was analysed in search of such syntactic sequence regularities and of their possible biochemical role. The 1141 sequences were encoded by means of different physicochemical properties of residues, thereby obtaining different
sets of numerical series, each series corresponding to a specific protein sequence coded by a specific property of its
constituent amino acids. Each sequence was submitted to
Recurrence Quantification Analysis (RQA), a statistical
method widely applied in many diverse fields [18]. As introduced before (see Fig. 1) the Recurrence Plot (RP) of a protein sequence depicts the N x N square matrix of similarities
(for the specific physico-chemical property taken into consideration) between patches of amino acids along the sequences putting a dot for each similarity above a given
threshold. The RP can be safely viewed as an adjacency matrix (and consequently a network) whose nodes are the residues and whose links correspond to the scoring of a strong
similarity in hydrophobicity (or any other property, but hydrophobicity is the by far most explored one) for the corresponding residue pairs. Such pairs are called recurrent [18]
and, in network language, correspond to a pair of nodes
linked by an edge.
u
rib
t
s
i
D
r
o
F
t
o
N
n
tio
Fig. 4 reports two such RPs together with the corresponding sequence encoded by means of Myiazawa-Jernigan hydrophobicity, which was demonstrated to be the property
code showing the richest autocorrelation structure. RQA
generates a numerical set of descriptors for RPs resembling
the network invariants described above. The two most basic
descriptors are:
1. Percent of Recurrence (REC): Percent of recurrent
pairs (corresponding to aggregation coefficient)
2. Percent of Determinism (DET): Percent of recurrent

pairs forming diagonal lines in the RP relative to the total of
recurrent pairs. (each diagonal line is a path in the network,
and DET is the number of paths; so DET corresponds to
connectivity).
In order to measure the similarities of patches of residues
along the chain we must create a moving window scanning
the sequence corresponding to the length of the patches we
want to compare. In other words, we must set the number of
neighbours m, of a residue i so as to compare the distribution
of the property with the one relative to the m neighbours of
residue j. This window of length m will be moved by one
residue after the other so to consider all the possible pairs of
residues along the sequence. This window is called embedding dimension using typical terminology of non linear dynamics (from which RQA derives). Another choice to be
made is the radius r corresponding to the maximum allowed
distance for two windows to be considered as recurrent (the
details of the method are fully explained in [18]).
It can be observed from Fig. 5 that determinism (DET)
reaches a maximum for the entire set of proteins at an embedding dimension of four residues; this corresponds to a
typical word length of six if we consider the need (for
scoring a deterministic line) to have at least three consecutive recurrent pairs to obtain a diagonal deterministic line.
This tells us of a possible characteristic length of six for protein quasi-repeats (patches of amino acids with similar hydrophobicity distribution). It is worth noting that the proteins
having the highest DET were those having both the highest
percentage of natively unfolded (or very flexible) portions
and the highest number of interactions with other proteins
[27].
33
completely independent of the total length of the protein in

which the word is embedded [41], exactly as we would expect for real words, whose length is clearly independent of
the total length of the texts that they are part of.
u
rib
n
tio
Fig. (5). Determinism scaling at varying embedding dimension for

two different choices of recurrence thresholds (radius).
The relevance of a characteristic length of six residues is

supported by other avenues of research. Schwartz and King
[42] demonstrated a strong bias against blocks of hydrophobic strings deviating from expected frequencies at about six
residues of block length. In another study evaluating the potential for protein knotting, Lua and Grossberg [43] point
out that knots are relatively rare, and that chains beyond six
residues quickly increase their chances of interpenetration,
thus promoting aggregation. Finally, an analysis of the
nucleation cores based by Compiani et al. [44] on the basis
of a structural entropy criterion shows an average length of
6.12 for these cores [27, 44]. This is a crucial point:
Compiani et al. looked for sequence elements that
maintained their local folding irrespective of the sequence
they are embedded into [44]. The characteristic length of
6.12 they found tells us that six residues patches maintain
their individual features thus possibly giving rise to a
mutual recognition of identical words both inside the same
protein (folding cores), as well as between different proteins
(aggregation cores). The discovery of a characteristic length
for the repetition of hydrophobically homogeneous residue
patches along the sequence was obtained by purely
topological considerations on the basis of an N x N matrix
reporting
the
between
residues
similarities
in
hydrophobicity. This topological characteristic at the
sequence level was demonstrated to have both structural
(proteins with a highest number of repetitive patches tend to
be natively unfolded) and functional (proteins with a lot of
recurrent patches tend to have more protein-protein
interactions). It is important to try to correlate this sequence
level feature to analogous features at the level of 3D network
based structure representation constituted by the so called
contact matrix (Fig. 1b). In the next section we explore if the
discovery of a characteristic length of six for hydrophobicity
patches, highlighted by graph theoretical approaches [27, 41]
and confirmed by many independent evidences [42, 43, 44]
has a direct counterpart in the 3D structure network topology.
t
s
i
D
r
o
F
t
o
N
Fig. (4). Two Recurrence Plots together with the hydrophobicity

coded sequence that generated them: a) An enzyme (transferase,
swiss-prot code Q07357) with a relatively low recurrence rate, b) A
protein engaged in a lot of protein-protein interaction (GATA binding factor, swiss-prot code: P52618).
The above evidence was reinforced by the analysis [41]

of a different data set of 1977 single chain protein structures
solved by x-ray diffraction and obtained from CATH v2.6.0
(April 2005) (http://cathwww.biochem.ucl.ac.uk/ latest/lists/index.html) in Cath List Format (CLF) .
In this set too (completely independent of the other one)
a characteristic length of six for the hydrophobic word was
evident. Moreover the size of the hydrophobic word was
PROTEIN STRUCTURAL MODULES (6+6=12)

According to the conjecture of a six residues characteristic length acting as the basic folding unit, by the mutual recognition of two similar words along the sequence, we must
find some evidence of a basic structural unit of characteristic
length 12 when shifting from sequence (RP) to 3D structure
(contact matrix) representation of the proteins.
In network language this corresponds to the demonstration of a characteristic size of 12 for network modules, i.e.
for a portion of the networks whose nodes have a much
larger numbers of contacts among them than with other portions of the network [45]. For accomplishing this task we
used a subset of structures that share < 20% identity with
each other and have been determined with a resolution <
2A, resulting in a total of 1420 proteins that were culled
from the PDB and obtained from the protein-culling server
PISCES [46]. All entries contained a single chain. The identification of network structural modules starting from the
inter-residues contact matrix of the different proteins was
carried out by the algorithm developed by Guimera and
Amaral [45]. This algorithm, allows for an unbiased definition of what a module is in terms analogous to the definition
of a well formed cluster in multidimensional statistics: a
module (cluster) is a set of nodes (in our case amino acid
residues) maximizing the ratio: Within Cluster links (contacts) / Between Cluster links.
Krishnan et al.
between the physiological role of residues and their representation in terms of network invariants.
u
rib
t
s
i
D
r
The algorithm converges towards the maximization of

the modularity of the analysed network and allows for a representation of the single nodes (residues) in terms of their
intra-module degree, z, and their participation coefficient, P, corresponding to the relative position of a residue
well inside the module (high z) or in an inter-modules frontier (high P). The algorithm was separately applied to the
different proteins of the data set by means of an optimization
method based on a Genetic Algorithm. Having subdivided
all the 1420 proteins into modules we performed some statistics on module length.
o
F
t
o
N
Figure 6, reporting module length distribution for the

entire data set gives a clear cut proof to our conjecture. The
figure shows a distribution of the modules by size across the
1420 different proteins. It can be observed that the peak of
this distribution lies around a module size of 12 amino acid
residues. Even for structural modules we observed a basic
invariance of size with respect to the length of the proteins
they are embedded into.
This network based representation allows us to characterize the single residues in terms of their topological role in the
network by means of the P and z coefficients, i.e., in terms of
the relative role of different residues in connecting different
modules (P) or their central position inside the module (z).
The representation of the single proteins in the P vs. z
space produces graphs with an invariant general shape, as is
evident from Fig. 7 and Fig. 8 reporting the P vs. z space for
a single protein and for the superposition of the 1420 graphs
relative to the entire set respectively.
The invariance of the P vs. z plot for very different protein structures forces us to look for a characterization of the
different portions of these graphs in terms of a possible link
n
tio
Fig. (6). Distribution of structural modules frequency vs. length

according to [45]. The maximal frequency of modules is achieved
for a size of 12.
Fig. (7). Single protein (ubiquitin) P vs. z graph (see text for explanation).
Looking at the above figure, it is evident that P and z

have a tendency to be negatively correlated. This is not
strange if we consider the meaning of these two descriptors,
P pointing to a role of inter module connector (and thus in
many cases peripheral with respect to its own module), while
high values of z point to a central position of the residue inside its module. For this reason, of specific interest are those
residues that are characterized by an extremely high absolute
value of P/z ratio. These residues are those that attain a much
higher inter-module connecting role with respect to what
would be expected by the obvious P vs. z negative correlation. These residues are the best candidates to play the role
of the so-called non-hub connectors indicated by Guimera
and Amaral [45] as the most critical nodes in a network. This
was actually the case for three analysed model systems ubiq-
35
tems [1]: this make proteins as one of the most fruitful and
intriguing territories for many diverse explorations carried
on by scientists with very different backgrounds. Physically
inclined scientists are particularly attracted by the field of
molecular dynamics simulation, i.e., by the possibility of
investigating the motions of proteins by means of computationally intensive approaches. The basic recipe is more or
less as follows: start with a known 3D structure defined in
terms of mutual positions of the residues, add some solvent
molecules (this is not mandatory but adds realism), put inside all the potentials we already know that act on the above
elements, define some general boundary conditions (i.e.
temperature, ionic strength, pH, etc. in the correct physical
formulation) and start the simulation by means of an algorithm that progressively adjusts the three dimensional coordinates of the protein residues applying to them, at each discrete step of the simulation, all the known potentials [25,26,
51,52].
n
tio
Fig. (8). P vs. z graph superposition for the whole set of proteins
(see text for explanation).
The basic recipe can obviously be changed in myriads of

possible directions to account for many different situations:
insertion of a mutation, change of pH, change of solvent, etc.
uitin (PDB: 1UBQ), hen lysozyme (PDB: 1E8L) and RNAse

A (PDB: 7RSA) where residues with high, absolute P/z values were demonstrated to correspond to residues that are
protected during the transition state (unpublished results).
The relevance of the description of protein residues in terms
of the prediction of different folding features was recognized
by different groups [37, 47, 48, 49]. Of particular interest is
the work by Karplus and coworkers [50] that make an explicit use of the aggregation coefficient averaged over all the
residues of the protein (clustering coefficient).
In any case, the output of the process consists of a huge

amount of numbers made of all the different positions, of all
the atoms of the system, relative to all time steps. This situation asks for some efficient formalization so as to derive an
understandable message from this deluge of information.
The most common way to accomplish this task is to collapse
all the information to the covariance of the motions of different residues in time, ending in correlation maps similar to the
one reproduced in panel a) of Fig. 1. Again we have an N x
N symmetric matrix with rows and columns corresponding
to the N residues of the studied protein and the elements to
the covariance of the time series corresponding to the different positions in time of the residues [25,26]. The N x N covariance matrix can be considered as a network (as any
symmetric matrix reporting a measure of the relationship
between the row and column elements), but in this case the
relationships (covariances) are generally considered in terms
of their actual quantitative values instead of being dichotomised (presence or absence of an edge) as we observed for
sequence and structure networks. In network terms the covariance matrix is a labelled graph in which each arc has a
value (label) corresponding to the entity of the corresponding
relation [28].
t
s
i
D
r
o
F
t
o
N
C=
2n k
1

N k n k [ n k 1]
u
rib
(3)
where N is the number of residues and nk is the number of

links of the k-th residue, together with the graph path length
corresponding to the average path connecting nodes in the
graph.
These two network invariants were applied to 978 representative proteins from the PDB discovering a small world
architecture of proteins (the same depicted in panel B of Fig.
3) with a few vertices working as hubs and in many cases
corresponding to the key residues for folding. This is a crucial result, complementing the previously described finding
by Rao and Caflish [37] on a much larger scale.
We can summarize the take-home message of this partial description of network based analysis of 3D structure of
proteins by the following points:
The topology of residue contacts in the 3D structure allows for the detection of modules. These structural modules
are reminiscent of the hydrophobic modules detected at the
sequence level.
The residues at the frontier of two distinct modules play
privileged roles in protein folding.
DYNAMICAL NETWORKS
As remarked in the Introduction, the peculiarity of proteins is their position in between simple and complex sys-
The most time honoured method to extract the relevant

information from such covariance matrices is Principal
Component Analysis (PCA). This is perhaps the most versatile statistical method ever developed, allowing for an appreciation of the studied phenomenon halfway between the
hard sciences-style based on differential equations and the
pure post-hoc statistical data analysis typical of biomedical
studies [53].
The same algorithm (with minor modifications) takes the
names of SVD (Singular Value Decomposition), SSA (Singular Spectrum Analysis), or Karhunen-Loewe decomposition depending upon the scientific discipline which employs
it (e.g., physicists, climate scientists, engineers) [54]. Basically the method consists of the extraction of the eigenvalues
and relative eigenvectors of the N x N covariance matrix in
order of explained variance. Thus the first components col-
lect the most important motions of the protein system [25].

Given that we are talking of a covariance matrix, most important means explaining the major portion of coherent dynamics. This is to say that the major components describe
the coordinated displacements of protein domains. This is
the analogue of looking for modules of the network system
where a module is a portion of the molecule that has a coherent displacement of the constituent elements. Again we are
working around the concept of structure in its very basic
meaning of optimal dissection of a whole into its parts and
the connections between the parts.
There is a significant literature on the application of eigenvalues/eigenvector methods to molecular dynamics simulations to which the reader is directed for further reading [25,
55-57].
Here it is worth noting that the extraction of eigenvalues /
eigenvector spectrum is a classical way to analyse different
network systems in order to derive crucial information such
as network modularity or stability [28,58]. Particularly interesting from our point of view is the analysis by means of
SVD of the N x N distance matrix between side chains of
protein residues in crystal structures so as to derive protein
domains in a mathematically objective way [59].
How can we connect the dynamical perspective of molecules with the sequence and 3D structure views? In [60] the
authors analyse the differences in the dynamical behaviour
Krishnan et al.
of A40, a peptide of 40 residues involved in the pathogenesis of Alzheimer disease by means of the mutual aggregation
of different monomers to form supra-molecular complexes
endowed with neurotoxic activity.
The aggregation process was demonstrated [60] to be
mediated by the presence of highly flexible regions along the
molecule acting as aggregation hot spots. The amount of
flexibility of these regions (measured as RMSD of the residues) is deeply influenced by pH , thus the molecular dynamics simulation was performed at 3 pH values; low (range
2-4), medium (range 5-6) and neutral (denominated L, M
and N respectively) of which M corresponded to the highest
flexibility of the molecule. The highest flexibility condition
(M) had, as its experimental counterpart, a much higher aggregation rate of the peptide [26].
Figure 9 allows for an immediate appreciation of the link
existing between more recurrent and deterministic patches
along the molecule (as appreciated by the application of
RQA on Myiazawa-Jernigan coded peptide sequence) and
most mobile patches expressed by the histogram of RMSD
of the single residues for the three pH conditions (Fig. 9, left
panel). This correspondence is made more cogent by the
smoothing of both REC and RMSD values along the sequence as reported in the right panel of Figure 9.
u
rib
t
s
i
D
r
o
F
t
o
N
n
tio
This case story closes the circle initiated by the discovery

in the 1141 eukaryotic proteins data sets of a statistical corre-
Fig. (9). On the left are reported the RP of A 1-40 peptide together with the RMSD of each residue in the three experimental condition, on
the right the same situation is depicted by means of a moving average (smoothing) procedure on the same data.
lation between the presence of repetitive patches along the

sequence and both the features of being natively unfolded
(another name for high flexibility) and having many proteinprotein interactions (the general case of protein aggregation).
On the other hand the folding process is not qualitatively
different from protein aggregation, the only difference being
the intra-molecular (folding) as opposed to the intermolecular (aggregation) character of interactions [61, 62].
The folding side of the coin is represented by the resemblance between 3D structural modules (putative folding
units) and hydrophobic words along the sequence we discussed above.
The consideration of dynamical networks in the form
of links (covariances) between trajectories of different residues (nodes) of a protein system ends the sequencestructure-dynamics path we proposed as the main path of
exploration of the use of graph theory based approaches in
protein science. The study of the molecular dynamic simulation of A40 highlights the dynamical counterpart of protein
repetitive patches (modules of the sequence based network
representation) as the most mobile and aggregation prone
portions of the sequence, thus indicating a strong link between the different sequence-structure-dynamics layers of
protein description that was discovered by means of considering proteins as network systems.
Moreover a very recent finding that appeared in the literature while this review was in process by the Nussinov
group [69], demonstrated by means of the decomposition of
protein contact matrices into modules in a way equivalent to
the one described in this paper, that the inter-modular
boundaries not only contain the most conserved residues of
proteins but the ones most crucial for allosteric communications. This signalling role of inter-modules residues is an
important result of general network theory that was already
demonstrated in other biological networks, such as gene expression networks [70]. The confirmation of this organizational principle of network architecture in the case of protein topology could be one of the very few general theoretical principles holding for biological matter.
AKNOWLEDGEMENTS
This work was supported by a joint DMS/DGMS initiative to support mathematical biology, from the NSF and
NIH, (NSF DMS #0240230) to JPZ.
REFERENCES
[1]
[2]
[3]
[4]
The above results indicate the feasibility and usefulness

of a network formalization of proteins at different levels of
definition. This formalization allows subtle sequence/ structure/dynamics relationships to be clearly highlighted by the
use of a common mathematical formalism derived from the
consideration of a protein molecule as a network.
o
F
t
o
N
The network formalization allows for the projection of

sequence, structure and dynamical features of protein molecules on three spaces of identical dimensionality and having
as basic common elements the single residues that are the
natural components of proteins. This allows us to approach
the studied systems in an unbiased way avoiding unjustified
or arbitrary assumptions.
The correspondence between network structures and adjacency (or covariance) matrices allows for an immediate
translation of classical mathematical tools used for dealing
with covariance and correlation matrixes (cluster techniques,
spectral decomposition methods) into graph-based formalisms (topological invariants, connectivity descriptors). This
allows both for a cross-fertilization of different fields of investigation ranging from systems biology to metabolic network analysis to structural biology as well as for a direct
translation of potentially useful results from statistical mechanics (based on network structures) into biological and
chemical sciences. Vishveshwara and coworkers have utilized a graph theoretical representation of protein structures
along with spectral analysis techniques to study such diverse
aspects as identification of backbone and sidechain clusters
[63, 64], determination of quaternary association [65, 66],
identification of domains [67] as well as in analyzing the
stability properties of proteins [68].
[5]
[6]
[7]
[8]
u
rib
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
n
tio
Frauenfelder, H. and Wolynes, P. (1994) Phys. Today, 47, 58.

Karplus, M. and Weaver, D.L. (1994) Protein Sci., 3, 650.
Karplus, M. and Sali, A. (1995) Curr. Op. Struct. Biol., 5, 58.
Dill, K.A., Bromberg, S., Yue, K., Fiebig, K.M. and Yee D.P.
(1995) Protein Sci., 4, 561.
Bryngelson, J.D. and Wolynes, P.G. (1987) Proc. Natl. Acad. Sci.
USA, 84, 7524.
Levitt, M. and Warshel, A. (1975) Nature, 253, 693.
Skolnick, J. and Kolinski, A. (1992) Science, 250, 1121.
Godzik, A., Skolnick, J. and Kolinski, A. (1992) Proc. Natl. Acad.
Sci. USA, 89, 2629.
Kolinski, A. and Skolnick J. (1994) Proteins Struct. Funct. Genet.,
18, 338.
Baldwin, R.L. (1975) Ann. Rev. Biochem., 44, 453.
Kim, P.S. and Baldwin, R.L. (1982) Ann. Rev. Biochem., 51, 459.
Yon, J.M. (2002) J. Cell. Mol. Med., 6, 307.
Zimm, B.H. and Bragg, J.K. (1959) J. Chem. Phys., 31, 526.
Lifson, S. and Roig, A. (1961) J. Chem. Phys., 34, 1963.
Jaenicke, R. (1987) Prog. Biophys. Mol. Biol., 49, 117.
Wetlaufer, D.B. (1973) Proc. Natl. Acad. Sci. USA, 70, 697.
Wolynes, P.G., Onuchic, J.N. and Thirumalai, D. (1995) Science,
267, 1618.
Giuliani, A., Benigni, R., Zbilut, J.P., Webber, C.L., Jr., Sirabella,
P. and Colosimo, A. (2002) Chem. Rev., 102, 1471.
Romero, P., Obradovic, Z., Li, X., Graner, E.C., Brown, C.J. and
Dunker A.K. (2001) Proteins Struct. Funct. Genet., 42, 38.
Uversky, V.N. (2002) Protein Sci., 11, 739.
Abkevich, V.I., Gutin, A.M. and Shakhnovic, EI. (1998) Proteins
Struct. Func. Genet., 31, 335.
Yates, FE. and Kugler, P.N. (1986) J. Pharm. Sci., 75, 1019.
Tenenbaum, J.B., De Silva, V.D. and Langford, J.C. (2000) Science, 290, 2319.
Webber, C.L., Giuliani, A., Zbilut, J.P. and Colosimo, A. (2001)
Proteins: Struct. Funct. and Genet., 44, 292.
Amadei, A., Linssen, A.B.M. and Berendsen, H.J.C. (1993) Proteins Struct. Funct. Genet., 4, 412
Valerio, M.C., Colosimo, A., Conti, F., Giuliani, A., Grottesi, A.,
Manetti, C. and Zbilut, J.P. (2005) Proteins Struct. Funct. and Bioinf., 58, 110.
Colafranceschi, M., Colosimo, A., Zbilut, J.P., Uversky, V.N. and
Giuliani, A. (2005) J. Chem. Inf. Model., 45, 183
Palumbo, M.C., Farina, L., Colosimo, A., Tun, K., Dhar, P.K. and
Giuliani, A. (2006) Curr. Bioinf., 2, 219
Smolen, P., Baxter, D.A. and Byrne, G. (2000) Bull. Math. Biol.,
62, 247.
Gardner, T.S. and Faith, J.J. (2005) Phys. Life Rev., 2, 65.
Fiehn, O. and Weckwerth, W. (2003) Eur. J. Biochem., 270, 579.
t
s
i
D
r
CONCLUSIONS
37

[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
Stelling, J., Klamt, S., Bettenbrock, K., Schuster, S. and Gilles,

E.D. (2002) Nature, 420, 190.
Lassig, M., Bastolla, U., Manrubia, S.C. and Valleriani, A. (2001)
Phys. Rev. Lett., 86, 4418.
Sauro, M. and Kholodenko, B.N. (2004) Prog. Biophys. Mol. Biol.,
86, 5.
Lukovits, I. (2000) J. Chem. Inf. Comput. Sci., 40, 1147.
McMahon, S.M., Miller, K.H. and Drake, J. (2001) Science, 293,
1604.
Rao, F. and Caflisch, A. (2004) J.Mol. Biol., 342, 299.
Amaral, L.A.N., Scala, A., Barthelemy, M. and Stanley, H.E.
(2000) Proc. Natl. Acad. Sci USA, 97, 1149.
Barabasi, A.L. and Albert, R. (1999) Science, 286, 509.
Rackovski, S. (1998) Proc. Natl. Acad. Sci. USA, 95, 8580.
Zbilut, J.P., Chua, G.H., Krishnan, A., Bossa, C., Colafranceschi,
M. and Giuliani, A. (2006) FEBS Lett., 580, 4861.
Schwartz, R. and King, J. (2006) Protein Sci., 15, 102.
Lua, R.C. and Grossberg, A.Y. (2006) PloS Comput. Biol., 2, 350.
Compiani, M., Fariselli, P., Martelli, P.L. and Casadio, R. (1998)
Proc. Natl. Acad. Sci. USA, 95, 9290.
Guimera, R. and Amaral, L.A.N. (2005) Nature, 433, 895.
Wang, G. and Dunbrack, R.L.J. (2003) Bioinf., 19, 1589.
Bagler, G. and Sinha, S. (2005) Physica. A, 346, 27.
Kundu, S. (2005) Physica. A, 346, 104.
Higman, V.A. and Greene, L.H. (2006) Physica. A, 368, 595.
Vendruscolo, M., Dokholyan, N.V., Paci, E. and Karplus, M.
(2002) Phys. Rev. E., 65, 061910.
Car, R. and Parrinello, M. (1985) Phys. Rev. Lett., 55, 2471.
Berendsen, H.J.C., Postma, J.P.M., Van Gustereen, W.F. and Di
Nola, A. (1984) J. Chem. Phys., 81, 3684.
Benigni, R. and Giuliani, A. (1994) Am. J. Physiol., 266, R1697.
Preisendorfer , R.W. and Mobley, C.D. (1988) Principal Component Analysis in Meteorology and Oceanography, Elsevier , Amsterdam.
Revised: September 19, 2007
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
Arcangeli, C., Bizzarri, A.R. and Cannistraro, S. (2001) Biophys.

Chem., 90, 45.
Chillemi, G., Falconi, M., Amadei, A., Zimatore, G., Desideri, A.
and Di Nola, A. (1997) Biophys. J., 73, 1007.
Peters, G.H., van Aalten, D.M., Edholm, O., Toxvaerd, S. and
Bywater, R. (1996) Biophys. J., 71, 2245.
Goh, K.I, Kahng, B. and Kim, D. (2001) Phys. Rev. E., 64, 051903.
Kannan, N. and Vishveshwara, S., (1999) J. Mol. Biol., 292, 441.
Zbilut, J.P., Colosimo, A., Conti, F., Colafranceschi, M., Manetti,
C., Valerio, M.C., Webber, C.L. and Giuliani, A. (2003) Biophys.
J., 85, 3544.
Zbilut, J.P., Giuliani, A., Colosimo, A., Mitchell, J.C., Colafranceschi, M., Marwan, N., Webber, C.L. and Uversky, V. (2004)
J. of Proteome Res., 3, 1243.
Chiti, F., Taddei, N., Baroni, F., Capanni, C., Stefani, M., Ramponi, G. and Dobson, C.M. (2002) Nature Struct. Biol., 9, 137.
Kannan, N. and Vishveshwara, S. (1999) J. Mol. Biol., 292, 441464
Patra, S.M. and Vishveshwara, S. (2000) J. Theor. Comp. Chem.,
84, 13-25
Brinda, K.V., Mitra N., Surolia, A. and Vishveshwara, S., (2004)
Protein Sci., 13, 1735-49
Brinda, K.V., Surolia, A. and Vishveshwara, S. (2005) Biochem. J.,
391, 1-15
Sistla, R.K., Brinda, K.V. and Vishveshwara, S. (2005) Proteins:
Struc. Func. Bioinfo., 59, 616-626
Brinda, K.V. and Vishveshwara, S. (2005) Biophys. J., 89, 41594170
Del Sol, A., Arauzo-Bravo, M.J., Amoros Moya, D. and Nussinov,
R. (2007) Genome Biol., 8, R92.
Yu, H., Kim, P.M., Sprecher, E., Trifonov, V. and Gerstein, M.
(2007) PloS Comput. Biol., 3, e59.
Accepted: September 19, 2007
[70]
n
tio
u
rib
t
s
i
D
r
o
F
t
o
N
Received: December 04, 2006
Krishnan et al.

Protein Network Paper 2008

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Protein Network Paper 2008

Hochgeladen von

Copyright:

Verfügbare Formate

28

Current Protein and Peptide Science, 2008, 9, 28-38

Proteins As Networks: Usefulness of Graph Theory in Protein Science

1. Recurrence matrix of hydrophobicity patterns along the sequence

3. Correlation matrix of motions of different portion of the molecule in molecular dynamics.

In an important paper [1], Hans Frauenfelder and Peter

Protein folding as a topic has intrigued scientists for

With the discovery of natively unfolded proteins and the

Current Protein and Peptide Science, 2008, Vol. 9, No. 1

Thus, we can consider the particular linear arrangement

This shift of perspective allows us to project all three

On the contrary we can assume an intrinsic (relativistic)

In panel a) of Fig 1 the dynamics of the 1-40 -amyloid

30 Current Protein and Peptide Science, 2008, Vol. 9, No. 1

along the sequence in the form of horizontal (vertical) lines

DERIVATION OF QUANTITATIVE DESCRIPTORS

fully represented by a wiring diagram like the one reported

The most basic network feature is the possibility to

Current Protein and Peptide Science, 2008, Vol. 9, No. 1

k(i) * ( k(i) 1)

Expression (1) corresponds to the maximal number of

32 Current Protein and Peptide Science, 2008, Vol. 9, No. 1

periodicity: a reliable estimate of the entropy reduction due

2. Percent of Determinism (DET): Percent of recurrent

Current Protein and Peptide Science, 2008, Vol. 9, No. 1

completely independent of the total length of the protein in

Fig. (5). Determinism scaling at varying embedding dimension for

The relevance of a characteristic length of six residues is

Fig. (4). Two Recurrence Plots together with the hydrophobicity

The above evidence was reinforced by the analysis [41]

34 Current Protein and Peptide Science, 2008, Vol. 9, No. 1

PROTEIN STRUCTURAL MODULES (6+6=12)

The algorithm converges towards the maximization of

Figure 6, reporting module length distribution for the

Fig. (6). Distribution of structural modules frequency vs. length

Looking at the above figure, it is evident that P and z

Current Protein and Peptide Science, 2008, Vol. 9, No. 1

The basic recipe can obviously be changed in myriads of

uitin (PDB: 1UBQ), hen lysozyme (PDB: 1E8L) and RNAse

In any case, the output of the process consists of a huge

where N is the number of residues and nk is the number of

The most time honoured method to extract the relevant

36 Current Protein and Peptide Science, 2008, Vol. 9, No. 1

lect the most important motions of the protein system [25].

This case story closes the circle initiated by the discovery

Current Protein and Peptide Science, 2008, Vol. 9, No. 1

lation between the presence of repetitive patches along the

The above results indicate the feasibility and usefulness

The network formalization allows for the projection of

Frauenfelder, H. and Wolynes, P. (1994) Phys. Today, 47, 58.

38 Current Protein and Peptide Science, 2008, Vol. 9, No. 1

Stelling, J., Klamt, S., Bettenbrock, K., Schuster, S. and Gilles,

Revised: September 19, 2007

Arcangeli, C., Bizzarri, A.R. and Cannistraro, S. (2001) Biophys.

Accepted: September 19, 2007

Das könnte Ihnen auch gefallen