Beruflich Dokumente
Kultur Dokumente
Institute of Advanced Biosciences, Keio University, Tsuruoka City, Japan; 2Molecular Biophysics and Physiology
Dept., Rush University Medical Center, Chicago, USA; 3Environment and Health Dept., Istituto Superiore di Sanita,
Viale Regina Elena 299, 00161, Roma, Italy
Abstract: The network paradigm is based on the derivation of emerging properties of studied systems by their representation as oriented graphs: any system is traced back to a set of nodes (its constituent elements) linked by edges (arcs) correspondent to the relations existing between the nodes. This allows for a straightforward quantitative formalization of systems by means of the computation of mathematical descriptors of such graphs (graph theory). The network paradigm is
particularly useful when it is clear which elements of the modelled system must play the role of nodes and arcs respectively, and when topological constraints have a major role with respect to kinetic ones. In this review we demonstrate how
nodes and arcs of protein topology are characterized at different levels of definition:
n
tio
u
rib
These three conditions represent different but potentially correlated reticular systems that can be profitably analysed by
means of network analysis tools.
t
s
i
D
r
Keywords: Systems biology, protein folding, recurrence quantification analysis, molecular dynamics, computational biology.
INTRODUCTION
o
F
t
o
N
1389-2037/08 $55.00+.00
Proteins As Networks
o
F
t
o
N
n
tio
Fig. 1 reports a pictorial example of these different representations: quite obviously, the intrinsic geometry approach,
taking into consideration the relationships between the elements of the system instead of looking for an absolute representation, allows us to use the same mathematical object (an
N x N symmetrical matrix) for describing all the different
aspects of protein architecture and behaviour, thereby offering a consistent basis for modelling. This provides an easy to
use method for going from one layer to the other.
u
rib
t
s
i
D
r
The amino acid residues are the most natural basic elements for protein modelling given that they are the monomers constituting the polymers under study and are common
to all the three different layers. This ends up in a physicochemical description of specific residues at the sequence
layer, a position in space for structure and a relative mobility
for dynamics. The different characterization of the same
elements (aminoacid residues) relative to the three organization levels poses a problem for the dimensionality aspect
of modelling: sequence is mono dimensional, and can be
equated to a time series with time substituted by the order of
the residues along the sequence, whereas structure is obviously a three dimensional object. A molecular dynamics
simulation lives in a M x N space with N being the number
of residues and M (= 4) being the number of coordinates (3
spatial coordinates + time coordinate). These dimensionality
considerations tacitly imply an absolute space existing independently from the particular system; what we call an extrinsic geometry [22, 23]. In other words physico-chemical descriptors, or space coordinates have an existence of their
own, independent of the particular protein we are describingthey are an absolute system of coordinates.
29
Krishnan et al.
n
tio
u
rib
t
s
i
D
r
Fig. (1). The figure reports the network formalization in terms of NxN adjacency matrix for the three layers of protein formalization: a) dynamics, covariance matrix of the motions of A 1-40 peptide, b) 3D structure, TNF-A contact matrix with the indication of secondary structure elements, c) sequence: hydrophobicity recurrence plot of P53 protein.
o
F
t
o
N
Proteins As Networks
31
Fig. (2). Isomorphism between network graph formalization and adjacency matrix. The thickness of the edges of the graph and of the squares
of the matrix is proportional to the intensity of the relation between the corresponding elements.
only one component . The number of distinct paths connecting two nodes can be considered as an index of connectivity of the graph nodes: the greater the number of paths
connecting the two nodes, the greater the two nodes are correlated, and the higher their connectivity index.
This allows for a straightforward application of clustering
algorithms able to delineate supernodes, i.e., groups of
nodes highly connected among each other and forming functional modules. This metric property of topological graph
representation was exploited in many different fields extending from organic chemistry (where the graphs are the molecules with atoms as nodes and chemical bonds as edges [35],
to social networks [36]. In the case of protein domains, a
very clear application with aminoacid residues = nodes and
contacts = edges can be found in [37]. In this paper, by
means of a pure network topology, it is possible to identify
conformations in the folding transition state (TS) ensemble,
and provide a basis for the understanding of the heterogeneity of the TS and denatured state ensemble as well as the
existence of multiple folding pathways. Network topology is
described by means of intuitive computations. Each node of
the graph can be labelled by the number of nodes connected
to it (degree): this gives rise to the degree distribution P(k)
describing the general wiring pattern of the network having as the abscissa the number k of connections and as the
ordinates, the number of nodes having k connections. In
analogy with statistical mechanics, these distributions are
defined as scaling laws [38, 39].
Fig. 3 shows two typical kinds of scaling laws (node degree distributions): the panel on the left shows a Poisson
distribution in which there is a privileged scale of number of
connections and a decreasing number of nodes having less
than average or more than average links. The panel on the
right depicts a so-called scale-free network [28, 38], in
which there is a large majority of nodes with a low number
of connections and a very small number of nodes having a
large number of links. These highly connected nodes are
called hubs. The tendency of having subsets of nodes
strongly connected among them can be measured by the socalled aggregation coefficient. Consider a generic node i of
the network having k(i) edges connecting it to other k(i)
nodes. In order that these nodes possess the maximal connectivity (each node connected to each other) we should have a
total number of edges equal to:
n
tio
u
rib
t
s
i
D
r
o
F
t
o
N
(1)
Ei
Ci = 2 *
( k(i) * ( k(i) 1))
(2)
The aggregation coefficient for the entire network corresponds to the average of Ci over all the nodes. The operative
counterpart of clustering tendency is the concept of modularity, that is, the possibility to isolate portions of a more general network that can be considered as partially independent
sub-networks (also known as modules) that can be studied
as such, without necessarily referring to the whole network.
This is the same as the concept of stable classification in
classical multidimensional statistics, in which well behaved clusters are defined as collections of statistical units
very near each other (in the network language having many
mutual connections) and distant from the elements of the
other cluster (in network language, having only a few arcs
connecting elements of different modules) [28].
The above defined measures, when applied to N x N adjacency matrices (protein networks) in sequence, structure
and dynamic spaces allow us to quantitatively describe protein systems at different levels of specification by means of
directly comparable measures. In the following we sketch a
path from protein sequence through protein structure and
protein dynamics network-based representations (by the
agency of the relative N x N adjacency matrixes) showing
how the use of a common mathematical support can help to
derive some interesting conjectures about the protein folding
process.
PROTEINS DISPLAY SIX RESIDUE LONG HYDROPHOBICITY PATTERNS ALONG THEIR SEQUENCE
Protein sequences are, with rare exceptions (e.g. fibrous
polymerising proteins such as collagen or silk), quasirandom strings of symbols with scant evidence of order or
Krishnan et al.
Fig. (3). Two possible scaling of the degree of nodes (k) and their relative frequency (p(k)), left: Poisson scaling, right: scale free distribution.
u
rib
t
s
i
D
r
o
F
t
o
N
n
tio
Fig. 4 reports two such RPs together with the corresponding sequence encoded by means of Myiazawa-Jernigan hydrophobicity, which was demonstrated to be the property
code showing the richest autocorrelation structure. RQA
generates a numerical set of descriptors for RPs resembling
the network invariants described above. The two most basic
descriptors are:
1. Percent of Recurrence (REC): Percent of recurrent
pairs (corresponding to aggregation coefficient)
Proteins As Networks
33
u
rib
n
tio
t
s
i
D
r
o
F
t
o
N
Krishnan et al.
between the physiological role of residues and their representation in terms of network invariants.
u
rib
t
s
i
D
r
o
F
t
o
N
n
tio
Fig. (7). Single protein (ubiquitin) P vs. z graph (see text for explanation).
Proteins As Networks
35
tems [1]: this make proteins as one of the most fruitful and
intriguing territories for many diverse explorations carried
on by scientists with very different backgrounds. Physically
inclined scientists are particularly attracted by the field of
molecular dynamics simulation, i.e., by the possibility of
investigating the motions of proteins by means of computationally intensive approaches. The basic recipe is more or
less as follows: start with a known 3D structure defined in
terms of mutual positions of the residues, add some solvent
molecules (this is not mandatory but adds realism), put inside all the potentials we already know that act on the above
elements, define some general boundary conditions (i.e.
temperature, ionic strength, pH, etc. in the correct physical
formulation) and start the simulation by means of an algorithm that progressively adjusts the three dimensional coordinates of the protein residues applying to them, at each discrete step of the simulation, all the known potentials [25,26,
51,52].
n
tio
Fig. (8). P vs. z graph superposition for the whole set of proteins
(see text for explanation).
t
s
i
D
r
o
F
t
o
N
C=
2n k
1
N k n k [ n k 1]
u
rib
(3)
These two network invariants were applied to 978 representative proteins from the PDB discovering a small world
architecture of proteins (the same depicted in panel B of Fig.
3) with a few vertices working as hubs and in many cases
corresponding to the key residues for folding. This is a crucial result, complementing the previously described finding
by Rao and Caflish [37] on a much larger scale.
We can summarize the take-home message of this partial description of network based analysis of 3D structure of
proteins by the following points:
The topology of residue contacts in the 3D structure allows for the detection of modules. These structural modules
are reminiscent of the hydrophobic modules detected at the
sequence level.
The residues at the frontier of two distinct modules play
privileged roles in protein folding.
DYNAMICAL NETWORKS
As remarked in the Introduction, the peculiarity of proteins is their position in between simple and complex sys-
Krishnan et al.
of A40, a peptide of 40 residues involved in the pathogenesis of Alzheimer disease by means of the mutual aggregation
of different monomers to form supra-molecular complexes
endowed with neurotoxic activity.
The aggregation process was demonstrated [60] to be
mediated by the presence of highly flexible regions along the
molecule acting as aggregation hot spots. The amount of
flexibility of these regions (measured as RMSD of the residues) is deeply influenced by pH , thus the molecular dynamics simulation was performed at 3 pH values; low (range
2-4), medium (range 5-6) and neutral (denominated L, M
and N respectively) of which M corresponded to the highest
flexibility of the molecule. The highest flexibility condition
(M) had, as its experimental counterpart, a much higher aggregation rate of the peptide [26].
Figure 9 allows for an immediate appreciation of the link
existing between more recurrent and deterministic patches
along the molecule (as appreciated by the application of
RQA on Myiazawa-Jernigan coded peptide sequence) and
most mobile patches expressed by the histogram of RMSD
of the single residues for the three pH conditions (Fig. 9, left
panel). This correspondence is made more cogent by the
smoothing of both REC and RMSD values along the sequence as reported in the right panel of Figure 9.
u
rib
t
s
i
D
r
o
F
t
o
N
n
tio
Fig. (9). On the left are reported the RP of A 1-40 peptide together with the RMSD of each residue in the three experimental condition, on
the right the same situation is depicted by means of a moving average (smoothing) procedure on the same data.
Proteins As Networks
Moreover a very recent finding that appeared in the literature while this review was in process by the Nussinov
group [69], demonstrated by means of the decomposition of
protein contact matrices into modules in a way equivalent to
the one described in this paper, that the inter-modular
boundaries not only contain the most conserved residues of
proteins but the ones most crucial for allosteric communications. This signalling role of inter-modules residues is an
important result of general network theory that was already
demonstrated in other biological networks, such as gene expression networks [70]. The confirmation of this organizational principle of network architecture in the case of protein topology could be one of the very few general theoretical principles holding for biological matter.
AKNOWLEDGEMENTS
This work was supported by a joint DMS/DGMS initiative to support mathematical biology, from the NSF and
NIH, (NSF DMS #0240230) to JPZ.
REFERENCES
[1]
[2]
[3]
[4]
o
F
t
o
N
The correspondence between network structures and adjacency (or covariance) matrices allows for an immediate
translation of classical mathematical tools used for dealing
with covariance and correlation matrixes (cluster techniques,
spectral decomposition methods) into graph-based formalisms (topological invariants, connectivity descriptors). This
allows both for a cross-fertilization of different fields of investigation ranging from systems biology to metabolic network analysis to structural biology as well as for a direct
translation of potentially useful results from statistical mechanics (based on network structures) into biological and
chemical sciences. Vishveshwara and coworkers have utilized a graph theoretical representation of protein structures
along with spectral analysis techniques to study such diverse
aspects as identification of backbone and sidechain clusters
[63, 64], determination of quaternary association [65, 66],
identification of domains [67] as well as in analyzing the
stability properties of proteins [68].
[5]
[6]
[7]
[8]
u
rib
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
n
tio
t
s
i
D
r
CONCLUSIONS
37
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
n
tio
u
rib
t
s
i
D
r
o
F
t
o
N
Received: December 04, 2006
Krishnan et al.