Sie sind auf Seite 1von 9

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

Degree Distribution of Real Biological Networks


Vilda Purutcuoglu and Omolola Odunsi
Middle East Technical University
Department of Statistics, 06800, Ankara, Turkey
vpurutcu@metu.edu.tr; omolola.odunsi@gmail.com
ABSTRACT
The networks indicate differences with respect to
their topological features. The degree distribution,
which implies the total number of links attached to
a gene, is one of these features. In biological networks, the number of incoming links and out-going
links have distinct names and are expressed by various distributions. The former implies the in-degree
and is typically represented by the exponential distribution. On the other hand, the latter refers to
the out-degree and is presented by different distributions such as power-law, generalized pareto-low
and geometric distributions. In this study, by using real systems under various dimensions, we investigate the validity of these alternatives and test
whether the out-degree distribution can be represented by one of the families in the Pearson system.

KEYWORDS
Degree distribution, Pearson system, real data, systems biology.

INTRODUCTION

The networks are the structures composed of


the nodes and the edges and indicate the flow of
processes between nodes via the functional or
physical interactions. In biological networks,
the nodes refer to the gene, protein or species
in the systems, and the edges show the interrelationships between the nodes. In biological networks, the distributions of the links
can be used to distinguish the type of networks as each of them denotes different features. For instance, the random networks have
the Poisson distribution as the distribution of
the links. Whereas, the scale-free networks
have the power-low density to express the same
distribution [3]. Therefore, the distribution of
the links, also referred as the degree distribution, is one of the topological features of bi-

ISBN: 978-1-941968-35-2 2016 SDIWC

ological networks. However, in the study of


Khanin and Wit (2006) [13], it has been shown
that the degree distribution of 10 selected real
networks do not indicate neither the power-low
distribution nor the scale-free feature. Hence,
they have suggested the truncated power-low,
stretched exponential and geometric distributions.
Hereby, in this study, we consider to investigate a unique distribution type in the Pearson family which may cover all these alternative distributions besides power-low; but has a
closed and explicit formula of density. Furthermore, such a knowledge is also very helpful
to assign the prior density of the degree distribution in various Bayesian analyses about
the construction of the networks which is composed of high dimensional data. For these purposes, we implement 10 different biological
networks under various dimensions, i.e., total
number of genes, and check which types of
the Pearson family of curves fit better. Also,
we investigate whether they may follow any
of the underlying alternative distributions as
the degree distribution. Thus, as the organization of this study, we present the types of networks and their degree distributions in Section
2. Here, we also describe the Pearson family of distributions. Then, in Section 3, we
represent our selected biological networks and
check which type of the Pearson family is best
fitted to these real networks. We also control
the validity of other proposal degree distributions. Finally, in Section 4, we summarize the
results and discuss the future works.
2 METHODS
In the literature, they are different types of
networks that can be classified based on their
components, their links and the distribution of

92

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

the links [12]. If they are separated with respect to their components, the major components of the system give its name to the system
such as metabolic networks as they are composed of metabolites or protein-protein interaction networks as they consist of large number
of proteins [9, 16]. On the other hand, if they
are classified regarding their links, they are
named as directed and undirected networks.
Finally, if they are grouped based on the distribution of the links, they are called as homogeneous and non-homogeneous networks.
The homogenous networks, also known as the
Erdos-Renyi networks, present the networks
where each node has the same number of connections. Thus, the distribution of these links
or connections is presented via the Poisson distribution [3, 21, 15, 12].

On the other side, the nonhomogeneous


networks are further separated under three
branches, namely, the scale-free networks, the
hierarchical networks and the modular networks. Among these alternatives, the biological networks are represented by the scale-free
networks and they describe the systems having highly connected nodes, also called hubs.
In these networks, if a randomly selected node
is excluded, the activation of the system cannot change, whereas, if we attack the hubs, the
system can degrade and its regulation can destroy. Furthermore, in these networks, the distribution of the links is defined with respect
to the direction of the links. If we count the
number of in-coming links to each node, it can
be denoted by the exponential distribution. On
the contrary, if we deal with the distribution of
the number of out-going links per node, then,
we can represent it as the power-low distribution. But the geometric, stretched exponential, truncated power-low, generalized pareto
distributions and the combination of these proposed densities are also suggested as the alternatives of the power-low distributions since
they also maintain the characteristics of biological distributions such as centrality and lethality [13, 3, 12, 21].

ISBN: 978-1-941968-35-2 2016 SDIWC

2.1 Alternative Degree Distributions


In order to identify the degree distribution of
the biological networks, the following four
types of distributions are considered in the literature.
1. Generalized pareto distribution
2. Geometric distribution
3. Stretched exponential distribution
4. Truncated power-low distribution.
The generalized pareto distribution has the
axa
density f (x) = x(a+1)
for the random variable
x > 0 and a > 0. The rth moment of this
axr
density is derived from Ur = ar
for r < a.
The second density, i.e., the geometric distribution, can be identified by the following expression.
f (x) = (1 p)i p

(1)

whose rth moment can be generated from

1p
=
p(x) t
p
x=0

Ux

)i

(2)

where the probability of success is shown by p.


The stretched exponential distribution has the
following probability density function f (x)
exp(x ) by replacing a functional power-law
into an exponential function with the assumption that x has the range of [0, +]. In this
expression, refers to the scale parameter and
when = 1, the result produces the exponential function with a stretching exponent between 0 and 1.
The stretched exponential is also known as
the complementary cumulative weibull distribution [18] and its higher moments are derived
from

E(T n ) =
0

( )

tn1 e(k/Tk dt = Tkn


)

(3)

in which is the gamma function and T implies the shape parameter. Hereby, the density

93

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

function of the weibull distribution


[1, 8] is de( )
x
x 1
noted by f (x, , ) =
exp( ) for
x > 0 and when x < 0. In this equation, > 0
represents the scale parameter and > 0 indicates the shape parameter of the distribution.
Finally, the truncated power-low distribution
[7] is defined as
f (k) k
for x a and a > 0 with the normalization
factor which depends on the nature of discrete
x. Here, when 1 < < 2, the mean and other
moments are infinite. If 2 < < 3, the mean
exits but other moments are infinite. Therefore,
the density is truncated around b [5, 6, 10, 11]
such that
Figure 1. Diagram of the Pearson curve showing the
distributions of the types I, III, VI, V and IV on a 1 and
2 plane [14].

f (x) k a for 0 a b
where b is the truncation point.
2.2

Pearson Family of Distributions

The Pearson system [17] is a parametric family of distributions that uses a four-parameter
probability density functions to describe the
density of the random variables. The underlying four parameters which are denoted by a, b0 ,
b1 and b2 in the Pearson family have a direct relation with the central moments (1 , 2 , 3 , 4 )
as below [20]. The explicit forms of these parameters are presented in the following expressions.

b0 =
b1 =
b2 =
1 =
2 =

2 (42 312 )
,
A
1/2
2 1 (2 + 3)
a=
,
A
(22 312 6)
,
A
23
,
32
4
.
22

(4)
(5)
(6)
(7)
(8)
(9)

ISBN: 978-1-941968-35-2 2016 SDIWC

In Equations (7) and (8), 1 and 2 present


the skewness and the kurtosis, respectively. In
these expressions, the scaling parameters A
and A are found as below.
A = 104 2 1832 1223 ,
A = 102 18 1212 .

(10)
(11)

By this way, the Pearson table can define 12


classes of distributions whose the most wellknown 5 are presented in Figure 1 [2, 14].
3 APPLICATIONS
In order to test the alternative degree distributions of the departing connectivity for the real
datasets, we take 10 different networks of the
human diseases interactions which are gathered from the KEGG database. In the database,
the structure of all these networks is presented
via the directed networks with various numbers of genes. The datasets 1 to 6 consist of
724, 223, 1008, 987 643 and 188 genes interactions, respectively. Datasets 7 to 10 are the

94

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

HIV diseases interactions with different numbers of genes in the sense that the dataset 7 is
composed of 1469 genes. The datasets 8, 9, 10
have 1152, 722 and 306 genes, respectively.
In the analyses for the Pearson family-type of
distributions, we initially generate a frequency
table for the out-degree of each gene. Then,
we compute their first four-moments in such a
way that a and b values in their distributions
indicate the shape parameters for the left and
right sides, respectively. For the remaining two
parameters, we use the location and the scale
parameters. Here, the former is the minimum
and is shown via l and the latter indicates the
difference in the maximum and the minimum
of the range as displayed by s in the tabulated
terms from each dataset.
Finally, we perform the goodness of fit test to
check whether any suggested density for the
out-degree can fit these datasets. Here, we construct the hypothesis based on the following
test.
H0 : The data come from the specified (stretched exponential, pareto
and geometric) distribution.
H1 : The null hypothesis is not true.
In Tables 1-2, we list the estimated parameters,
i.e., means, variances, skewness and kurtosis
values, for the selected real datasets and then
specify a suitable density in the Pearson family.
From the results, it is shown that the majority (8 out of 10) of the data falls into the Pearson Type I distribution which presents the beta
family. On the other side, two datasets follow the Pearson Type VI distribution. The beta
family belongs to a family of the continuous
probability distributions parameterized by the
two positive shape parameters (a and b), location (l) and scale (s). In the beta family, the location parameter controls the position and the
scale parameter indicates the spread of the distribution on the x-axis in the coordinate system. The network of the inflammatory bowel
disease and the network of the glycogen storage disease fall in the Pearson Type VI family
of distributions. This type of distributions indicates an area defined as the region between the

ISBN: 978-1-941968-35-2 2016 SDIWC

gamma and the Pearson Type V family. Special cases of this family can be the beta distribution of the second kind and the Fisher F distribution [14]. From our results based on the
four-moment estimates, we find that these two
networks indicate the beta distribution of the
second kind.
In the literature, it has been suggested that the
degree of the departing connectivity in biological networks follows the generalized pareto,
the geometric or the weibull distribution (alternative to the stretched exponential). We test
the original 10 datasets under each of these
distributions via the chi-square goodness of fit
and the Kolmogorov-Smirnov tests. From the
findings, we detect that most of the original
datasets follow any of the tested alternative distribution, except in the case of the inflammatory disease, lafora and HIV with 1469 genes
in the system.
Table 1. Estimated parameters and Pearson Type for all
networks with their dimensions d, i.e., total number of
genes.
Network
and
dimension
Paget
disease
d = 724
Mendes
disease
d = 223
Inflamatory
bowel
d = 1008
Glycogen
storage
d = 987
Muscle
d = 643

Type

16.58

300.57

2.50

4.97

a:1.07
11.62

b:39.07
134.18

l:-0.83
5.60

s:630.03
7.58

a:0.60
2.05

b:4.21
11.30

l:1.03
1.03

s:84.26
4.78

VI

a:5.72
21.02

b:44.24
464.33

l:0.51
12.26

s:19.37
23.79

VI

a:1.14
104.82
a:0.81

b:12.80
996.51
b:7.08

l:0.02
1.57
l:4.23

s:217.87
4.84
s:981.16

We also draw the Q-Q plots of the empirical


and theoretical distributions. The graphes for
the weilbull distribution are presented in Figures 2-11 for illustration. The Q-Q plots show
that none of the dataset is likely to come from
the distribution tested in the null hypotheses.
Furthermore, the density plot of each dataset
(Figure 12) indicates that all data are right

95

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

Table 2. Estimated parameters and Pearson Type for all


networks with their dimensions d, i.e., total number of
genes.
Network
and
dimension
Lafora
d = 188
HIV
d = 1469
HIV
d = 1152
HIV
d = 722
HIV
d = 306

Type

48.47
a:0.57
83.28
a:0.66
11.51
a:0.78
7.56
a:0.88
3.53
a:0.00

1867.86
b:2.71
590.19
b:1.87
140.80
b:11.86
67.83
b:13.68
709.38
b:1.09

1.34
l:7.27
0.90
l:-2.67
3.84
l:0.28
7.60
l:0.70
2.95
l:1.96

4.31
s:235.41
2.91
s:328.56
7.37
s:182.26
13.04
s:136.46
5.90
s:947.72

skewed which are the visual indication of the


beta family.
4

CONCLUSION

In this paper, we have investigated the degree


distribution of biological networks by controlling the associated density in different dimensional real systems. We have investigated
whether any suggested out-degree distributions
can fit the values of these systems by controlling their Q-Q plots and statistical tests. Then,
we have proposed a Pearson type of distributions to describe the underlying degree distributions. The findings have showed that the degree distributions of the real systems support
mostly the Type I family and only few of them
are under the Pearson Type VI. The Pearson
Type I family refers to the beta family and the
Type VI family indicates the region between
the gamma and the Type V family. Thereby,
we have concluded that the major cases in the
Pearson system are observed as the beta distribution of the second kind and the Fisher Fdistribution.
As a result, we believe that the detection of
the degree distribution can be helpful for better understanding the structure of the biological networks which have high dimensions and
predicting the activations of the systems in detail. Indeed, in order to answer this problem,
different inference methods have been already

ISBN: 978-1-941968-35-2 2016 SDIWC

I
I
I
I

Figure 2. Q-Q plots of the real dataset 1 (red dots)


against the theoretical weibull distribution (straight
line).

performed and the calculation of the estimation has been conducted by the Bayesian methods [19]. Hence, this study has also showed
that the Pearson Type I family can be a plausible prior distribution of the degree distribution, rather than mostly believed choices such
as power-low and weibull densities, in the calculations.
As a future work, we consider to control this
density in simulated data and long-run Monte
Carlo analyses. Furthermore, we think to
extend the study for the metabolic networks
in order to detect whether we can identify
a common type of degree distributions for
both protein-protein interaction networks and
metabolic networks.
REFERENCES
[1] M.A. Al-Fawzan, Methods for estimating the parameters of the weibull distribution, King Abdulaziz City for Science and Technology, Riyadh,
Saudi Arabia. 2000.
[2] A. Andreev, A. Kanto A, and P. Malo, Simple approach for distribution selection in the Pearson system, Helsinki School of Economics, Working Paper W-388, Helsinki, Finland. 2005.
[3] A.L. Barabasi and Z.N. Oltvai, Network biology:

96

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

Figure 3. Q-Q plots of the real dataset 2 (red dots)


against the theoretical weibull distribution (straight
line).

Figure 4. Q-Q plots of the real dataset 3 (red dots)


against the theoretical weibull distribution (straight
line).

understanding the cells functional organization,


Nature Review, vol. 5, pp. 101-113, 2004.

ate discrete distributions, Wiley-Interscience, New


York, 2005.

[4] S. Bergmann, J. Ihmels, and N. Barkai, Similarities and differences in genome-wide expression data
of six organisms, PLoS Biology, vol. 2, 1, doi:
10.1371/journal.pbio.0020009, 2004.

[12] B.H. Junker and N. Schreiber, Analysis of biological networks, John Wiley and Sons, 2008.

[5] S.M. Burroughs and S.F. Tebbens, Uppertruncated power law distributions, Fractals, vol. 9,
pp. 209-222, 2001.
[6] S.M. Burroughs and S.F. Tebbens, Uppertruncated power laws in natural systems, Journal
of Pure and Applied Geophysics, vol. 158, pp. 741757, 2001.
[7] Y. Dodge, The Oxford dictionary of statistical
terms, Oxford Press, 2003.
[8] M. Evans, N. Hastings, and B. Peacock, Statistical
distributons, John Wiley and Sons. 2000.
[9] N. Guelzim, S. Bottani, P. Bourgine, and F. Kepes,
Topological and causal structure of the yeast transcriptional regulatory network, Nature Genetics,
vol. 31, pp. 60-63, 2002.
[10] N.L. Johnson, S. Kotz, and N. Balakrishnan, Continuous univariate distributions, WileyInterscience, New York, 1994.
[11] N.L. Johnson, A.W. Kemp, and S. Kotz, Univari-

ISBN: 978-1-941968-35-2 2016 SDIWC

[13] R. Khanin and E. Wit, How scale-free are biological networks, Journal of Computational Biology,
vol. 13, pp. 810-818, 2006.
[14] B. Lahcene, On Pearson families of distributions
and its applications, African Journal of Mathematics and Computer Science Research, vol. 6, 5, pp.
108-117, 2013.
[15] S. Mossa, M. Barthelme, H.E. Stanley, and
L.A. Amaral, Truncation of power-law behavior in scale-free, Physical Review Letters, doi:
10.1103/PhysRevLett.88.138701, 2002.
[16] V. Van Noort, B. Snel, and M.A. Huynen, The
yeast coexpression network has a small-world,
scale-free architecture and can be explained by a
simple model, EMBO, vol. 5, 3, pp. 280-284, 2004.
[17] K. Pearson, Contributions to the mathematical
theory of evolution, II: skew variation in homogeneous material, Philosophical Transactions of the
Royal Society, vol. 186, pp. 343414, 1895.
[18] M.N.B. Santos, E.N. Bodunov, and B. Valeur,
Mathematical functions for the analysis of luminescence decays with underlying distributions 1:

97

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

Figure 5. Q-Q plots of the real dataset 4 (red dots)


against the theoretical weibull distribution (straight
line).

Kohlrausch decay function (stretched exponential),


Chemical Physics, vol. 315, pp. 171-182, 2005.

Figure 6. Q-Q plots of the real dataset 5 (red dots)


against the theoretical weibull distribution (straight
line).

[19] P. Sheridan, T. Kamimura, and H. Shimodaira, A


scale-free structure prior for graphical models with
applications in functional genomics, Department
of Mathematical and Computing Sciences, Tokyo
Institute of Technology, Tokya, Japan, 2010.
[20] A. Stuart and J.K. Ord, Kendalls advanced theory
of statistics, volume I: distribution theory, Edward
Arnold Press, 1994.
[21] E. Wit, V. Vinciotti, and V. Purutcuoglu, Statistics
for biological networks: short course notes, 25th
International Biometric Conference (IBC), Florianopolis, Brazil, 2010.

Figure 7. Q-Q plots of the real dataset 6 (red dots)


against the theoretical weibull distribution (straight
line).

ISBN: 978-1-941968-35-2 2016 SDIWC

98

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

Figure 8. Q-Q plots of the real dataset 7 (red dots)


against the theoretical weibull distribution (straight
line).

Figure 9. Q-Q plots of the real dataset 8 (red dots)


against the theoretical weibull distribution (straight
line).

ISBN: 978-1-941968-35-2 2016 SDIWC

Figure 10. Q-Q plots of the real dataset 9 (red


dots) against the theoretical weibull distribution (straight
line).

Figure 11. Q-Q plots of the real dataset 10 (red


dots) against the theoretical weibull distribution (straight
line).

99

Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 2016

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(j)

(k)

Figure 12. Density plots of the real data (Data 1 to 10,


in order) from (a) to (k), respectively.

ISBN: 978-1-941968-35-2 2016 SDIWC

100

Das könnte Ihnen auch gefallen