Ha Soojung 2009

A Statistical Method for Structure Learning of Bayesian Networks from Data
Soojung Ha*, Seyun Kim*, Minkook Suh*, Hyunwoo Seong*, Kwang Mo Jeong† , Sung-Ho Kim‡
*
Korea Science Academy, † Pusan National University,
‡
Korea Advanced Institute of Science and Technology
sjhakorea@hanmail.net, whataud@naver.com, minkook789@hanmail.net,
hwseong@hotmail.com, kmjung@pusan.ac.kr, sung-ho.kim@kaist.edu
Abstract is the number of nodes) is large even for a small

number of variables, and if we were to investigate all
The Bayesian network, a powerful tool for the possible cases one by one, the computing time
predicting and diagnosing uncertain phenomena, is would be enormously long. Moreover, the data we can
used in various fields including artificial intelligence, obtain are only part of the population. When the
business administration, and medical science. We use sample data do not represent the population well, the
a statistical approach, and present a simple algorithm computation for the BN structure becomes more prone
for learning Bayesian network structure from data. to error.
First we obtain from data the original correlation Recently, many researches have been conducted to
graph and the correlation graphs when one or two develop efficient methods for learning Bayesian
variables are fixed. Then we construct a Bayesian networks from data. There are two main approaches to
network that would produce the most similar this problem. One is to define a scoring function of the
correlation graphs. Simulation results are given to network and search for the network which has a
demonstrate that the algorithm determines the network maximum score. Various scoring functions [2] [3] and
structure with a high accuracy. searching techniques [4] [5] have been developed. The
other approach is to determine the presence or absence
of an edge by examining conditional independencies
1. Introduction obtained from data. Some other researches use both
approaches, to form hybrid algorithms.
Bayesian networks (BN) are graphs that represent In this paper we present a simple and time-
the probabilistic relationships among a large number of efficient algorithm based on the second approach.
variables [1]. They are used in many areas that involve Then we give simulation results to show that the
probabilistic inference, such as artificial intelligence, algorithm produces BN structure with a high accuracy.
business administration, medical science, biology, and
social science. Basically, Bayesian networks are 2. The correlation graph
directed acyclic graphs with joint probability
distributions. The nodes denote the random variables, The correlation graph is a key concept in our
and the directed edges represent the causal relationship algorithm. A correlation graph of a BN represents the
between the random variables. correlation (or dependency) between the random
When the number of variables is small and the variables. There is an edge between two variables if
causal relationships between the variables are known, and only if the two variables are dependent.
BN structure can be constructed directly without the Fixing variable A to a means that when
help of computers. However, when the number of considering the joint probability function between two
variables is large or when the causal relationships are variables X and Y, instead of P(x, y), we consider P(x, y
not fully known, the BN structure has to be induced | A=a). We can also think of correlation graphs with
from data. This procedure is often very difficult some variables fixed. These correlation graphs will be
because the number of possible structures (2n, where n
different from the correlation graph with no variables dependent. From the data, we compute the correlation
fixed. function of two variables and if they are dependent,
When there is a BN, we can deduce a correlation add an edge between them in the correlation graph.
graph from it. Also, when there is data, we can derive a To calculate the correlation graph when some
correlation graph from them. Deducing a correlation variables are fixed, we sort out the data before
graph from a BN is simple. Two variables A and B are computing the correlation function. For example,
connected in the correlation graph if 1) either of them suppose A is a discrete variable that can have values 0
is an ancestor of the other or 2) they share a common and 1. From the database, we first select only the data
ancestor. If they satisfy neither condition, no edge in which A=0, and compute the correlation function.
exists between A-B. The following figure is an Next we select only the data in which A=1, and
example of deducing a correlation graph from a BN. compute the correlation function. In this respect, fixed
variables may be regarded as conditioning variables.
Now we average the two correlation values,
giving more weight to the case with larger data size.
Determining dependency of all the pairs of variables,
we obtain a new correlation graph where the
conditioning variables are isolated from the other
variables since the correlation coefficient of a random
Figure 1. Deducing a correlation graph from a known variable with a constant is zero. We may regard a
BN conditioning variable a constant since its value is fixed
when computing for their correlation graph.
When some variables are fixed, the correlation If the data size is large enough, the correlation
graph can change in accordance with the conditional graphs obtained from a BN and data will most likely
distribution of the remaining variables of the BN. Two be identical. Our research is based on this observation.
originally dependent / independent variables can In our algorithm, we first obtain the correlation graphs
become independent / dependent. This happens in two from data, and then try to construct a Bayesian
cases: 1) if a common descendent of two originally network that would produce the same, or the most
independent variables is fixed, the two variables and similar outcome.
all their descendents become dependent and 2) if in all
the paths connecting two originally dependent 3. The proposed algorithm
variables, at least one variable is fixed, the two
variables become independent. This is illustrated in the Our algorithm consists of two main parts. First we
following figure. find the skeleton of the BN, i.e., the undirected version
a) c) of the BN, and then we determine orientation of the
edges.
3.1. Construction of skeleton
b) d) [Initialization]
1. Initial skeleton has all the nodes, and no edges.
[Adding edges]
2. Obtain a correlation graph from data. Add all the
edges to the skeleton to obtain the correlation.
Figure 2. a) BN b) fixing A c) correlation graph
[Deleting edges]
d) correlation graph when A is fixed
3. If when A is fixed, the correlation coefficient between
To obtain a correlation graph from data, we must B and C is close to zero, delete the edge between B and
consider errors in the data. Even if two variables A and C.
B are independent, P(A|B)=P(A) cannot be fulfilled 4. If when A and B are fixed, the correlation coefficient
precisely unless the sample size is infinite. Instead we
between C and D is close to zero, delete the edge
set a correlation function for measuring dependency
between C and D.
(or independency) and when the function value is over
a certain value, determine that the two variables are Figure 3. Part 1 of the proposed algorithm
If two variables are thought to have a direct causal is fixed, A→C, B→C
relationship, we connect them in the skeleton. 6. If edges A-C and B-C exist in the skeleton but no
In step 1, we start with an initially empty skeleton.
edge exists between A and B in the correlation graph
In step 2, we add all the possible ‘candidates’ for edges
when another variable D is fixed, and the correlation
that represent direct causal relationship. If two
variables have a direct causal relationship, they will be between A and B becomes stronger when C is fixed,
correlated to each other. Therefore, an edge between A→C, B→C
the variables will be added to the skeleton after this 7. If edges A-C and B-C exist in the skeleton but no
step. However, two variables can be correlated without edge exists between A and B in the correlation graph
a direct causal relationship. So some edges must be
when D and E are fixed, and the correlation between A
deleted after this step.
and B becomes stronger when C is fixed, A→C, B→C
In steps 3 and 4, we consider the cases explained
above. This can happen when one variable is an [No convergence of edge direction]
ancestor of the other or when the two variables share a 8. If an edge A-B exists in the correlation graph (but
common ancestor. In either case, if we block every not in the skeleton), and there is only one path that
path connecting the two variables by fixing certain connects A-B in the skeleton, direct the edges so that no
variables on the paths, their correlation will disappear
convergence of edge direction (i.e. X→Y and Z→Y when
and we delete the edge between them.
X, Y, Z are adjacent in the path) happens in the path.
As we only fix up to two variables at a time in this
paper, if the structure is so complex that we cannot [Preventing cycles]
block all the paths between certain two variables, we 9. If A is an ancestor of B and there is an edge
may not be able to delete all the unnecessary edges. between A and B in the skeleton, A→B.
However, we can think of three problems arise from Figure 4. Part 2 of the proposed algorithm
fixing many variables at a time: lengthy execution time,
decrease of sample size, and higher risk of deleting Steps 5 to 7 find three variables A, B, C that A and
edges that do represent a causal relationship. Suppose B are independent, but both are connected to C. In this
that the number of nodes is n, and that it takes case, A and B are both parents of C [6]. This is because
approximately t time to sort out the data, obtain a of the four possible cases
correlation graph, and delete edges accordingly. If up A→C→B, A←C←B, A←C→B, A→C←B.
to n-2 variables could be fixed, (2n-n-2)t time would be where there will be no correlation relationship between
required. This can be a problem when n is large. A and B only in the last case. For verification, we
Decrease of sample size is an equally serious problem. check the correlation between A and B when C is
When computing for correlation graphs with some fixed: if the correlation gets weaker we do not set the
variables fixed, the sample size decreases dramatically arrow directions.
in the data sorting process. Also, by adding more steps Sometimes A and B which are not adjacent to each
to delete edges, we may increase the possibility of other might be both parents of C but not independent.
deleting edges that should not be deleted. In some This is when they share a common ancestor or if one is
cases the correlation between a parent and a child is an ancestor of the other. Steps 6 and 7 are designed to
not very strong, and the edge can be deleted solve this problem. If we fix some variables along the
accidentally. Since omitting edges that represent direct paths connecting A and B (other than A-C-B), we might
relationships is a more serious problem than adding detect that the dependency of A and B is not
edges that do not represent direct relationships, it is contributable to C. Then we can set the arrow
preferable to include only a few steps in edge deletion. directions as in step 5.
Step 8 works in the other way. For two variables
3.2. Edge Orientation to be dependent while not having a direct causal
relationship, they must share a common ancestor or
(If orienting an edge according to the algorithm creates a one must be an ancestor of the other. If there is only
cycle, leave the edge undirected) one path connecting the two variables, one variable
[Convergence of edges]
must be an ancestor of the other, or a variable along
the path must be a common ancestor of both. This is
5. If edges A-C and B-C exist in the skeleton but no
equivalent to ‘no convergence’ of edge direction in the
edge exists between A and B in the correlation graph, path. Step 9 directs all the edges from ancestor to
and the correlation between A and B is stronger when C
descendent. If directed otherwise, the edge direction similar correlation graphs. The skeleton is constructed
creates a cycle. by first adding edges from the original correlation
At the end of the algorithm, some edges are left graph, and then deleting edges considering the
with no direction assigned. This is because for a correlation graphs when one or two variables are fixed.
certain set of correlation graphs, there are many The orientations of the edges are determined in three
Bayesian networks that would produce the same set. A successive steps, respectively related to convergence
simple example is two BNs A→B and A←B. They are or no convergence of edge direction, and preventing
indistinguishable in regard to data. Another example is cycles. The experimental results demonstrate the
when the network is a tree. Regardless of whichever algorithm’s ability to learn BN structure with a high
leaf node we choose as the root, we can obtain a BN accuracy.
that is a faithful representation of data.
Among the steps in the algorithm, step 7 has the
most computation time. With two fixed nodes, we
should consider three other nodes, yielding the time
complexity of the algorithm O(n5) in worst case.
4. Experimental Results
We implemented the algorithm using C++. The
program accepts data as input and produces BN as
output. To test the accuracy of the proposed algorithm,
we created four Bayesian networks and generated four
sets of test data from them. The sample size was 10000.
Then we determined the network structure for each set
of data with our program. Figure 5. Original and deduced graphs (BN3)
The table below summarizes the results. It shows
that our algorithm has a high accuracy. We found from 6. References
the experiments that there was no missed edge, which
means that recall rate is 98.85% (86/87*100%). There [1] R. E. Neopolitan, Learning Bayesian Networks, Prentice
were only a few edges(9) added incorrectly. We got Hall, Chicago, Illinois, 2004.
precision rate 90.53%(86/(86+9)*100%). It should be
also noted that there was no edge with wrong direction. [2] N. Friedman and Z. Yakhini, “On the sample complexity
Figure 5 illustrates one of the simulation results: BN3, of learning Bayesian networks,” Proceedings of the 12th
conference on Uncertainty in Artificial Intelligence, Morgan
where only two edges are added incorrectly, and no
Kaufmann, 1996.
edge in the original graph is missed.
[3] Wallace, C.S., and K. Korb, “Learning Linear Causal
Table 1. Simulation results Models by MML Sampling”, in Gammerman, A. (Ed.):
Random Correct Wrong edges
Total Causal Models and Intelligent Data Mining, Springer-Verlag,
Variable edges (No Wrong
Edges Missed Added New York, 1999.
s direction) direction
BN1 10 16 15 (6) 1 0 0
[4] P. Larranaga, M. Poza, Y. Yurramendi, R.H. Murga,
BN2 13 21 21 (18) 0 3 0 C.M.H. Kuijpers, “Structure learning of Bayesian networks
BN3 16 15 15 (9) 0 2 0 by genetic algorithms: a performance analysis of control
parameters”, IEEE Transactions on Pattern Analysis and
BN4 20 35 35 (17) 0 4 0
Machine Intelligence, Vol. 18, No. 9, 1996, pp. 912-926.
Total 59 87 86 (50) 1 9 0
[5] L.M. de Campos, J.M. Fernandez-Luna, J.A. Gamez, J.M.
5. Conclusion Puerta, “Ant colony optimization for learning Bayesian
networks”, International Journal of Approximate Reasoning,
Vol. 31, No. 3, 2002, pp. 291-311.
In this paper, we presented a simple and efficient
BN learning algorithm based on a statistical approach. [6] G. Rebane and J. Pearl, “The recovery of causal poly-
First we obtain from data the original correlation graph trees from statistical data”, in Proceedings of the Third
and the correlation graphs with some variables fixed. Conference on Uncertainty in Artificial Intelligence, Seattle,
Then we construct a BN that would produce the most Washington, July 1987, pp. 222-228.

Ha Soojung 2009

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Ha Soojung 2009

Hochgeladen von

Copyright:

Verfügbare Formate

A Statistical Method for Structure Learning of Bayesian Networks from Data

Abstract is the number of nodes) is large even for a small

3.1. Construction of skeleton

Das könnte Ihnen auch gefallen