BOA: The Bayesian Optimization Algorithm

BOA: The Bayesian Optimization Algorithm
Martin Pelikan, David E. Goldberg, and Erick Cantú-Paz

Illinois Genetic Algorithms Laboratory
Department of General Engineering
University of Illinois at Urbana-Champaign
{pelikan,deg,cantupaz}@illigal.ge.uiuc.edu
Abstract tion of multinomial data by Bayesian networks in or-

der to generate new solutions. The proposed algorithm
In this paper, an algorithm based on the extends existing methods in order to solve more diffi-
concepts of genetic algorithms that uses an cult classes of problems more efficiently and reliably.
estimation of a probability distribution of By covering interactions of higher order, the disrup-
promising solutions in order to generate new tion of identified partial solutions is prevented. Prior
candidate solutions is proposed. To esti- information from various sources can be used. The
mate the distribution, techniques for model- combination of information from the set of good so-
ing multivariate data by Bayesian networks lutions and the prior information about a problem is
are used. The proposed algorithm identifies, used to estimate the distribution. Preliminary experi-
reproduces and mixes building blocks up to ments with uniformly-scaled additively decomposable
a specified order. It is independent of the problems with non-overlapping building blocks indi-
ordering of the variables in the strings rep- cate that the proposed algorithm is able to solve all
resenting the solutions. Moreover, prior in- tested problems in close to linear time with respect to
formation about the problem can be incor- the number of fitness evaluations until convergence.
porated into the algorithm. However, prior In Section 2, the background needed to understand
information is not essential. Preliminary ex- the motivation and basic principles of the discussed
periments show that the BOA outperforms methods is provided. In Section 3, the Bayesian op-
the simple genetic algorithm even on decom- timization algorithm (BOA) is introduced. In subse-
posable functions with tight building blocks quent sections, the structure of Bayesian networks and
as a problem size grows. the techniques used in the BOA to construct the net-
work for a given data set and to use the constructed
network to generate new instances are described. The
1 INTRODUCTION results of the experiments are presented in Section 6.
The conclusions are provided in Section 7.
Recently, there has been a growing interest in opti-
mization methods that explicitly model the good so-
2 BACKGROUND
lutions found so far and use the constructed model to
guide the further search (Baluja, 1994; Harik et al.,
Genetic algorithms (GAs) are optimization methods
1997; Mühlenbein & Paaß, 1996; Mühlenbein et al.,
loosely based on the mechanics of artificial selection
1998; Pelikan & Mühlenbein, 1999). This line of re-
and genetic recombination operators. Most of the the-
search in stochastic optimization was strongly moti-
ory of genetic algorithms deals with the so-called build-
vated by results achieved in the field of evolutionary
ing blocks (BBs) (Goldberg, 1989). By building blocks,
computation. However, the connection between these
partial solutions of a problem are meant. The ge-
two areas has sometimes been obscured. Moreover, the
netic algorithm implicitly manipulates a large number
capabilities of model building have often been insuffi-
of building blocks by mechanisms of selection and re-
ciently powerful to solve hard optimization problems.
combination. It reproduces and mixes building blocks.
The purpose of this paper is to introduce an algorithm However, a fixed mapping from the space of solutions
that uses techniques for estimating the joint distribu- into the internal representation of solutions in the al-
gorithm and simple two-parent recombination opera- mination criteria are met. However, estimating the
tors soon showed to be insufficiently powerful even for distribution is not an easy task. There is a trade off
problems that are composed of simpler partial sub- between the accuracy of the estimation and its com-
problems. General, fixed, problem-independent re- putational cost.
combination operators often break partial solutions
The simplest way to estimate the distribution of
what can sometimes lead to losing these and converg-
good solutions is to consider each variable in a
ing to a local optimum. Two crucial factors of the GA
problem independently and generate new solutions
success—a proper growth and mixing of good building
by only preserving the proportions of the values of
blocks—are often not achieved (Thierens, 1995). Var-
all variables independently of the remaining solu-
ious attempts to prevent the disruption of important
tions. This is the basic principle of the population
building blocks have been done recently and are briefly
based incremental learning (PBIL) algorithm (Baluja,
discussed in the remainder of this section.
1994), the compact genetic algorithm (cGA) (Harik
There are two major approaches to resolve the prob- et al., 1997), and the univariate marginal distribu-
lem of building-block disruption. The first approach is tion algorithm (UMDA) (Mühlenbein & Paaß, 1996).
based on manipulating the representation of solutions There is theoretical evidence that the UMDA ap-
in the algorithm in order to make the interacting com- proximates the behavior of the simple GA with uni-
ponents of partial solutions less likely to be broken by form crossover (Mühlenbein, 1997). It reproduces and
recombination operators. Various reordering and map- mixes the building blocks of order one very efficiently.
ping operators were used. However, reordering opera- The theory of UMDA based on the techniques of quan-
tors are often too slow and lose the race against selec- titative genetics can be found in Mühlenbein (1997).
tion, resulting in premature convergence to low-quality Some analyses of PBIL can be found in Kvasnicka et al.
solutions. Reordering is not sufficiently powerful in or- (1996).
der to ensure a proper mixing of partial solutions be-
The PBIL, cGA, and UMDA algorithms work very well
fore these are lost. This line of research has resulted in
for problems with no significant interactions among
algorithms which evolve the representation of a prob-
variables (Mühlenbein, 1997; Harik et al., 1997; Pe-
lem among individual solutions, e.g. the messy genetic
likan & Mühlenbein, 1999). However, partial solu-
algorithm (mGA) (Goldberg et al., 1989), the gene ex-
tions of order more than one are disrupted and there-
pression messy genetic algorithm (GEMGA) (Bandy-
fore these algorithms experience a great difficulty to
opadhyay et al., 1998), the linkage learning genetic
solve problems with interactions among the variables.
algorithm (LLGA) (Harik & Goldberg, 1996), or the
First attempts to solve this problem were based on
linkage identification by non-linearity checking genetic
covering some pairwise interactions, e.g. the incre-
algorithm (LINC-GA) (Munetomo & Goldberg, 1998).
mental algorithm using the so-called dependency trees
A different way to cope with the disruption of partial as a distribution estimate (Baluja & Davies, 1997),
solutions is to change the basic principle of recombina- the population-based MIMIC algorithm using simple
tion. In the second approach, instead of implicit repro- chain distributions (De Bonet et al., 1997), or the bi-
duction of important building blocks and their mixing variate marginal distribution algorithm (BMDA) (Pe-
by selection and two-parent recombination operators, likan & Mühlenbein, 1999). In the algorithms based
new solutions are generated by using the information on covering pairwise interactions, the reproduction of
extracted from the entire set of promising solutions. building blocks of order one is guaranteed. Moreover,
the disruption of some important building blocks of
Global information about the set of promising solu-
order two is prevented. Important building blocks of
tions can be used to estimate their distribution and
order two are identified using various statistical meth-
new candidate solutions can be generated according
ods. Mixing of building blocks of order one and two is
to this estimate. A general scheme of the algorithms
guaranteed assuming the independence of the remain-
based on this principle is called the estimation of distri-
ing groups of variables.
bution algorithm (EDA) (Mühlenbein & Paaß, 1996).
In EDAs, better solutions are selected from an ini- However, covering only pairwise interactions has been
tially randomly generated population of solutions like shown to be insufficient to solve problems with interac-
in the simple GA. The distribution of the selected set tions of higher order efficiently (Pelikan & Mühlenbein,
of solutions is estimated. New solutions are generated 1999). Covering pairwise interactions still does not
according to this estimate. The new solutions are then preserve higher order partial solutions. Moreover, in-
added into the original population, replacing some of teractions of higher order do not necessarily imply
the old ones. The process is repeated until the ter- pairwise interactions that can be detected at the level
of partial solutions of order two. 3 BAYESIAN OPTIMIZATION
In the factorized distribution algorithm (FDA) ALGORITHM
(Mühlenbein et al., 1998), a factorization of the distri-
bution is used for generating new solutions. The distri- This section introduces an algorithm that uses tech-
bution factorization is a conditional distribution con- niques for modeling data by Bayesian networks to es-
structed by analyzing the problem decomposition. The timate the joint distribution of promising solutions
FDA is capable of covering the interactions of higher (strings). This estimate is used to generate new can-
order and combining important partial solutions effec- didate solutions. The proposed algorithm is called the
tively. It works very well on additively decomposable Bayesian optimization algorithm (BOA). The BOA
problems. The theory of UMDA can be used in order covers both the UMDA as well as BMDA and extends
to estimate the time to convergence in the FDA. them to cover the interactions of higher order. The or-
der of interactions that will be taken into account can
However, the FDA requires the prior information be given as input to the algorithm. The combination
about the problem in the form of a problem decompo- of prior information and the set of promising solutions
sition and its factorization. As an input, this algorithm is used to estimate the distribution. Prior information
gets a complete or approximate information about the about the structure of a problem as well as the infor-
structure of a problem. Unfortunately, the exact dis- mation represented by the set of high-quality solutions
tribution factorization is often not available without can be incorporated into the algorithm. The ratio be-
computationally expensive problem analysis. More- tween the prior information and the information ac-
over, the use of an approximate distribution according quired during the run used to generate new solutions
to the current state of information represented by the can be controlled. The BOA fills the gap between the
set of promising solutions can be very effective even if fully informed FDA and totally uninformed black-box
it is not a valid distribution factorization. However, optimization methods. Prior information is not essen-
by providing sufficient conditions for the distribution tial.
estimate that ensure a fast and reliable convergence
on decomposable problems, the FDA is of great the- In the BOA, the first population of strings is gener-
oretical value. Moreover, for problems of which the ated at random. From the current population, the
factorization of the distribution is known, the FDA is better strings are selected. Any selection method can
a very powerful optimization tool. be used. A Bayesian network that fits the selected
set of strings is constructed. Any metric as a mea-
The algorithm proposed in this paper is also capable of sure for quality of networks and any search algorithm
covering higher order interactions. It uses techniques can be used to search over the networks in order to
from the field of modeling data by Bayesian networks maximize the value of the used metric. New strings
in order to estimate the joint distribution of promising are generated using the joint distribution encoded by
solutions. The class of distributions that are consid- the constructed network. The new strings are added
ered is identical to the class of conditional distribu- into the old population, replacing some of the old ones.
tions used in the FDA. Therefore, the theory of the The pseudo-code of the BOA follows:
FDA can be used in order to demonstrate the power
of the proposed algorithm to solve decomposable prob- The Bayesian Optimization Algorithm (BOA)
lems. However, unlike the FDA, our algorithm does
not require any prior information about the problem. (1) set t ← 0
It discovers the structure of a problem on the fly. It randomly generate initial population P (0)
identifies, reproduces and mixes building blocks up to (2) select a set of promising strings S(t) from P (t)
a specified order very efficiently.
(3) construct the network B using a chosen metric and
In this paper, the solutions will be represented by bi- constraints
nary strings of fixed length. However, the described
(4) generate a set of new strings O(t) according to the
techniques can be easily extended for strings over any
joint distribution encoded by B
finite alphabet. String positions will be numbered se-
quentially from left to right, starting with the posi- (5) create a new population P (t+1) by replacing some
tion 0. strings from P (t) with O(t)
set t ← t + 1
(6) if the termination criteria are not met, go to (2)
In the following section, Bayesian networks and the

techniques for their construction and use will be de- order to find the one (or a set of networks) with the
scribed. value of a scoring metric as high as possible. The space
of networks can be reduced by constraint operators.
4 BAYESIAN NETWORKS Commonly used constraints restrict the networks to
have at most k incoming edges into each node. This
Bayesian networks (Howard & Matheson, 1981; Pearl, number directly influences the complexity of both the
1988) are often used for modeling multinomial data network construction as well as its use for generation
with both discrete and continuous variables. A of new instances and the order of interactions that can
Bayesian network encodes the relationships between be covered by the class of networks restricted in this
the variables contained in the modeled data. It repre- way.
sents the structure of a problem. Bayesian networks
can be used to describe the data as well as to generate 4.1.1 Bayesian Dirichlet metric
new instances of the variables with similar properties As a measure of the quality of networks, the so-called
as those of given data. Each node in the network cor- Bayesian Dirichlet (BD) metric (Heckerman et al.,
responds to one variable. By Xi , both the variable and 1994) can be used. This metric combines the prior
the node corresponding to this variable will be denoted knowledge about the problem and the statistical data
in this text. Each variable corresponds to one position from a given data set. The BD metric for a network
in strings representing the solutions (Xi corresponds B given a data set D of size N , and the background
to the ith position in a string). The relationship be- information ξ, denoted by p(D, B|ξ), is defined as
tween two variables is represented by an edge between
the two corresponding nodes. The edges in Bayesian n−1
Y Y Γ (m0 (πXi ))
networks can be either directed or undirected. In this p(D, B|ξ) = p(B|ξ) ·
Γ (m0 (πXi ) + m(πXi ))
paper, only Bayesian networks represented by directed i=0 πXi
acyclic graphs will be considered. The modeled data Y Γ (m0 (xi , πX ) + m(xi , πX ))
i i
sets will be defined within discrete domains. 0 (x , π ))
,
x
Γ (m i Xi
i
Mathematically, an acyclic Bayesian network with di-
rected edges encodes a joint probability distribution.
where p(B|ξ) is the prior probability of the network
This can be written as
B, the product over πXi runs over all instances of the
n−1
Y parents of Xi and the product over xi runs over all
p(X) = p(Xi |ΠXi ), (1) instances of Xi . By m(πXi ), the number of instances in
i=0 D with variables ΠXi (the parents of Xi ) instantiated
to πXi is denoted. When the set ΠXi is empty, there
where X = (X0 , . . . , Xn−1 ) is a vector of variables,
is one instance of ΠXi and the number of instances
ΠXi is the set of parents of Xi in the network (the set
with ΠXi instantiated to this instance is set to N . By
of nodes from which there exists an edge to Xi ) and
m(xi , πXi ), we denote the number of instances in D
p(Xi |ΠXi ) is the conditional probability of Xi condi-
that have both Xi set to xi as well as ΠXi set to πXi .
tioned on the variables ΠXi . This distribution can be
used to generate new instances using the marginal and By numbers m0 (xi , πXi ) and p(B|ξ), prior informa-
conditional probabilities in a modeled data set. tion about the problem is incorporated into the metric.
The m0 (xi , πXi ) stands for prior information about the
The following sections discuss how to learn the network
number of instances that have Xi set to xi and the set
structure if this is not given by the user, and how to
of variables ΠXi is instantiated to πXi . The prior prob-
use the network to generate new candidate solutions.
ability p(B|ξ) of the network reflects how the measured
network resembles the prior network. By using a prior
4.1 CONSTRUCTING THE NETWORK network, the prior information about the structure of
a problem is incorporated into the metric. The prior
There are two basic components of the algorithms
network can be set to an empty network, when there
for learning the network structure (Heckerman et al.,
is no such information. In our implementation, we
1994). The first one is a scoring metric and the sec-
set p(B|ξ) = 1 for all networks, i.e. all networks are
ond one is a search procedure. A scoring metric is
treated equally.
a measure of how well the network models the data.
Prior knowledge about the problem can be incorpo- The numbers m0 (xi , πXi ) can be set in various ways.
rated into the metric as well. A search procedure is They can be set according to the prior information the
used to explore the space of all possible networks in user has about the problem. When there is no prior in-
formation, uninformative assignments can be used. In used (Heckerman et al., 1994). A simple greedy algo-
the so-called K2 metric, for instance, the m0 (xi , πXi ) rithm, local hill-climbing, or simulated annealing can
coefficients are all simply set to 1 (Heckerman et al., be used. Simple operations that can be performed on
1994). This assignment corresponds to having no prior a network include edge additions, edge reversals, and
information about the problem. In the empirical part edge removals. Each iteration, an operation that in-
of this paper we will use the K2 metric. creases the network the most is applied. Only opera-
tions that keep the network acyclic and with at most k
Since the factorials in Equation 2 can grow to huge
incoming edges into each of the nodes are allowed (i.e.,
numbers, usually a logarithm of the scoring metric is
the operations that do not violate the constraints).
used. The contribution of one node to the logarithm
The algorithms can start with an empty network, the
of the metric can be computed in O(2k N ) steps where
best network with one incoming edge into each node
k is the maximal number of incoming edges into each
at maximum, or a randomly generated network.
node in the network and N is the size of the data
set (the number of instances). The computation of an In our implementation, we have used a simple greedy
increase of the logarithm of the value of the BD metric algorithm with only edge additions allowed. The algo-
for an edge addition, edge reversal, or an edge removal, rithm starts with an empty network. The time com-
respectively, can be computed in time O(2k N ) since plexity of this algorithm can be computed using the
the total sum contribution corresponding to the nodes time complexity of a simple edge addition and the
of which the set of parents has not changed remains number of edges that have to be processed at most.
unchanged as well. Assuming that k is constant, we With the BD metric, the overall time to construct
get linear time complexity of the computation of both the network using the described greedy algorithm is
the contribution of one node as well as the increase in O(k2k n2 N + kn3 ). Assuming that k is constant, we
the metric for an edge addition O(N ) with respect to get the overall time complexity O(n2 N + n3 ).
the size of the data set.
4.2 GENERATING NEW INSTANCES
4.1.2 Searching for a Good Network
In this section, the generation of new instances using
In this section, the basic principles of algorithms that a network B and the marginal frequencies for few sets
can be used for searching over the networks in order to of variables in the modeled data set will be described.
maximize the value of a scoring metric are described. New instances are generated using the joint distribu-
Only the classes of networks with restricted number of tion encoded by the network (see Equation 1).
incoming edges denoted by k will be considered. First, the conditional probabilities of each possible in-
a) k = 0 stance of each variable given all possible instances of
This case is trivial. An empty network is the best one its parents in a given data set are computed. The con-
(and the only one possible). ditional probabilities are used to generate each new
instance. Each iteration, the nodes whose parents are
b) k = 1 already fixed are generated using the corresponding
For k = 1, there exists a polynomial algorithm for the conditional probabilities. This is repeated until the
network construction (Heckerman et al., 1994). The values of all variables are generated. Since the net-
problem can be easily reduced to a special case of the work is acyclic, it is easy to see that the algorithm is
so-called maximal branching problem for which there defined well.
exists a polynomial algorithm (Edmonds, 1967).
The time complexity of generating an instance of all
c) k > 1 variables is bounded by O(kn) where n is the number
For k > 1 the problem gets much more complicated. of variables. Assuming that k is constant, the overall
Although for k = 1 there exists a polynomial algo- time complexity is O(n).
rithm for finding the best network, for k > 1 the prob-
lem of determining the best network with respect to 5 DECOMPOSABLE FUNCTIONS
a given metric is NP-complete for most Bayesian and
non-Bayesian metrics (Heckerman et al., 1994).
A function is additively decomposable of a certain or-
Various algorithms can be used in order to find a der if it can be written as the sum of simpler functions
good network, from a total enumeration to a blind defined over the subsets of variables, each of cardi-
random search. Usually, due to their effectiveness in nality less or equal than the order of the decomposi-
this context, simple local search based methods are tion (Mühlenbein et al., 1998; Pelikan & Mühlenbein,
1999). The problems defined by this class of functions set are either mapped close to each other or spread
can be decomposed into smaller subproblems. How- throughout the whole string. Each variable will be re-
ever, simple GAs experience a great difficulty to solve quired to contribute to the function through some of
these decomposable problems with deceptive building the subfunction. A function composed in this fashion
blocks when these are not mapped tightly onto the is clearly additively decomposable of the order of the
strings representing the solutions (Thierens, 1995). subfunctions it was composed with.
In general, the BOA with k ≥ 0 can cover interac- A deceptive function of order 3, denoted by
tions or order k + 1. This actually does not mean 3-deceptive, is defined as
that all interactions in a problem that is order-(k + 1) 
decomposable can be covered (e.g., 2D spin-glass sys- 
 0.9 if u = 0
0.8 if u = 1

tems (Mühlenbein et al., 1998)). There is no straight- f3deceptive (u) = (3)
forward way to relate general decomposable prob- 
 0 if u = 2
1 otherwise

lems and what are the necessary interactions to be
taken into account (or, what is the order of building where u is the number of one’s in an input string.
blocks). By introducing overlapping among the sets
from the decomposition along with scaling of the con- A trap function of order 5, denoted by trap-5, is de-
tributions of each of these sets according to some func- fined as
tion of problem size, the problem becomes very com-

4 − u if u < 5
plex. Nevertheless, the class of distributions the BOA ftrap5 (u) = (4)
5 otherwise
uses is very powerful the decomposable problems with
either overlapping or non-overlapping building blocks A bipolar deceptive function of order 6, denoted by
or a bounded order. This has been confirmed by a 6-bipolar, is defined with the use of the 3-deceptive
number of experiments with various test functions. function as follows
6 EXPERIMENTS f6bipolar (u) = f3deceptive (|3 − u|) (5)
The experiments were designed in order to show the 6.2 RESULTS OF THE EXPERIMENTS
behavior of the proposed algorithm only on non-
For all problems, the average number of fitness eval-
overlapping decomposable problems with uniformly
uations until convergence in 30 independent runs is
scaled deceptive building blocks. For all problems, the
shown. For the 3-deceptive and trap-5 functions, the
scalability of the proposed algorithm is investigated.
population is said to have converged when the propor-
In the following sections, the functions of unitation
tion of some value on each position reaches 95%. This
used in the experiments will be described and the re-
criterion of convergence is applicable only for prob-
sults of the experiments will be presented.
lems with at most one global optimum and selection
schemes that do not force the algorithm to preserve
6.1 FUNCTIONS OF UNITATION
the diversity in a population (e.g. niching methods).
A function of unitation is a function whose value de- For the 6-bipolar function, the population is said to
pends only on the number of ones in a binary input have converged when it contains over a half of opti-
string. The function values for the strings with the mal solutions. For all algorithms, the population sizes
same number of ones are equal. Several functions of for all problem instances have been determined empiri-
unitation can be additively composed in order to form cally as a minimal size so that the algorithms converge
a more complex function. Let us have a function of to the optimum in all of 30 independent runs. In all
unitation fk defined for strings of length k. Then, the runs, the truncation selection with τ = 50% has been
function additively composed of l functions fk is de- used (the better half of the solutions is selected). Off-
fined as spring replace the worse half of the old population.
l−1
X The crossover rate for the simple GA has been empir-
f (X) = fk (Si ), (2) ically determined for each problem with one problem
i=0 instance. In the simple GA, the best results have been
where X is the set of n variables and Si for i ∈ achieved with the probability of crossover 100%. The
{0, . . . , l − 1} are subsets of k variables from X. Sets probability of flipping a single bit by mutation has
Si can be either overlapping or non-overlapping and been set to 1%. In the BOA, no prior information but
they can be mapped onto a string (the inner repre- the maximal order of interactions to be considered has
sentation of a solution) so that the variables from one been incorporated into the algorithm.
800000 500000
700000 450000
BOA (k=2, K2 metric) BOA (k=5, K2 metric)
GA (one-point) 400000 GA (one-point)
Number of fitness evaluations

600000
350000
500000 300000
400000 250000
300000 200000
150000
200000
100000
100000 50000
0 0
40 60 80 100 120 140 160 180 40 60 80 100 120 140 160 180
Size of the problem Size of the problem
Figure 1: Results for 3-deceptive Function. Figure 3: Results for 6-bipolar Function.
700000
with the size of a problem (Thierens, 1995). On the
600000 BOA (k=4, K2 metric) other hand, the BOA would perform the same since
GA (one-point)
it is independent of the variable ordering in a string.

500000
The population sizes for the GA ranged from N = 400
400000 for n = 30 to N = 7700 for n = 180. The population
sizes for the BOA ranged from N = 1000 for n = 30
300000
to N = 7700 for n = 180.
200000 In Figure 2, the results for the trap-5 function are
100000
presented. The building blocks are non-overlapping
and they are again mapped tightly onto a string. The
0 results for this function are similar to those for the
40 60 80 100 120 140 160 180
Size of the problem 3-deceptive function. The population sizes for the GA
ranged from N = 600 for n = 30 to N = 8100 for
Figure 2: Results for trap-5 Function. n = 180. The population sizes for the BOA ranged
from N = 1300 for n = 30 to N = 11800 for n = 180.
In Figure 3, the results for the 6-bipolar function are
In Figure 1, the results for the 3-deceptive function presented. The results for this function are similar
are presented. In this function, the deceptive building to those for the 3-deceptive function. In addition to
blocks are of order 3. The building blocks are non- the faster convergence, the BOA discovers a number of
n
overlapping and mapped tightly onto strings. There- solutions out of totally 2 6 global optima of 6-bipolar
fore, one-point crossover is not likely to disrupt them. function instead of converging into a single solution.
The looser the building blocks would be, the worse the This effect could be further magnified by using niching
simple GA would perform. Since the building blocks methods. The population sizes for the GA ranged from
are deceptive, the computational requirements of the N = 360 for n = 30 to N = 4800 for n = 180. The
simple GA with uniform crossover and the BOA with population sizes for the BOA ranged from N = 900
k = 0 (i.e., the UMDA) grow exponentially and there- for n = 30 to N = 5000 for n = 180.
fore we do not present the results for these algorithms.
Some results for BMDA can be found in Pelikan and
Mühlenbein (1999). The BOA with k = 2 and the K2 7 CONCLUSIONS
metric performs the best of the compared algorithms
in terms of the number of functions evaluations until The experiments have shown that the proposed algo-
successful convergence. The simple GA with one-point rithm outperforms the simple GA even on decompos-
crossover performs worse than the BOA with k = 2 as able problems with tight building blocks as the prob-
the problem size grows. For loose building blocks, the lem size grows. The gap between the proposed al-
simple GA with one-point crossover would require the gorithm and the simple GA would significantly en-
number of fitness evaluations growing exponentially large for problems with loose building blocks. For
loose mapping the time requirements of the simple optimization, and machine learning. Reading, MA:
GA grow exponentially with the problem size. On Addison-Wesley.
the other hand, the BOA is independent of the order- Goldberg, D. E., Korb, B., & Deb, K. (1989). Messy
ing of the variables in a string and therefore changing genetic algorithms: Motivation, analysis, and first
this would not affect the performance of the algorithm. results. Complex Systems, 3 (5), 493–530.
The proposed algorithm works very well also for other Harik, G. R., & Goldberg, D. E. (1996). Learning link-
problems with highly overlapping building blocks, e.g. age. Foundations of Genetic Algorithms, 4 , 247–262.
spin-glasses, that are not discussed in this paper. Harik, G. R., Lobo, F. G., & Goldberg, D. E. (1997).
The compact genetic algorithm (IlliGAL Report No.
97006). Urbana, IL: University of Illinois at Urbana-
Acknowledgments Champaign, Illinois Genetic Algorithms Laboratory.
The authors would like to thank Heinz Mühlenbein, Heckerman, D., Geiger, D., & Chickering, D. M.
(1994). Learning Bayesian networks: The combina-
David Heckerman, and Ole J. Mengshoel for valuable tion of knowledge and statistical data (Technical Re-
discussions and useful comments. Martin Pelikan was port MSR-TR-94-09). Redmond, WA: Microsoft Re-
supported by grants number 1/4209/97 and 1/5229/98 search.
of the Scientific Grant Agency of Slovak Republic. Howard, R. A., & Matheson, J. E. (1981). Influence dia-
grams. In Howard, R. A., & Matheson, J. E. (Eds.),
The work was sponsored by the Air Force Office of Readings on the principles and applications of deci-
Scientific Research, Air Force Materiel Command, sion analysis, Volume II (pp. 721–762). Menlo Park,
USAF, under grant number F49620-97-1-0050. Re- CA: Strategic Decisions Group.
search funding for this project was also provided by Kvasnicka, V., Pelikan, M., & Pospichal, J. (1996). Hill
a grant from the U.S. Army Research Laboratory climbing with learning (An abstraction of genetic al-
under the Federated Laboratory Program, Coopera- gorithm). Neural Network World, 6 , 773–796.
tive Agreement DAAL01-96-2-0003. The U.S. Govern- Mühlenbein, H. (1997). The equation for response to se-
ment is authorized to reproduce and distribute reprints lection and its use for prediction. Evolutionary Com-
for Governmental purposes notwithstanding any copy- putation, 5 (3), 303–346.
right notation thereon. The views and conclusions con- Mühlenbein, H., Mahnig, T., & Rodriguez, A. O. (1998).
tained herein are those of the authors and should not Schemata, distributions and graphical models in evo-
lutionary optimization. Submitted for publication.
be interpreted as necessarily representing the official
policies and endorsements, either expressed or implied, Mühlenbein, H., & Paaß, G. (1996). From recombination
of genes to the estimation of distributions I. Binary
of the Air Force of Scientific Research or the U.S. Gov- parameters. Parallel Problem Solving from Nature,
ernment. 178–187.
Munetomo, M., & Goldberg, D. E. (1998). Design-
References ing a genetic algorithm using the linkage identi-
fication by nonlinearity check (Technical Report
Baluja, S. (1994). Population-based incremental learn- 98014). Urbana, IL: University of Illinois at Urbana-
ing: A method for integrating genetic search Champaign, Illinois Genetic Algorithms Laboratory.
based function optimization and competitive learning
(Tech. Rep. No. CMU-CS-94-163). Pittsburgh, PA: Pearl, J. (1988). Probabilistic reasoning in intelligent
Carnegie Mellon University. systems: Networks of plausible inference. San Ma-
teo, CA: Morgan Kaufmann.
Baluja, S., & Davies, S. (1997). Using optimal
dependency-trees for combinatorial optimization: Pelikan, M., & Mühlenbein, H. (1999). The bivariate
Learning the structure of the search space. Pro- marginal distribution algorithm. Advances in Soft
ceedings of the International Conference on Machine Computing - Engineering Design and Manufactur-
Learning, 30–38. ing, 521–535.
Bandyopadhyay, S., Kargupta, H., & Wang, G. (1998). Thierens, D. (1995). Analysis and design of genetic al-
Revisiting the GEMGA: Scalable evolutionary opti- gorithms. Doctoral dissertation, Katholieke Univer-
mization through linkage learning. Proceedings of the siteit Leuven, Leuven, Belgium.
International Conference on Evolutionary Computa-
tion (ICEC-98), 603–608.
De Bonet, J. S., Isbell, C. L., & Viola, P. (1997). MIMIC:
Finding optima by estimating probability densities.
Advances in neural information processing systems
(NIPS-97), 9 , 424–431.
Edmonds, J. (1967). Optimum branching. J. Res. Nat.
Bur. Standards, 71B, 233–240.
Goldberg, D. E. (1989). Genetic algorithms in search,

BOA: The Bayesian Optimization Algorithm

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

BOA: The Bayesian Optimization Algorithm

Hochgeladen von

Copyright:

Verfügbare Formate

BOA: The Bayesian Optimization Algorithm

Martin Pelikan, David E. Goldberg, and Erick Cantú-Paz

Abstract tion of multinomial data by Bayesian networks in or-

In the following section, Bayesian networks and the

6 EXPERIMENTS f6bipolar (u) = f3deceptive (|3 − u|) (5)

Number of fitness evaluations

it is independent of the variable ordering in a string.

Das könnte Ihnen auch gefallen