Sie sind auf Seite 1von 12

An Architecture for Selecting Exact Algorithms

to Solve the Distribution Design Problem

Laura Cruz R., Joaquin Pérez O., Irma Y. Hernández B., Nelson Rangel V.,
Norma E. Garcı́a A., and Victor M. Alvarez H.

Instituto Tecnológico de Ciudad Madero


lcruz@itcm.edu.mx,jperez@sd-cenidet.edu.mx,
{irmabaez,ia32,ganeisc,mantorvicuel}@hotmail.com

Abstract. This paper is focused on the algorithm selection problem,


which can be formulated as follows: Given a set of available algorithms,
for a new NP-hard instance to predict which algorithm solves it better.
For this problem there are two main selection approaches. The first one
consists of developing functions to relate performance to problem size. In
the second more characteristics are incorporated, however they are not
even defined formally, neither systematically. In contrast, we propose
a formal methodology to model algorithm performance predictors that
incorporate more critical characteristics. In previous work, for approxi-
mated algorithms, the relationship among performance and characteris-
tics was learned from historical data. In this work a selection mechanism
based on heuristic sampling is proposed for exact algorithm. To validate
our approach, an instances set of the Distribution Design problem was
solved with backtracking algorithms. The experimental results show that
it is feasible to use heuristic sampling for selecting the most promising
exact algorithm.

1 Introduction
The advances in computer and communication technology facilitate the im-
plementation of Distributed Database Systems. However, the commercial Dis-
tributed Database Management Systems should be managed by expert profes-
sionals so they can develop databases for the web without support of robust
methodologies and design assistance tools. A distributed system with a wrong
data distribution design could undergo severe performance degradation.
Few mathematical models have been developed for data-objects distribution
in the web; in [1] we proposed one of them named DFAR model, an acronym
for distribution, fragmentation, allocation, and reallocation of data-objects. The
problem modeled by DFAR is NP-hard and it has been solved using the exact
method Branch and Bound and approximated algorithms like Threshold Accept-
ing, Tabu Search, and Genetic Algorithms. We have carried out a large number
of experiments with them, and no algorithm showed absolute superiority. Those
results correspond with the conjecture of Wolpert’s No-Free-Lunch Theorem
(NFL) about different algorithms being appropriated for different problems [2].
Hence, we have also been working in developing an automatic method for
algorithm selection [3]. In this paper the selection of exact algorithms is ap-
proached with an architecture based on heuristic sampling. The proposed ar-
chitecture allows to choose, from an exact algorithm set, the most promising
to solve a particular instance of DFAR. This architecture consists mainly of
two modules. The Converter module transforms DFAR instances to instances of
classical Constraint Satisfaction Problem (CSP); with a converter, we can take
advantage of the already existing algorithms for CSP. The Selector module does
a heuristic sampling.
To validate our approach we carried out experiments using exact algorithms
based on backtracking search. The Selector module does a heuristic sampling
derived from Knuth´s method, in other words, it estimates the efficiency taking
random samples of the search tree associated to a particular algorithm.
This paper is organized as follow: A description of DFAR and CSP problems
is given in section 2. Section 3 describes three backtracking solution methods.
The heuristic sampling and related works are shown in section 4. The architecture
to select exact algorithm is presented in section 5. The experimental results are
shown in section 6.

2 Optimization Problems
In this section the mathematical formulation of the data distribution problem
modeled by DFAR and the Constraint Satisfaction Problem are described.

2.1 DFAR Mathematical Model for the Distribution Design


Problem
DFAR is an integer (binary) programming model. The decision about storing
an attribute m in site j is represented by a binary variable xmj. The objective
function below models costs using four terms: transmission, access to several
fragments, storage, and migration. The model solutions are subject to five con-
straints. The detailed description of this model can be found at [1].

XX XX XXX
min z = fki qkm lkm cij xmj + cl fki ykj +
k i m j i k j
X XXX
c2 wj + ami cij dm xmj (1)
j m i j

where
fki = emission frequency of query k from site i;
qkm = usage parameter, qkm = 1 if query k uses attribute m,
otherwise qkm = 0;
lkm = number of packets for transporting attribute m for query k;
cij = communication cost between sites i and j;
cl = cost for accessing several fragments to satisfy a query;
ykj = indicates if query k accesses one or more attributes located at site j;
c2 = cost for allocating a fragment to a site;
wj = indicates if there exist attributes at site j;
ami = indicates if attribute m was previously located at site i;
dm = number of packets for moving attribute m to another site if necessary;

2.2 The Constraint Satisfaction Problem

The methodology of The objective of CSP is to assign values to the variables


that participate in it, that assignment should satisfy the established constraints.
CSP is defined by three sets: X is a set of n variables x1 , . . . , xn that defines
the problem; D is a domain set that belongs to each one of the variables of the
problem, D(x1 ), . . . , D(xn ) denotes valid values for each one of the variables; C
is a set of m constraints c0 , . . . , cm on the variables in X. A solution for this
type of problem is an assignment of a value ai ∈ Dxi to xi , where 1 ≤ i ≤ n,
that satisfies all constraints.

3 Exact Solution Methods

The most commonly algorithms for CSP are based on chronological backtrack-
ing, variants can be obtained adding arc consistency and Branch and Bound
techniques.

3.1 Chronological Backtracking Algorithm

A chronological backtracking (CB) algorithm is a well known recursive algorithm.


Every recursive step is related with a node in the search tree. For each node, the
algorithm selects a non instanced variable. Each value in the domain of a specific
variable is chosen as a possible value for it, and a function to verify constraints
is invoked. If the current assignment doesn’t violate restrictions then the search
continues in the next depth level.

3.2 Backtracking Algorithm with Arc Consistency

The main difference among different backtracking algorithms is the level of con-
sistency. The technique more commonly used for domain reduction is Arc Con-
sistency (AC). This algorithm verifies consistency on variables that still have not
been instanced and remove values from their domains, reducing the branching
factor. We implemented two functions, with different consistency level: General
Arc Consistency (GAC3) and Arc Consistency 4 (AC4). The key data structure
used by GAC3 [4] is a stack of variables. The AC4 algorithm [5] uses a stack of
variable-value records.
3.3 Backtracking Algorithm with Branch and Bound

Backtracking algorithms in the Figure 1 solve decision problems like CSP, but
not optimization problems like DFAR. The next code is an adaptation of Back-
tracking with B&B, and solves optimization problems.
The algorithm finds an assignment with less cost than an upper bound given
(c*). Then, it searches the entire tree, when it finds a feasible solution (line 1)
updates the upper bound. This algorithm verifies consistency on variables that
still have not been instanced and remove values from their domains (lines 5 to
9), reducing the branching factor with the function FORCE CONSISTENCY. If
the partial assignment does not exceed the upper bound (line 11) the algorithm
continues in the next depth level (line 12), otherwise prunes the branch, restores
the level and does a backtrack.

BACKTRACKING B&B(level, c∗)


1: if leaf then
2: c∗ ← b
3: return true
4: end if
5: var ← SELECT VARIABLE()
6: for all d ∈ D(var) do
7: if (var, d) is a valid label then
8: var ← d
9: if FORCE CONSISTENCY() then
10: b <CALCULATE PARCIAL Z()
11: if b < c∗ then
12: if BACKTRAKING B&B (level + 1) then
13: return true
14: end if
15: RESTORE(level)
16: end if
17: end if
18: end for

Fig. 1. BACKTRACKING B&B Algorithm

4 Heuristic Sampling Method

In this section, we review the related works about sampling of algorithm per-
formance, which are based on tree search, and the proposed heuristic sampling
method.
4.1 Related Works

A simple method for predicting the performance of a backtracking algorithm


was proposed by Knuth [6]. A sample is obtained as a result of the exploration
of a path traced from the root down to a leaf in the tree. The path is selected of
a random way. The statistics of interest for Knuth are the costs for processing
a node and the number of children of this node. If the number of children of a
node in the level i is di then, (1) + (d1 ) + (d1 × d2 ) + ... + (d1 × ... × dn ) is an
estimated of the number of nodes in the tree. If the cost of processing each node
in the level i is ci then c1 + c2 (d1 ) + c3 (d1 × d2 ) + . . . + cn + 1(d1 × . . . × dn ) is
an estimated of the cost to process all nodes of the tree. The sampling method
of Knuth has been improved in different ways.
Purdom [7] found that the Knuth’s method has difficulties for learning the
depth levels of a tree with many nodes, and without successors in each level.
To overcome this difficulty, Purdom proposed an algorithm called Partial Back-
tracking (PB), while Knuth’s algorithm explores exactly a path, from the root
down to a leaf. The PB algorithm lets the exploration of any specific number
(w) of children of each visited node. The efficiency and accuracy of the Pur-
dom’s method depends on the selection of w. While the value of w be increasing,
the PB method tends to be a complete backtracking and therefore increases the
accuracy, but decreases the efficiency of the estimator, consuming more time.
Another improvement to the Knuth’s method was made by Sillito [8]. He
proposed that for a certain depth level of the tree (depth) a complete search
must be carried out. The number of nodes of the deepest levels is estimated
through the application of sampling on sub-trees. Thus, for all i < depth, the
number of nodes of the level i − th, is known in an exact way. Lobjois and
Lemaitre [9] applied the sampling method of Knuth to a Branch and Bound
algorithm to solve a weighted CSP.

4.2 Proposed Method for Heuristic Sampling

In order to estimate the performance of backtracking algorithms (BC, GAC3


and AC4) implemented in this work, we developed a PREDICTOR algorithm.
In the PREDICTOR algorithm in the Figure 2, for the tree levels smaller
than the parameter of maximum depth, a complete exploration is carried out
(lines 7 to 21). When level is equal to depth the function NODES ESTIMATOR
is called (line 2), which estimates the number of nodes that will be processed by
each one of the subtrees generated starting from the depth level.
The code of the Figure 3 shows the algorithm of the NODE ESTIMATOR
function. This function takes several samples (threads) of the sub-tree (line 2);
calculates the EstimatedLength using the function SAMPLING (line 5); and
calculates the AverageEstimatedLength (lines 6 and 8).
The structure of the SAMPLING function is obtained from the structure of
the BACKTRACKING B&B function described in Figure 4. The main changes
are: the elimination of the cycle for analyzing all the values of the variable d,
instead of that, only one value is chosen randomly (line 5); the estimated calculus
of the number of nodes under the current level w (line 6); the accumulation of
the number of nodes per level (line 7).

PREDICTOR (level, depth, threads, c∗)


1: if level is equal to depth
2: tnodes ← tnodes+NODES ESTIMATOR(level, threads, c∗)
3: end if
4: if leaf
5: return false
6: end if
7: var ← CHOOSE VARIABLE()
8: for d ∈ D(var) do
9: if (var, d) is a valid label then
10: var ← d
11: if FORCE CONSISTENCY() then
12: b ← CALCULATE PARCIAL Z()
13: if b < c∗ then
14: if PREDICTOR(level + 1) then
15: return true
16: end if
17: end if
18: RESTORE(level)
19: end if
20: end if
21: end for

Fig. 2. a)Predictor Algorithm, and b)Sampling Algorithm

NODES ESTIMATOR(level, threads, c∗)


1: AverageEstimatedLength ← 0
2: for j = 1 to threads
3: w ← 1
4: EstimatedLength ← 0
5: SAMPLING(level, w)
6: AverageEstimatedLength ← AverageEstimatedLength
+EstimatedLength
7: end for
8: AverageEstimatedLength ← AverageEstimatedLength/threads
9: return AverageEstimatedLength

Fig. 3. a)Predictor Algorithm, and b)Sampling Algorithm


SAMPLING(level,w)
1: if leaf then
2: return true
3: end if
4: var ← CHOOSE VARIABLE()
5: d ← Choose randomly a value from D(var)
6: w ← w · |D(var)|
7: EstimatedLength ← EstimatedLength + w
8: if (var, d) is a valid label then
9: if FORCE CONSISTENCY( ) then
10: b ← CALCULATE PARCIAL Z()
11: if b < c∗ then
12: SAMPLING(level + 1)
13: end if
14: RESTORE(level)
15: end if
16: end if
17: return false

Fig. 4. Node Estimator Algorithm

5 General Architecture for Algorithm Selection

It is known that in real-life situations no algorithm outperforms the other. We are


working in the development of a general architecture for algorithm selection. It is
well accepted that approximated algorithms are a good alternative for very large
instances; in [3] we proposed a selection methodology for them. In contrast, exact
algorithms are considered adequate for small instances, this paper approaches
the selection of this kind of algorithms.

5.1 Selection of Approximated Algorithms

Few researches have identified the algorithm dominance regions considering more
than one characteristic of the problem, but they do not identify formally and
systematically the characteristics that affect performance in a critical way, and
do not incorporate them explicitly in a performance model. In contrast, in [3] we
proposed a methodology to model algorithm performance predictors that incor-
porate critical characteristics. The relationship among performance and charac-
teristics is learned from historical data using machine learning techniques.
The selection methodology consists of three phases: initial training, predic-
tion and retraining (Figure 5). The first phase constitutes the kernel of the
selection process. In this phase, starting from a set of historical data solved with
several algorithms, machine learning techniques are applied, in particular clus-
tering and classification, to learn the relationship among the performance and
problem characteristics. In the prediction phase the relationship learned is ap-
plied to algorithm selection for a given instance. The purpose of retraining phase
is to improve the accuracy of the prediction with new experiences.

Fig. 5. Selection architecture for approximated algorithms

5.2 Proposed Architecture for Selection of Exact Algorithms

In this section the proposed Selection Architecture for Backtracking Algorithms


is described. Given a DFAR instance to be solved and a list of candidate algo-
rithms (variants of backtracking obtained with Arc Consistency and B&B), the
best algorithm would be selected from the candidate set, this is, the algorithm
that is expected will solve the instance in the smallest possible time.
Figure 6 shows the selection architecture. An instance modeled by DFAR
is received as input, which is converted to CSP instance by means of a CON-
VERTER module. This CSP instance together with a group of candidate algo-
rithms are the input parameters to the SELECTOR module; its function is to
determine the estimated time of execution of each one of the candidate algo-
rithms and choose the best. To carry out this function the SELECTOR applies
two estimators: an estimator of the number of nodes of the search tree expanded
by each candidate algorithm and an estimator of the consumed average time in
each node of the tree.
In the SELECTOR algorithm in Figure 7, I is the instance to be tested, L
is the group of candidate algorithms, tn is the allowed time to estimate the time
per node, tm is the allowed time to estimate the number of nodes of the tree, c∗
is the upper bound, this can be calculated with the first feasible solution that
is found during the search process, depth is the maximum level in which al the
nodes of the tree will be visited, threads is the number of samples to take from
the tree.
The first function executes the algorithm Ai for a time tn on the instance I
and returns an estimate n of the average execution time per node for Ai . The
PREDICTOR function gets a sample during a time tm from the search tree
traced by the algorithm Ai , and returns the average number of predicted nodes
m. The product of n and m gets the estimated execution time.

Fig. 6. Selection architecture for exact algorithms

SELECTOR(I, L, tn , tm , c∗, depth, threads)


1: for each algorithm Ai in L
2: n ← ESTIMATED TIME PER NODE(I, Ai , tn , c∗)
3: m ← PREDICTOR(level, depth, threads, tm, c∗)
4: timei ← n × m
5: end for

Fig. 7. Selector Algorithm


6 Experimentation

Some experiments were carried out with the proposed selection architecture for
exact algorithms. Three backtracking algorithms were tested with B&B: CB,
AC4 and GAC3.
Table 1 presents a small instance set, which was selected from a sample
with 100 DFAR instances; columns 2, 3 and 4 contain attributes, sites, and
queries; column 5 indicates the level of constrained capacity. A cause of the
DFAR instances were converted to CSP instances, columns 6 and 7 show the
number of variables and constraints respectively. Column 8 shows the optimum
solution of DFAR instances.

Table 1. Example of DFAR instances converted to CSP instances

DFAR CSP
Instance Optimium
Attrib. Sites Queries Constrained Num. of Num. of
Capacity Variables Constraints
I 1 30 20 20 Maximum 600 50 40532
I 2 33 22 22 Maximum 726 55 44585
I 3 33 22 22 Without 726 33 3324
I 4 39 26 26 Medium 1014 64 52691
I 5 39 26 26 Without 1014 39 3928
I 6 42 28 28 Without 1176 42 4230
I 7 51 34 34 Without 1734 51 5137

The SELECTOR algorithm was executed 30 times for each instance using
tn = 1 sec, tm = 2 sec, threads = 5 and depth = 0.2. The results are summarized
in Table 2. To verify the prediction quality of the SELECTOR, the algorithms
CB, GAC3 and AC4 were executed completely in order to determine the real
best algorithm for each instance (column 2) and the average time tb to execute
them (column 3). Columns 4, 5 y 6 indicate the times that each algorithm is
chosen as the most promising algorithm, the average time ts for execute it is in
column 7. Column 8 is for accuracy (percentage of successful selections).
For example, the algorithm BC was the best to solve the instance I 1, and
the execution time was 9.61 seconds. The selector chose the algorithms AC4,
CB y GAC3, 13, 17 and 0 times respectively; and was executed in 8.90 seconds.
For this instance, an accuracy of 57% was obtained, but in general, the selector
predicted the right algorithm for 86 % of the 100 instances.
Besides, the results of executing the selected algorithms were contrasted with
a random election. The accumulated time of the our proposal (4765.51 sec.) was
smaller than the second (7273.91 sec.); the difference was of 2508.4 seconds (41.8
minutes).
Table 2. Example of DFAR instances converted to CSP instances

Selector
Best Algorithm
Instance
Selected Algorithm Selection Accuracy
Real Execution AC4 CB GAC3 Time ts %
time tb
I 1 BC 9.61 13 17 0 8.90 57
I 2 BC 13.35 6 24 0 6.10 80
I 3 AC4 130.94 21 8 1 3.52 70
I 4 BC 24.25 5 25 0 17.00 83
I 5 BC 23.29 6 24 0 21.65 80
I 6 BC 46.24 7 23 0 24.72 77
I 7 AC4 728.64 20 10 0 4.23 67

7 Conclusions and Future Work

We have proposed a software architecture to select the best backtracking algo-


rithm for a particular instance. This architecture consists mainly of two modules.
The Converter module transforms instances of Distribution Design problem to
instances of classical Constraint Satisfaction Problem. The Selector module es-
timates the efficiency taking random samples of the search tree associated to a
particular algorithm. The heuristic sampling was derived from Knuth’s method.
The results show that the proposed architecture is an efficient method to select
exact algorithm, an accuracy of 86% was obtained, and the accumulative time
was 41 minutes less than a random selection.
We consider that the principles followed in this research can be applied to
solve other NP-hard problems that can be converted to CSP. Although the the-
oretical conclusions on the results of the estimator have not been encouraging,
in practice, it could be demonstrated that they are satisfactory. The experimen-
tal results show clearly that there is a particular algorithm that is better for a
specific instance.
We are planning to integrate our work with the general selection architecture
developed by us, to select automatically the best between different exact and
approximated algorithms.

References

[1] Pérez, J., Pazos, R.A., Frausto, J., Reyes, G., Santaolaya, R., Fraire, H., Cruz,
L., An Approach for Solving Very Large Scale Instances of the Design Distribution
Problems for distributed DataBase Systems, Lectures Notes in Computer Science,
Springer-Verlag, Berlin Heidelberg New York ,2005.
[2] Wolpert, D. H., Macready, W. G., No Free Lunch Theorems for Optimizations,
IEEE Transactions on Evolutionary Computation, Vol. 1, pp. 67-82, 1997.
[3] Pérez, J., Pazos, R.A., Frausto, J., Rodrı́guez, G., Romero, D., Cruz, L., A Sta-
tistical Approach for Algorithm Selection, Lectures Notes in Computer Science, Vol.
3059. Springer-Verlag, Berlin Heidelberg New York, pp. 417-431, 2004.
[4] Kumar, V., Algorithms for Constraint Satisfaction Problems: A Survey, AI Maga-
zine, 1992.
[5] Liu, Z., Algorithms for Constraint Satisfaction Problems, Master Thesis, University
of Waterloo, Ontario, Canadá, 1998.
[6] Knuth, D., Estimating the efficiency of backtrack programs, Mathematics of com-
putation, 1975.
[7] Purdom, P., Tree size by partial backtracking, SIAM J. Comput., 1978.
[8] Sillito, J., Improvements to and estimating the cost of backtracking algorithms for
constraint satisfaction problems, Master Thesis, University of Alberta, Edmonton,
Alberta, 2000.
[9] Lobjois, L., Lemaitre, M., Branch and bound algorithm selection by performance
prediction, Proceedings of the Fifteenth National Conference on Artificial Intelli-
gence, Madison, Wisconsin, 1998.

Das könnte Ihnen auch gefallen