Efficient Rule Matching in Large Scale Rule Based Systems

Efficient Rule Matching in Large Scale Rule Based Systems
Jack Tan Dept. of Computer Science University of Houston Houston, TX 77204 Jaideep Srivastava Computer Science Department University of Minnesota Minneapolis, MN 55455
ABSTRACT
This paper presents an efifcient matching algorithm for a large scale production system. The matching algorithm is state saving and does incremental evaluation. Implementation details are presented which include optimization of join tests, and eficient buffering of data blocks. A cost analysis, both in terms o matching evaluation cost f and storage cost for the saved state is presented. Results from our performance study show that substantial savings in matching cost are obtained with little space overhead for saving state. Matching becomes computationally intensive in a secondary memory environment, and eficient algorithms are a must for successful integration of production systems and databases.
1 Introduction .
The implementation of AI production systems, e.g., OPS5 [FORG81J has focused on the main memory environment, where the working memory and the matching state information (if any) is relatively small. The advent of large scale production systems [SELL88, SRIV901, stemming from the need to model applications with large volumes of knowledge, requires shifting to a secondary memory environment. Thus, integrating production systems with database technology is a natural means for achieving the objectives of large scale production systems. Research in enhancing production system performance has focused primarily on improving the computationally expensive matching phase. This is attributed to the observation that matching consumes a significant portion of the production cycle [GWr88, FORG821. Existing efficient matching algorithms, e.g., Rete [FORG82], TREAT [MIRA87], work reasonably for small main memory-based systems. However, with a large size working memory database resident on slow disk-based storage, matching is considerably more expensive and the performance of these algorithms may
not scale up. In this paper our emphasis is on the matching phase of the production system cycle and we present an efficient algorithm for it, which is suitable in a disk based environment. We illustrate this by showing how the OPS5 and relational model are compatible, and thus implementation techniques in one domain are applicable to the other. Rete is a popular main memory-based matching algorithm that does incremental matching by storing a history of the matching tuples in a pre-computed discrimination network. The efficiency of the Rete is well debated in the literature [NAYA88]. The Rete matching, designed for a rule base and working memory (WM) small enough to fit completely in main memory, becomes unmanageable for a large rule base. Another well-known main memory based matching algorithm is TREAT [MIRA87], which extends quite naturally to a relational database, since matching is equivalent to partial query evaluation (results of selections are materialized while joins are recomputed everytime they are needed). Re-evaluation algorithms, however, are likely to be expensive, since they do not incorporate the full advantage of a priori rule knowledge. Since, the debate on the superiority of TREAT over RETE even in the main memory is still unresolved [MIRA87, NAYA881. Hence, a pure TREAT implementation in a database environment is by no means the clear choice. The Matchbox [PERL89], another main memory-based algorithm, divides storage into numerous buckets based on the binding space, i.e., cross product of the ranges of join variables. A secondary memory implementation of Matchbox is infeasible since the size of the entire binding space would be prohibitively large. Recently, some secondary memory-based matching algorithms have been proposed [SELL@, DAYA88, STON90, RISC89, MIRA901, COND relations [SELL881 uses a priori information about condition elements to create relations that functionally determines
0073-1129-1/92 $3.00 0 1992 IEEE
the equivalent of a rule trigger. It uses pre-computed information for incremental matching. However, it stores information in a highly redundant manner, and its implementation and performance have not been reported so far. The authors justify high redundancy as a means of achieving faster matching. However, this also implies that high maintenance costs are incurred. W A C S match algorithm [DAYA88] utilizes a graph-based model called signal graphs that performs incremental matching by storing delta relations. Signal graphs are similar to global access plans in multiple query optimization. Implementation and performance details have not been reported so far. The match mechanism of POSTGRES [STON901 evaluates conditions based on an extension of the relational query language QUEL. Matching is based on the reevaluation of a rules condition. Such a nonstate saving approach may be appropriate for applications where the rule set is dynamic. However, in production systems, the rule set is known and usually static, and this a priori knowledge is should be exploited for efficient incremental evaluation. Condition monitoring in IRIS, an object oriented database [RISC89], is implemented as a separate module. As a result, it does not understand the internal database structures, and would likely lead to inefficient matching. Furthermore, it becomes a difficult task to determine the periodicity of triggering the match. The performance on such an implementation has not been reported so far. Matching was integrated with selection in [MIRA90], i.e., the matching only creates the binding to be picked next by the conflict resolution mechanism. An example of a conflict resolution scheme that this technique implements is recency. It is not clear if other types of conflict resolution schemes can be implemented using this technique. Futhermore, for set oriented actions where all the bindings are required, matching has to be performed repeatedly, i.e., once for each binding, making it much more expensive than regular matching. Our matching algorithm, the Tuple-Oriented Rule Incremental Matcher (TRIM), implements matching using a data structure called RBIND-TBL. Our RBIND-TBL is associated with each rule, making this approach the dual of Selliss COND relations, where a COND relation was associated with each relation. The main features of TRIM are the provision of matching state storage, incremental matching, sharing of conditions, and optimization of the match structure. Instead of repetitive re-evaluation, it performs incremental evaluation by decomposing storage of history, optimizes
the match structure, and permit the sharing of common conditions. Some ideas in the implementation of TRIM are derived from the view materialization technique [BLAK90], which materializes the results of a successful join tests, facilitating the efficient processing of subsequent tests. Organization of the paper is as follows: The similarities between OPS5 and the relational data model are discussed in Section 2. Section 3 describes the TRIM algorithm. Implementation details, optimization via sharing across multiple rules, and storage and time complexity analyses are presented in Section 4. Section 5 contains a simulation based performance evaluation of TRIM, while conclusions are presented in Section 6.
2. Similarities Between OFS5 and the Relational Model A major motivation for utilizing relational implementation techniques in TRIM is the natural correspondence between relational algebra and the OPS5 LHS conditions. This correspondence is shown here through examples, and a formal proof of this equivalence can be found in [DELC88]. Table 1 lists the correspondence between various constructs.
Data Definition OPS5 Descriution I Relational Model I Classes I Relations Fields Attributes Working Memory Tuples Elements
OPS5 Syntax
Operations Relational Model A.A1= B.B2
-(A TA2 a)
(t E A I t.A2 f a =A-{~E Alt.A2=a
Table 1.Natural Correspondence Between O 5and FS the Relational Model. The relational model equivalence of a LHS with negative conditions is shown in the next example. Consider the following LHS:
392
The SQL equivalent of the above is SELECT * FROM ((SELECT * FROM A,B WHERE A.AI = B.B2) MINUS (SELECT * FROM A,B,C,D WHERE B.B1= C.C1 or A.A2 = D.Dl )) It has been shown that a relationally complete query language is strictly more expressive than OPS5 [DEL88]. Let RM= and RM<**s>9*=** represent the relational model with only the equijoin and all possible join operators, respectively. All other relational operators are included in either. Now, Expressive Power (OPS5) = Expressive Power (RM=) c Expressive Power (RM<****l=v* ) Consider the LHS of an OPS5 production as consisting of two disjoint parts LHS + and LHS- such that: LHS +: Conjunction o all positive condition elements, f and LHS-: Conjunction o all negative condition elements f Further, let E + be the set of tuples satisfying LHS+ while E be those satisfying (LHS LHS-). The SQL equivalent of the LHS is the set of tuples E + MINUS E The above observations are pleasing, since we now have a gamut of relational database implementation techniques at our disposal for use in building large scale production systems.
+
satisfy the LHS of a rule. Specifically, all join bindings that match the LHS of a rule yield its set of bindings. The following two subproblems have to be addressed: 1. Matching of simple (selection) conditions, each on a single class, and 2. Obtaining the bindings for join-conditions across classes. Rules with no variables do not require join operations to determine the instantiations. Problem 1 is easy and is solved by providing a filtering mechanism for such rules. Problem 2 is NP-hard [PERLgO]. It is well-known in both the AI [FORG82] and database communities that an efficient matching algorithm should test simple conditions before join conditions. Thus, TRIM also performs the filtering first. For the purpose of consistency in future examples, the following LHS will be used for the rest of the paper. (R1 (A TA1 ex> TA2 a TA3 a>) (B?Blex> TB2 <y> ?B3 by) (Ct c 1 c Tc2 <y> T 3 <z>) c
3 1 Data Structures for TRIM ..

RDESC-TBL is the filter for the simple conditions I of rules, as shown in Fig. 2. The R D field of the RDESC-TBL holds a rules unique ID. Simple tests that do not involve joins, e.g., A.AI=a, are stored in the Simple Tests field. A rule type, i.e., simple or complex, is reflected in the R-Type field of the RDESC-TBL, while a link to a corresponding RBIND-TBL, for any join tests, is stored in the Link field.
RDESC TBL for all Rules RID - Simple Tests R1 A2=a; B3=b; Cl=c A1 = a R-TYpe Complex Simple Comdex Link RBIND-TBL[Rl] RBIND-TBL[Rl] RBIND-TBL[Rl] none RBIND-TBL[Rnl
3. The Tuple Oriented Rule Incremental Matcher

(TRIM) Based on the observations of the previous section, we propose to implement the working memory as a relational database. Evaluating an LHS is now equivalent to finding the set of tuples satisfying the resultant nonprocedural query. For simplicity, the discussion of TRIM will focus on evaluating the LHS of a single rule. However, sharing of common subexpressions among rules (as in Rete) is easily incorporated as shown later. The crux of the matching problem is to determine all sets of tuples that
R2 Rn
Fig. 2. Structure of RDESC-TBL.

If a rule involves join tests, the next stage of processing is done using the RBIND-TBL. The RBIND-TBL is shown in Fig. 3. The Activate column reflects the existing bindings that satisfy the join conditions, with one bit for each LHS variable. A one in the
393
i" bit position of the Activate column means a successful variable match of the join variable. The LHS is fully satisfied when the Activate column achieves full dimensionality, i.e., all bit positions are one's. The set of bindings for the rule is obtained from the respective variables in the activated row. Fig. 3a and 3b show the structure of an RBIND-TBL for rule R 1 with an example of table entries for two inserted tuples. The incremental match algorithm based on the RBIND-TBL is described in detail in the next section. As an example, consider the RBIND-TBL[Rl] shown in Fig. 3a. The columns of RBIND-TBL are determined by observing that A and B join on the variable 00, and C on cy>. and A and C on <z>. The B Activate column puts the rule in the conflict set if all bits are set to one. Fig. 3b shows the resulting RBIND-TBL after the insertion of tuples B(47,b) and C(c,7,3) to an empty WM.
the CONDRULE-TBL. The order of consultation of the three structures described above is as follows. All rules affected by an arriving update tuple are first determined by consulting the CONDRULE-TBL. The rows of RDESC-TBL corresponding to each of the affected rules are checked to filter the tuple through the respective simple conditions. For complex rules, the tuples' bindings are then directed to the respective RBIND-TBL's to continue the matching process. Every rule, for which new entries in the Activate column of its RBIND-TBL are created with all one's, has been made true with the appropriate bindings. This instantiation is then put in the confiict set.
I CE Class
A
Rules Affected Rl,R3,R4,R5,R16
Y
Ins-Tuple Tuples produced by the most recently completed execution
R3, R5, R17
I (A1,Bl)
I
Var satisfyfying
(B2,C2) Vat. satisfyfying
(A3,C3) VU satisfyfying
<z>
Activate corresp join columns
Fig. 4. CONDRULE-TBL for Rules in the Knowledge Base. 3.2. Update Driven Incremental Maintenance (Matching) The RHS execution of a selected production updates the WM, causing insertion, deletion or modification of WM elements. Changes to the WM are propagated through the various matching data structures. First, the CONDRULE-TBL is used to determine the rules affected by an update. Next, the tuples are checked against the simple conditions of the relevant rules in the RDESC-TBL. The matching stops if the tests are unsuccessful. For a rule with no join conditions, matching is successful if conditions in the RDESC-TBL are satisfied. If a tuple affects a rule with join conditions, and it passes the simple tests in RDESC-TBL, it is propagated to the appropriate RBIND-TBL for further matching. For a newly inserted (deleted) WM tuple, the RBIND-TBL matching has to determine the new instantiations that may be created.l Analogously, creation and deletion of instantiations has to be determined for deleted and modified WM tuples. The newly created (deleted) instantiations are added to (removed from) the
""J
Fig. 3a. Structure of RBIND-TBL.

Table entries are shown for insertion of tuples B(4,7,b) and C(c,7,3)
RBIND-TBL[Rl]
Fig. 3b. RBIND-TBL Entries for R1.

Support for multiple rules requires TRIM to have knowledge about what rules are affected on an update to the WM. Update to a class may affect more than one rule. The dependence between WM classes and affected rules is maintained in CONDRULE-TBL. This table is accessed when a WM element (tuple) is updated. Associated with each WM class is a list of all rules that reference it in their LHS's. Fig. 4 illustrates the structure of
Existence of negative condition elements can l a to existed
ing instantiationst be deleted. o
394
conflict set.
Various orderings are possible for testing join

conditions of a rule, each having a different cost. The optimal order is determined during rule compilation and the information is kept locally in the RBIND-TBL. This information is crucial for the match process since suboptimal orderings can lead to much higher costs. For notational simplicity, let the optimal join order of rule Ri be
<i*
3.3. Tuple Updates For all tuples added to (deleted from) the W M by an RHS execution, corresponding tuples with add and delete tags are propagated through the matching data structures. These are matched with the existing tuples in the specified join order. The outcome is reflected by the Activate column bit corresponding to the join variable. Fig. 5 gives a high level description of the tuple update algorithm which is the crux of TRIM. Algorithm Tuple-Update(upd-type,T);
I* upd-type :insert or delete a tuple *I I* T : tuple to be updated *I
begin
Check CONDRULE-TBL to determine all rules, k,that tuple T affects: for each rule k do if tuple matches all of its simple condition then
W n
(UPd-tYpe) of insert :
begin repeat if tuple matches any pre-existing tuple in c k then

generate composite binding(s) and insert in MIND-TBL; keep matching composite bindings with subsequent pre-existing tuples in & endif until (rule fires) or (no match is found); end delete :same as insert code except tuple is removed from MIND-TBL instead of being inserted; end I* case *I endfor
An incoming tuple is first inserted into the RBIND-TBL, since it will be required by future matchings. Next, it is matched with some of the existing tuples2 to determine satisfying bindings. If no match is found, the process stops. If a match is encountered however, a composite tuple made from the two matching tuples is created and inserted into the table to reflect this match. Each new composite tuple is now handled in the same way as an incoming tuple. For example, if an incoming B tuple matches a pre-existing A tuple, a new tuple AB is created. AB is then inserted into the RBIND-TBL. Since = < c A , B > , C > , AB is matched with all pre-existing C tuples in the RBIND-TBL. If there is a match, another composite (ABC) tuple is created and inserted into the RBINI-TBL. For example, for the tuples defined in a composite tuple of full dimensionality is the ABC tuple since it is a composite of all the tuples defined in A full dimensionality composite tuple yields the required set of bindings. The corresponding rule instantiation is then put into the conflict set. Fig. 6 illustrates an example of a succession of tuple insertions. An extra column called Match Tuple is inserted for informational purposes only. Assume that the order of tuple arrival is as follows: B(4,5,b), C(c,7,8), A(4,a,8), B(4,7,b), C(c,5,8), B(4.6,b) and C(c,6,8). The first two tuples inserted into the RBIND-TBL do not match any tuple, and there is no further action. The next tuple, A(4,u,8), matches the existing B tuple, creating a composite tuple AB(4,5,8) which is inserted into the table to reflect this new match and its corresponding binding. As a result of this matching, the first bit of the Activate column is toggled to one. Since AB(4,5,8) does not match with the only preexisting C tuple, C(c,6,8), no further action is needed. When the fourth tuple, B(4,7,b), arrives, it is inserted in the table, and the composite tuple AB(4,7,8) is created due to a match. The fist bit of the Activate column is toggled to one since only one variable matches. When all bits of the Activate column are set, a full match is found in the corresponding composite tuple. This situation arises when the AB(4,7,8) tuple is next matched with C(c,7,8), yielding a full dimensioned composite tuple ABC(4,7,8), which results in the rules instantiation entering the conflict set. Similarly, the arrivals of tuples C(c,5,8), B(4,6,b)and C(c,6,8) activate the rule with
cl
c1, c1.
Fig. 5. Algorithm for Tuple Updates.
An incoming tuple is matched only w t the tuples in the ih RBIND-TBL specified by the join order c i . In example 1. an A tuple matches only a B tuple (and vice versa), while only a C tuple matches only an AB tuple (and vice versa).
395
RBIND-TBL of Rule R1
Inserted Tuple
(A3,C3)
Activate
normal insert normal insert normal insert A with B(45,b) normal insert B with A B with C(c,7,8) normal insert C with AB(4,5,8) normal insert B with A(4,a,8) normal insert C with AB(4,6,8)
Fig. 6. Example of Tuple Insertion Algorithm.
RBIND-TBL of Rule R1
Inserted Tuple
Match Tuple
(A3,C3)
Activate
normal insert normal insert normal insert A with B(4,5,b) target delete tuple partially matched with A(4,a,8) Removed from CS normal insert C with AB(4,5,8) normal insert B with A(4,a,8) normal insert C with AB(4,6,8)
Fig. 7. Example of a Tuple Deletion.
396
bindings of (4.5.8) and (4,6,8). TRIM handles the deletion of a tuple from the W M by removing the tuple and all composite tuples of which it is a part, from the RBIND-TBL. The details are similar to tuple insertion except that occurrence of a match corresponds to the removal of a tuple and the clearing of the Acrivure bit. If the tuple was responsible for adding rule instantiations to the conflict set, these instantiations are removed. Fig. 7 illustrates an example of the tuple deletion algorithm. The deleted tuple, B(4,7,b), and its matching composite tuples are italicized. The instantiation ABC(4,7,8) is removed from the conflict set.
4. Implementation Details and Cost Analysis
The TRIM algorithm described so far is only conceptual since the direct implementation of RBIND-TBL as a single table would be inefficient. RBIND-TBL is scanned every time a new tuple is inserted, and on the insertion of corresponding composite tuples, if any. This repetitive scanning of the RBIND-TBL is timeconsuming. Techniques like join index, view materialization and hybrid hash join [BLAK90] are examples of techniques that can be used to implement TRIM. Optimizations like (i) partitioning the tables using hashing, etc., with partitions kept in one or separate files. (ii) indexes on tables/partitions, (iii) pattern subsumption, and (iv) removing RBIND-TBL redundancy (already removed in implementation) are examples of additional ways to improve on TRIM. We next describe the implementation of TRIM that employs optimizations (i) and (iv) with view materialization. The RBIND-TBL stores a history of tuples that have been inserted or deleted in the course of the execution of the system. Storing all tuples in a single structure would require more than one scan each time a tuple is inserted or deleted since tuples with partial matches will create new composite tuples to be further matched with corresponding tuples in the join order <. In the worst case, on the insertion (deletion) of a WM tuple, this can result in a scan of the RBIND-TBL for each join condition in the rule. This, however, is wasteful since the aim in each scan is to perform matching with only a subset of the RBIND-TBL tuples, i.e., those specified by <, the join order. One way to eliminate this excessive cost is to provide an indexing scheme for all the tuples in the RBIND-TBL based on A, B, C, AB, ABC, etc., as keys. This provides a savings in time at the expense of storing and maintaining the index structure. A better solution,
and the one adopted in TRIM, is to partition the RBIND-TBL into subtables according to these keys, as shown in Fig. 8. This reduces the work required for propagating the effects of a W M update to exactly one scan of each partition of the RBIND-TBL. Another optimization is the caching, in main memory, of composite tuples created on successful matches. As shown in Fig. 8, if an A tuple is inserted into the RBINI-TBL, its comparison with the B partition creates composite tuples for all successful matches. The set of composite tuples, AAB, is cached and at the same time inserted in the AB table. The cached AAB tuples are then matched with the C table to create new composite tuples. These in turn are cached (dashed box AABC), to be subsequently matched with other tables down the line, if any. The new bindings in the cache are also inserted into the ABC table.
Fig. 8. Partitioned Configuration of the
RBINI-TBL.
4.1. Storage Analysis
The physical storage required for storing the partitioned RBIND-TBLs is important since the scheme would be infeasible if this overhead was excessive. A quantitative analysis of the storage requirements is described next. Let a rule R be defined in terms of R I , R2,...., Rk. Without loss of generality, let = < < < R I , R2>,R3 >,.... Rk> be the optimal join order. The RBIND-TBL is stored as a (2k-l)-way partitioned table. The ifhpartition, 1 Ii I k , corresponds to the relation oi(Ri),while the k+i-lfh partition, 21; i a, corresponds to the join of relations R , , R 2 , .... ,Ri. We
397
define the following:

0(123...i-i)i.
= 80
2 I i Ik : join selectivity of the join test between Ri and (R 1 , . . . ,Ri-l ).
: .
Storage = 600 + 40 + 80 = 720 = 8 blocks
oi, 1 5 i I k : selectivity of simple test on Ri.
R2
R3
Ni, 1 I i I k : size of Ri in tuples.

B: disk block size.
Si, 1 I i I k : size of selection partitions.

Si, k+l I i I2k-1 : size of join partitions.
The sizes of the selection partitions are
Si = aiNi for 1 I i I k
< = a R 1 , R2 >, R3 >
v
Da
s3
The sizes of the join partitions are

T i
l f i
Fig. 9. Join order of R , R 2 , and R3.

1 1 In contrast, if (3's = - and 8's = -, then the storage 3 10 would increase to 3 0 x lo6 blocks. Fig. 9 shows the optimal join order of the three relations where < = a R 1 , R2 >, R3 >. From example 1, it is evident that selectivities will have to be much higher, i.e., (3's and 8's much lower, for this scheme to be practical. We now analyze what test selectivities, i.e., (3's and 8's are expected to be for OPS5 rules. Consider the following LHS:
Example 1:
Assume that relations R 1 , R 2and R 3 have the following characteristics:
(31
(A ?A1 'al' ?A2 'a2' ?A3<x>) (B ' b l u o ' b 2 'b2' ' b 3 'b3')
Let h) be the cardinality of the domains of attributes Ai , 1 I i 53. Assuming each value of an attribute to be equiprobable, and different attributes of a tuple to be uncorrelated, the probability that an A tuple satisfies both simple conditions, i.e., Al='al' and A2='a2', is 1 1 --
= (32 = 0 3 = -
012 = e(12)3 = N1 = N2 = N3 = 20,000 records, and B = 100 records per block.

o l N l = 02N2= 03N3= 200 records e12(31N1(32N2= 1000.200.200
= 40
I
1 100 ' 1
k 1 I'k21
*(31=--
k ll'k21
and
' - Bd'B31
2 - - -
Attributes A 3 and B 1 are drawn from the same domain of size k 3 say. Hence, 1 1
012
012,3alN102N2cr3N36 . 4 0 2 0 0 =
=% '13 "41
398
-- 1
- 1431
If any application involves primarily equijoins and the key domain, i.e., join attribute domain, is large, the selectivity will be low. This would make our scheme attractive. For example, the storage for the join of R 1 andR2 is
e12.01
-- 40
3
= 13.33 I/O's
4.3. Sharing Overheads Across Multiple Rules
.N2= &.o1 .c2.N1 .N2
If k 3 I= O(Ni), then our scheme will perform efficiently. Since domain sizes are usually larger than relation sizes, the scheme is realistic.
4.2. Evaluation Cost Analysis
Multiple rules often share common condition elements [FORG81]. Hence, techniques from multiple query optimization [SELL901 can be employed to optimize join orders for multiple rules. Constructing a global matching structure may now be optimum for the entire rule set. Consider a rule, R1, involving relations A, B, and C, and rule R2, involving B, C, and D.Fig. 10 illustrates a global structure that is optimal for both. However, it may not be individually optimal for either.
Since I/O is the dominant cost in a disk based system, we consider disk access to be the cost unit. The number of disk accesses required to read a file depend on its size in blocks. Suppose a rule firing updates class Ri. The only partitions that need to be examined are {1,2,.....2k-1) for i = l and {i, i+l, ......,k and k +i -2, .....,2k -1 } for 2 5 i 5 k. All tuples in each partition have to be examined. All blocks of affected partitions have to be read in and written out once each. Thus, the cost, Ci, (in number of disk Yo's), of updating the RBIND-TBL as a result of a rule firing is
2k-1
"
Pa I
I>a I
Fig. 10. Sharing of Common Joins.

Constructing a globally optimal plan for positive condition elements only, with a sequence of join tests, has been studied in the AI and DB literature. Presence of negative condition elements corresponds to optimizing expressions having both join and MINUS operators. This has not been addressed in the DB literature and is presently being investigated.
5. Experimental Performance Evaluation
The probability of insertion in Ri, assuming uniform 1 insertion, is - The expected insertion cost of a tuple is
k'
k
i =1
E(C,) = $ Z C i
Example 2: With reference to example 1, the size of R 1 = R2 = R2 = 2 blocks, and that of R l R 2 = R 1 R 2 R 3 = 1 . Hence, Ci, and the expected number of disk Yo's, E ( C ) , is
C1=2 x 8 = 16
C2= 2 x 8 = 16
The TRIM algorithm was implemented on a SUN/2 running Unix. The partitions of the RBIND-TBL were stored in separate files. The system was run in single user mode and the elapsed time was noted. Rules with one, two and t r e join tests were used for the he evaluation. The W M memory classes (database relation) sizes chosen were 5K, 10K and 20K tuples. The classes were generated with join attributes having uniform distribution. Insertion and deletion of tuples was assumed to occur with equal probability on any of the classes. Consider the TRIM data structure shown in Fig.
399
8. The insert and delete tuple operations occur on any of the classes of A, B or C with equal probability. Selectivity of the simple tests was chosen to be 0.15, and that of the join test as 0.00012. These values were chosen in accordance with the observations in section 4.1. A number of runs of the tuple insertioddeletion algorithm were made and the average elapsed time for updating the RBIND-TBL was determined. Space usage was calculated by obtaining the size in bytes of the WM classes and RBIND-TBL, i.e., sum of all its partition sizes.
Assuming nested loop join with 1921 blocks of buffer space, to fit the inner relation in main memory, the join cost is 1920 + 1920 = 3840 block reads. With a measured value of 60 msec per U 0 operation, the time taken for the matching would be
6o 3840 = 230.4 secs. 1000
Tables 2,3,4 show the results. Table 2 shows the average time taken to update an RBIND-TBL after a rule firing. It indicates that the time varies linearly with respect to size of relations. Table 3 shows disk space required to store the RBIND-TBL. It shows that the extra space required grows linearly with the size of base relations. Table 4 shows the storage required for RBIND-TBL as a percentage of the space for WM classes. It shows that percentage increase remains nearly constant with respect to size of relations, and increases almost linearly with the number of joins. In order to interpret the results let us look at the time required for maintenance of a rule with a join defined over two WM classes, each of size 20 K tuples. From Table 3, the RBIND-TBL would have a size of 36459 tuples. In the following analysis the cost of in memory computation is ignored. In the worst case, TRIM reads and writes the whole RBIND-TBL, and since the disk block size is 512 bytes, a total of 80 blocks are read and written, i.e., 160 disk I/Os are performed. Since the elapsed time is 9.6 seconds, the average measured time to access a block was 60 msec. Such a high value for each block access is observed because of operating system overheads including system calls and overheads of C functions to read and write files which are avoidable if TRIM is a part of the data manager. Thus, the actual cost will be much less than 9.6 seconds. Even though the pessimistic estimate of TRIMS behavior is not encouraging let us compare it with the time taken for doing a join on the same WM classes using some algorithm based on re-evaluation. Let R1 and R2 be the WM classes with 48 byte tuples. Thus, the size of each relation is
These estimates are extremely pessimistic since the implementation was done using a file system. However, the noticeable thing is the order of magnitude difference between the performance of the algorithms. Also, in order for the join to be performed with each block being read only once, 1921 blocks, i.e., = 1 MB, of buffer space is required. Assuming no buffers, the cost of joining R1 and R2 is
= 1920 x 1921 x 0.06 seconds
= 221,229 seconds
Thus, it is clear that even in ideal circumstances, i.e., of complete buffering, re-evaluation is about 24 times more expensive than TRIM. For the example given, TRIM uses only 1.86% extra disk space. Thus, by using a little additional space, match evaluation time has been reduced by a substantial amount. In fact, assuming a disk access to cost 20 msec3 the match time for TRIM can be reduced to about 3 seconds.
6. Conclusions In this paper we presented a new matching algorithm for evaluating the variable bindings satisfying the LHSs of production rules in a secondary memory database environment. The algorithm carries out matching in an incremental manner and is state saving. Both the working memory database and the matching state information are large enough so as not to fit in main memory. The cost of matching can now escalate so much that careful implementation becomes crucial. We present the ideas of decomposed storage of matching state and careful buffering of disk blocks to reduce the matching time. A cost analysis of the algorithm is presented, both in terms of the matching evaluation time, and the amount of storage required for saving state. The algorithm was implemented in a UNIX file system environment and
This is typically the case if TRIM is implemented inside the
data manager
- 2 0 ~ 2 ~ ~ xblocks 48 512
= 1920 blocks
400
Table 2.
[DAYA88]
Time ~Taken for Various Joins _ _ _ ~ _.Each WM 1 Join 2 Joins 3 Joins Class Size
~
I
[DELC88]
20K
Table 3. Each WM Class Size (in tuples) 5K 10K 20K Disk for
1 Join
[FORG81]
Space RBIND-TBL 2 Joins 21504 43244 88210
Needed
[FORG82]
9089 18297 36459
3 Joins 37935 77035 156976
[GUPT881
[MIRA84] Table 4. Disk Space Usage [MIRA87] Size 1 Join 1.849 1.862 1.854 [MIRA90]
20K
performance studies were carried out. Results show that substantial savings in matching cost are obtained with little space overhead for saving state. As expected, matching becomes computationally intensive in a secondary memory environment, and efficient algorithms are a must for successful integration of production systems and databases.
[NAYA881
[PEN891
7. References
[BLAK90]
J. Blakeley and N. Martin, "Join Index, Materialized View, and Hybrid-Hash Join: A Performance Analysis", IEEE 6th Int'l Con$ on Data Engineering, Los Angeles, February 1990, pp.256-262.
[RISC89]
U. Dayal, Active Database Management Systems, Proceedings of the 3rd Intemational Conference on Data and Knowledge Management, Jerusalem, Israel, June 1988. L. M. L. Delcambre and J.N. Etheredge, "The Relational Production Language: A Production Language for Relational Databases", Proc. 2nd Int'l Con? on Expen Database Systems, 1988, pp. 153-162. C. L. Forgy, "OPS5'User's Manual", Tech. Report CMU-CS-81-135, Carnegie-Mellon University, 1981. C. L. Forgy, "Rete: A Fast Algorithm for the Many Patternhiany Object Pattem Match Problem", Artijicial Intelligence 19,pp. 17-37, 1982. A. Gupta, et. al., "Parallel OPS5 on the Encore Multimax", Proc. Znt'l Con? on Parallel Processing, pp 271-280, 1988. D. P. Miranker, "Performance Estimates for the DADO Machine: A Comparison of TREAT and RETE", Proc. Znt'l Con$ on Fifh Generation Computer Systems, pp. 449-457 ICOT, 1984. D. P. Miranker, "TREAT: A Better Match Algorithm for AI Production Systems", Proc. AAAZ 87, Sixth Nat'l Con$ on AI, Seattle, WA, July 1987. D. P. Miranker, "An Algorithmic Basis for Integrating Production Systems and Large Databases", Proceedings o 6th f ZEEE Data Engineering Conference, Feb. 1990. P. Nayak, A. Gupta and P. Rosenbloom, "Comparison of the Rete and Treat Production Matchers for Soar (A Summary)", Proc. AAAI 88, Seventh Nat'l Con$ on AI, St. Paul, MN, Aug. 1988. M. Perlin and J.-M. Debaud, "Match Box: Fine-Grained Parallelism at the Match Level", Proc. IEEE First Int'l Con? on Tools for Artijicial Intelligence, Fairfax, VA, October 1989, pp. 428-434. Tore Risch, "Monitoring Database Objects", Proc. Fifeenth Int '1 Conference on VLDB, Amsterdam, The Netherlands, August 1989.
40 1
[SELL881
[SELL901
[SRIVBO]
[STON901
T. Sellis, et. al., "Implementing Large Production Systems in a DBMS Environment", Proc. ACM-SIGMOD Con5 on Management of Data, 1988. T. Sellis and S . Ghosh, "On the Multiple Query Optimization Problem", "ZEEE Trans. on Knowledge and Data Engineering, Vol. 2, No. 2, June 1990. J. Srivastava, K. W. Hwang and J. S . E. Tan, "Parallelism in Database Production Systems,"IEEE Proc. Sixth Znt'l Con$ on Data Engineering, Los Angeles, CA, February 1990. M. Stonebraker, et. al., "The Implementation of POSTGRES", "IEEE Trans. on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.
402

Efficient Rule Matching in Large Scale Rule Based Systems

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Efficient Rule Matching in Large Scale Rule Based Systems

Hochgeladen von

Copyright:

Verfügbare Formate

Efficient Rule Matching in Large Scale Rule Based Systems

0073-1129-1/92 $3.00 0 1992 IEEE

Operations Relational Model A.A1= B.B2

(t E A I t.A2 f a =A-{~E Alt.A2=a

3 1 Data Structures for TRIM ..

3. The Tuple Oriented Rule Incremental Matcher

Fig. 2. Structure of RDESC-TBL.

Rules Affected Rl,R3,R4,R5,R16

R3, R5, R17

(B2,C2) Vat. satisfyfying

Activate corresp join columns

Fig. 3a. Structure of RBIND-TBL.

Fig. 3b. RBIND-TBL Entries for R1.

Existence of negative condition elements can l a to existed

ing instantiationst be deleted. o

Various orderings are possible for testing join

begin repeat if tuple matches any pre-existing tuple in c k then

Fig. 5. Algorithm for Tuple Updates.

Fig. 6. Example of Tuple Insertion Algorithm.

Fig. 7. Example of a Tuple Deletion.

Fig. 8. Partitioned Configuration of the

define the following:

2 I i Ik : join selectivity of the join test between Ri and (R 1 , . . . ,Ri-l ).

Storage = 600 + 40 + 80 = 720 = 8 blocks

oi, 1 5 i I k : selectivity of simple test on Ri.

Ni, 1 I i I k : size of Ri in tuples.

Si, 1 I i I k : size of selection partitions.

< = a R 1 , R2 >, R3 >

The sizes of the join partitions are

Fig. 9. Join order of R , R 2 , and R3.

012 = e(12)3 = N1 = N2 = N3 = 20,000 records, and B = 100 records per block.

.N2= &.o1 .c2.N1 .N2

Fig. 10. Sharing of Common Joins.

Space RBIND-TBL 2 Joins 21504 43244 88210

9089 18297 36459

3 Joins 37935 77035 156976

Das könnte Ihnen auch gefallen