Sie sind auf Seite 1von 3

Current Research in Databases and Information Systems WS11/12 Seminar

Lisette E. Espin Noboa Mat. #2540287

Accelerating Queries with Group-By and Join by Groupjoin

Motivation The motivation of this topic is related to query optimization, specifically for aggregation queries which include both group-by and join operators. The objective of this optimization is to enhance the performance of the database management systems (dbms); to minimize the execution time and computer resources. Keywords: query optimization, group-by, join, left outer join, dbms.

Problem (insights) The problem with these kind of queries; which contain group-by and join operator, is the time consuming for evaluate them. Related Work We have several related works for this topic, because the groupjoin is about 22 years old, and the father of this operator is von Bltzingsloewen [2] who invented the groupjoin with the name of outer aggregation. He found a solution for an optimization problem that is the translation of a SQL query into an efficient parallel execution plan, to reduce response times, by generation of algebraic expressions. One of these expressions is the outer aggregation which attaches aggregate function values to each tuple and is equivalent to an outer join followed by the standard aggregation. There are more researchers whose have proposed this operator, for example Nakano [3] who presented a translation method for optimization from relational calculus to relational algebra having aggregate functions. He focused on the query translation and query optimization processes. Another research about groupjoin was presented by Akinde, Chatziantoniou, Johnson and Kim [4]. They proposed the MD-join operator for complex OLAP queries, where this operator separates the group definition and the aggregate computation. Its implementation is simple and optimizable, which make it better than its equivalent relational algebra expression.

Solution Query optimization is a function which selects a good query plan that satisfies a given query. Before this selection, the query optimization function tries multiple query plans until found the best strategy. In order to find the best query plan for a given query which contains the sequence (join, groupby), Moerkette and Neumann [1] propose to merge these two operators into one operator the groupjoin, for improving the query execution in terms of time consumption. This merging consists in two equivalences, which have been proved experimentally by the authors. 1. LEFT OUTER JOIN GROUP BY GROUPJOIN 2. JOIN GROUP BY GROUPJOIN
Page 1 of 3

For proving these equivalences, first we have to rewrite them as algebraic expressions: 1. Where denotes algebraic expressions, and be a join predicate, where the join attributes are subsets of the s attributes. is a set of grouping attributes, be a splittable and decomposable aggregation vector. Finally, is the set of attributes occurring in the result. Under the following conditions: a. According to Yan and Larson [5], a grouping can be pushed into a regular join. In this context it means that no two tuples from belong to the same group. b. None of the functional dependences is violated. c. There is an one to one matching between rows in the two groups, with matching rows having the same value for the columns of the left-hand relation. d. Holds for min, max, sum, count(a), but no count(*). It means that count(*) is 0 if the input is the empty set, and 1 if it is applied to some null tuple. 2. With the same notations and conditions a, b, and c of the previous equivalence. The most important thing behind these two equivalences, is their implementation, which consists in three steps: 1. Build a hashtable on R.a. The hashtable holds attributes from R, and initialized aggregates from F (e.g. sum(e)=0) 2. Probe the hashtable with S.b Aggregates computation for every tuple from S (matching with R) 3. Scan the hashtable, finalize the aggregates and push the result to the next operator. In contrast, the traditional queries (using join/left-outer-join and group by) can be represented by a cascade of 2 hashtables, where first a hashtable is built and maintained to compute the join result (expensive), and then the second hashtable is filled to compute the result of the aggregation.

Limitations of Solution The limitations of this solution are based on the violation of the conditions of the equivalences. When the group-by attributes and the join attributes are different. Summary of Results/Experiments Moerkotte and Neuman [1] probed the groupjoin operator proposed by them, using the equivalences with three different queries. They made the query plan for each query and for each side of the equivalences. Figure 1 shows the results obtained in the experiments.

Page 2 of 3

Figure 1 Performance Results As we can see, the execution time could be improved by more than a factor of 3, because the plan without a groupjoin is more expensive than the plan with a group join, since the first one has the plan joins lineitem with lineitem, and lineitem is by far the largest relation in the TPC-H benchmark. The first three queries, show an impressive improvement execution time by using the groupjoin operator. However, the remaining queries do not benefit in the same way, but note that the transformation is always beneficial.

REFERENCES [ 1] G. Moerkotte and T. Neumann. Accelerating Queries with GroupBy and Join by Groupjoin. In: PVLDB, Vol. 4, Nr. 11 (2011) , p. 843-851. [ 2] G. von Bltzingsloewen. Optimizing SQL queries for parallel execution [ 3] R. Nakano. Translation with optimization from relational calculus to relational algebra having aggregate functions. TODS, 15(4):518-557, 1990. [ 4] D. Chatziantoniou, M. Akinde, T. Johnson, and S. Kim. The MD-Join: An Operator for complex OLAP. In ICDE, pages 524-533, 2011 [ 5] W. Yan and P.-A. Larson. Performing group-by before join. In ICDE, pages 89-100, 1994.

Page 3 of 3

Das könnte Ihnen auch gefallen