Sie sind auf Seite 1von 9

Query Optimization

Suppose you were given a chance to visit 15 pre-selected different cities in Europe. The
only constraint would be ‘Time’

 Would you have a plan to visit the cities in any order?

Plan:

 Place the 15 cities in different groups based on their proximity to each


other.

 Start with one group and move on to the next group.

Important point made over here is that you would have visited the cities in a more
organized manner, and the ‘Time’ constraint mentioned earlier would have been dealt
with efficiently.

Query Optimization works in a similar way:

There can be many different ways to get an answer from a given query. The result would
be same in all scenarios.

DBMS strive to process the query in the most efficient way (in terms of ‘Time’) to
produce the answer.

Cost = Time needed to get all answers

A DBMS often has a choice about the access path for retrieving data. For example, the
DBMS can use an index (fast lookup for specific entries) or scan the entire table to
retrieve the appropriate rows. In addition, in statements in which two tables are joined,
the DBMS can choose which table to examine first (join order) and how to join the tables
(join strategy). Optimization means that DBMS makes the best (optimal) choice of access
paths, join order, and join strategy. True query optimization means that the DBMS will
usually make a good choice regardless of how the query is written. The optimizer does
not necessarily make the best choice, just a good one.

2.5.1 Overview of SQL Processing

SQL processing uses the following four main components or phases to execute a SQL
query:

 Parser: Aim is to transform high-level query into RA query and check that query
is syntactically and semantically correct.
 Optimization: The Optimizer uses costing methods, cost-based optimizer (CBO),
or internal rules, rule-based optimizer (RBO), to determine the most efficient way
of producing the result of the query.

 Code generation: The Row Source Generator receives the optimal plan from the
optimizer and outputs the execution plan for the SQL statement.

 Execution: The SQL Execution Engine operates on the execution plan associated
with a SQL statement and then produces the results of the query.

Figure 2.7 illustrates SQL processing.

Figure 2.7: SQL Processing Overview

2.5.2 Overview of the Optimizer


Optimizer Choice
Query optimization is the central activity during the parsing phase in query processing. In
this phase, the DBMS must choose what indexes to use, how to perform join operations,
what table to use first, and so on. Each DBMS has its own algorithms for determining the
most efficient way to access the data. The query optimizer can operate in one of two
modes:

1. A rule-based optimizer uses preset rules and points to determine the best
approach to execute a query. The rules assign a “fixed cost” to each SQL
operation; the costs are then added to yield the cost of execution plan. For
example, a full table scan has a cost of 10, while a table access by row ID has a
set cost of 3.

2. A cost-based optimizer uses sophisticated algorithms based on the statistics


about the objects being accessed to determine the best approach to execute a
query. In this case, the optimizer process adds up the processing cost, the I/O
costs, and the resource costs (RAM and temporary space) to come up with the
total cost of given execution plan.

The optimizer objective is to find alternate ways to execute query—to evaluate the “cost”
of each alternative and then to choose the one with the lowest cost. To understand the
function of the query optimizer, let’s use a simple example. Assume that you want to list
all products provided by a vendor based in Florida. To acquire that information, you
could write the following query:

SELECT P_CODE, P_DESCRIPT, P_PRICE, V_NAME, V_STATE


FROM PRODUCT, VENDOR
WHERE PRODUCT.V_CODE=VENDOR.V_CODE
AND VENDOR.V_STATE=’FL’;

Furthermore, let’s assume that the database statistics indicate that:


 The PRODUCT table has 7,000 rows.
 The VENDOR table has 300 rows.
 Ten vendors are located in Florida.
 One thousand products come from vendors in Florida.

It’s important to point out that only the first two items are available to the optimizer. The
second two items are assumed to illustrate the choices that the optimizer must make.
Armed with the information in first two items, the optimizer would try to find the most
efficient way to access the data. The primary factor in determining the most efficient
access plan is the I/O cost. (Remember, the DBMS always tries to minimize the I/O
operations.) Table 2.3 shows two sample access plans for the previous query and their
respective I/O costs.
Table 2.3: Comparing Access Plans and I/O Costs

Plan Step Operation I/O I/O Cost Resulting Total I/O


Operations Set Rows Cost
A A1 Cartesian 7,000+300 7,300 2,100,000 7,300
product
(PRODUCT,
VENDOR)
A2 Select rows in 2,100,000 2,100,000 7,000 2,107,300
A1 with
matching
vendor codes
A3 Select rows in 7,000 7,000 1,000 2,114,300
A2 with
V_STATE=F
L
B B1 Select rows in 300 300 10 300
VENDOR
with
V_STATE=F
L
B2 Cartesian 7,000 +10 7,010 70,000 7,310
product
(PRODUCT,
B1)
B3 Select rows in 70,000 70,000 1,000 77,310
B2 with
matching
vendor codes

To make the example easier to understand, the I/O Operations and I/O Cost column in
Table 2.3 estimate only the number of I/O disk reads the DBMS must perform. For
simplicity’s sake, it is assumed that there are no indexes and that each row read has an
I/O cost of 1. For example, in step A1, the DBMS must perform a Cartesian product of
PRODUCT and VENDOR. To do that, the DBMS must read all rows from PRODUCT
(7,000) and all rows from VENDOR (300), yielding a total of 7,300 I/O operations. The
same computation is done in all steps. In Table 11.4, you can see how plan A has a total
I/O cost that is almost 30 times higher than plan B. In this case, the optimizer will choose
plan B to execute the SQL.

Given the right conditions, some queries could be used entirely by using only an index.
For example, assume the PRODUCT table and the index P_QOH_NDX in the P_QOH
attribute. Then a query such as SELECT MIN(P_QOH) FROM PRODUCT could be
resolved by reading only the first entry in the P_QOH_NDX index, without the need to
access any of the data blocks for the PRODUCT table. (Remember that the index defaults
to ascending order.)

You learned that columns with low sparsity are not good candidates for index creation.
However, there are cases where an index in a low sparsity column would be helpful. For
example, assume that the EMPLOYEE table has 122,483 rows. If you want to find out
how many female employees are in the company, you would write a query such as:

SELECT COUNT(EMP_SEX) FROM EMPLOYEE WHERE EMP_SEX=’F’;

If you do not have an index for the EMP_SEX column, the query would have to perform
a full table scan to read all EMPLOYEE rows—and each full row includes attributes you
do not need. However, if you have an index on EMP_SEX, the query could be answered
by reading only the data, without the need to access the employee data at all.

USING HINTS TO AFFECT OPTIMIZER CHOICES

Although the optimizer generally performs very well under most circumstances, in some
instances the optimizer might not choose the best execution plan. Remember, the
optimizer makes decisions based on the existing statistics. If the statistics are old, the
optimizer might not do a good job in selecting the best execution plan. Even with current
statistics, the optimizer choice might not be the most efficient one. There are some
occasions when the end user would like to change the optimizer mode for the current
SQL statement. In order to do that, you need to use hints. Optimizer hints are special
instructions for the optimizer that are embedded inside the SQL command text. Table 2.4
summarizes a few of the most common optimizer hints used in standard SQL.

Table 2.4: Optimizer Hints

Hint Usage
ALL_ROWS Instructs the optimizer to minimize the
overall execution time, that is, to minimize
the time it takes to return all rows in the
query result set. This hint is generally used
for batch mode processes. For example:

SELECT /*+ALL_ROWS*/*
FROM PRODUCT
WHERE P_QOH<10;

FIRST_ROWS Instructs the optimizer to minimize the time


it takes to process the first set of rows, that
is, to minimize the time it takes to return
only the first set of rows in the query result
set. This hint is generally used for
interactive mode processes. For example:

SELECT /*+FIRST_ROWS*/*
FROM PRODUCT
WHERE P_QOH<10;

INDEX(name) Forces the optimizer to use the


P_QOH_NDX index to process this query.
For example:

SELECT
/*+INDEX(P_QOH_NDX)*/*
FROM PRODUCT
WHERE P_QOH<10;

Now that you are familiar with the way the DBMS processes SQL queries.

The output from the optimizer is a plan that describes an optimum method of execution.
0As we know above, the Oracle server provides the cost-based (CBO) and rule-based
(RBO) optimization. In general, use the cost-based approach. Oracle Corporation is
continually improving the CBO and new features require CBO.

Understanding the Cost-Based Optimizer

The CBO determines which execution plan is most efficient by considering available
access paths and by factoring in information based on statistics for the schema objects
(tables or indexes) accessed by the SQL statement. The CBO also considers hints, which
are optimization suggestions placed in a comment in the statement.

The CBO performs the following steps:

1. The optimizer generates a set of potential plans for the SQL statement based on
available access paths and hints.

2. The optimizer estimates the cost of each plan based on statistics in the data
dictionary for the data distribution and storage characteristics of the tables,
indexes, and partitions accessed by the statement.

The cost is an estimated value proportional to the expected resource use needed to
execute the statement with a particular plan. The optimizers calculate the cost of
access paths and join orders based on the estimated computer resources, which
includes I/O, CPU, and memory.

Summarization of all cost factors

Total cost = CPU cost + I/O cost + communication cost

CPU cost = unit instruction cost * no. of instruction

I/O cost = unit disk I/O cost * no. of disk I/Os

Communication cost = message initiation +transmission

Serial plans with higher costs take more time to execute than those with smaller
costs. When using a parallel plan, however, resource use is not directly related to
elapsed time.
3. The optimizer compares the costs of the plans and chooses the one with the lowest
cost.

Example 1: Find all Managers that work at a London branch:

SELECT * FROM staff s, branch b

WHERE s.bno = b.bno AND

(s.position = ‘Manager’ AND b.city = ‘London’);

3 equivalent RA queries are:

(position=’Manager’)  (city=’London’)  (staff.bno=branch.bno) (Staff X Branch)

(position=’Manager’)  (city=’London’) (Staff  Branch)

((position=’Manager’) (Staff)  ((city=’London’ (Branch))

Assume:

 1000 tuples in Staff; 50 tuples in Branch;

 50 Managers; 5 London branches;

 No indexes or sort keys;

 Results of any intermediate operations stored on disk;

 Cost of the final write is ignored;

 Tuples are accessed one at a time.

Cost Comparison

Cost (in disk accesses) are:

(1) (1000 + 50) + 2*(1000 * 50) = 101050

(2) (1000 + 50) + 2*1000 = 3050

(3) 1000 + 50 + 50 + 5 + (50 + 5) = 1160

 Cartesian product and join operations are much more expensive than selection
 (3) significantly reduces size of relations being joined together.

Example 2:

Cost-based query Optimization: Algebraic Expressions

If we had the following query-

SELECT p.pname, d.dname

FROM Patients p, Doctors d

WHERE p.doctor = d.dname AND d.dgender = ‘M’

projection

filter

join

Scan (Patients) Scan (Doctors)

Transformation:

projection projection

filter join

join Filter

Scan (Patients) Scan (Doctors) Scan(Patients) Scan(Doctors)


Implementation:

projection projection

filter Hash join

Natural join Filter

Scan (Patients) Scan (Doctors) Scan(Patients) Scan(Doctors)

Plan selection based on costs:

projection projection

filter Hash join

Natural join Filter

Scan (Patients) Scan (Doctors) Scan(Patients) Scan(Doctors)

Estimated Costs Estimated Costs


= 100ms = 50ms