You are on page 1of 19

NOW WHAT'S UP WITH DBMS_STATS?

Terry Sutton, Database Specialists, Inc.

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

Having accurate optimizer statistics in your Oracle database is more important than ever. Handling larger volumes of data, using more complex queries, and benefitting from the ever-increasing number of choices of execution plans generated by the Oracle Optimizer require that the optimizer knows as much as possible about your data. The way to gather your statistics is with the DBMS_STATS package. But there are many procedures in the package, and many options and parameters in the procedures. Your choice of options can dramatically affect your results, both in accuracy of statistics and performance of the statistics operation itself. A while back we examined DBMS_STATS in Oracle 9 and 10.1. This paper updates the information for Oracle 11.2. We discuss the effects of the various choices. Our focus is on actual experience, measured performance, and detailed examples, not just on the documentation.

BACKGROUND
For years the DBMS_STATS package has been the preferred method for getting information about the data in Oracle databases and providing that information to the optimizer. While the ANALYZE command is still around, its use has been deprecated, and theres really no reason to use it at all. The DBMS_STATS package has quite a few procedures, functions, and options, and it provides a lot of choices. But which are the right ones? Some time back (http://www.dbspecialists.com/files/presentations/dbms_stats.html), we studied the behavior of the DBMS_STATS procedures and various options. At that time we focused on Oracle 9 and 10.1. Much has changed since then. So were going to take a look primarily at version 11.2, with some comparison to 10.2. Well examine several commonly used options, and how best to use them. And well see the results of testing the various options, in terms of both time to perform and accuracy of statistics. Our goal is to help you understand DBMS_STATS better, and to know which choices to make (and why).

TECHNICAL DISCUSSION AND EXAMPLES


DBMS_STATS Procedures
The DBMS_STATS package in Oracle 11.2 has more than 90 procedures. Many of the procedures are used for maintenance and administrative tasks, like exporting, importing, deleting, copying, and locking statistics. Others are used for setting preferences for statistics gathering. We will focus on the GATHER_TABLE_STATS and GATHER_INDEX_STATS procedures, as they are the starting point for gathering the metadata on your data so the optimizer can make informed decisions. These procedures (unsurprisingly) gather statistics on the data in a specific table or index. And their options are similar to those of GATHER_SCHEMA_STATS and GATHER_DATABASE_STATS. The parameters of DBMS_STATS.GATHER_TABLE_STATS are:
DBMS_STATS.GATHER_TABLE_STATS ( ownname VARCHAR2, tabname VARCHAR2, partname VARCHAR2 DEFAULT NULL, estimate_percent NUMBER DEFAULT to_estimate_percent_type(get_param('ESTIMATE_PERCENT')), block_sample BOOLEAN DEFAULT FALSE, method_opt VARCHAR2 DEFAULT get_param('METHOD_OPT'), degree NUMBER DEFAULT to_degree_type(get_param('DEGREE')), granularity VARCHAR2 DEFAULT GET_PARAM('GRANULARITY'), cascade BOOLEAN DEFAULT to_cascade_type(get_param('CASCADE')), stattab VARCHAR2 DEFAULT NULL, statid VARCHAR2 DEFAULT NULL, statown VARCHAR2 DEFAULT NULL, no_invalidate BOOLEAN DEFAULT to_no_invalidate_type (get_param('NO_INVALIDATE')),

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

force

BOOLEAN

DEFAULT FALSE);

The parameters of DBMS_STATS.GATHER_INDEX_STATS are:


DBMS_STATS.GATHER_INDEX_STATS ownname VARCHAR2, indname VARCHAR2, partname VARCHAR2 estimate_percent NUMBER stattab VARCHAR2 statid VARCHAR2 statown VARCHAR2 degree NUMBER granularity VARCHAR2 no_invalidate BOOLEAN force BOOLEAN ( DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT DEFAULT NULL, to_estimate_percent_type(GET_PARAM('ESTIMATE_PERCENT')), NULL, NULL, NULL, to_degree_type(get_param('DEGREE')), GET_PARAM('GRANULARITY'), to_no_invalidate_type(GET_PARAM('NO_INVALIDATE')), FALSE);

For the purposes of our testing, the estimate_percent, method_opt, and cascade parameters of GATHER_TABLE_STATS are of the most interest in terms of their effects on performance and accuracy (the degree parameter affects performance, but in a fairly obvious way; the higher the degree of parallelism the faster the stats gathering completes). One thing to note about the parameters in 11.2 is the default values. Several of the default values refer to other parameters (e.g., to_estimate_percent_type(get_param('ESTIMATE_PERCENT')) for cascade). This adds quite a bit of flexibility to the settings used for various options. Different preferences can be set for specific schemas or tables so that you dont have to change your scheduled statistics gathering jobs. estimate_percent The value for estimate_percent is the percentage of rows to sample, with NULL meaning compute (i.e., 100%). You can use the constant DBMS_STATS.AUTO_SAMPLE_SIZE to have Oracle determine the best sample size for good statistics (this is the default); this will be important later. method_opt This parameter determines whether to collect histograms to help in dealing with skewed data. FOR ALL COLUMNS or FOR ALL INDEXED COLUMNS with a SIZE value determines which columns and how many histogram buckets to use. Instead of an integer value for SIZE, you can specify SKEWONLY to have Oracle determine the columns on which to collect histograms based on their data distribution. Or you can specify AUTO to have Oracle determine the columns on which to collect histograms based on data distribution and workload, or REPEAT to collect histograms on columns that already have histograms. The default is FOR ALL COLUMNS SIZE AUTO. cascade The value for cascade determines whether to gather statistics on indexes as well. Using TRUE is the equivalent to running GATHER_INDEX_STATS on each of the tables indexes (though well see that it is actually more efficient than that). You can use the constant DBMS_STATS.AUTO_CASCADE to have Oracle determine whether index statistics are to be gathered or not (this is the default).

Our Testing Data


For our tests, we will use two tables, FILE_HISTORY and PROP_CAT:
FILE_HISTORY [1,951,673 Column Name -----------------------FILE_ID FNAME STATE_NO FILE_TYPE rows 28507 blocks, 223MB ] Null? Type Distinct Values -------- ------------- --------------NOT NULL NUMBER 1951673 NOT NULL VARCHAR2(240) 1951673 NUMBER 6 NOT NULL NUMBER 7

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS

VARCHAR2(100) NOT NULL DATE NOT NULL NUMBER NOT NULL NUMBER NUMBER NUMBER DATE DATE DATE NUMBER

65345 960724 9 6 1206 0 0 0 837279 1206

PROP_CAT [11,486,321 Column Name Null? ------------------------ -------LINENUM NOT NULL LOOKUPID EXTID SOLD NOT NULL CATEGORY NOTES DETAILS PROPSTYLE

rows 117705 blocks, 920MB ] Type Distinct Values ------------- --------------NUMBER(38) 11486321 VARCHAR2(64) 40903 VARCHAR2(20) 11486321 NUMBER(38) 1 VARCHAR2(6) 843 VARCHAR2(255) 0 VARCHAR2(255) 873 VARCHAR2(20) 48936

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

The indexes on the tables are as follows:


Table Unique? Index Name Column Name ------------ ---------- ---------------------------------------- --------------FILE_HISTORY NONUNIQUE TSUTTON.FILEH_FNAME FNAME NONUNIQUE NONUNIQUE UNIQUE PROP_CAT NONUNIQUE NONUNIQUE NONUNIQUE TSUTTON.FILEH_FTYPE_STATE TSUTTON.FILEH_PREFIX_STATE TSUTTON.PK_FILE_HISTORY TSUTTON.PK_PROP_CAT TSUTTON.PROPC_LOOKUPID TSUTTON.PROPC_PROPSTYLE FILE_TYPE STATE_NO PREF STATE_NO FILE_ID EXTID SOLD LOOKUPID PROPSTYLE

The number of distinct values for the various columns was calculated using
select count(distinct col_name) from table_name;

rather than from gathering statistics. When we performed out tests, we used two queries to find out the value for the statistics gathered: 1) tc.sql
select table_name, column_name, data_type, num_distinct, sample_size, to_char(last_analyzed, ' num_buckets buckets from dba_tab_columns where table_name in ('FILE_HISTORY','PROP_CAT') order by table_name, column_id; HH24:MI:SS') last_analyzed,

2) ic.sql
select ind.table_name, ind.uniqueness uniq, col.index_name indname, col.column_name, ind.distinct_keys dist, ind.sample_size from dba_ind_columns dba_indexes where ind.table_owner = 'TSUTTON' and ind.table_name in ('FILE_HISTORY','PROP_CAT') and col.index_owner = ind.owner and col.index_name = ind.index_name col, ind

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

and col.table_owner = ind.table_owner and col.table_name = ind.table_name order by col.table_name, col.index_name, col.column_position;

The Tests
We tested different estimate_percent values for gathering statistics on the tables. And we did some tests with cascade=>true, and some other with cascade=>false followed by gathering stats on the indexes. We have heard many opinions over the years recommending various estimate_percents, as well as recommendations for using a small estimate percent for the tables and a higher one for the indexes. Lets see what the tests say. For a baseline, lets do the tests in version 10.2 first: Test: estimate_percent=>1, cascade=>true
SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>1, cascade=>true) PL/SQL procedure successfully completed. Elapsed: 00:00:04.74 SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>1, cascade=>true) PL/SQL procedure successfully completed. Elapsed: 00:00:08.24 SQL> @tc TABLE_NAME COLUMN_NAME ------------ -----------FILE_HISTORY FILE_ID FNAME STATE_NO FILE_TYPE PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS PROP_CAT LINENUM LOOKUPID EXTID SOLD CATEGORY NOTES DETAILS PROPSTYLE Sample DATA_TYPE NUM_DISTINCT Size LAST_ANALYZED ------------ ------------ --------- ------------NUMBER 1954900 19549 00:34:56 VARCHAR2 1954900 19549 00:34:56 NUMBER 3 19549 00:34:56 NUMBER 7 19549 00:34:56 VARCHAR2 6833 14562 00:34:56 DATE 399418 19549 00:34:56 NUMBER 9 19549 00:34:56 NUMBER 6 19549 00:34:56 NUMBER 510 16063 00:34:56 NUMBER 0 00:34:56 DATE 0 00:34:56 DATE 0 00:34:56 DATE 528973 19549 00:34:56 NUMBER 510 16067 00:34:56 NUMBER VARCHAR2 VARCHAR2 NUMBER VARCHAR2 VARCHAR2 VARCHAR2 VARCHAR2 11494400 16929 11494400 1 783 0 785 16952 114944 114944 114944 114944 114944 114944 114944 00:35:16 00:35:16 00:35:16 00:35:16 00:35:16 00:35:16 00:35:16 00:35:16

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

SQL> @ic Sample TABLE_NAME Uniqueness Index Name COLUMN_NAME Distinct Keys Size ------------ ---------- -------------------- ------------ ------------- --------FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,954,900 129311 FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY UNIQUE PROP_CAT PROP_CAT PROP_CAT PROP_CAT NONUNIQUE NONUNIQUE NONUNIQUE FILEH_FNAME_STATE FILEH_FTYPE_STATE FILEH_PREFIX_STATE PK_FILE_HISTORY PK_PROP_CAT PROPC_LOOKUPID PROPC_PROPSTYLE FNAME STATE_NO FILE_TYPE STATE_NO PREF STATE_NO FILE_ID EXTID SOLD LOOKUPID PROPSTYLE 2,043,558 2,043,558 13 13 62,011 62,011 1,912,880 10,698,273 10,698,273 16,929 16,952 126022 126022 468842 468842 411979 411979 495696 361290 361290 460820 509268

Gathering statistics on the two tables and their indexes took 13 seconds (times will be rounded to the nearest second). The statistics are very accurate for the column FNAME, but the stats for PREF are pretty far off (6,833, when the correct count is 65,345). And the PROPSTYLE column shows 16,952 distinct column values and 16,952 distinct index values, when the correct count is 48,936. Test: estimate_percent=>1, cascade=>false, indexes estimate_percent=>20
SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY',estimate_percent=>1) PL/SQL procedure successfully completed. Elapsed: 00:00:04.95 SQL> EXECUTE dbms_stats.gather_index_stats (ownname=>'TSUTTON', indname=> 'PK_FILE_HISTORY', estimate_percent=>20) PL/SQL procedure successfully completed. Elapsed: 00:00:00.34 SQL> EXECUTE dbms_stats.gather_index_stats (ownname=>'TSUTTON', indname=> 'FILEH_FTYPE_STATE', estimate_percent=>20) PL/SQL procedure successfully completed. Elapsed: 00:00:00.34 SQL> EXECUTE dbms_stats.gather_index_stats (ownname=>'TSUTTON', indname=> 'FILEH_PREFIX_STATE', estimate_percent=>20) PL/SQL procedure successfully completed. Elapsed: 00:00:01.08 SQL> EXECUTE dbms_stats.gather_index_stats (ownname=>'TSUTTON', indname=> 'FILEH_FNAME', estimate_percent=>20)

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

PL/SQL procedure successfully completed. Elapsed: 00:00:01.22 SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>1) PL/SQL procedure successfully completed. Elapsed: 00:00:01.26 SQL> EXECUTE dbms_stats.gather_index_stats (ownname=>'TSUTTON', indname=> 'PK_PROP_CAT', estimate_percent=>20) PL/SQL procedure successfully completed. Elapsed: 00:00:05.87 SQL> EXECUTE dbms_stats.gather_index_stats (ownname=>'TSUTTON', indname=> 'PROPC_LOOKUPID', estimate_percent=>20) PL/SQL procedure successfully completed. Elapsed: 00:00:08.97 SQL> EXECUTE dbms_stats.gather_index_stats (ownname=>'TSUTTON', indname=> 'PROPC_PROPSTYLE', estimate_percent=>20) PL/SQL procedure successfully completed. Elapsed: 00:00:03.74 SQL> @tc TABLE_NAME COLUMN_NAME ------------ -----------FILE_HISTORY FILE_ID FNAME STATE_NO FILE_TYPE PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS PROP_CAT LINENUM LOOKUPID EXTID SOLD CATEGORY NOTES DETAILS PROPSTYLE Sample DATA_TYPE NUM_DISTINCT Size LAST_ANALYZED ------------ ------------ --------- ------------NUMBER 1956700 19567 01:08:19 VARCHAR2 1956700 19567 01:08:19 NUMBER 4 19567 01:08:19 NUMBER 7 19567 01:08:19 VARCHAR2 6758 14539 01:08:19 DATE 431402 19567 01:08:19 NUMBER 9 19567 01:08:19 NUMBER 6 19567 01:08:19 NUMBER 501 16090 01:08:19 NUMBER 0 01:08:19 DATE 0 01:08:19 DATE 0 01:08:19 DATE 507022 19567 01:08:19 NUMBER 501 16093 01:08:19 NUMBER VARCHAR2 VARCHAR2 NUMBER VARCHAR2 VARCHAR2 VARCHAR2 VARCHAR2 11500900 16949 11500900 1 783 0 783 16975 115009 115009 115009 115009 115009 115009 115009 01:08:30 01:08:30 01:08:30 01:08:30 01:08:30 01:08:30 01:08:30 01:08:30

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

SQL> @ic Sample TABLE_NAME Uniqueness Index Name COLUMN_NAME Distinct Keys Size ------------ ---------- -------------------- ------------ ------------- --------FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,967,115 393423 FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY UNIQUE PROP_CAT PROP_CAT PROP_CAT PROP_CAT NONUNIQUE NONUNIQUE NONUNIQUE FILEH_FNAME_STATE FILEH_FTYPE_STATE FILEH_PREFIX_STATE PK_FILE_HISTORY PK_PROP_CAT PROPC_LOOKUPID PROPC_PROPSTYLE FNAME STATE_NO FILE_TYPE STATE_NO PREF STATE_NO FILE_ID EXTID SOLD LOOKUPID PROPSTYLE 2,011,500 2,011,500 15 15 68,646 68,646 1,959,662 11,498,695 11,498,695 9,654 10,744 124045 124045 447324 447324 429836 429836 507819 2299739 2299739 2309463 2349661

Gathering the statistics on the tables and their indexes took 23 seconds this time. Again, we get a fairly accurate count for FNAME, but the stats for PREF and PROPSTYLE are still way off. Test: estimate_percent=>1, cascade=>false, indexes estimate_percent=>100 [well stop listing the commands here in the interest of sanity]
TABLE_NAME COLUMN_NAME ------------ -----------FILE_HISTORY FILE_ID FNAME STATE_NO FILE_TYPE PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS PROP_CAT LINENUM LOOKUPID EXTID SOLD CATEGORY NOTES DETAILS PROPSTYLE DATA_TYPE NUM_DISTINCT Size LAST_ANALYZED ------------ ------------ --------- ------------NUMBER 1958100 19581 01:12:53 VARCHAR2 1958100 19581 01:12:53 NUMBER 4 19581 01:12:53 NUMBER 7 19581 01:12:53 VARCHAR2 6730 14429 01:12:53 DATE 387345 19581 01:12:53 NUMBER 9 19581 01:12:53 NUMBER 6 19581 01:12:53 NUMBER 507 16099 01:12:53 NUMBER 0 01:12:53 DATE 0 01:12:53 DATE 0 01:12:53 DATE 478055 19581 01:12:53 NUMBER 507 16103 01:12:53 NUMBER VARCHAR2 VARCHAR2 NUMBER VARCHAR2 VARCHAR2 VARCHAR2 VARCHAR2 11481200 16928 11481200 1 779 0 784 16952 114812 114812 114812 114812 114812 114812 114812 01:13:18 01:13:18 01:13:18 01:13:18 01:13:18 01:13:18 01:13:18 01:13:18

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

TABLE_NAME Uniqueness Index Name COLUMN_NAME Distinct Keys Size ------------ ---------- -------------------- ------------ ------------- --------FILE_HISTORY NONUNIQUE FILEH_FNAME FNAME 1,951,673 1951673 FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY NONUNIQUE FILE_HISTORY FILE_HISTORY UNIQUE PROP_CAT PROP_CAT PROP_CAT PROP_CAT NONUNIQUE NONUNIQUE NONUNIQUE FILEH_FNAME_STATE FILEH_FTYPE_STATE FILEH_PREFIX_STATE PK_FILE_HISTORY PK_PROP_CAT PROPC_LOOKUPID PROPC_PROPSTYLE FNAME STATE_NO FILE_TYPE STATE_NO PREF STATE_NO FILE_ID EXTID SOLD LOOKUPID PROPSTYLE 1,906,826 1,906,826 23 23 65,390 65,390 1,951,673 11,486,321 11,486,321 40,903 48,936 117590 117590 1951673 1951673 1951673 1951673 1951673 11486321 11486321 11486321 11486321

This time the statistics gathering took 1 minute 50 seconds, and the accuracy of the columns remained about the same. But the index statistics show a completely accurate count of distinct index values for PROPSTYLE. We continued with testing additional values, and the results of the tests are summed up in this table: Oracle 10.2: estimate_percent

Elapsed Time :13 :23 1:50 :20 :37 :30 :51 2:02 4:04 1:08

# Dist Rows FNAME (1951673) 1954900 1956700 1958100 1954700 1944160 1946050 1951845 1953826 1951673 1952261

# Dist Rows PREF (65345) 6833 6758 6730 24202 23883 35982 48122 60746 65345 1871

# Dist Rows PROPSTYLE (48936) 16952/16952 16975/10744 16952/48936 20127/20127 20156/10405 23381/23381 28887/28887 39785/39785 48936/48936 16565/16565

1%, cascade 1% table, 20% indexes 1% table, compute indexes 5%, cascade 5% table, 20% indexes 10%, cascade 20%, cascade 50%, cascade Null (compute) , cascade dbms_stats.auto_sample_size , cascade

[The dual numbers under the PROPSTYLE column are the distinct column values / distinct index values] So we see that increasing the estimate_percent gives us more accurate statistics for the PREF and PROPSTYLE columns. And using AUTO_SAMPLE_SIZE for estimate_percent gives us the least accuracy of all!

The tests (11.2)


Now lets try the same tests in Oracle 11.2. Here are the results, summarized:
Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

Oracle 11.2: estimate_percent

Elapsed Time :22 :30 1:56 :22 :37 :33 :57 2:09 4:14 :40

# Dist Rows FNAME (1951673) 1960700 1950300 1940200 1944340 1949020 1949310 1950530 1952036 1951673 1944064

# Dist Rows # Dist Rows PREF PROPSTYLE (65345) (48936) 6755 6693 6661 24323 24431 36036 48293 60703 65345 65192 16961/16961 16880/10241 16910/48936 20159/20159 20181/11191 23619/23619 29135/29135 39779/39779 48936/48936 48576/48576

1%, cascade 1% table, 20% indexes 1% table, compute indexes 5%, cascade 5% table, 20% indexes 10%, cascade 20%, cascade 50%, cascade null (compute) , cascade dbms_stats.auto_sample_size , cascade

AUTO_SAMPLE_SIZE
These results dont look much different than in Oracle 10.2, except for the AUTO_SAMPLE_SIZE test (a very important exception). The commands for this test were:
SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY', estimate_percent=>dbms_stats.auto_sample_size, cascade=>true) SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT',estimate_percent=>dbms_stats.auto_sample_size, cascade=>true)

Or, alternatively:
SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'FILE_HISTORY', cascade=>true) SQL> EXECUTE dbms_stats.gather_table_stats (ownname=>'TSUTTON', tabname=>'PROP_CAT', cascade=>true)

When this option is used, the time taken to gather the statistics is just slightly more than the time for a 10% estimate_percent, but the results are astounding. The distinct value counts are nearly as accurate as for a 100% estimate_percent! This is a result of a new sampling algorithm used in Oracle 11.2. The mathematics are a bit complex to detail here (in case you understand them better than I do), but if youre interested, a paper on the topic by Amit Poddar is available (at jonathanlewis.files.wordpress.com/2011/12/one-pass-distinct-sampling.pdf). To get the benefits of this new sampling algorithm, two things need to be done: 1. The parameter APPROXIMATE_NDV must be set to TRUE (this is its default value). To set it if it has been set otherwise, we use the DBMS_STATS.SET_GLOBAL_PREFS procedure:

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

SQL> EXECUTE dbms_stats.set_global_prefs('APPROXIMATE_NDV','TRUE')

2. DBMS_STATS.AUTO_SAMPLE_SIZE must be used as the estimate_percent (this is its default value). To set it if it has been set otherwise,
SQL> EXECUTE dbms_stats.set_global_prefs('ESTIMATE_PERCENT','DBMS_STATS.AUTO_SAMPLE_SIZE')

Given the speed and accuracy of using AUTO_SAMPLE_SIZE this appears to be the option to use in Oracle 11.2! Another benefit of using AUTO_SAMPLE_SIZE is that it enables you to do Incremental Statistics Gathering for partitioned tables. Incremental Statistics Gathering allows you to update global statistics on a new partition in the table by scanning only the new partition, not the entire table. To do this there are three requirements: 1. The INCREMENTAL value for the partitioned table is TRUE (the default setting is false).
SQL> EXECUTE dbms_stats.set_table_prefs (<owner>, <table_name>, INCREMENTAL, TRUE)

2. The PUBLISH value for the partitioned table is TRUE (which is the default).
SQL> EXECUTE dbms_stats.set_table_prefs (<owner>, <table_name>, PUBLISH, TRUE)

3. AUTO_SAMPLE_SIZE is used for estimate_percent and AUTO (which is the default) is used for granularity when gathering statistics on the table.
SQL> EXECUTE dbms_stats.gather_table_prefs (<owner>, table_name, granularity=>AUTO)

Incremental Statistics Gathering can save you a lot of time when the new partition is introduced because the old partitions do not need to be scanned to update global statistics on the table.

Histograms
Histograms on columns which have a lot of skew can greatly benefit your query performance. Consider a table which lists the subscribers of a womens magazine. Lets call the table SUBSCRIBERS, and say that it has a column, GENDER, which holds the gender of the subscriber. And lets say that 99% of the rows have a GENDER value of F, and the rest of the rows M. If you had a query like
SQL> select address from SUBSCRIBER where GENDER = F

you would clearly want to do a full table scan, since you need 99% of the rows. But if the query were
SQL> select address from SUBSCRIBER where GENDER = M

you would want to do an indexed lookup using an index on GENDER. But basic statistics will only show that the GENDER column has 2 distinct values, and the optimizer assumes that each value is in half the rows. A histogram on GENDER will show the optimizer that the F values are extremely common and M values rare. In the first case the execution plan will use a full table scan and the latter will use an indexed lookup. We will test a few possibilities for method_opt, and check the results. First, a common traditional way of gathering histograms is to use FOR ALL INDEXED COLUMNS. Lets forget for the moment that you wont necessarily want histograms on every indexed column, and you might want them on a non-indexed column.

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', method_opt=>'for all indexed columns size 254', cascade=>true); end; / Elapsed: 00:00:16.19 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', method_opt=>'for all indexed columns size 254', cascade=>true); end; / Elapsed: 00:00:36.41 SQL> @tc TABLE_NAME COLUMN_NAME ------------ -----------FILE_HISTORY FILE_ID FNAME STATE_NO FILE_TYPE PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS PROP_CAT LINENUM LOOKUPID EXTID SOLD CATEGORY NOTES DETAILS PROPSTYLE Sample DATA_TYPE NUM_DISTINCT Size LAST_ANALYZED BUCKETS ------------ ------------ --------- ------------- ------NUMBER 1951673 7260 13:28:51 254 VARCHAR2 1944064 7260 13:28:51 254 NUMBER 6 7260 13:28:51 3 NUMBER 7 7260 13:28:51 7 VARCHAR2 65192 5405 13:28:51 254 DATE NUMBER NUMBER NUMBER NUMBER DATE DATE DATE NUMBER NUMBER VARCHAR2 VARCHAR2 NUMBER VARCHAR2 VARCHAR2 VARCHAR2 VARCHAR2 41004 11486321 1 5491 5491 5491 13:29:08 13:29:08 13:29:08 254 254 1

48576

5491

13:29:08

254

This statistics gathering took 53 seconds. But we have no statistics on non-indexed columns. And do we really need a histogram on FILE_ID, which is the primary key of FILE_HISTORY? Lets try the SKEWONLY option, which will gather histograms based on their data distribution.

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', method_opt=>'for all columns size skewonly', cascade=>true); end; / Elapsed: 00:00:14.55 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', method_opt=>'for all columns size skewonly', cascade=>true); end; / Elapsed: 00:00:32.97 SQL> @tc TABLE_NAME COLUMN_NAME ------------ -----------FILE_HISTORY FILE_ID FNAME STATE_NO FILE_TYPE PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS PROP_CAT LINENUM LOOKUPID EXTID SOLD CATEGORY NOTES DETAILS PROPSTYLE Sample DATA_TYPE NUM_DISTINCT Size LAST_ANALYZED BUCKETS ------------ ------------ --------- ------------- ------NUMBER 1951673 1951673 13:37:15 1 VARCHAR2 1944064 7390 13:37:15 254 NUMBER 6 7390 13:37:15 2 NUMBER 7 7390 13:37:15 7 VARCHAR2 65192 5555 13:37:15 254 DATE 946688 1951673 13:37:15 1 NUMBER 9 7390 13:37:15 9 NUMBER 6 7390 13:37:15 6 NUMBER 1206 6083 13:37:15 254 NUMBER 0 13:37:15 0 DATE 0 13:37:15 0 DATE 0 13:37:15 0 DATE 828672 1951673 13:37:15 1 NUMBER 1206 6086 13:37:15 254 NUMBER VARCHAR2 VARCHAR2 NUMBER VARCHAR2 VARCHAR2 VARCHAR2 VARCHAR2 11460608 41004 11486321 1 843 0 873 48576 11486321 5410 5410 5410 5410 5410 5410 13:37:29 13:37:29 13:37:29 13:37:29 13:37:29 13:37:29 13:37:29 13:37:29 1 254 254 1 254 0 254 254

This test took 48 seconds, and histograms were built on more columns. Now lets try it with FOR ALL COLUMNS SIZE AUTO, which is the default.
begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', method_opt=>'for all columns size auto', cascade=>true); end; / Elapsed: 00:00:13.46

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', method_opt=>'for all columns size auto', cascade=>true); end; / Elapsed: 00:00:31.51 SQL> @tc TABLE_NAME COLUMN_NAME ------------ -----------FILE_HISTORY FILE_ID FNAME STATE_NO FILE_TYPE PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS PROP_CAT LINENUM LOOKUPID EXTID SOLD CATEGORY NOTES DETAILS PROPSTYLE Sample DATA_TYPE NUM_DISTINCT Size LAST_ANALYZED BUCKETS ------------ ------------ --------- ------------- ------NUMBER 1951673 1951673 13:56:56 1 VARCHAR2 1944064 1951673 13:56:56 1 NUMBER 6 1951673 13:56:56 1 NUMBER 7 1951673 13:56:56 1 VARCHAR2 65192 1448208 13:56:56 1 DATE 946688 1951673 13:56:56 1 NUMBER 9 1951673 13:56:56 1 NUMBER 6 1951673 13:56:56 1 NUMBER 1206 1605130 13:56:56 1 NUMBER 0 13:56:56 0 DATE 0 13:56:56 0 DATE 0 13:56:56 0 DATE 828672 1951673 13:56:56 1 NUMBER 1206 1605706 13:56:56 1 NUMBER VARCHAR2 VARCHAR2 NUMBER VARCHAR2 VARCHAR2 VARCHAR2 VARCHAR2 11460608 41004 11486321 1 843 0 873 48576 11486321 11486321 11486321 11486321 11486321 11486321 11486321 13:58:26 13:58:26 13:58:26 13:58:26 13:58:26 13:58:26 13:58:26 13:58:26 1 1 1 1 1 0 1 1

This test took 45 seconds, so its a bit faster, but it didnt gather any histograms. FOR ALL COLUMNS SIZE AUTO collects histograms based on data distribution and workload. When theres been no workload, there wont be any histograms. So lets do some queries.
select count(*) from file_history where file_type = 5; select count(*) from file_history where file_type = 4; select count(*) from file_history where fname = 'SOMETHING'; select count(*) from file_history where state_no = 999;

Lets test again:


begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', method_opt=>'for all columns size auto', cascade=>true); end; /

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

Elapsed: 00:00:13.63 begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'PROP_CAT', method_opt=>'for all columns size auto', cascade=>true); end; / Elapsed: 00:00:31.34 SQL> @tc Sample DATA_TYPE NUM_DISTINCT Size LAST_ANALYZED BUCKETS ------------ ------------ --------- ------------- ------NUMBER 1951673 1951673 14:08:26 1 VARCHAR2 1944064 5515 14:08:26 254 NUMBER 6 5515 14:08:26 1 NUMBER 7 5515 14:08:26 7 VARCHAR2 65192 1448208 14:08:26 1 DATE 946688 1951673 14:08:26 1 NUMBER 9 1951673 14:08:26 1 NUMBER 6 1951673 14:08:26 1 NUMBER 1206 1605130 14:08:26 1 NUMBER 0 14:08:26 0 DATE 0 14:08:26 0 DATE 0 14:08:26 0 DATE 828672 1951673 14:08:26 1 NUMBER 1206 1605706 14:08:26 1 NUMBER VARCHAR2 VARCHAR2 NUMBER VARCHAR2 VARCHAR2 VARCHAR2 VARCHAR2 11460608 41004 11486321 1 843 0 873 48576 11486321 11486321 11486321 11486321 11486321 11486321 11486321 14:08:51 14:08:51 14:08:51 14:08:51 14:08:51 14:08:51 14:08:51 14:08:51 1 1 1 1 1 0 1 1

TABLE_NAME COLUMN_NAME ------------ -----------FILE_HISTORY FILE_ID FNAME STATE_NO FILE_TYPE PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS PROP_CAT LINENUM LOOKUPID EXTID SOLD CATEGORY NOTES DETAILS PROPSTYLE

We now get histograms on FNAME and FILE_TYPE. But we dont get any on STATE_NO. Lets look at the columns data:
select STATE_NO, COUNT(*) from FILE_HISTORY group by STATE_NO; STATE_NO 0 20 30 40 999 9999 COUNT(*) 95 569 1950957 39 4 9

---------- ----------

That certainly looks like a candidate for a histogram. Lets try it again now that weve queried that column again.

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

begin dbms_stats.gather_table_stats( ownname=>'TSUTTON', tabname=>'FILE_HISTORY', cascade=>true); end; / Sample DATA_TYPE NUM_DISTINCT Size LAST_ANALYZED BUCKETS ------------ ------------ --------- ------------- ------NUMBER 1951673 1951673 14:17:07 1 VARCHAR2 1944064 5416 14:17:07 254 NUMBER 6 5416 14:17:07 3 NUMBER 7 5416 14:17:07 7 VARCHAR2 65192 1448208 14:17:07 1 DATE 946688 1951673 14:17:07 1 NUMBER 9 1951673 14:17:07 1 NUMBER 6 1951673 14:17:07 1 NUMBER 1206 1605130 14:17:07 1 NUMBER 0 14:17:07 0 DATE 0 14:17:07 0 DATE 0 14:17:07 0 DATE 828672 1951673 14:17:07 1 NUMBER 1206 1605706 14:17:07 1

TABLE_NAME COLUMN_NAME ------------ -----------FILE_HISTORY FILE_ID FNAME STATE_NO FILE_TYPE PREF CREATE_DATE TRACK_ID SECTOR_ID TEAMS BYTE_SIZE START_DATE END_DATE LAST_UPDATE CONTAINERS

Now the STATE_NO column has a histogram. This result from using FOR ALL COLUMNS SIZE AUTO is interesting. While it gathers stats for each column, it doesnt collect histograms unless the column has been part of a predicate in a query. The FOR ALL COLUMNS SKEWONLY option seems to have done a better job, at least until the columns with skew are queried. This shows one of the risks of relying on statistics in a new database (or new table) after an initial statistics gathering before any application activity has been run on the tables. It might be better to gather statistics using SKEWONLY until your application has run for a while, and then switch to AUTO. Another option would be to use FOR ALL COLUMNS SIZE AUTO, followed by a job which collects histograms on columns that are known to be skewed and are critical to query performance. You can see when columns have been used in query predicates with the SYS.COL_USAGE$ table.

Miscellaneous
Four procedures in DBMS_STATS that are important for setting options, and also add a lot of flexibility are:
SET_TABLE_PREFS SET_SCHEMA_PREFS SET_DATABASE_PREFS SET_GLOBAL_PREFS

As weve in a couple of examples above, these procedures allow us to set parameter preferences on a more universal basis than just for a single statistics gathering job. They allow us to set the following preferences:
AUTOSTATS_TARGET (GLOBAL ONLY) CONCURRENT (GLOBAL ONLY) CASCADE DEGREE ESTIMATE_PERCENT METHOD_OPT NO_INVALIDATE GRANULARITY PUBLISH INCREMENTAL STALE_PERCENT

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

SET_GLOBAL_PREFS and SET_DATABASE_PREFS allow us to set the values for preferences for all database objects, while SET_SCHEMA_PREFS and SET_TABLE_PREFS can override those settings for specific schemas or objects. CONCURRENT and AUTOSTATS_TARGET are only used in SET_GLOBAL_PREFS. CONCURRENT allows you to run multiple statistics gathering jobs simultaneously to speed the work up. AUTOSTATS_TARGET allows the statistics gathering job run in the nightly maintenance window to only gather stats on the data dictionary tables, in case you dont want your application schemas to have statistics gathered then. SET_DATABASE_PREFS calls SET_TABLE_PREFS for each table in the database. It doesnt affect tables created after it is run; their settings will remain at the global defaults. SET_SCHEMA_PREFS and SET_TABLE_PREFS add flexibility to the statistics gathering process. If you want to use a setting other than the defaults just for a specific table, setting it with SET_TABLE_PREFS allows you to do this without having to schedule multiple non-standard statistics gathering jobs. For instance, if you want to gather histogram data for one table on the same columns that have been gathered in the past you can use exec dbms_stats.set_table_prefs (<owner>, <table_name>, 'METHOD_OPT', 'FOR ALL COLUMNS SIZE REPEAT'), and you dont have to change the default statistics gathering job.
One parameter which is not related to the performance of gathering statistics, but may be critical to database performance, is no_invalidate. This parameter determines whether cursors which are currently in the shared pool are invalidated when new

statistics are gathered. If no_invalidate is TRUE, cursors dependent on the tables on which statistics are gathered are not invalidated upon stats gathering. This means that queries already in the shared pool will not use the new statistics until they need to be hard parsed for some other reason, such as being executed after being aged out of the shared pool. If no_invalidate is FALSE, then all the dependent cursors are invalidated immediately. This can have the unpleasant side effect that lots of cursors need hard parses all at once. If this happens during a busy part of the day there could be lots of latching waits and system performance could suffer greatly. The other possible setting for no_invalidate is AUTO_INVALIDATE (which is the default). When this is set then dependent cursors are not all invalidated at once. Over a period of several hours the cursors are gradually invalidated, so theyre not all doing a hard parse at once. This is a good solution if the statistics gathering is routine, and you dont have queries that need the new statistics immediately to resolve performance problems that arose from having inaccurate or out-of-date stats. Another preference that can be used when gathering stats is PUBLISH. When this preference has the value TRUE, the stats will be published in the data dictionary immediately after the stats gathering job is complete. When FALSE, the stats will not immediately be published to the data dictionary. Then the parameter optimizer_use_pending_statistics can be set to TRUE in individual sessions, which can do testing using these statistics. Once the testing is done you can publish them using the PUBLISH_PENDING_STATS procedure (e.g., exec dbms_stats.publish_pending_stats (<owner>,<table_name>)) or delete them using the DELETE_PENDING_STATS procedure (e.g., exec
dbms_stats.delete_ pending_stats (<owner>,<table_name>)).

Summary
Weve looked into the DBMS_STATS procedures and some of the options in them, focusing on performance and accuracy aspects. From the tests weve determined: 1. In Oracle 11.2, using estimate_percent=>DBMS_STATS.AUTO_SAMPLE_SIZE with cascade=>TRUE is extremely efficient, both in speed of gathering statistics and in accuracy of those statistics, because of a new algorithm using APPROXIMATE_NDV. 2. AUTO_SAMPLE_SIZE also has benefits for Incremental Statistics Gathering on partitioned tables. 3. We found that, when gathering statistics on tables that havent been under application load, it may be best to gather stats using method_opt=>FOR ALL COLUMNS SIZE SKEWONLY, then switch to FOR ALL COLUMNS SIZE AUTO after normal database activity has taken place on the tables. 4. We discussed the use of some additional feature of DBMS_STATS, such as AUTO_INVALIDATE and PUBLISH.

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com

About the author


Terry Sutton, OCP, has been an Oracle DBA for 19 years, and has worked in the information technology area for more than 25 years. Since 2000, Terry has been a consultant at Database Specialists, where he is the Director of Managed Services. He performs duties ranging from production database administration to emergency troubleshooting with a particular focus on Oracle database performance tuning and remote database administration. He has been a speaker at the RMOUG, NoCOUG, and IOUG-Live conferences, as well as the Hotsos Symposium and various local user groups. You may contact Terry by email at tsutton@dbspecialists.com

About Database Specialists, Inc.


Database Specialists, Inc. provides remote database administrator (DBA) services and onsite database support for your mission critical Oracle systems. Since 1995, we have been providing Oracle database consulting in Solaris, HP-UX, Linux, AIX, and Windows environments. We are DBAs, speakers, educators, and authors. Our team is continually recognized by Oracle, at national conferences and by leading trade publications. Learn more about our remote DBA support, Oracle database administration, and dba outsourcing services by visiting our website, or call us at 415-344-0500 or 888-648-0500.

REFERENCES
Whats Up With dbms_stats? (http://www.dbspecialists.com/files/presentations/dbms_stats.html) One Pass Distinct Sampling (jonathanlewis.files.wordpress.com/2011/12/one-pass-distinctsampling.pdf) Preserving Statistics During Export/Import (http://www.dbspecialists.com/blog/databasetools/preserving-statistics-during-export-import) Understanding Optimizer Statistics (http://www.oracle.com/ocom/groups/public/@otn/documents/webcontent/1354477.pdf) Best Practices for Gathering Optimizer Statistics (http://www.oracle.com/technetwork/database/bi-datawarehousing/twp-bp-optimizer-stats04042012-1577139.pdf)

Copyright 2013 Database Specialists, Inc. http://www.dbspecialists.com