You are on page 1of 10

Teradata Performance Query Efficiency

Query Efficiency Guide

Page 2 of 10

TABLE OF CONTENTS
1 INTRODUCTION..............................................................................................................................4 1.1 OVERVIEW..................................................................................................................................4 1.2 AUDIENCE...................................................................................................................................4 2 TIPS TO IMPROVE QUERY EFFICIENCY................................................................................5 2.1 DATA RETRIEVAL (SELECT)...........................................................................................................5 2.1.1 Explain the SQL......................................................................................................................5 2.1.2 Join columns of same data type..............................................................................................5 2.1.3 Avoid Manipulated Columns in Where clauses......................................................................6 2.1.4 Dont embed more than 50 values in a query.........................................................................6 2.1.5 Use Date Functions where possible........................................................................................6 2.1.6 Test on smaller sample tables.................................................................................................6 2.1.7 Dont Select it if you dont need it..........................................................................................7 2.1.8 Use Union All instead of just Union..................................................................................7 2.2 DATA MAINTENANCE (INSERT/UPDATE/DELETE)................................................................................7 2.2.1 Delete all rows from a Table rather than Drop a Table........................................................7 2.2.2 Collect Statistics......................................................................................................................7 2.2.3 Insert Select Rather than Update.........................................................................................8 2.2.4 Remove Secondary Indexes when the data is being loaded, updated or deleted..................8 2.2.5 Create Tables which are appropriate to requirements..........................................................8 2.3 GENERAL....................................................................................................................................9 2.3.1 Terminate Queries which are not needed...............................................................................9 2.3.2 Run jobs outside peak hours...................................................................................................9 3 UNDERSTANDING CPU..................................................................................................................9 3.1 IDENTIFYING CPU USAGE............................................................................................................10

Query Efficiency Guide

Page 3 of 10

1 Introduction
1.1 Overview
This document identifies examples of good practice, adherence to which would help to reduce unnecessary use of resources. These tips result from observations of actual use made of the Teradata system within Barclays. Some of the problems created have enormous impact on the response times of other queries on the machine, whilst others may only affect the response of the query causing the problem.

1.2 Audience
The good practices identified in this document should be understood and adhered to by anyone managing, controlling, implementing or using a DWS database.

Query Efficiency Guide

Page 4 of 10

2 Tips to improve Query Efficiency


It is the responsibility of every person who uses Teradata to use it responsibly. If necessary, those found to be using Teradata irresponsibly will be reported to their Line Management for actions to be taken to curtail their use. The DWS have monitoring and controls in place, which alert us to possible problems. In particular, any session which accumulates 100,000 CPU secs of usage will be automatically terminated.

2.1 Data Retrieval (Select)


2.1.1 Explain the SQL The Explain statement is used to aid in identifying potential performance issues, it analyses the SQL and breaks it down into its low level process. Unfortunately the output can be very difficult for the untrained person to interpret, but there are some points to recognise: Confidence Level and Product Joins. Confidence Level Teradata attempts to predict the number of rows which will result at each stage in the processing, and will qualify the prediction with a confidence level as shown below: No Confidence. Normally means no stats. Low Confidence. Normally means stats are difficult to use precisely. High Confidence. Normally means Optimiser is sure of the results based on the stats available.

Product Joins Product joins are the condition occurring when Teradata compares every row of the first table to every row of the second table. This process can use huge amounts of cpu and spool. Likely causes of product joins are: When a join condition is based on an inequality or totally missing When an Alias has been used to identify a table, but the Alias has not been used consistently throughout the SQL to identify the table. As a consequence Teradata believes that a reference is being made to another copy of the table, but there is no join condition placed on the other table, resulting in a Product Join. Sometimes a Product Join is an appropriate option for Teradata to use, and occasionally when Teradata believes it needs to compare a small number of rows from one table to another then a Product Join is the right choice. HOWEVER if the Stats on a Table are incorrect then Teradata may choose a Product Join when in actual fact it is the worst choice.

2.1.2 Join columns of same data type Ideally the data type of the columns should match when joining data because:
Query Efficiency Guide Page 5 of 10

the join is inefficient due to the conversion required Teradata is unable to compare the demographics of columns which are of a different type, even if Statistics have been collected. As a consequence the way in which the join is performed may not be the best choice. Depending on the sizes of the tables involved, it may be more efficient to load the data from one of the tables into a new table whose data types will match.

The same type of problem exists when a join is attempted on part of a column (e.g. when using Substring). Even if Statistics have been collected for the column, Teradata cannot know the distribution of values in the substring. The advice is to check the Explain output and if you are concerned about the possible performance then load the data into a temporary table with a column for the substring, and remember to collect Statistics on the new column of the new table! 2.1.3 Avoid Manipulated Columns in Where clauses As mentioned in the Collect Statistics section, statistics should exist on columns used in Where clauses to restrict rows being returned (or join conditions). When coding restrictions or joins, avoid manipulating the columns whenever possible. The optimiser is unable to utilise the statistics on manipulated columns. For example, rather than code ColumnA - 28 < Date, code ColumnA < Date + 28. 2.1.4 Dont embed more than 50 values in a query It is tempting to Cut and Paste values into SQL as shown in the example below: Select * from Where column in (val1,val2,val3,..valn) However, this is inefficient because Teradata cannot easily share this work across all its processors. Therefore it is better to insert the data into a table and code: Where column in (select col from Tablename); 2.1.5 Use Date Functions where possible There are a number of Date Functions that are available to help with the manipulation of columns that are defined as Dates. Use them rather than attempting to redefine the date as a character column and split it into its component parts. Teradata does it much more efficiently. 2.1.6 Test on smaller sample tables. The amount of CPU time used to process a query is dependent on a number of factors, but in many cases the CPU used will be proportional to the size of the table. So if (iterative) testing is conducted on a small extract of the Live table then the amount of CPU will be reduced.

Query Efficiency Guide

Page 6 of 10

2.1.7 Dont Select it if you dont need it. Processing will be more efficient if SQL excludes rows through a specific clause in the Where condition, rather than relying on the Join condition to exclude them from the final result. For example if some rows have all zeros in a column because it is a special case, dont rely on them failing the join condition to exclude them. In this particular case not only is it inefficient to compare the contents of all the rows containing zeros, but they can skew the data (instead of spreading the data across all of the Teradata processors, it is concentrated on a single processor and takes much longer to run). 2.1.8 Use Union All instead of just Union When creating a Union of 2 sets of rows, the default form of the statement will check for the presence of duplicate rows, which is unnecessary if duplicates are acceptable. In the majority of situations it is known that duplicates can not possibly exist, and if they do exist then it is correct to select them. Therefore in the majority of cases it is better to code Union All, which recognises that duplicates may exist.

2.2 Data Maintenance (Insert/Update/Delete)


Data maintenance code often includes a Select clause. See the Data Retrieval (Select) section for relevant tips. 2.2.1 Delete all rows from a Table rather than Drop a Table When a Table needs to be emptied, and then reloaded, it is more efficient to Delete the rows in the table, rather than Drop the table and re-create it, as locks on the Dictionary are avoided. 2.2.2 Collect Statistics When processing SQL that joins 2 or more tables, Teradatas choice of join plan is totally dependent on its knowledge of the values of the data in the columns referenced in the SQL. Efficiency of query plans will be greatly improved if Statistics on join columns are available and current. Statistics should be collected on: primary index columns of every table and also on all known columns used in joins or restrictions in queries. combinations of columns known to be frequently joined. any column which features in WHERE conditions Statistics should normally be collected after the data has been loaded, or reloaded, or significantly updated. If the table changes so frequently that recollecting statistics every time would have a resource impact, then a threshold after which statistics will be collected must be identified. If Statistics are not collected or are not current, and the wrong plan is used by Teradata, then many thousands of CPU secs can be used instead of a few hundred. The elapsed time of queries is frequently reduced from hours to minutes through judicious collection of statistics.
Query Efficiency Guide Page 7 of 10

It is the responsibility of any user who suspects that (re)collection of statistics would improve the efficiency of their query, to bring this to the attention of DWS Support Desk (Serviced Managed Databases) or the object owner (User Managed Databases). When developing applications that require the creation of new tables, consideration should be given to which columns are likely to benefit from having statistics collected on them. In particular, columns that could clearly be used to limit the number of rows returned should be identified and statistics collected. These will typically be 'types', 'codes', 'flags', whereas columns such as 'names' are unlikely to benefit. 2.2.3 Insert Select Rather than Update Teradata is particularly fast at inserting new rows into an empty table, and because of this it can be more efficient to use this technique rather than performing an Update if it would affect a lot of rows in the table. e.g: Update OldTable set Columnc = Columnc+100 Can be replaced by: Insert into NewTable Select Columna, Columnb, Columnc+100 from OldTable; As a Guide , the Insert Select is better if updating more than 20% of a table with more than 1 million rows. 2.2.4 Remove Secondary Indexes when the data is being loaded, updated or deleted If a table requires a Secondary index, create the index after the data has been loaded into the table to ensure that the load process is completed as fast as possible. If an Update or Delete is being performed, remove all Secondary indexes before applying the change. Once the change is complete, re-create the Secondary index. 2.2.5 Create Tables which are appropriate to requirements. Since the default form of Table, is the Set table, it is used by most people without thinking, however a Volatile table is much better for temporarily storing data that is not required after the session has finished. The reason is that the creation of a Volatile Table is the only table type which does not take restrictive locks on the Dictionary. Listed below are the 4 different table types, with some of their characteristics: Set Table Multiset Table Volatile Table Duplicate rows not allowed Teradata default. Duplicate rows allowed. Defined for duration of session only. Rows only exist for duration of transaction, unless Table definition
Page 8 of 10

Query Efficiency Guide

includes On Commit preserve rows. Cannot collect statistics. Global Temporary Table Same as Volatile except definition is permanent, and data is deleted at end of session. Can collect statistics.

The use of 'Create Table As' can be a useful development tool to create new tables based on existing table(s). You should be aware that Teradata will create the new table using defaults which might not be the way that the developer actually wishes the table to be structured, and the defaults may vary from release to release. DWS recommends that all users explicitly state all column attributes (NOT NULL etc) and indexes (UNIQUE). 'Create Table As' should be NOT be used when code is to be run repetitively (eg every month or every day). To continue using 'Create Table As' in this situation would require a Drop table, which would violate Commandment 1

2.3 General
2.3.1 Terminate Queries which are not needed If you think a query is not performing as you expected you should consider aborting it. However, killing a job which is performing large Updates or Inserts on a non-empty table can cause rollbacks involving lot of CPU. If in doubt ask. 2.3.2 Run jobs outside peak hours The Teradata system is less heavily used at night and at weekends. Using that spare capacity increases the availability of resources during the day.

3 Understanding CPU
To understand the scale of some CPU related problems, it is important to understand the cpu resources that are available for use on the 5350 Teradata system. The Operational machine (NCR5350) has 24 Nodes, and each Node has 2 CPUs. Therefore in any second there are 48 CPU secs of processing available to be shared by Users and System processes. As the number of queries in the machine increases, the cpu shares get smaller, and the time taken for the query to finish increases. It is obvious that inefficient SQL will take more resources, longer to complete, and increase the congestion - impacting everyone! To give an example of how much CPU can be used by a simple and efficient query on a large table: select count(*) from t_bcard_restdb.bc_history where account_status_code = 'zz';

Query Efficiency Guide

Page 9 of 10

This resulted in a Table scan of Barclaycards History table, which has 667 million rows and occupies 166GB. The query used 820 CPU secs, which is small compared with the amount of CPU that some SQL will use!

3.1 Identifying CPU usage


The amount of CPU used by each User can be found from A_Usage_base.UsageHistory, which is accessible by all users.

Query Efficiency Guide

Page 10 of 10