Beruflich Dokumente
Kultur Dokumente
Teradata
Concepts
Center of Excellence
Data Warehousing
Topics
Primary and Secondary Indexes
Join Processing
Join Indexes
Hash Indexes
Partitioned Primary Indexes
Collect Statistics
Priority Scheduler
Teradata Dual Active Server
Indexes
Teradata provides numerous indexing
options that can improve query
performance for different types of queries
and workloads. Following kinds of indexes
are available:
Primary Index
Secondary Indexes
Join Indexes
Hash Indexes
Partitioned Primary Indexes.
Primary Indexes
In Teradata, Primary Index is a mechanism
to assign and store a data row in an AMP.
Since primary index is used to store data
rows, retrieving data using primary index is
very efficient.
Primary Index can be Unique or Non-Unique.
Choosing Primary Index is critical as it
affects the data distribution across the
processing units (AMPs) and hence affects
the performance.
NUPI
Good distribution for near unique values.
Duplicate PI rows go to same block. No extra
Secondary Indexes
Secondary Index values are stored in sub tables.
May be unique or non unique.
Teradata implements USI and NUSI differently.
Index Subtable
Sec. Index value
Hash Algorithm
SI Value
BT Row ID
SI value
BT Row ID
Base Table
Secondary Indexes
USI are hash distributed across all AMPs.
Sub table rows may reside in a AMP other than
the base table row.
USI access involved two-AMP operation.
Row Id
Indx1
Multiple-column
secondary indexes are less
usable. Define multiple
secondary indexes to allow
bit mapping.
Indx2
USI
NUSI
FTS
PI Value
USI Value
NUSI Value
Hashing Algorithm
Hashing Algorithm
Hashing Algorithm
Sub Table
Sub Table
Sub Table
Sub Table
Base Table
Base Table
Base Table
Base Table
Value
Hashing Algorithm
Join Processing
Join Processing
Each AMP performs join processing in
parallel.
Optimizer chooses best join strategy based
on
Available indexes, and
Data Demographics (Collect
Statistics/Dynamic Sampling)
Rows must be on the same AMP for matching.
Teradata temporarily moves the rows to same
AMP if they are not in the same AMP for join.
Join Processing
General Join Scenarios:
Join column is the PI of both the tables.
Join column is PI of one of the tables.
Join column is not a PI of either of the table.
Case 1 - Example
CREATE SET TABLE EMPLOYEE
(
EmpNo SMALLINT
Name VARCHAR(12),
DeptNo SMALLINT,
JobTitle VARCHAR(12),
Salary DECIMAL(8,2)
DOB DATE,
)
UNIQUE PRIMARY INDEX ( EmpNo )
Case 2 Example
CREATE SET TABLE EMPLOYEE
(
EmpNo SMALLINT
Name VARCHAR(12),
DeptNo SMALLINT,
JobTitle VARCHAR(12),
Salary DECIMAL(8,2)
DOB DATE,
)
UNIQUE PRIMARY INDEX ( EmpNo )
Case 3 - Example
Join Strategies
Nested Join
Merge Join
Product Join
Nested Join
Optimizer choose this join strategy when
SELECT ...
FROM Table_1, Table_2
WHERE Table_1.Col1 = Table_2.<Any Index>
AND Table_1.<Unique Index> = <value>;
Example:
SELECT E.Name, D.DeptName
FROM Employee E, Department D
WHERE E.DeptNo = D.DeptNo
AND E.Name = 'Sandy M';
Merge Join
Commonly done when the join conditions are
based on equality.
Steps
Identify the smaller table.
Put the qualifying rows from one or both table into
spool.
Move the spool rows to the AMPs based on join column
hash (if required).
Sort the spool rows by join column hash value (if
necessary).
Compare those rows with matching join column hash
values.
Example : Case 1, Case 2 and Case 3 as described.
Product Join
Most general for of join. A X B.
Optimizer choose product join usually in
following conditions
WHERE clause is missing.
Join condition is not based on equality
condition.
Steps:
Identify the smaller table
Duplicate it in spool on all AMPs.
Join each spool row of the smaller table to
every row of the larger table.
AMP 1
AMP 3
AMP 4
1005
300
1000
300
1003
200
1001
400
1009
100
1002
100
1004
300
1007
400
1014
500
1006
900
1010
200
1008
500
1019
100
1011
200
1013
400
1012
300
1017
300
1015
200
1018
400
1016
700
Department
200
Result
400
100
600
500
300
1001
400
1009
100
1005
300
1010
200
1007
400
1019
100
1017
300
1011
200
1013
400
1002
100
1000
300
1015
200
1018
400
1014
500
1004
300
1003
200
1018
500
1012
300
1006
900
1016
700
400
100
600
500
300
200
Employee
Join Indexes
Join Indexes
Join Index is an index structure that stores and
maintains results from joining two or more tables.
Optimizer resolves the query using join index,
rather than performing joins every time the query
is executed.
Teradata supports a variety of Join Indexes such
as:
Multi-table Join Indexes
Single-table join Indexes
Aggregate Join Indexes
Does the index cover the query ?Does the index cover the quer
Note: A join index with outer join covers both inner join query as well as outer join query.
Sparse Indexes
Sparse Index can be used to index a portion of a table.
CREATE JOIN INDEX OrderByCust AS
SELECT (C_Custkey, C_Name ,C_Address, O_Orderdate),
(O_TotalPrice)
FROM Customer INNER JOIN Ordertbl
ON C_CustKey = O_Custkey
WHERE O_Orderdate > DATE 2004-01-01'
PRIMARY INDEX(C_Custkey);
SELECT C_Name ,C_Address,
O_Orderdate ,O_TotalPrice
FROM Customer INNER JOIN Ordertbl
ON C_CustKey = O_Custkey
WHERE O_Orderdate BETWEEN DATE '2004-06-01' AND DATE '2004-12-31';
Cont..
As the join index covers the employee part of the query, Optimizer join
Department table with the join index instead of Employee table itself.
Note : Optimizer went for full table scan of the Employee table instead of using Jo
because the existing join index EmpDept does not fully cover the Employee part o
Cont
Does the
SELECT L_ShipDate, SUM(L_Quantity) AS SumQty
index
FROM Lineitem GROUP BY 1;
cover
the
3) We do an all-AMPs SUM step to aggregate from TPCH.JIAggLineItem by
query ?
way of an all-rows scan with no residual conditions, and the
grouping identifier in field 1. Aggregate Intermediate Results
are computed globally, then placed in Spool 3. The input table
will not be cached in memory, but it is eligible for synchronized
scanning. The size of Spool 3 is estimated with no confidence to
be 491 rows. The estimated time for this step is 0.84 seconds.
4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with no confidence
to be 491 rows. The estimated time for this step is 0.04 seconds.
Hash Indexes
Hash Indexes
Index file structures that share properties with single table join indexes and
secondary indexes.
Hash indexes are like single table join indexes but they automatically carry bas
table primary index value.
SELECT O_CustKey,
O_Orderdate,
O_Totalprice,
FROM OrderTbl
WHERE O_CustKey > 12;
SELECT O_CustKey,
O_Orderdate,
O_Totalprice,
O_Orderstatus
FROM OrderTbl
WHERE O_CustKey > 12;
Hash Indexes
CREATE HASH INDEX HIOrder
(O_CustKey ,
O_OrderDate,
O_TotalPrice)
ON OrderTbl
BY (O_CustKey)
ORDER BY (O_CustKey)
SELECT C_Name, C_Address,
O_Orderdate, O_TotalPrice, O_Orderstatus
FROM Customer, Ordertbl
WHERE C_Custkey = O_Custkey
AND O_Custkey < 10;
Explain
Hash Indexes
5) We do an all-AMPs JOIN step from TPCH.Customer by way of an
all-rows scan with a condition of ("TPCH.Customer.C_CUSTKEY < 10"),
which is joined to TPCH.HIOrder with a range constraint of (
"TPCH.HIOrder.O_CUSTKEY <= 9") with an additional condition of (
"TPCH.HIOrder.O_CUSTKEY <= 9"). TPCH.Customer and TPCH.HIOrder
are joined using a product join, with a join condition of (
"TPCH.Customer.C_CUSTKEY = TPCH.HIOrder.O_CUSTKEY"). The input
table TPCH.HIOrder will not be cached in memory, but it is
eligible for synchronized scanning. The result goes into Spool 2
(all_amps), which is redistributed by hash code to all AMPs. Then
we do a SORT to order Spool 2 by row hash. The size of Spool 2 is
estimated with no confidence to be 3 rows. The estimated time for
this step is 0.10 seconds.
6) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a
RowHash match scan, which is joined to TPCH.Ordertbl. Spool 2 and
TPCH.Ordertbl are joined using a merge join, with a join condition
of ("(Field_3 = (SUBSTRING((TPCH.Ordertbl.RowID) FROM 7 FOR 4 )))
AND (Field_2 =)"). The input table TPCH.Ordertbl will not be
cached in memory. The result goes into Spool 1 (group_amps),
which is built locally on the AMPs. The size of Spool 1 is
estimated with no confidence to be 3 rows. The estimated time for
this step is 0.05 seconds.
Hash Indexes
Hash Index ( also Join Index ) can also be used to avoid row
redistribution for join preparation.
Hash Indexes
Without Hash Index defined:
4) We do an all-AMPs RETRIEVE step from TPCH.Ordertbl by way of an
all-rows scan with no residual conditions into Spool 2 (all_amps),
which is redistributed by hash code to all AMPs. Then we do a
SORT to order Spool 2 by row hash. The result spool file will not
be cached in memory. The size of Spool 2 is estimated with high
confidence to be 60,000 rows. The estimated time for this step is
1.20 seconds.
5) We do an all-AMPs JOIN step from TPCH.Customer by way of a RowHash
match scan with no residual conditions, which is joined to Spool 2
(Last Use). TPCH.Customer and Spool 2 are joined using a merge
join, with a join condition of ("TPCH.Customer.C_CUSTKEY =
O_CUSTKEY"). The result goes into Spool 1 (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
low confidence to be 60,000 rows. The estimated time for this
step is 0.48 seconds.
Hash Indexes
With Hash Index defined:
CREATE HASH INDEX HIOrder(O_Custkey,
O_TotalPrice,
O_OrderDate)
ON OrderTbl BY (O_CustKey) ORDER BY HASH (O_CustKey);
4) We do an all-AMPs JOIN step from TPCH.Customer by way of a RowHash
match scan with no residual conditions, which is joined to
TPCH.HIOrder. TPCH.Customer and TPCH.HIOrder are joined using a
merge join, with a join condition of ("TPCH.Customer.C_CUSTKEY =
TPCH.HIOrder.O_CUSTKEY"). The input table TPCH.HIOrder will not
be cached in memory. The result goes into Spool 1 (group_amps),
which is built locally on the AMPs. The size of Spool 1 is
No redistribution,
estimated with low confidence to be 60,000 rows. The estimated
No sorting.
time for this step is 0.52 seconds.
Total join time
significantly
reduced
The total estimated time is 0.52 seconds.
NON-PPI Table
Records are sorted in row hash (not shown) sequence within the AMP.
101 10
120
30 01/10
131 20
115
01/10
30 01/02
114 40
119
01/12
30 01/20
107 20
135
01/18
30 01/10
129 10
102
01/02
01/20
30 01/18
110 10
125
30 01/12
122 20
132
01/10
10 01/18
113 40
106
01/18
30 01/20
123 40
118
01/20
40 01/02
138 30
128
01/10
01/02
40 01/12
140 10
121
01/10
40 01/12
116 30
01/18
101 10
130
20 01/10
139 40
112 20 01/10
134
103 40
105 10
133
10 01/20
126 20
127
01/02
30 01/18
136 20
109
01/18
01/10
30 01/02
01/02
01/18
30 01/20
01/02
104 20 01/12
124 20
117
30 01/18
137 20
108
01/20
01/02
10 01/12
PPI Table
Records are sorted in row hash (not shown) sequence in each partition within t
101 10
01/02
113 40
01/02
01/02
101 10
01/02
30 01/02
105 10
01/02
137 20
01/02
135
30 01/02
132
115
30 01/10
110 10
01/10
140 10
01/10
123 40
01/10
136 20
01/10
106
40 01/12
112 20
01/12
125
30 01/12
121
30 01/18
118
10 01/18
116 30
01/18
138
30 01/18
127
30 01/20
128
30 01/20
103 40
107 20
120
30 01/10
129 10
119
131 20
102
01/10
114 40
01/20
40 01/02
126 20
122 20
01/20
109
133
130
20 01/10
01/10
104
20 01/12
40 01/12
108
10 01/12
01/18
30 01/18
01/18
10 01/20
139 40
01/18
117
30 01/18
134
30 01/20
124 20
01/20
PPI Example
CREATE TABLE Lineitem (
CREATE TABLE LineitemPPI (
L_ORDERKEY INTEGER,
L_ORDERKEY INTEGER,
L_PARTKEY INTEGER,
L_PARTKEY INTEGER,
L_SUPPKEY INTEGER,
L_SUPPKEY INTEGER,
L_LINENUMBER INTEGER ,
L_LINENUMBER INTEGER ,
L_QUANTITY DECIMAL(15,2),
L_QUANTITY DECIMAL(15,2),
L_EXTENDEDPRICE DECIMAL(15,2),
L_EXTENDEDPRICE DECIMAL(15,2),
L_DISCOUNT DECIMAL(15,2),
L_DISCOUNT DECIMAL(15,2),
L_TAX DECIMAL(15,2),
L_TAX DECIMAL(15,2),
L_RETURNFLAG CHAR(1),
L_RETURNFLAG CHAR(1),
L_LINESTATUS CHAR(1),
L_LINESTATUS CHAR(1),
L_SHIPDATE DATE,
L_SHIPDATE DATE,
L_COMMITDATE DATE,
L_COMMITDATE DATE,
L_RECEIPTDATE DATE,
L_RECEIPTDATE DATE,
L_SHIPINSTRUCT CHAR(25),
L_SHIPINSTRUCT CHAR(25),
L_SHIPMODE CHAR(10),
L_SHIPMODE CHAR(10),
L_COMMENT VARCHAR(44)
L_COMMENT VARCHAR(44)
)
)
PRIMARY INDEX (L_ORDERKEY);
PRIMARY INDEX (L_ORDERKEY)
PARTITION BY RANGE_N(L_ShipDate BETWEEN
DATE '1992-01-03' AND DATE '1998-11-30'
EACH INTERVAL '1' MONTH );
PPI Example
SELECT MIN(L_Shipdate), MAX(L_Shipdate) FROM Lineitem;
Minimum(L_SHIPDATE)
------------------------------1992-01-03
Maximum(L_SHIPDATE)
-------------------------------1998-11-30
NON-PPI Table:
EXPLAIN SELECT * FROM Lineitem WHERE l_Shipdate > DATE '1997-12-31';
PPI Example
PPI Table:
EXPLAIN SELECT * FROM LineitemPPI WHERE l_Shipdate > DATE '1997-12-31';
PPI Example
NON-PPI Table:
PPI Example
PPI Table:
PPI Joins
CREATE TABLE LineitemPPI (
L_ORDERKEY INTEGER,
L_PARTKEY INTEGER,
L_SUPPKEY INTEGER,
L_LINENUMBER INTEGER ,
L_QUANTITY DECIMAL(15,2),
L_EXTENDEDPRICE DECIMAL(15,2),
L_DISCOUNT DECIMAL(15,2),
L_TAX DECIMAL(15,2),
L_RETURNFLAG CHAR(1),
L_LINESTATUS CHAR(1),
L_COMMENT VARCHAR(44)
)
PRIMARY INDEX (L_ORDERKEY)
PARTITION BY RANGE_N(L_ShipDate
BETWEEN DATE '1992-01-03'
AND
DATE '1998-11-30'
EACH INTERVAL '1' MONTH );
PPI Joins
Collect Statistics
Collect Statistics
Optimizer must be provided with correct
demographic information on your data to
choose optimal plan to execute your query.
Statistics tells the optimizer
How many rows per value are there.
How many distinct values are there in the
column.
Collect Statistics
Collected statistics are not automatically
updated by Teradata DBS.
User must refresh statistics when 5% to
10% change on the table rows.
Collect Statistics on
All non-unique Indexes of a table or a join
index.
Any column used in WHERE clause for set
selection or join constraint.
Collect Statistics
COLLECT STATISTICS ON Lineitem COLUMN L_Orderkey;
COLLECT STATISTICS ON Lineitem COLUMN L_Shipdate;
COLLECT STATISTICS ON Lineitem COLUMN (L_Orderkey, L_Shipdate);
HELP STATISTICS Lineitem;
Date
Time
Unique Values
Column Names
------------- ------------ -------------------- -----------------------------------04/10/05 11:04:48
60,000
L_ORDERKEY
04/10/05 09:57:52
2,524
L_SHIPDATE
04/10/05 11:49:47 236,352
L_ORDERKEY,L_SHIPDATE
Data Compression
Data Compression
Makes row sizes smaller
Allows more rows per block
Reduces the number of I/Os
Implemented in column level
Compression is a I/O-intensive workload.
Improvement gained through the more-rows-per-block
concept is significant in the Full Table Scan operations.
Compression is transparent to applications.
Data Compression
Single-Value Compression
V2R4 and prior
CREATE TABLE Employee
(EmployeeNo INTEGER
Jobtitle CHARACTER(30)
COMPRESS (cashier)
);
Nulls and
cashiers will be compressed.
Multi-Value Compression
V2R5 and later
CREATE TABLE Employee
(EmployeeNo INTEGER
Jobtitle CHARACTER(30)
COMPRESS (cashier,
manager,
programmer)
...
);
Cashiers,
managers,
programmers
will be compressed including nulls.
255 distinct values for an individual
column can be compressed.
Table Header
Field: City
CAHR(20)
01 Chicago
10 Los Angeles
11 New York
130 Sutter
San
Francisco
01
St.
133 Wacker
Drive.
11
5 Times Sq.
NY
01
11
00
10
304 S.
Broadway
Field:
StateCod
e Char(2)
CA
IL
IL
NY
Racine
CA
WI
Query Management
Priority Scheduler
A DBA may want to:
Configure the system to execute queries at a higher
priority submitted by Sales Managers.
Or
Priority Scheduler
Can be used to control resources allocated
to users.
Administrator can specify performance
group while creating the user.
It manages resource distribution to
improve performance of one application at
the expense of other.
Performance
Groups
RP#
Performance
Periods
8am5pm 5pm-9pm 9pm-8am
AG1
AG2
AG3
Allocation
Groups
AG1
5
AG2
10
8pm-8am
AG4
AG3
20
8am-8pm
AG3
AG4
40
Performance Group
Provides relative priority with
in the Resource Partition
Can be specified in the
Account String in Create User
statement.
Can be specified in user Logon
String ($M$, $DEV$ etc).
Performance Period
Controls the scheduling
policy at that point in time.
Links a PG to an Allocation
Groups weight and policy
Allocation Group
Defines a method for
disbursing resources
among sessions active
within that allocation
group
Carries the weight.
Defines a scheduling policy
Example 1 Percentage of
Resource Allocation
User WHDev with performance group $L$ logged on to the system at
9:30 PM.
What is the percentage share of system resources the user WHDev
will get ?
Priority
20
10
5
0
1000
2000
3000
Time
Performance Period 1
Usage 3600 Seconds
Allocation Group AG11
Performance Period 2
Usage 0 Seconds
Allocation Group AGDEF
Allocation Grp=AG11
Weight=40
Allocation Grp=AGDEF
Weight=5
Prevent all queries that are estimated to return more than 100
rows from running between the hours of 8:00 a.m. and 1:00 p.
on Fridays.
Or
TDQM Architecture
Query Management
All Client
systems
accessing
Teradata.
TDQM
Administrat
or
TDQM
Partition
TDQM
Metadat
a
Scheduled Requests
Scheduled
Requests
Client
User Data
Scheduled
Request
Server
Teradata
RDBMS
Data
Synchronization
Operation Control
Teradata
Query Director
Users/
Applications
Users/
Applications
Backup System
Questions ?