Sie sind auf Seite 1von 20

Parser Engine Component

Parser
Optimizer
Generator
Dispatcher
Parser receives the request from client system and performs the following actions
through its different components:
SYNTAXER: Checks syntax and comes up with parse tree.
RESOLVER: Adds additional info from data dictionary.
SECURITY MODULE: Checks for access on the objects in the request sent.
Optimizer
Comes up with the most optimized way to execute a query
Scans the request to find out the locks required on the objects.
Passes the optimized parse tree to the generator.

Generator
Comes up with the plastic steps to execute a query.
Caches the plastic steps in a request cache.
Passes these steps to GNC apply which actually binds in parameters and
comes up with concrete steps.
Dispatcher
It controls the sequence in which steps are executed.
Performs the following four major tasks:
Receives concrete steps from GNC apply.
Sends the first step to BYNET which then send to specific AMP for processing.
Receives completion request from AMP.
Places next step in BYNET.

PRIMARY INDEX

HASHING ALOGRITHM
When the primary index value of a row is input to the hashing algorithm, then the output is called the row hash. Row
hash is the logical storage address of the row, and identifies the amp of the row. Also, the table id plus the row hash
identifies the cylinder and data block, and is used for row distribution, placement and retrieval of the row. Based on the row
hash uniqueness, data distribution happens.
The table id is a sequential number assigned whenever a table is created. This number changes whenever a table is recreated.
Hash code redistribution is used in join operation. This is used when the foreign key (join column) of a table (i.e. table A) is
joined to a primary index of another table (i.e. table B). For each table A row, the row hash of the foreign key is calculated.
Then, the table A row is sent to the amp dictated by the rowhash, which is the same amp that contains table Bs row for that
row hash.
Join column hash code sequence is the result of a sorting. The row hash of the foreign key (join column) of a table (i.e.
table A) is sorted into this sequence. These are matched in sequence to the other table (i.e. table B) on the same amp.

PART2
Teradata uses a HASHING algorithm to distribute rows among various AMPs. The process of rows distribution is unique to
Teradata and obviously is the core reason behind the parallel architecture of TERADATA. To understand the process of Rows
Distribution , refer to below diagram.

ROWS-DISTRIBUTION-IN-TERADATA

TERADATA uses indexes to determine the distribution of rows. Teradata uses a hashing algorithm which processes the index
and gives the HASH VALUE. Depending on the HASH VALUE , it refers to HASH MAP to decide the HASH BUCKET and

hence HASH AMP. That particular AMP will store that record. Similary there are other AMPs also receiving their share of
records. So each record is stored in specific AMP depending on the HASH VALUE. This is the reason it is suggested that the
columns with more unique values and used in joins etc are preferable index columns. So whenever set of records are
received , index columns are processed and are stored in respective AMPs. Since the work is distributed between AMPs
TERADATA is so swift. So we can say TERADATA is as fast as its slowest AMP.

Session 2 (8 July 2014)


Types of Tables
Permanent Table
Set table
Multi Set Table
Volatile Table
Global Temporary Table
Derived Table
Set Table:

Does not allow duplicate rows


A SET table force Teradata to check for duplicate rows every time a row is inserted or updated
SET tables will insert rows quickly as we start inserts, but will become much slower as its record count in table
reached millions
Create Set Table <Table Name>

Multi set Tables

Allows duplicate values in table.

Mostly used for staging tables.

Saves time and enhances performance when the source contains distinct rows.
If table specification not define in DDL of table then it will pick default as SET Table
Which table to use?
Set table causes an additional overhead for checking the duplicate data.
.if you are using any group by or qualify statement on the source table then its highly recommend to define table as MULTI
set. As it will filter duplicate records already.
If source table need to define UPI(Unique primary index) then also no need of Set table. As UPI will not allow duplicate row.

Volatile Table:

Volatile table is a temporary table for a single use only.


The definition of a volatile table resides in memory but does not survive across a system restart.
Automatically drop the table as session end.
The definition of a volatile table is stored in memory cache only for the duration of the current session

Volatile tables do not survive a Teradata Database reset. Restart processing destroys both the contents and the
definition of a volatile table.

The following are not available for volatile tables:


RI constraints (see Database Design and SQL Data Definition Language for details)
Check constraints
Permanent journaling
Compressed column values
DEFAULT clause
TITLE clause
Named indexes
Privilege checking (because volatile tables are private to the session in which they are created)
Identity Column

Global Temporary Table

Global temporary tables are tables that exist only for the duration of the SQL session in which they are used.

The contents of these tables are private to the session, and the system automatically drops the table at the end of
that session.
The system saves the table definition permanently in the Data Dictionary
The biggest difference between global temporary tables and volatile tables is that the definition of a global temporary table is
stored in the Data Dictionary and therefore can be shared and used by many different user sessions. Each user session can
materialize its own local instance of the table.
Derived Table

Special type of temporary table.

A derived table is obtained from one or more other tables as the result of a subquery.

Using derived tables avoids having to use the CREATE and DROP TABLE statements for storing retrieved
information.

Use of a derived table may be appropriate when it significantly reduces the complexity/increases the readability of
a query.

It can significantly reduce query complexity and improve performance and readability if updates are written with
from clause as a derived table. This is particularly useful when there are many table involved in the query. For
example

UPDATE Table1
FROM
( SELECT
TB2.COL1
, TB3.COL2
, TB4.COL3
FROM
Table2 TB2
INNER JOIN
Table3 TB3
ON TB2.COL1 = TB3.COL1
INNER JOIN
Table4 TB4
ON TB2.COL1 = TB4.COL1
WHERE TB2.COL4 = 123
) XXX
SET Table1.COL3 = XXX.COL3
WHERE Table1.COL1 = XXX.COL1
AND Table1.COL2 = XXX.COL2;

The general approach should be use of temporary tables instead of derived table if the expected dataset or

involved table(s) have more then 250K records (1000 rows/AMP * 240 AMPs).

You cannot specify any of the following SQL syntactical elements in a derived table:

ORDER BY
WITH

WITH BY
Recursion

View :

Macros:-

VIEW is a virtual table. It is stored in Teradatas Data Dictionary.


Views provide security at the row or column level.
Views are used to provide customized access to data tables.
Views provide the capability of pre-joining multiple tables together in order to simplify user access to information.
Views require no permanent space.

Create VIEW Agg_View AS

SELECT Dept_no

,Average(salary) as AVGSAL

FROM Employee_Table

Group by Dept_no;

Macros are SQL statements stored as an object in the Data Dictionary (DD)
A macro can store one or multiple SQL statements.
INSERT, UPDATE, and DELETE commands are valid within a macro
DROP a Macro
To drop a macro, use following command :

DROP MACRO DOB_Details;


REPLACE a Macro
We can use replace macro command to modify an existing macro :
REPLACE MACRO DOB_Details AS
(SELECT first_name,last_name ,DOB
FROM TERADATA.employees
WHERE dept_numbr = 321
ORDER BY DOB, first_name;);

How to create and use Macro?


A macro is a Teradata extension to ANSI SQL that contains pre written SQL statements. Macros are used to run a
repeatable set of tasks.The details of macro can be found in data dictionary (DD) . Macros are database objects and
thus they belong to a specified user or database. A macro can be executed by Queryman. , BTEQ, by another
macro.

How to create a Macro


Create a macro to generate a DOB list for department 321:
CREATE MACRO DOB_Details AS
(SELECT first_name ,last_name ,DOB
FROM TERADATA.employees
WHERE dept_numbr =321
ORDER BY DOB asc;);

EXECUTE a Macro
To execute a macro, call it along with the exec command.
EXEC DOB_Details;
last_name first_name DOB
Ram
Kumar
75/02/22
Laxman
Sinha
79/04/06
DROP a Macro
To drop a macro, use following command .
DROP MACRO DOB_Details;
REPLACE a Macro
If we need to modify an existing macro , instead of dropping and re-creating it
We can use replace macro command as follows
REPLACE MACRO DOB_Details AS
(SELECT first_name,last_name ,DOB
FROM TERADATA.employees
WHERE dept_numbr = 321
ORDER BY DOB, first_name;);
Parameterized Macros
Parametrized macros allow usage of variables . we can pass values to these variables.
Advantage of using parametrized macros is , Values can be passed to these variables at run-

time.
Example
CREATE MACRO dept_list (dept INTEGER) AS
(
SELECT last_name
FROM TERADATA.employees
WHERE dept_numbr = :dept; );
To Execute the macro
EXEC dept_list (321);
Macros may have more than one parameter. Each name and its associated type are
separated by a comma from the next name and its associated type. The order is important.
The first value in the EXEC of the macro will be associated with the first value in the
parameter list. The second value in the EXEC is associated with the second value in the
parameter list, and so on.
Example
CREATE MACRO emp_verify (dept_numbr INTEGER ,salary DEC(18,0))
AS (
SELECT emp_numbr
from TERADATA.employees
WHERE dept_numbr = :dept
AND salary< :sal;) ;
To Execute this macro
EXEC emp_check (301, 50000);
Key points to note about Macros:

Macros are a Teradata extension to SQL.

Macros can only be executed with the EXEC privilege.

Macros can provide column level security.

NOTE: A user must have EXEC privileges to execute the macros. It doesn't matter if
he has privileges for the underlying tables or views that the macro uses.
Part3 :Data Distribution and Data Access Methods
Teradata Database Indexes
An index is a physical mechanism used to store and access the rows of a table. Indexes on
tables in a relational database function much like indexes in books, they speed up information
retrieval.
In general, Teradata Database uses indexes to:
Distribute data rows.
Locate data rows.
Improve performance.
Indexed access is usually more efficient than searching all rows of a table.
Ensure uniqueness of the index values.
Only one row of a table can have a particular value in the column or columns defined as a
unique index.
Teradata Database supports the following types of indexes:
Primary
Partitioned Primary
Secondary
Join

Hash
Special indexes for referential integrity
These indexes are discussed in the following sections.
Primary index: we can create 3 type of index
a UPI

the PI is a column, or columns, that has no duplicate values.

a NUPI

the PI is a column, or columns, that may have duplicate values.

NoPI

there is no PI column and rows are not hashed based on any column values.

Secondary Indexes
Secondary Indexes (SIs) allow access to information in a table by alternate, less frequently used paths
and improve performance by avoiding full table scans.
Although SIs add to table overhead, in terms of disk space and maintenance, you can drop and recreate
SIs as needed.
SIs:
Do not affect the distribution of rows across AMPs.
Can be unique or nonunique.
Are used by the Optimizer when the indexes can improve query performance.
Can be useful for NoPI tables.

NON Primary Index :-

Partition Primary Index:-

Tera data join strategy and recovery and protection of data

Teradata joins
When we join two or more tables on a column or set of columns, Joining takes place. This will
result in data resulting from matching records in both the tables. This Universal concept
remains the same for all the databases.

In Teradata, we have Optimizer (a very smart Interpreter), which determines type of join
strategy to be used based on user input taking performance factor in mind.
In Teradata, some of common join types are used like
- Inner join (can also be "self join" in some cases)
- Outer Join (Left, Right, Full)
- Cross join (Cartesian product join)
When User provides join query, optimizer will come up with join plans to perform joins. These
Join strategies include
- Merge Join
- Nested Join
- Hash Join
- Product join
- Exclusion Join

Merge Join
-------------------Merge join is a concept in which rows to be joined must be present in same AMP. If the rows
to be joined are not on the same AMP, Teradata will either redistribute the data or duplicate
the data in spool to make that happen based on row hash of the columns involved in the joins
WHERE Clause.
If two tables to be joined have same primary Index, then the records will be present in
Same AMP and Re-Distribution of records is not required.
There are four scenarios in which redistribution can happen for Merge Join
Case 1: If joining columns are on UPI = UPI, the records to be joined are present in Same
AMP and redistribution is not required. This is most efficient and fastest join strategy
Case 2: If joining columns are on UPI = Non Index column, the records in 2nd table has to
be redistributed on AMP's based on data corresponding to first table.
Case 3: If joining columns are on Non Index column = Non Index column , the both the
tables are to be redistributed so that matching data lies on same amp , so the join can happen
on redistributed data. This strategy is time consuming since complete redistribution of both
the tables takes across all the amps
Case 4: For join happening on Primary Index, If the Referenced table (second table in the
join) is very small, then this table is duplicated /copied on to every AMP.
Nested Join
------------------Nested Join is one of the most precise join plans suggested by Optimizer .Nested Join works
on UPI/USI used in Join statement and is used to retrieve the single row from first table . It
then checks for one more matching rows in second table based on being used in the join
using an index (primary or secondary) and returns the matching results.
Example:
Select EMP.Ename , DEP.Deptno, EMP.salary
from

EMPLOYEE EMP ,
DEPARTMENT DEP
Where EMP.Enum = DEP.Enum
and EMp.Enum= 2345; -- this results in nested join
Hash join
---------------Hash join is one of the plans suggested by Optimizer based on joining conditions. We can say
Hash Join to be close relative of Merge based on its functionality. In case of merge join,
joining would happen in same amp. In Hash Join, one or both tables which are on same amp
are fit completely inside the AMP's Memory . Amp chooses to hold small tables in its
memory for joins happening on ROW hash.
Advantages of Hash joins are
1. They are faster than Merge joins since the large table doesnt need to be sorted.
2. Since the join happening b/w table in AMP memory and table in unsorted spool, it happens
so quickly.
Exclusion Join
------------------------These type of joins are suggested by optimizer when following are used in the queries
- NOT IN
- EXCEPT
- MINUS
- SET subtraction operations

Select EMP.Ename , DEP.Deptno, EMP.salary


from
EMPLOYEE EMP
WHERE EMP.Enum NOT IN
( Select Enum from
DEPARTMENT DEP
where Enum is NOT NULL );
Please make sure to add an additional WHERE filter with <column> IS NOT NULL since
usage of NULL in a NOT IN <column> list will return no results.
Exclusion join for following NOT In query has 3 scenarios
Case 1: matched data in "NOT IN" sub Query will disqualify that row
Case 2: Non-matched data in "NOT IN" sub Query will qualify that row
Case 3: Any Unknown result in "NOT IN" will disqualify that row - ('NULL' is a typical example
of this scenario).
Product Join:The product join compares every qualifying row from one relation to every
qualifying row from the other relation and saves the rows that match the WHERE
predicate filter. Because all rows of the left relation in the join must be compared
with all rows of the right relation, the system always duplicates the smaller

relation on all AMPs, and if the entire spool does not fit into available memory, the
system is required to read the same data blocks more than one time. Reading the
same data block multiple times is a very costly operation.
This operation is called a product join because the number of comparisons needed
is the algebraic product of the number of qualifying rows in the two relations.
Any of the following conditions can cause the Optimizer to apply a product join
over other join methods:
No WHERE clause is specified in the query.
The join is on an inequality condition.
If you specify a connecting, or bind term between the relations, then the system
does not specify a product join. Bind terms are those that are bound together by
the equality operator.
There are ORed join conditions.
A referenced relation is not specified in any join condition.
The product join is the least costly join method available in the situation.
Single window merge join ..

Das könnte Ihnen auch gefallen