Beruflich Dokumente
Kultur Dokumente
An Overview
Traditionally, data processing scenarios have been divided into two categories...
Scenario 1 Customer places an order for iPhone through online store. Steps involved:
A new order O1 is created in the Order table Customer number C1, Product number P1 pulled from master tables and enriched in Order table row A order confirmation number sent to the customer End of process
The access patterns of these two approaches are very different and hence they make very different demands on the underlying database engine The basic database architecture has to be different to be optimized for one type of processing Teradata leader in DSS and Data warehouse space
What is Teradata
Teradata is a Relational Database Management System (RDBMS) composed of hardware and software Designed for worlds largest commercial databases. Used by Customer who are looking out for answers to their business questions from data of over 1 Terabyte
6 of the top 10 Retailers 6 of the top 9 Communications companies Over 40% of the leading Manufacturers in the world 3 of the top 4 Blue Cross/Blue Shield insurance companies Many of the world's leading Banks
1979 - Teradata Corp founded in Los Angeles, California. Development begins on a massively parallel database computer 1984 - Teradata sells first DBC/1012 1986 - Product of the Year 1990 - First Terabyte system installed and in production 1992 - Teradata is merged into NCR 1995 - Teradata Version 2 for UNIX operating systems released
Why Teradata
Capacity:
Scaling from Gigabytes to Terabytes of detailed data stored in billions of rows Scaling to thousands of millions of instructions per second (MIPS) to process data
Performance:
Shared Nothing Architecture - able to achieve parallelism in each and every stage of query execution Makes Teradata Database faster than other relational systems
Can be accessed by network-attached and channel-attached systems Supports the requirements of many diverse clients
High fault tolerance, no single point failure Automatically detects and recovers from hardware failures
Ensures that transactions either complete or rollback to a stable state if a fault occurs Linearly expandable - as your database grows, additional nodes may be added Allows expansion without sacrificing performance
PEs
Vdisks
This is called SMP Symmetric Multiprocessor - A multiprocessing node that contains a number of central processing units sharing a single memory pool "Shared Nothing Architecture" - each AMP has its own disk (data) and it shares this with no other AMP and solely responsible for any changes/access to that data
BYNET
Scalable bandwidth as nodes are added The BYNET is responsible for: Broadcast, multicast, and point-topoint communications between nodes and virtual processors Merging answer sets back to the PE Making Teradata parallelism possible
MPP (Massively Parallel Processing) consists of a number of nodes (SMPs) that work on a problem at the same time Each node (SMP) has one or more CPUs, own memory, I/O, network connections and disk arrays and doesn't share its resources with other nodes
Important components
SMP Symmetric Multiprocessing is a single node that contains multiple CPUs sharing memory pool. MPP SMP combined with a communication network (BYNET) form a MPP. A MPP comprises of two or more loosely coupled SMP nodes connected by the BYNET with shared SCSI access to multiple disk arrays BYNET Hardware inter-processor network to link nodes on an MPP system. It implements point to point, multicast, broadcast communications depending upon situation. BYNET is usually used for merging and sorting of data from different nodes. The accumulated data is then sent back to the User. Disk Array Teradata employs RAID storage technology where drives are configured logically in one or more logical unit (LUN) which is further sliced into Pdisk that is assigned to each AMP. Group of Pdisk assigned to a AMP is called Vdisk.
More Definitions
PDE - Parallel Database Extension is an interface layer on top of operating system. It enhances the processing by providing capability of parallel processing and priority scheduling. It executes Vprocs. It take advantage of BYNET and Shared Disk hardware to improve performance. It may visualized as a layer on top of Operating System File System - Teradata File System service calls allow Teradata RDBMS to store and retrieve data efficiently without being concerned about underlying operating system interfaces. It divides the disk in to logical blocks, MI, CI, CID, DB, DBD TPA - Teradata Parallel Application is responsible for distribution, coordination and balancing of processes/threads across nodes TDP - Teradata Director Program is responsible for session balancing across multiple PEs, failure notification, logging, verification, recovery, restart and security
Logical Processors
VPROCS - Virtual Processors. Vprocs are set of software processors that run on a node under Teradata PDE within the multitasking environment of the operating system. A single node (SMP) can have as high as 128 Vprocs
PE - Parsing Engine performs session control and dispatches tasks to fetch, return and merge data. It communicates with the client system on one side and with the AMPs on the other side (via BYNET) AMP - Access Modular Processor retrieve and update data on the virtual disks. It is accountable for doing locking, joining, sorting, aggregation, data conversion, disk space management, accounting, and journaling
A single PE can handle a request at a time. This request is parsed, optimized, steps are built and then dispatched to corresponding AMP(s) An AMP has 80 worker task which perform different kind of work related to the steps. If the request is a select, these worker tasks after finishing the work sends data to BYNET where it is merged and sorted PE dispatches the resultant data to the user
Query Lifecycle
Client Server
Application sends the request Application sends the request WHERE id = 4; SELECT * FROM t1 to the PE - PE sends back the to the PE - PE sends back the SELECT * FROM t1 WHERE id IN (2,8); acknowledge to application acknowledge to application The SQL is parsed by the PE CLI The SQL is parsed by the PE CLI
PE uses the Hashmap to locate PE uses the Hashmap to locate the AMP the AMPs TDP (Teradata Director Program) PE sends the request to the PE sends the request to the particular AMP - AMP sends back individual AMPs - AMP sends Hashmap PE (1) PE (2) the acknowledge to PE back the acknowledge to PE AMP retrieves the data from its own Vdisk AMP sends data to BYNET AMP (1) AMP (2) BYNET merges the data BYNET sends merged data to PE V Disk (1) V Disk (2) Result is sent to application from PE - Application sends back acknowledge to PE ID (PI) Desc ID (PI) Desc 3 C 1 A 5 E 4 D BYNET Merge AMP retrieves the data from its own Vdisk AMP sends the data to PE AMP (3) AMP (4) Result is sent to application from PE - Application sends back acknowledge to PE V Disk (3) ID (PI) Desc 2 B 6 F V Disk (4) ID (PI) Desc 7 G 8 H
Row Hash
HashMap
A data value (or values, if the index is compound) from a row acts as an index key to that row Associates the index key with a relative row address that reports the location of the row on disk Stored in order of their index key values and are said to be value-ordered
Hashing:
Index key data value is transformed by a mathematical function to produce an abstract value not related to the original data value in an obvious way Hashed data is assigned to hash buckets that correspond in a 1:1 manner to the relationship a particular hash code with an AMP location There is no obvious correspondence between a hash code and the location of the row it refers to
Teradata does not use indexing. What we refer to as indexes are either row hash values or data tables (join index) Tradeoffs Between Hashing and Indexing:
Hashing is far better suited for the parallel database architecture Hashing provides consistently better performance because rows are always distributed evenly across the AMPs Primary indexes are not stored in an index subtable - directly as part of the row data Primary index columns on frequently used join constraints can be co-located on the same AMP Range queries Retrievals having selection criteria that involve only part of a multicolumn hash key
Hashing
Teradata Database hashing algorithms are proprietary mathematical functions that transform an input data value of any length into a 32-bit value A 32-bit row hash value provides 4.2 billion possible values 16-bit Destination Selection Word Row Hash Row ID First 16 bits - Destination Selection Word - used to define the hash bucket for the hashed row The remaining 16 bits are a remainder from the operation of the hash function on the original input value Uniqueness Value - additional 32-bit system-generated Uniqueness Value to ensure the uniqueness of any RowID. Generated at AMP level There are 65,536 hash buckets, distributed as evenly as possible among the AMPs The BYNET interface board on each AMP maintains a hash map - an index of which hash buckets are assigned to which AMPs Row assignment is performed in a manner that ensures as equal a distribution of table rows as possible among all the AMPs 16-bit Remainder 32-bit Uniqueness Value
Hash-Related Functions
SELECT HASHAMP (HASHBUCKET (HASHROW (empno))) AS amp_no, COUNT(*) FROM employee GROUP BY 1 ORDER BY 2 DESC; amp_no count(*) 25 3510 29 3468 17 3181
SELECT HASHROW (empno)) AS hash_value, COUNT(*) FROM employee GROUP BY 1 ORDER BY 2 DESC; hash_value count(*) 63524 14 8069 14 4191 1 SELECT(COUNT (*) (FLOAT))/(COUNT(DISTINCT HASHROW(empno))) FROM employee;
Data Partitioning
For Join-on columns, a row hash value is recalculated based on new columns involved in the join. If tables are being joined on 3 column (a,b,c), then a row hash value is computed as if (a,b,c) was a PI. If row hash values of the joining columns are not on AMP, then the rows are redistributed across all AMP which is overhead
Teradata Indexes
Indexes are method of storing and retrieving data from Teradata optimally
By default every table would have one index. It is called Primary Index (PI). In addition, if the user is making use of columns other than PI in a query, then he/she can declare Secondary Index (SI) on that column for faster access of data
Types of indexes:
Primary Index Unique and Non-Unique, no Subtable, affects data distribution Secondary Index Unique and Non-Unique, avoids FTS, Subtable, does not affect data distribution, extra overhead of updating Subtable in case insert/delete/update is done on table Join Index Single Table, Multi Table and Aggregate Join Index
Single Table JI allows hashing of rows based on some other column. This column might be used in condition of SQL qualifying the JI for data access Multi-Table JI on columns from more than one table avoids recalculating join values in a query which is frequently used Aggregate JI on columns help queries which perform frequent aggregation on same column(s)
Hash Index:
are file structures that share properties with STJI and SI
Restrictions:
Only one PI per table Not more than 64 columns Cannot include columns having BLOB or CLOB data types
No separate physical storage stored in-line with the row in the base table Rows are hash-ordered within the same AMP Types of Primary Index : A PI can be defined over two orthogonal dimensions
Unique (UPI) or non-unique (NUPI) Partitioned (PPI) or non-partitioned (NPPI)
Types of PI
Unique Primary Index Non-unique Primary Index Non-Partitioned Primary Index
Standard Teradata Database primary index Rows are hashed to the appropriate AMPs and stored there in row hash order
PPI
Insert Data
Row Hash A11111 A22222 A33333 A44444 order_nr 10 20 30 40 order_cre_dt 2007-01-11 2007-02-22 2007-01-12 2007-02-23 Row Hash A11111 A22222 A33333 A44444 order_nr 10 20 30 40 order_cre_dt 2007-01-11 2007-02-22 2007-01-12 2007-02-23
The more distinct the primary index values, the better Rows having the same primary index value are distributed to the same AMP Parallel processing is more efficient when table rows are distributed evenly across the AMPs The primary index should be chosen on the most frequently used access path Primary index operations must provide the full primary index value Primary index retrievals on a single value are always one-AMP operations
Volatility:
How often the value of index column is changed. The lesser it is changed the better choice in index it holds
The Trade-Off:
Data Distribution vs. Access Path Normal Access vs. Range Access NPPI vs. PPI
Exercise
Table definition:
Table1 - Order table geo_cd + order_nr defines the uniqueness of a row Table2 - Order item table geo_cd + order_nr + item_nr defines the uniqueness of a row upd_ts on both the tables captures the last modified timestamp of data
Secondary Index
Enhances set selection by specifying access paths other than the primary index path SI storage - System maintains a subtable for each SI. Subtables keep base table SI row hash, column values, and RowID of the base table which contains actual value. There is a overhead in maintaining SI subtable if the table involves INSERT/UPDATE/DELETE operations. Restrictions on Secondary Indexes: A table can have up to 32 secondary, hash and join indexes No more than 64 columns can be included in a secondary index definition Cannot include columns having BLOB or CLOB data types SI Types: Unique Secondary Index (USI) Non-Unique Secondary Index (NUSI) Value-Ordered Secondary Index NUSI and Query Covering NUSI Bit Mapping
The process for locating a row using a USI is as follows: 1. After checking the syntax and lexicon of the query, the Parser looks up the Table ID for the USI subtable that contains the specified USI value 2. The hashing algorithm hashes the USI value 3. The Generator creates an AMP step message containing the USI Table ID, USI row hash value, and USI data value 4. The Dispatcher uses the USI row hash to send the message across the BYNET to AMP 3, which contains the appropriate USI subtable row 5. The file system on AMP 3 locates the appropriate USI subtable using the USI Table ID 6. The file system on AMP 3 uses the USI row ID to locate the appropriate index row in the subtable 7. This operation might require a search through a number of rows with the same row hash value before the row with the desired value is located 8. AMP 3 reads the base table row ID from the USI row and distributes a message containing the base table ID and the row ID for the requested row across the BYNET to AMP 10, which contains the requested base table row 9. The file system uses the row ID to locate the base table row
The process used by this example for locating a row using the NUSI value CA is as follows: 1. After checking the syntax and lexicon of the query, the Parser looks up the Table ID for the NUSI subtable that contains the NUSI value CA 2. The hashing algorithm hashes the NUSI value 3. The Generator creates an AMP steps message containing the NUSI Table ID (734596), NUSI row hash value (53), and NUSI data value (CA) and then the Dispatcher distributes it across the BYNET to all AMPs 4. The file system on a receiving AMP locates the appropriate NUSI subtable using the NUSI Table ID 5. The file system on a receiving AMP uses the NUSI row hash value to locate the appropriate index row in the Subtable 6. If there is a NUSI row, its table row ID list is scanned for base table row IDs 7. The file system uses the row IDs to locate the base table rows containing the NUSI value CA
EXPLAIN SELECT * FROM t1 WHERE j = 100; 1) First, we do a two-AMP RETRIEVE step from t1 by way of unique index # 4 "t1.j = 100" with no residual conditions. The estimated time for this step is 0.02 seconds. CREATE MULTISET TABLE t2,NO FALLBACK,NO BEFORE JOURNAL,NO AFTER JOURNAL (i INTEGER NOT NULL, j INTEGER NOT NULL, a CHAR(10)) UNIQUE PRIMARY INDEX upi_t2 (i), INDEX nusi_t2_01 (j);
i 100 200 300 400 100 100 300 400 j a a a a a
EXPLAIN SELECT * FROM t2 WHERE j = 100; 1) We do an all-AMPs RETRIEVE step from t2 by way of an all-rows scan with a condition of ("t2.j = 100") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 2 rows. The estimated time for this step is 0.03 seconds.
Value-Ordered NUSI
Value-ordered NUSIs are very efficient for range conditions Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values Examples:
CREATE INDEX Idx_Date (o_orderdate) ORDER BY VALUES (o_orderdate) ON Orders; SELECT * FROM Orders WHERE o_orderdate BETWEEN 1997-10-01 AND 1997-10-07;
NUSI Bit-Mapping
Bit mapping is a technique used by the Optimizer to effectively link several weakly selective indexes in a way that creates a result that drastically reduces the number of base rows that must be accessed to retrieve the desired data. Teradata only performs NUSI bit mapping when weakly selective indexed conditions are ANDed and their composite selectivity is strong. Optimizer instruct each AMP to construct bit maps to determine which rowIDs their local NUSI rows have in common and then access just those rows, applying the conditions to them exclusively. Example:
Covering Index
An index is said to be covering if all of the columns requested in a query are also available from existing index subtable, making it unnecessary to access the base table rows to complete the query. Example:
Simple Query Considered for Index Covering: CREATE INDEX IdxOrd (o_orderkey, o_date, o_totalprice) ON ORDERS; SELECT o_date, AVG(o_totalprice) FROM ORDERS WHERE o_orderkey >1000 GROUP BY o_date; Aggregate Query Considered for Index Covering: CREATE INDEX IdxEmployee (DeptNo) ON Employee; SELECT DeptNo, COUNT(*) FROM Employee GROUP BY DeptNo;
Join Index
Join indexes allows denormalization of physical database without affecting the normalization of the physical and logical database models These can serve the purpose of storing aggregated data as being used in Fact table in Dimensional Modeling Unlike traditional indexes, join indexes do not store pointers to their associated base table rows Instead, they are generally used as a fast path final access point that eliminates the need to access and join the base tables they represent. They substitute for rather than point to base table rows. The only exception to this is the case where an index partially covers a query If the index is defined using either the ROWID keyword or the UPI of its base table as one of its columns, then it can be used to join with the base table to cover the query Statistics should be collected on Join Index to have an updated information Join Index provide overhead if the table(s) are updated which are part of its definition. JI would simultaneously be rebuilt User cannot directly select from a Join Index
Hash Index
Hash indexes are file structures that share properties with both single-table join indexes and secondary indexes Hash indexes can optionally be specified to be distributed in such a way that their rows are AMP-local with their associated base table rows They can also provide a transparent direct access path to those base table rows to complete a query only partially covered by the index Example:
CREATE TABLE Orders (o_orderkey INTEGER NOT NULL, o_custkey INTEGER, o_orderstatus CHARACTER(1) CASESPECIFIC, o_totalprice DECIMAL(13,2) NOT NULL, o_orderdate DATE FORMAT 'yyyy-mm-dd' NOT NULL, o_orderpriority CHARACTER(21), o_clerk CHARACTER(16), o_shippriority INTEGER, o_comment VARCHAR(79)) UNIQUE PRIMARY INDEX (o_orderkey); CREATE HASH INDEX OrdHIdx_1 (o_orderdate) ON orders BY (o_orderdate) ORDER BY (o_orderdate);
Teradata Joins
Joins available to user: Left Outer Join Right Outer Join Full Outer Join Inner Join Cross Join Self Join Teradata Internal Joins: Product Join Merge Join Nested Join Hash Join Self Join Correlated Join
Merge Join: Comparison of rows are done based on hash values of the joining columns. Sorting is performed before comparison. Comparison involves lesser number of rows in comparison to Product Join Different methods to perform comparison of hash values:
Redistribution of rows based on hash values Duplication of rows based on hash values Matching Indexes
Dept
(FK)
Dept
(UPI, PK)
1 2 3 4 5 6 7 8
SELECT Name, DeptName, Loc FROM Employee, Department WHERE Employee.DeptNo = Department.DeptNo; Since DeptNo in Employee table is not a UPI, but is a foreign key. The table would be hash redistributed based on the DeptNo Hash Redistribution takes place local to AMP Rows are sorted before applying join condition
5 PETER
150
1 BROWN 200
J O
I N
150 PAYROLL
310
MFG
200 FINANCE
400 DELIVERY
J O
Spool file after locally copying and sorting on Employee.Dept Row Hash
8 BAKER 310 6 FOSTER 400 3 JONES 310 4 CLAY 400 1 BROWN 200 7 GRAY 310 2 SMITH 310 5 PETER 150
I N
Nested Join
A nested join is a join for which the WHERE conditions specify a constant value for a unique index in one table and those conditions also match some column of that single row to the primary or secondary index of the second table. Example SELECT DeptName, Name, YrsExp FROM Employee, Department WHERE Employee.EmpNo = Department.MgrNo AND Department.DeptNo = 100;
Correlated Queries
A correlated query is a subquery whose outer query results are processed a row at a time against the subquery result. SELECT last_name, department_number as DEPTNO, salary_amount FROM employee ee WHERE salary_amount = (SELECT MAX(salary_amount) FROM employee em WHERE em.department_number = ee.department_number); Steps of execution: 1. Read an employee row 2. Get max salary for his/her department from the subquery 3. Compare his/her salary to the max salary 4. If equal, output this row 5. Go to 1
Volatile Tables
Volatile Tables Holds information for intermediate results of queries. Valid for a session only Are not available after a session get a restart during dbs restart No access logging can be done No indexes and referential integrity can be implemented Not stored in database schema CREATE VOLATILE TABLE vt_deptsal, LOG (deptno SMALLINT,avgsal DEC(9,2),maxsal DEC(9,2),minsal DEC(9,2),sumsal DEC(9,2),empcnt SMALLINT) ON COMMIT PRESERVE ROWS; INSERT INTO vt_deptsal SELECT dept ,AVG(sal) ,MAX(sal) ,MIN(sal) ,SUM(sal) ,COUNT(emp)FROM emp GROUP BY 1;
Derived Tables
Derived tables are temporary tables that are created in spool and dropped when the query is completed Example Employees who salary is greater than the company average SELECT last_name, salary_amount, avgsal, FROM (SELECT AVG(salary_amount) FROM employee) my_temp(avgsal), employee WHERE salar_amount > avgsal ORDER BY 2 DESC;
Teradata Macro
A macro consists of one or more statements that are executed in a single transaction Macro is similar to performing a multi statement request. i.e. either all statements in the request complete successfully, or the entire request is aborted All statements can be executed in parallel, making use of the parallel processing architecture of Teradata, thus reducing processing time Macros simplify an operation that is complex or must be performed frequently Can return multi-row answer set Typically called from a trigger Creating a Macro:
CREATE MACRO NewEmpAdd (id INTEGER, name VARCHAR(50)) AS ( INSERT INTO EMPLOYEE values(:Id,:name); );
EXEC NewEmpAdd(25,ABC);
Locking in Teradata
Default locking mechanism in Teradata:
READers can simultaneously READ the same database object READer needs to wait while a WRITE operation is in effect on the same database object WRITEer needs to wait while a READ operation is in effect on the same database object Everybody needs to wait while there is an EXCLUSIVE lock on the database object
This definitely affects the transaction concurrency The solution is: ACCESS lock
Down-grade the severity of lock by explicit specification LOCKING t1 FOR ACCESS But at the expense of Uncommitted Dependencies (Dirty Read) chances
Locking Severity
The available lock severities, from most restrictive to least restrictive, are as follows:
Locking Level
Locking level the database object on which the lock is placed
Exercise
SL-to-PL and PL-to-Aggregate processing should run in parallel. DML in PL layer due to SL-to-PL processing should not hold Aggregate processing, and vice-versa. Also the data integrity should be maintained. Bad user queries taking more restrictive locks, holding on to other processes. What are the design options? Take realistic application level design considerations:
Maintain a time lag (t) between SL-to-PL and PL-to-Aggregate layers Assumption: no data can remain uncommitted for more than t in PL layer Down-grade the lock to ACCESS while accessing PL layer, read data up to (Max Timestamp t)
Design and optimize queries to have the least restrictive locking level
Statistics
Statistics on a column or index of a table provides Optimizer about the details of: Total number of rows Total values for the column Unique values for the column Null values of the column Maximum number of rows per value Minimum number of rows per value Minimum value for an interval Maximum value for an interval Number of Intervals Using Statistics values, Optimizer plans for the best plan for the execution of the query Statistics should be updated regularly so that Optimizer has access to the current information about the table Random AMP Samples (RAS) - If statistics are not available, then Teradata Optimizer uses Random AMP Samples which is the information collected from a single AMP about the table columns and the data stored in it.
Collect Statistics
Statistics can be collected on A single column Primary Index Secondary Indexes Primary Index of a Join Index Primary Index of a Hash Index Column which are part of Join Condition in a query
EXPLAIN SEL * FROM t1 WHERE i > 200; 3) We do an all-AMPs RETRIEVE step from t1 by way of an all-rows scan with a condition of ("t1.i > 200") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with no confidence to be 1 row. The estimated time for this step is 0.03 seconds.
Explain
EXPLAIN <query> Explain describes about the execution plan that Optimizer has prepared for a query. It will tell number of steps involved in the execution of a query Tables/Views to be used in the query Parallel steps Internal Joins to be used Rows estimation for each step Time estimation for each step Explain can be viewed through BTEQ, SQL Assistant and Visual Explain. Visual Explain provides graphics version of the explain steps which is more readable. Using it, explains for two queries can also be compared
Client Software
SQL Client / Queryman Based on ODBC Aqua Studio Based on JDBC/ODBC BTEQ Interactive and batch query processor/report generator
References
http://www.teradataforum.com/ncr_pdf.htm http://www.teradata.com