Sie sind auf Seite 1von 137

Page 1 of 137

Mod 1 - Welcome to the Teradata Database Objectives After completing this module, you should be able to: Describe the Teradata Database. Describe the advantages of the Teradata Database. Define the terms associated with relational databases. Describe the advantages of a relational database. HOT TIP: This module contains links to important supplemental course information. Please be sure to click on each hotword link to capture all of the training content.

What is the Teradata Database?

The Teradata Database is a relational database management system (RDBMS) that drives a company's data warehouse. The Teradata Database provides the foundation to give a company the power to grow, to compete in today's dynamic marketplace, and to evolve the business by getting answers to a new generation of questions. The Teradata Database's scalability allows the system to grow as the business grows, from gigabytes to terabytes and beyond. The Teradata Database's unique technology has been proven at customer sites across industries and around the world. The Teradata Database is an open system, compliant with industry ANSI standards. It is currently available on these industry standard operating systems, UNIX MP-RAS (Discontinued with Teradata 13.10), Microsoft Windows 2000, Microsoft Windows 2003 and Novell SUSE Linux operating systems. For this reason, Teradata is considered an open architecture. The Teradata Database is a large database server that accommodates multiple client applications making inquiries against it concurrently. Various client platforms access the database through a TCP-IP connection or across an IBM mainframe channel connection. The Teradata Database is accessed using SQL (Structured Query Language), the industry standard access language for communicating with an RDBMS. The ability to manage large amounts of data is accomplished using the concept of parallelism, wherein many individual processors perform smaller tasks concurrently to accomplish an operation against a huge repository of data. To date, only parallel architectures can handle databases of this size.

How Is The Teradata Database Used?

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 2 of 137

Each Teradata Database implementation can model a company's business. The ability to keep up with rapid changes in today's business environment makes the Teradata Database an ideal foundation for many applications, including: Enterprise data warehousing Active data warehousing Customer relationship management Internet and E-Business Data marts

Just for Fun . . .


Based on what you know so far, what do you think are some Teradata Database features that make it so successful in today's business environment? (Details on the following are coming up next.) A. Scalability. B. Single data store. C. High degree of parallelism. D. Ability to model the business. E. All of the above.

Feedback:

That's correct! Teradata has all these features.

What Makes the Teradata Database Unique? In this Web-Based Training, you will learn about many features that make the Teradata Database, an RDBMS, right for business-critical applications. To start with, this section covers these key features: Single data store Scalability Unconditional parallelism (parallel architecture) Ability to model the business Mature, parallel-aware Optimizer

Single Data Store

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 3 of 137

The Teradata Database acts as a single data store, with multiple client applications making inquiries against it concurrently. Instead of replicating a database for different purposes, with the Teradata Database you store the data once and use it for many applications. The Teradata Database provides the same connectivity for an entry-level system as it does for a massive enterprise data warehouse.

Scalability

"Linear scalability" means that as you add components to the system, the performance increase is linear. Adding components allows the system to accommodate increased workload without decreased throughput. Linear scalability enables the system to grow to support more users/data/queries/complexity of queries without experiencing performance degradation. As the configuration grows, performance increase is linear, slope of 1. The Teradata Database was the first commercial database system to scale to and support a trillion bytes of data. The chart below lists the meaning of the prefixes: Prefix Exponent Meaning kilomegagigatera103 106 109 1012 1,000 (thousand) 1,000,000 (million) 1,000,000,000 (billion) 1,000,000,000,000 (trillion)

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 4 of 137

petaexa-

1015 1018

1,000,000,000,000,000 (quadrillion) 1,000,000,000,000,000,000 (quintillion)

The Teradata Database can scale from 100 gigabytes to over 100 terabytes of data on a single system without losing any performance capability. The Teradata Database's scalability provides investment protection for customer's growth and application development. The Teradata Database is the only database that is predictably scalable in multiple dimensions, and this extends to data loading with the use of parallel loading utilities. The Teradata Database provides automatic data distribution and no reorganizations of data are needed. The Teradata Database is scalable in multiple ways, including hardware, query complexity, and number of concurrent users. Hardware Growth is a fundamental goal of business. An MPP Teradata Database system easily accommodates that growth whenever it happens. The Teradata Database runs on highly optimized Teradata servers in the following configurations: SMP - Symmetric multiprocessing platforms manage gigabytes of data to support an entry-level data warehousing system. MPP - Massively parallel processing systems can manage hundreds of terabytes of data. You can start with a couple of nodes, and later expand the system as your business grows. With the Teradata Database, you can increase the size of your system without replacing: Databases - When you expand your system, the data is automatically redistributed through the reconfiguration process, without manual interventions such as sorting, unloading and reloading, or partitioning. Platforms - The Teradata Database's modular structure allows you to add components to your existing system. Data model - The physical and logical data models remain the same regardless of data volume. Applications - Applications you develop for Teradata Database configurations will continue to work as the system grows, protecting your investment in application development. Complexity The Teradata Database is adept at complex data models that satisfy the information needs throughout an enterprise. The Teradata Database efficiently processes increasingly sophisticated business questions as users realize the value of the answers they are getting. It has the ability to perform large aggregations during query run time and can perform up to 64 joins in a single query. Concurrent Users As is proven in every Teradata Database benchmark, the Teradata Database can handle the most concurrent users, who are often running multiple, complex queries. The Teradata Database has the proven ability to handle from hundreds to thousands of users on the system simultaneously. Adding many concurrent users typically reduces system performance. However, adding more components can enable the system to accommodate the new users with equal or even better performance.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 5 of 137

Unconditional Parallelism

The Teradata Database provides exceptional performance using parallelism to achieve a single answer faster than a non-parallel system. Parallelism uses multiple processors working together to accomplish a task quickly. An example of parallelism can be seen at an amusement park, as guests stand in line for an attraction such as a roller coaster. As the line approaches the boarding platform, it typically will split into multiple, parallel lines. That way, groups of people can step into their seats simultaneously. The line moves faster than if the guests step onto the attraction one at a time. At the biggest amusement parks, the parallel loading of the rides becomes essential to their successful operation. Parallelism is evident throughout a Teradata Database, from the architecture to data loading to complex request processing. The Teradata Database processes requests in parallel without mandatory query tuning. The Teradata Database's parallelism does not depend on limited data quantity, column range constraints, or specialized data models -The Teradata Database has "unconditional parallelism." Teradata supports ad-hoc queries using ANSI-standard SQL, and includes SQL-ready database management information (log files). This allows Teradata to interface with 3rd party Business Intelligence (BI) tools and submit queries from other database systems.

Ability to Model the Business

A data warehouse built on a business model contains information from across the enterprise. Individual departments can use their own assumptions and views of the data for analysis, yet these varying perspectives have a common basis for a "single view of the business."

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 6 of 137

With the Teradata Database's centrally located, logical architecture, companies can get a cohesive view of their operations across functional areas to: Find out which divisions share customers. Track products throughout the supply chain, from initial manufacture, to inventory, to sale, to delivery, to maintenance, to customer satisfaction. Analyze relationships between results of different departments. Determine if a customer on the phone has used the company's website. Vary levels of service based on a customer's profitability. You get consistent answers from the different viewpoints above using a single business model, not functional models for different departments. In a functional model, data is organized according to what is done with it. But what happens if users later want to do some analysis that has never been done before? When a system is optimized for one department's function, the other departments' needs (and future needs) may not be met. A Teradata Database allows the data to represent a business model, with data organized according to what it represents, not how it is accessed, so it is easy to understand. The data model should be designed without regard to usage and be the same regardless of data volume. With a Teradata Database as the enterprise data warehouse, users can ask new questions of the data that were never anticipated, throughout the business cycle and even through changes in the business environment. A key Teradata Database strength is its ability to model the customer's business. The Teradata Database supports business models that are truly normalized, avoiding the costly star schema and snowflake implementations that many other database vendors use. The Teradata Database can support star schema and other types of relational modeling, but Third Normal Form is the method for relational modeling that we recommend to customers. Our competitors typically implement star schema or snowflake models either because they are implementing a set of known queries in a transaction processing environment, or because their architecture limits them to that type of model. Normalization is the process of reducing a complex data structure into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model. The Teradata Database supports normalized logical models because it is able to perform 64 table joins and large aggregations during queries.

Mature, Parallel-Aware Optimizer


The Teradata Database Optimizer is the most robust in the industry, able to handle: Multiple complex queries Multiple joins per query Unlimited ad-hoc processing The Optimizer is parallel-aware, meaning that it has knowledge of system components (how many nodes, vprocs, etc.). It determines the least expensive plan (time-wise) to process queries fast and in parallel. The Optimizer is further explained in the next module.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 7 of 137

What is a Relational Database? A database is a collection of permanently stored data that is: Logically related (the data was created for a specific purpose). Shared (many users may access the data). Protected (access to the data is controlled). Managed (the data integrity and value are maintained). The Teradata Database is a relational database. Relational databases are based on the relational model, which is founded on mathematical Set Theory. The relational model uses and extends many principles of Set Theory to provide a disciplined approach to data management. Users and applications access data in an RDBMS using industrystandard SQL statements. SQL is a set-oriented language for relational database management. A relational database is designed to: Represent a business and its business practices. Be extremely flexible in the way that data can be selected and used. Be easy to understand Model the business, not the applications Allow businesses to quickly respond to changing conditions In addition, a single copy of the data can serve multiple purposes Relational databases present data as of a set of tables. A table is a two-dimensional representation of data that consists of rows and columns. According to the relational model, a valid table does not have to be populated with data rows, it just needs to be defined with at least one column.

Rows
Each row contains all the columns in the table. A row is one instance of all columns, and each table can contain only one row format. The order of rows is arbitrary and does not imply priority, hierarchy, or significance. It is a single entity in the table. Each row represents an occurrence of an entity defined by the table. An entity is a person, place, thing, or event about which the table contains information. In this example, the entity is the employee and each row represents a single employee.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 8 of 137

Columns
Each column contains "like data," such as only part names, or only supplier names, or only employee numbers. In the example below, the Last_Name column contains last names only, and nothing else. The data in the columns is atomic data, so a telephone number might be divided into three columns: the area code, the prefix, and the suffix, so the customer data can be analyzed according to area code, etc. Missing data values would be represented by "nulls" (the absence of a value). Within a table, the column position is arbitrary.

Answering Questions with a Relational Database


A relational database is a set of logically related tables. Tables are logically related to each other by a common field, so information such as customer telephone numbers and addresses can exist in one table, yet be accessible for multiple purposes. Relational databases do not use access paths to locate data; data connections are made by data values. Data connections are made by matching values in one column with the values in a corresponding column in another table. In relational terminology, this connection is referred to as a join.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 9 of 137

The diagrams below show how the values in one table may be matched to values in another table. The tables below shows customer, order, and billing statement data, related by a common field, Customer ID. The common field of Customer ID lets you look up information such as a customer name for a particular statement number, even though the data exists in two different tables. This is done by performing a join between the tables using the common field, Customer ID. Here are a few other examples of questions that can be answered: "How many mats did customer Wood purchase?" "What is the statement number for O'Day's purchase of $45.30?" "For statement #344627, what state did the customer live in?"

To sum up, a relational database is a collection of tables. The data contained in the tables can be associated using columns with matching data values.

Logical/Relational Modeling
The logical model should be independent of usage. A variety of front-end tools can be accommodated so that the database can be created quickly. The design of the data model is the same regardless of data volume. An enterprise model is one that provides the ability to look across functional processes. Normalization is the process of reducing a complex data structure into a simple, stable one. Generally this process involves removing redundant attributes, keys, and relationships from the conceptual data model. Normalization theory is constructed around the concept of normal forms that define a system of constraints. If a relation meets the constraints of a particular normal form, we say that relation is in normal form." The intent of normalizing a relational database is to put one fact in one place. By

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 10 of 137

decomposing your relations into normalized forms, you can eliminate the majority of update anomalies that can occur when data is stored in de-normalized tables. A slightly more detailed statement of this principle would be the definition of a relation (or table) in a normalized relational database: A relation consists of a primary key, which uniquely identifies any tuple, and zero or more additional attributes, each of which represents a single-valued (atomic) property of the entity type identified by the primary key. A tuple is an ordered set of values. The separator for each value is often a comma. Common uses for the tuple as a data type are: 1. For passing a string of parameters from one program to another 2. Representing a set of value attributes in a relational database

3NF vs. Star Schema Model


As a model is refined, it passes through different states which can be referred to as normal forms. A normalized model includes:: Entities Attributes Relationships First normal form rules state that each and every attribute within an entity instance has one and only one value. No repeating groups are allowed within entities. Second normal form requires that the entity must conform to the first normal form rules. Every non-key attribute within an entity is fully dependent upon the entire key (key attributes) of the entity, not a subset of the key. Third normal form requires that the entity must conform to the first and second normal form rules. In addition, no non-key attributes within an entity is functionally dependent upon another non-key attribute within the same entity. While the Teradata Database can support any data model that can be processed via SQL; an advantages of a normalized data model is the ability to support previously unknown (ad-hoc) questions. Star Schema The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema. Some characteristics of a Star Schema model include: They tend to have fewer entities They advocate a greater level of denormalization

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 11 of 137

Primary Key In the relational model, a Primary Key (PK) is used to designate a unique identifier for each row when you design a table. A Primary Key can be composed of one or more columns. In the example below, the Primary Key is the employee number.

Primary Key Rules


Rules governing how Primary Keys must be defined and how they function are: Rule 1: A Primary Key is required. Rule 2: A Primary Key value must be unique. Rule 3: The Primary Key value cannot be NULL. Rule 4: The Primary Key value should not be changed. Rule 5: The Primary Key column should not be changed. Rule 6: A Primary Key may be any number of columns.

Rule 1: A Primary Key is Required


In the logical model, each table requires a Primary Key because that is how each row is able to be uniquely identified. Each table must have one, and only one, Primary Key. In any given row, the value of the Primary Key uniquely identifies the row. The Primary Key may span more than one column, but even then, there is only one Primary Key.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 12 of 137

Rule 2: Unique PK
Within the column(s) designated as the Primary Key, the values in each row must be unique. No duplicate values are allowed. The Primary Key's purpose is to uniquely identify a row. In a multi-column Primary Key, the combined value of the columns must be unique, even if an individual column in the Primary Key has duplicate values.

Rule 3: PK Cannot Be NULL


Within the Primary Key column, each row must have a Primary Key value and cannot be NULL (without a value). Because NULL is indeterminate, it cannot "identify" anything.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 13 of 137

Rule 4: PK Value Should Not Change


Primary Key values should not be changed. If you changed a Primary Key, you would lose all historical tracking of that row.

Rule 5: PK Column Should Not Change


Additionally, the column(s) designated as the Primary Key should not be changed. If you changed a Primary Key, you would lose all the information relating that table to other tables.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 14 of 137

Rule 6: No Column Limit


In the relational model, there is no limit to the number of columns that can be designated as the Primary Key, so it may consist of one or more columns. In the example below, the Primary Key consists of three columns: EMPLOYEE NUMBER, LAST NAME, and FIRST NAME.

Foreign Key A Foreign Key (FK) is an identifier that links related tables. A Foreign Key defines how two tables are related to each other. Each Foreign Key references a matching Primary Key in another table in the database. For example, in the table below, the Department Number column that is a Foreign Key actually exists in another table as a Primary Key.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 15 of 137

Having tables related to each other gives users the flexibility to look at the data in different ways, without the database administrator having to manage and maintain many tables of duplicate data for different applications.

Foreign Key Rules


Rules governing how Foreign Keys must be defined and how they operate are: Rule 1: Foreign Keys are optional. Rule 2: A Foreign Key value may be non-unique. Rule 3: The Foreign Key value may be NULL. Rule 4: The Foreign Key value may be changed. Rule 5: A Foreign Key may be any number of columns. Rule 6: Each Foreign Key must exist as a Primary Key in a related table.

Rule 1: Optional FKs


Foreign Keys are optional; not all tables have them. Tables that do have them can have multiple Foreign Keys because a table can relate to many other tables. In fact, a table can have an unlimited number of foreign keys. In the example table below: The Department Number Foreign Key relates to the Department Number Primary

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 16 of 137

Key in the Department table. The Job Code FK relates to the Job Code PK in the Job Code table.

Having tables related to each other makes a relational database flexible so that different users can look up information they need, while simplifying the database administration so the data doesn't have to be duplicated for each purpose or application.

Rule 2: Unique or Non-Unique FKs


Duplicate Foreign Key values are allowed. More than one employee could be assigned to the same department.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 17 of 137

Rule 3: FKs Can Be NULL


NULL (missing) Foreign Key values are allowed. For example, under special circumstances, an employee might not be assigned to a department.

Rule 4: FK Value Can Change


Foreign Key values may be changed. For example, if Arnando Villegas moves from Department 403 to Department 587, the Foreign Key value in his row would change.

Rule 5: FK Has No Column Limit


The Foreign Key may consist of one or more columns. A multi-column foreign key is used to relate to a multi-column Primary Key in a related table. In the relational model, there is no limit to the number of columns that can be designated as a Foreign Key.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 18 of 137

Rule 6: FK Must Be PK in Related Table


Each Foreign Key must exist as a Primary Key in a related table. A department number that does not exist in the Department Table would be invalid as a Foreign Key value in the Employee Table. This rule can apply even if the Foreign Key is NULL, or missing. Remember, a missing value is defined as a non-value; there is no value present. So the rule could be better stated: if a value exists in the Foreign Key column, it must match a Primary Key value in the related table.

Just for Fun . . .

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 19 of 137

To check your understanding of Primary Keys and Foreign Keys, complete this sentence. According to the relational model, a single table can have either: (Choose two.) A. Multiple primary keys. B. Multiple foreign keys. C. No primary keys. D. No foreign keys.

Feedback:
Check Answer Show Answer

Exercise 1.1 Choose the best answer from the pull-down menu: A A A contains "like data." can contain only one row format. is one instance of all columns in a table.

Feedback:
Show Answers Reset

To review these topics, click Rows or Columns. Exercise 1.2 Which statement is true? A. A database is a two-dimensional array of rows and columns. B. A Primary Key must contain one, and only one, column. C. Foreign Keys have no relationship to existing Primary Key selections. D. Teradata is an ideal foundation for customer relationship management, e-commerce, and active data warehousing applications.

Feedback:
To review these topics, click How is the Teradata Database Used?, What is a Relational Database?, Primary Key, or Foreign Key. Exercise 1.3 Create a relationship between the two tables by clicking on:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 20 of 137

The Foreign Key column in the Product table The Primary Key column in the Vendor table

Feedback:

To review these topics, click Foreign Key or Primary Key. Exercise 1.4 Click on the name of the customer who placed order 7324.

Feedback:

To review this topic, click Primary Key or Foreign Key. Exercise 1.5 How many calendars were shipped on 4/15? (These same tables were used in the previous

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 21 of 137

exercise.) A. 10 B. 2 C. 40 D. 30

Feedback:

To review this topic, click Primary Key or Foreign Key. Exercise 1.6 Which one is NOT a unique feature of the Teradata Database? A. Ability to model the business, with data organized according to what it represents. B. Provides a mature, parallel-aware Optimizer that chooses the least expensive plan for the SQL request. C. Provides linear scalability, so there is no performance degradation as you grow the system. D. Gives each department in the enterprise a self-contained, functional data store for their own assumptions and analysis. E. Provides automatic and even data distribution for faster query processing via its unconditional parallel architecture.

Feedback:
To review these topics, click Single Data Store, Scalability, Unconditional Parallelism, Ability to Model the Business, and Mature, Parallel-Aware Optimizer. Exercise 1.7 True or False: The logical model should be independent of

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 22 of 137

usage. A. True B. False

Feedback:
To review this topic, click Logical/Relational Modeling Mod 2 - Teradata Database and Data Warehouse Architecture Objectives After completing this module, you should be able to: Identify the different types of enterprise data processing. Define a data warehouse, active data warehouse, and a data mart. List and define the different types of data marts. Explain the advantages of detail data over summary data. Describe the overall Teradata Database parallel architecture. List and describe major Teradata Database hardware and software components and their functions. Explain how the architecture helps to maintain high availability and reliability for Teradata Database users. HOT TIP: This module contains links to important supplemental course information. Please be sure to click on each hotword link to capture all of the training content.

Evolution to Active Data Warehousing Data Warehouse Usage Evolution There is an information evolution happening in the data warehouse environment today. Changing business requirements have placed demands on data warehousing technology to do more things faster. Data warehouses have moved from back room strategic decision support systems to operational, business-critical components of the enterprise. As your company evolves in its use of the data warehouse, what you need from the data warehouse evolves too.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 23 of 137

Stage 1 Reporting: The initial stage typically focuses on reporting from a single view of the business to drive decision-making across functional and/or product boundaries. Questions are usually known in advance, such as a weekly sales report. Stage 2 Analyzing: Focuses on why something happened, such as why sales went down or discovering patterns in customer buying habits. Users perform ad-hoc analysis, slicing and dicing the data at a detail level, and questions are not known in advance. Stage 3 Predicting: Analysts utilize the system to leverage information to predict what will happen next in the business to proactively manage the organization's strategy. This stage requires data mining tools and building predictive models using historical detail. As an example, users can model customer demographics for target marketing. Stage 4 Operationalizing: Providing access to information for immediate decisionmaking in the field enters the realm of active data warehousing. Stages 1 to 3 focus on strategic decision-making within an organization. Stage 4 focuses on tactical decision support. Tactical decision support is not focused on developing corporate strategy, but rather on supporting the people in the field who execute it. Examples: Inventory management with just-in-time replenishment. Scheduling and routing for package delivery. Altering a campaign based on current results. Stage 5 Active Data Warehousing: The larger the role an ADW plays in the operational aspects of decision support, the more incentive the business has to automate the decision processes. You can automate decision-making when a customer interacts with a web site. Interactive customer relationship management (CRM) on a web site or at an ATM is about making decisions to optimize the customer relationship through individualized product offers, pricing, content delivery and so on. As technology evolves, more and more decisions become executed with event-driven triggers to initiate fully automated decision processes. Example: determine the best offer for a specific customer based on a real-time event, such as a significant ATM deposit.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 24 of 137

Active Enterprise Intelligence


Active Enterprise Intelligence is the seamless integration of the ADW into the customers existing business and technical architectures. Active Enterprise Intelligence (AEI) is a business strategy for providing strategic and operational intelligence to back office and front line users from a single enterprise data warehouse. The Active Enterprise Intelligence environment: Active - Is responsive, agile, and capable of driving better, faster decisions that drive intelligent, and often immediate, actions. Enterprise - Provides a single view of the business, across appropriate business functions, and enables new operational users, processes, and applications. Intelligence - Supports traditional strategic users and new operational users of the Enterprise Data Warehouse. Most importantly, it enables the linkage and alignment of operational systems, business processes and people with corporate goals so companies may execute on their strategies.

The technology that enables that business value is the Teradata Active Data Warehouse (ADW). The Teradata ADW is a combination of products, features, services, and business partnerships that support the Active Enterprise Intelligence business strategy. ADW is an extension of our existing Enterprise Data Warehouse (EDW).

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 25 of 137

Active Data Warehouse


Data warehouses are beginning to take on mission-critical roles supporting CRM, oneto-one marketing, and minute-to-minute decision-making. Data warehousing requirements have evolved to demand a decision capability that is not just oriented toward corporate staff and upper management, but actionable on a day-to-day basis. Decisions such as when to replenish Barbie dolls at a particular retail outlet may not be strategic at the level of customer segmentation or long-term pricing strategies, but when executed properly, they make a big difference to the bottom line. We refer to this capability as "tactical" decision support. Tactical decisions are the drivers for day-to-day management of the business. Businesses today want more than just strategic insight from their data warehouse implementations - they want better execution in running the business through more effective use of information for the decisions that get made thousands of times per day. The origin of the active data warehouse is the timely, integrated store of detail data available for analytic business decision-making. It is only from that source that the additional traits needed by the active data warehouse can evolve. These new "active" traits are supplemental to data warehouse functionality. For example, the work mix in the database still includes complex decision support queries, but expands to take on short, tactical queries, background data feeds, and possibly event-driven updates all at the same time. Data volumes and user concurrency levels may explode upward beyond expectation. Restraints may need to be placed on the longer, analytical queries in order to guarantee tactical work throughput. While accessing the detail data directly remains an important opportunity for analytical work, tactical work may thrive on shortcuts and summaries, such as operational data store (ODS) level information. And for both strategic and tactical decisions to be useful to the business, today's data, this hour's data, even this minute's data has to be available. The Teradata Database is positioned exceptionally well for stepping up to the challenges related to high availability, large multi-user workloads, and handling complex queries that are required for an active data warehouse implementation. The Teradata Database technology supports evolving business requirements by providing high performance and scalability for: Mixed workloads (both tactical and strategic queries) for mission critical applications Large amounts of detail data Concurrent users The Teradata Database provides 7x24 availability and reliability, as well as continuous updating of information so data is always fresh and accurate.

Evolution of Data Processing


Traditionally, data processing has been divided into two categories: on-line transaction processing (OLTP) and decision support systems (DSS). For either, requests are handled as transactions. A transaction is a logical unit of work, such as a request to update an account.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 26 of 137

An RDBMS is used in the following main processing environments: DSS OLTP OLAP Data Mining Decision Support Systems (DSS) In a decision support environment, users submit requests to analyze historical detail data stored in the tables. The results are used to establish strategies, reveal trends, and make projections. A database used as a decision support system (DSS) usually receives fewer, very complex, ad-hoc queries and may involve numerous tables. Decision support systems include batch reports, which roll-up numbers to give business the big picture, and over time, have evolved. Instead of routine, pre-written scripts users now require the ability to perform ad hoc queries (i.e. perform queries as the need arises), analysis, and predictive what-if type queries that are often complex and unpredictable in their processing. These types of questions are essential for long range, strategic planning. DSS systems often process huge volumes of detail data.

On-line Transaction Processing (OLTP) Unlike the DSS environment, an on-line transaction processing (OLTP) environment typically has users accessing current data to update, insert, and delete rows in the data tables. OLTP is typified by a small number of rows (or records) or a few of many possible tables being accessed in a matter of seconds or less. Very little I/O processing is required to complete the transaction. This type of transaction takes place when we take out money at an ATM. Once our card is validated, a debit transaction takes place against our current balance to reflect the amount of cash withdrawn. This type of transaction also takes place when we deposit money into a checking account and the balance gets updated. We expect these transactions to be performed quickly. They must occur in real time.

On-line Analytical Processing (OLAP) OLAP is a modern form of analytic processing within a DSS environment. OLAP tools from companies like Microstrategy and Cognos provide an easy to use Graphical User Interface to allow slice and dice analysis along multiple dimensions (for example, products, locations, sales teams, inventories, etc.). With OLAP, the user may be looking for historical trends, sales rankings or seasonal inventory fluctuations for the entire corporation. Usually, this involves a lot of detail data to be retrieved, processed and analyzed. Therefore, response time can be in seconds or minutes. Data Mining Data Mining (predictive modeling) involves analyzing moderate to large amounts of

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 27 of 137

detailed historical data to detect behavioral patterns (for example, buying, attrition, or fraud patterns), that are then used to predict future behavior. There are two phases to data mining. Phase 1: An analytic model is built from historical data incorporating the detected behavior patterns (takes minutes to hours). Phase 2: The model is then applied against current detail data of customers (that is, customers are scored), to predict likely outcomes (takes seconds or less). Scores can indicate a customer's likelihood of purchasing a product, switching to a competitor, or being fraudulent.

Advantages of Using Detail Data


Until recently, most business decisions were based on summary data. The problem is that summarized data is not as useful as detail data and cannot answer some questions with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall at the end of a reporting period and are cut in half. Here's another example. Think of your monthly bank statement that records checking account activity. If it only told you the total amount of deposits and withdrawals, would you be able to tell if a certain check had cleared? To answer that question you need a list of every check received by your bank. You need detail data. Decision support -- answering business questions -- is the real purpose of databases. To answer business questions, decision-makers must have four things: The right data Enough detail data Proper data structure Enough computer power to access and produce reports on the data Consider your own business and how it uses data. Is that data detailed or summarized?

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 28 of 137

If it's summarized, are there questions it cannot answer?

Check Your Understanding


Which type of data processing supports answering this type of question, "How many women's dresses did our store sell in December of last year?" A. OLTP B. Data Mining C. OLAP D. DSS

Feedback:

Row vs. Set Processing Both cursor and set processing define set(s) of rows of the data to process; but, while a cursor processes the rows sequentially, set processing takes its sets at once. Both can be processed with a single command.

Row-by-Row Processing
Row-by-row processing is where there are many rows to process, one row is fetched at a time and all calculations are done on it, then it is updated or inserted. Then the next row is fetched and processed. This is row-by-row processing and it makes for a slow program. A benefit of row processing is that there is less lock contention.

Set Processing
A lot of data processing is set processing, which is what relational databases do best. Instead of processing row-by-row sequentially, you can process relational data set-by-

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 29 of 137

set, without a cursor. For example, to sum all payment rows with 100 or less balances, a single SQL statement completely processes all rows that meet the condition as a set. With sufficient rows to process, this can be 10 to 30 or more times faster than row-at-atime processing. Some good uses of SET processing include: An update with all AMPs involved Single session processing which takes advantage of parallel processing Efficient updates of large amounts of data

Response Time vs. Throughput When determining how fast something is, there are two kinds of measures. You can measure how long it takes to do something or you can measure how much gets done per unit time. The former is referred to as response time, access time, transmission time, or execution time depending on the context. The latter is referred to as throughput.

Response Time
This speed measure is specified by an elapsed time from the initiation of some activity until its completion. The phrase response time is often used in operating systems contexts.

Throughput
A throughput measure is an amount of something per unit time. For operating systems throughput is often measured as tasks or transactions per unit time. For storage systems or networks throughput is measured as bytes or bits per unit time. For processors, the number of instructions executed per unit time is an important component of performance.

What Does this Mean to Teradata?


Throughput measures quantity of queries completed during a time interval Response Time measures the average duration of queries

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 30 of 137

a measure of the amount of work processed how many queries were processed the number of queries executed in an hour

a measure of process completion how long that processing takes the elapsed time per query

In order to improve both response time and throughput on a Teradata system, you can: Increase CPU power, ( i.e., add nodes) Implement workload management to control resources Decrease the number of concurrent users

The Data Warehouse A data warehouse is a central, enterprise-wide database that contains information extracted from the operational systems. A Data Warehouse has a centrally located logical architecture which minimizes data synchronization and provides a single view of the business. Data warehouses have become more common in corporations where enterprise-wide detail data may be used in on-line analytical processing to make strategic and tactical business decisions. Warehouses often carry many years worth of detail data so that historical trends may be analyzed using the full power of the data. Many data warehouses get their data directly from operational systems so that the data is timely and accurate. While data warehouses may begin somewhat small in scope and purpose, they often grow quite large as their utility becomes more fully exploited by the enterprise. Data Warehousing is a process, not a product. It is a technique to properly assemble and manage data from various sources to answer business questions not previously possible or known.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 31 of 137

Data Marts A data mart is a special purpose subset of enterprise data used by a particular department, function or application. Data marts may have both summary and detail data for a particular use rather than for general use. Usually the data has been preaggregated or transformed in some way to better handle the particular type of requests of a specific user community. Independent Data Marts Independent data marts are created directly from operational systems, just as is a data warehouse. In the data mart, the data is usually transformed as part of the load process. Data might be aggregated, dimensionalized or summarized historically, as the requirements of the data mart dictate. Logical Data Marts Logical data marts are not separate physical structures or a data load from a data warehouse, but rather are an existing part of the data warehouse. Because in theory the data warehouse contains the detail data of the entire enterprise, a logical view of the warehouse might provide the specific information for a given user community, much as a physical data mart would. Without the proper technology, a logical data mart can be a slow and frustrating experience for end users. With the proper technology, it removes the need for massive data loading and transforming, making a single data store available for all user needs. Dependent Data Marts Dependent data marts are created from the detail data in the data warehouse. While having many of the advantages of the logical data mart, this approach still requires the movement and transformation of data but may provide a better vehicle for performancecritical user queries.

Data Models - Enterprise vs. Application To build an EDW, an enterprise data model should be leveraged. An enterprise data model serves as a neutral data model that is normalized to address all business areas and not specific to any function or group, whereas an application model is built for a specific business area. The application data model only looks at one aspect of the business whereas an enterprise logical data model integrates all aspects of the business. In addition, an enterprise data model is more extensible than an application data model.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 32 of 137

It is intended to encompass the entire enterprise.

Data Mart Pros and Cons


Independent Data Marts Independent data marts are usually the easiest and fastest to implement and their payback value can be almost immediate. Some corporations start with several data marts before deciding to build a true data warehouse. This approach has several inherent problems: While data marts have obvious value, they are not a true enterprise-wide solution and can become very costly over time as more and more are added. A major problem with proliferating data marts is that, depending on where you look for answers, there is often inconsistency. They may not provide the historical depth of a true data warehouse. Because data marts are designed to handle specific types of queries from a specific type of user, they are often not good at ad hoc, or "what if" queries like a data warehouse is. Logical Data Marts Logical data marts overcome most of the limitations of independent data marts. They provide a single view of the business. There is no historical limit to the data and "what if" querying is entirely feasible. The major drawback to logical data marts is the lack of physical control over the data. Because data in the warehouse is not pre-aggregated or dimensionalized, performance against the logical mart may not be as good as against an independent mart. However, use of parallelism in the logical mart can overcome some of the limitations of the non-transformed data. Dependent Data Marts Dependent data marts provide all advantages of a logical mart and also allow for physical control of the data as it is extracted from the data warehouse. Because dependent marts use the warehouse as their foundation, they are generally considered a better solution than independent marts, but they take longer and are more expensive to implement.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 33 of 137

A Teradata Database System A Teradata Database system contains one or more nodes. A node is a term for a processing unit under the control of a single operating system. The node is where the processing occurs for the Teradata Database. There are two types of Teradata Database systems: Symmetric multiprocessing (SMP) - An SMP Teradata Database has a single node that contains multiple CPUs sharing a memory pool. Massively parallel processing (MPP) - Multiple SMP nodes working together comprise a larger, MPP implementation of a Teradata Database. The nodes are connected using the BYNET, which allows multiple virtual processors on multiple nodes to communicate with each other.

To manage a Teradata Database system, you use: SMP system: System Console (keyboard and monitor) attached directly to the

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 34 of 137

SMP node MPP system: Administration Workstation (AWS) To access a Teradata Database system, a user typically logs on through one of multiple client platforms (channel-attached mainframes or network-attached workstations). Client access is discussed in the next module.

Node Components
A node is the basic building block of a Teradata Database system, and contains a large number of hardware and software components. A conceptual diagram of a node and its major components is shown below. Hardware components are shown on the left side of the node and software components are shown on the right side. For a description, click on each component.

Shared Nothing Architecture


The Teradata Database virtual processors, or vprocs (which are the PEs and AMPs), share the components of the nodes (memory and cpu). The main component of the "shared-nothing" architecture is that each AMP manages its own dedicated portion of the system's disk space (called the vdisk) and this space is not shared with other AMPs. Each AMP uses system resources independently of the other AMPs so they can all work in parallel for high system performance overall.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 35 of 137

Check Your Understanding


Which of the following statements is true? PDE is an application that runs on the Teradata Database software. AMPs manage system disks on the node. The host channel adapter card connects to "bus and tag" cables through a Teradata Gateway. An Ethernet card is a hardware component used in the connection between a network-attached client and the node.

Feedback:

Teradata Virtual Storage What is Teradata Virtual Storage? Teradata Virtual Storage, introduced with Teradata 13.00, is a change to the way in which Teradata accesses storage. The purpose is to manage a multi-temperature warehouse. Teradata Virtual Storage pools all of the cylinders within a clique's disk space and allocates cylinders from this storage pool to individual AMPs. You can add storage to the clique-storage-pool versus to every AMP which allows sharing of storage devices among AMPs. It will allow you to store data that is accessed more frequently ("hot data") on faster devices and data that is accessed less frequently ("cold data") on slower devices and it can automatically migrate the data based on access frequency. Teradata Virtual Storage is designed to allow the Teradata Database to make use of new storage technologies such as adding fast Solid State Disks (SSDs) to an existing system with a different disk technology/speed/capacity. Teradata Virtual Storage enables the mixing of drive sizes, speeds, and technologies so you can "mix" storage devices. Since storage is pooled and shared by the AMPs, adding drives does not require adding AMPs. Teradata Virtual Storage is responsible for: Pooling clique storage and allocating cylinders from the storage pool to individual AMPs Tracking where data is stored on the physical media Maintaining statistics on the frequency of data access and on the performance of physical storage media Migrating frequently used data (hot data) to fast disks and data used less frequently (cold data) to slower disks.

Benefits and Key Concepts

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 36 of 137

Teradata Virtual Storage provides the following benefits: Storage Optimization, Data Migration, and Data Evacuation Teradata Virtual Storage maintains statistics on frequency of data access (data temperature) and on the performance (grade) of physical media. This allows the Teradata Virtual Storage product to intelligently place more frequently accessed data on faster physical storage. As data access patterns change, Teradata Virtual Storage can move (migrate) storage cylinders to faster or slower physical media within each clique. This can improve system performance over time. Teradata Virtual Storage can migrate data away from a physical storage device in order to prepare for removal or replacement of the device. This process is called evacuation. Complete data evacuation requires a system restart, but Teradata Virtual Storage supports a soft evacuation feature that allows much of the data to be moved while the system remains online. This can minimize system down time when evacuations are necessary. Lower Barriers to System Growth Device management features of Teradata Virtual Storage provide the ability to pool storage within each clique. Each storage device (pdisk) can be shared, if necessary, by all AMPs in the clique. If the number of storage devices is not a multiple of the number of AMPs in the clique, the extra storage will be shared. Consequently, storage can be added to the system in smaller increments, as needs and opportunities arise. This diagram illustrates the conceptual differences with and without Teradata Virtual Storage. Pre-Teradata Virtual Storage After Teradata Virtual Storage

Cylinders were addressed by drive # and cylinder #.

AMPs don't know the physical location of a cylinder and it can change. All of the cylinders in a clique are effectively in a pool that is managed by the TVS vproc. Cylinders are assigned a unique cylinder ID (virtual ID) across all of the pdisks.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 37 of 137

With Teradata Virtual Storage you can easily add storage to an existing system. Before Teradata Virtual Storage: Existing systems have integral number of drives / AMP Today adding storage requires an additional drive per AMP means 50% or 100% increase in capacity With Teradata Virtual Storage: You can add any number of drives. Added drives are shared by all AMPs These new drives may have different capacities and / or performance than those drives which already reside in the system.

Using the BYNET The BYNET (pronounced, "bye-net") is a high-speed interconnect (network) that enables multiple nodes in the system to communicate. The BYNET handles the internal communication of the Teradata Database. All communication between PEs and AMPs is done via the BYNET. When the PE dispatches the steps for the AMPs to perform, they are dispatched onto the BYNET. The messages are routed to the appropriate AMP(s) where results sets and status information are generated. This response information is also routed back to the requesting PE via the BYNET. Depending on the nature of the dispatch request, the communication between nodes may be to all nodes (Broadcast message) or to one specific node (point-to-point message) in the system.

BYNET Unique Features


The BYNET has several unique features: Scalable: As you add more nodes to the system, the overall network bandwidth scales linearly. This linear scalability means you can increase system size without performance penalty -- and sometimes even increase performance. High performance: An MPP system typically has two BYNET networks (BYNET 0 and BYNET 1). Because both networks in a system are active, the system benefits from having full use of the aggregate bandwidth of both the networks. Fault tolerant: Each network has multiple connection paths. If the BYNET detects an unusable path in either network, it will automatically reconfigure that network so all messages avoid the unusable path. Additionally, in the rare case that BYNET 0

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 38 of 137

cannot be reconfigured, hardware on BYNET 0 is disabled and messages are rerouted to BYNET 1. Load balanced: Traffic is automatically and dynamically distributed between both BYNETs.

BYNET Hardware and Software


The BYNET hardware and software handle the communication between the vprocs and the nodes. Hardware: The nodes of an MPP system are connected with the BYNET hardware, consisting of BYNET boards and cables. Software: The BYNET driver (software) is installed on every node. This BYNET driver is an interface between the PDE software and the BYNET hardware. SMP systems do not contain BYNET hardware. The PDE and BYNET software emulate BYNET activity in a single-node environment.

For more information on communication between the vprocs and nodes, click here. (Note: You do not need to know this information for the certification exam.)

Just for Fun . . .


1. When a message is delivered to a node using BYNET hardware and software, PDE software on the node has the ability to route the message to which three? (Choose three.) A. A single vproc on a node B. A group of vprocs on a node C. All vprocs on a node

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 39 of 137

D. All vprocs on all nodes

Feedback:
Check Answer Show Answer

Cliques A clique (pronounced, "kleek") is a group of nodes that share access to the same disk arrays. Each multi-node system has at least one clique. The cabling determines which nodes are in which cliques -- the nodes of a clique are connected to the disk array controllers of the same disk arrays.

Cliques Provide Resiliency


In the event of a node failure, cliques provide for data access through vproc migration. When a node resets, the following happens to the AMPs: 1. When the node fails, the Teradata Database restarts across all remaining nodes in the system. 2. The vprocs (AMPs) from the failed node migrate to the operational nodes in its clique. 3. The PE vprocs will migrate as follows: LAN attached PEs will migrate to other nodes in the clique. Channel attached PEs will not migrate. While that node remains down, that channel connection is not available. 4. Disks managed by the AMP remain available and processing continues while the failed node is being repaired.

Cliques in a System
Vprocs are distributed across all nodes in the system. Multiple cliques in the system should have the same number of nodes. The diagram below shows three cliques. The nodes in each clique are cabled to the same disk arrays. The overall system is connected by the BYNET. If one node goes down in a clique the vprocs will migrate to the other nodes in the clique, so data remains

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 40 of 137

available. However, system performance decreases due to the loss of a node. System performance degradation is proportional to clique size.

Hot Standby Node A Hot Standby Node (HSN) is a node that is a member of a clique that is not configured (initially) to execute any Teradata vprocs. If a node in the clique fails, the AMPs from the failed node move to the hot standby node. The performance degradation is 0%. When the failed node is recovered/repaired and restarted, it becomes the new hot standby node. A second restart of Teradata is not needed. Characteristics of a hot standby node are: A node that is a member of a clique. Does not normally participate in the trusted parallel application (TPA). Can be brought into the configuration when a node fails in the clique. Helps with unplanned outages. Eliminates the need for a restart to bring a failed node back into service. Hot Standby Nodes are positioned as a performance continuity feature.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 41 of 137

1. Performance degradation is 0% as AMPs are moved to the Hot Standby Node. 2. When node 1 is recovered it becomes the new Hot Standby Node.

Software Components

A Teradata Database node requires three distinct pieces of software:

For each node in the system, you need both of the following: Operating system license (UNIX, Microsoft Windows, or Linux) Teradata Database software license

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 42 of 137

Operating System

The Teradata Database can run on the following operating systems: UNIX MP-RAS (Not supported beyond Teradata 13.) Microsoft Windows 2000 SuSE Linux

Parallel Database Extensions (PDE)


The Parallel Database Extensions (PDE) software layer was added to the operating system to support the parallel software environment. The PDE controls the virtual processor (vproc) resources.

Trusted Parallel Application (TPA)


A Trusted Parallel Application (TPA) uses PDE to implement virtual processors (vprocs). The Teradata Database is classified as a TPA. The four components of the Teradata Database TPA are:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 43 of 137

AMP (Top Right) PE (Bottom Right) Channel Driver (Top Left) Teradata Gateway (Bottom Left)

Teradata Database Software: PE


A Parsing Engine (PE) is a virtual processor (vproc) that manages the dialogue between a client application and the Teradata Database, once a valid session has been established. Each PE can support a maximum of 120 sessions. The PE handles an incoming request in the following manner: 1. The Session Control component verifies the request for session authorization (user names and passwords), and either allows or disallows the request. 2. The Parser does the following: Interprets the SQL statement received from the application. Verifies SQL requests for the proper syntax and evaluates them semantically. Consults the Data Dictionary to ensure that all objects exist and that the user has authority to access them. 3. The Optimizer is cost-based and develops the least expensive plan (in terms of time) to return the requested response set. Processing alternatives are evaluated and the fastest alternative is chosen. This alternative is converted into executable steps, to be performed by the AMPs, which are then passed to the Dispatcher. The Optimizer is "parallel aware," meaning that it has knowledge of the system components (how many nodes, vprocs, etc.), which enables it to determine the fastest way to process the query. In order to maximize throughput and minimize resource contention, the Optimizer must know about system configuration, available units of parallelism (AMPs and PEs), and data demographics. The Teradata Database Optimizer is robust and intelligent, and enables the Teradata Database to handle multiple complex, ad-hoc queries efficiently. 4. The Dispatcher controls the sequence in which the steps are executed and passes the steps received from the optimizer onto the BYNET for execution by the AMPs. 5. After the AMPs process the steps, the PE receives their responses over the BYNET.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 44 of 137

6. The Dispatcher builds a response message and sends the message back to the user. To review the PE software, click the buttons (rectangles) on the PE.

Click on the PE buttons.

Teradata Database Software: AMP


The AMP is a vproc in the Teradata Database's shared-nothing architecture that is responsible for managing a portion of the database. Each AMP will manage some portion of each table on the system. AMPs do the physical work associated with generating an answer set (output) including sorting, aggregating, formatting, and converting. The AMPs retrieve and perform all database management functions on the required rows from a table. An AMP accesses data from its single associated vdisk, which is made up of multiple ranks of disks. An AMP responds to Parser/Optimizer steps transmitted across the BYNET by selecting data from or storing data to its disks. For some requests, the AMPs may redistribute a copy of the data to other AMPs. The Database Manager subsystem resides on each AMP. This subsystem will: Lock databases and tables. Create, modify, or delete definitions of tables. Insert, delete, or modify rows within the tables. Retrieve information from definitions and tables. Return responses to the Dispatcher. Earlier in this course, we discussed the logical organization of data into tables. The Database Manager subsystem provides a bridge between that logical organization and the physical organization of the data on disks. The Database Manager performs a space-management function that controls the use and allocation of space. To review the AMP software, click the buttons (rectangles) on the AMP.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 45 of 137

Click on the AMP buttons.

Teradata Database Software: Channel Driver


Channel Driver software is the means of communication between an application and the PEs assigned to channel-attached clients. There is one Channel Driver per node. In the diagram below, the blue dots show the communication from the channel-attached client, to the host channel adapter in the node, to the Channel Driver software, to the PE, and back to the client.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 46 of 137

Teradata Database Software: Teradata Gateway


Teradata Gateway software is the means of communication between an application and the PEs assigned to network-attached clients. There is one Teradata Gateway per node. In the diagram below, the blue dots show the communication from the network-attached client, to the Ethernet card in the node, to the Teradata Gateway software, to the PE, and back to the client.

Teradata Purpose-Built Family Platform Each platform is purpose built to meet different analytical requirements. They all leverage the Teradata Database. Customers may easily migrate applications from one platform to another without having to change data models, ETL, or underlying structures.

Teradata Extreme Data Appliance 1550


The Teradata Extreme Data Appliance 1550 provides for deep strategic intelligence from extremely large amounts of detailed data. It supports very high-volume, non-enterprise data/analysis requirements for a small number of power users in specific workgroups or projects that are outside of the enterprise data warehouse (EDW). This appliance is based on the field proven Teradata Active Enterprise Data Warehouse 5550 processing nodes and provides the same scalability and data warehouse

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 47 of 137

capabilities as any other Teradata platform.

Teradata Active Enterprise Data Warehouse - 5550 H and 5555 C/H


These models are targeted to the full-scale large data warehouse. They offer expansion capabilities up to 1024 TPA and non-TPA nodes. The power of the Teradata Database combined with the throughput, power and performance of both the Intel Xeon quad-core processors and BYNET V3 technologies offers unsurpassed performance and capacity within the scalable data warehouse.

Teradata Data Mart Appliance 2500/2550/2555


The Teradata Data Mart Appliance 2500 is a server that is optimized specifically for high DSS performance. The Teradata Data Mart Appliance 2550 and 2555 have similar characteristics to the 2500, but are approximately 40% - 45% faster on a per node basis. These systems are optimized for fast scans and heavy deep dive analytics. Characteristics of the Teradata Data Mart Appliance 2500/2550/2555 include: Delivered ready to run Integrated system fully staged and tested Includes a robust set of tools and utilities Rapid time to value with system live within hours Competitive price point Capacity on demand available if needed Easy migration to an EDW/ADW

Exercise 2.1 Select the answers from the options given in the drop-down boxes that correctly complete the sentences. causes vprocs to migrate to other nodes. carries the communication between nodes in a system. is a group of nodes with access to the same disk arrays. is installed on each node in the system.

A copy of

Feedback:
Show Answers Reset

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 48 of 137

To review these topics, click Cliques Provide Resiliency, Using the BYNET, or Cliques. Exercise 2.2 Which three statements about the Teradata Database are true? (Choose three.) A. Runs on a foundation called a TPA. B. PDE is a software layer that allows TPAs to run in a parallel software environment. C. There are two types of virtual processors: AMPs and PEs. D. Runs on UNIX MP-RAS (discontinued after Teradata 13), Windows 2000, and Linux.

Feedback:
Check Answer Show Answer

To review these topics, click Software Components, Parallel Database Extensions (PDE), A Teradata Database System, or Operating System. Exercise 2.3 Four of these components are contained in the TPA software. Click each of your choices and check the Feedback box below each time to see if you are correct.

Feedback:
Show Answers Reset

To review this topic, click Trusted Parallel Application (TPA). Exercise 2.4 Select AMP, BYNET, or PE in the pull-down menu as the component responsible for the following tasks: Carries messages between nodes. Sorts, aggregates, and formats data in the processing of requests. Accesses data on its assigned vdisk.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 49 of 137

Chooses the least expensive plan for creating a response set. Transports responses from the AMPs back to the PEs, facilitating AMP/PE communication. Distributes incoming data or retrieves rows being requested to generate an answer set. Can manage up to 120 sessions.

Feedback:
Show Answers Reset

To review these topics, click Node Components, Communication Between Nodes, Communication Between Vprocs, Teradata Database Software: PE, and Teradata Database Software: AMP. Exercise 2.5 From the drop-down box below, select the answer that correctly completes the sentence. In processing a request, the for processing the requested response. determines the most efficient plan

Feedback:
To review this topic, click Teradata Database Software: PE.

Exercise 2.6 Select OLAP, OLTP, DSS or Data Mining (DM) in the pull-down menu as the appropriate type of data processing for the following requests: Withdraw cash from ATM. Show the top ten selling items for 1997 across all stores. How many blue jeans were sold across all of our Eastern stores in the month of March in child sizes?

Feedback:
Show Answers Reset

To review these topics, click Evolution of Data Processing. Exercise 2.7 From the drop-down box below, select the answer that correctly completes the sentence. A(n) may contain detail or summary data and is a special purpose subset of enterprise data for a particular function or application, rather than for general use.

Feedback:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 50 of 137

To review this topic, click Data Marts. Exercise 2.8 From the drop-down box below, select the answer that correctly completes the sentence. A(n) supports the coexistence of tactical and strategic queries.

Feedback:
To review this topic, click Active Data Warehouse.

Exercise 2.9 From the drop-down box below, select the answer that correctly completes the sentence. enable(s) the mixing of drive sizes, speeds, and technologies so you can "mix" storage devices.

Feedback:
To review this topic, click Teradata Virtual Storage.

Exercise 2.10 Select Teradata Extreme Data Appliance (e.g. 1550) Teradata Active Enterprise Data Warehouse (e.g. 5550) or Teradata Data Mart Appliance (e.g. 2550) in the pull-down menu as the appropriate platform for each description: A server that is optimized specifically for high DSS performance such as fast scans and heavy deep dive analytics. Scalable data warehouse targeted to the full-scale large DW with expansion up to 1024 TPA and non-TPA nodes. Provides for deep strategic intelligence from extremely large amounts of detailed data and supports very high-volume, non-enterprise data/analysis requirements for a small number of power users in specific workgroups or projects that are outside of the enterprise data warehouse (EDW).

Feedback:
Show Answers Reset

To review these topics, click Teradata Purpose-Built Family Platform. Exercise 2.11

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 51 of 137

Match the performance term to its definition: Measures how long it takes to do something. Measures how much gets done per unit time.

Feedback:
Show Answers Reset

To review these topics, click Response Time vs. Throughput. Exercise 2.12 True or False: Both cursor row processing and set processing define set(s) of rows of the data to process and can be processed with a single command; but, while a cursor processes the rows sequentially, set processing takes its sets at once. A. True B. False

Feedback:
To review this topic, click Row vs. Set Processing Mod 3 - Client Access Objectives After completing this module, you should be able to: Describe how the clients access the Teradata Database. Illustrate how the Teradata Database processes a request. Describe the Teradata client utilities and their use. HOT TIP: This module contains links to important supplemental course information. Please be sure to click on each hotword link to capture all of the training content.

Client Connections Users can access data in the Teradata Database through an application on both channel-attached and network-attached clients. Additionally, the node itself can act as a client. Teradata client software is installed on each client (channel-attached, networkattached, or node) and communicates with RDBMS software on the node. You may hear

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 52 of 137

either type of client referred to by the term "host," though this term is not typically used in documentation or product literature. The client may be a mainframe system, such as IBM or Amdahl, which is channelattached to the Teradata Database, or it may be a PC, UNIX, or Linux-based system that is LAN-attached. The client application submits an SQL request to the database, receives the response, and submits the response to the user. This application could be a business intelligence (BI) tool or a data integration (DI/ETL/ELT) tool, submitting queries to Teradata or loading/updating tables in the database.

Channel-Attached Client

Channel-attached clients are IBM-compatible mainframe systems supported by the Teradata Database. The following software components installed on the mainframe are responsible for communications between client applications and the Channel Driver on a Teradata Database node: Teradata Director Program (TDP) software to manage session traffic, installed on the channel-attached client. Call-Level Interface (CLI), a library of routines that are the lowest-level interface to the Teradata Database. Communication with the Teradata Database System Communication from client applications on the mainframe goes through the mainframe channel, to the Host Channel Adapter on the node, to the Channel Driver software.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 53 of 137

Network Attached Client

The Teradata Database supports network-attached clients connected to the node over a LAN. The following software components installed on the network-attached client are responsible for communication between client applications and the Teradata Gateway on a Teradata Database node: Open Database Connectivity (ODBC) is an application programming standard that defines common database access mechanisms to simplify the exchange of data between a client and server. ODBC-compliant applications connect with a database through the use of a driver that translates the application's ODBC commands into database syntax. Call-Level Interface, Version2 (CLIv2) is a library of routines that enable an application program to access data stored in the Teradata Database. When used with network-attached clients, CLIv2 contains the following components: CLI (Call-Level Interface) MTDP (Micro Teradata Director Program) MOSI (Micro Operating System Interface) Java Database Connectivity (JDBC) is an Application Programming Interface (API) that allows platform independent Java applications to access a DBMS using Structured Query Language (SQL). JDBC enables the development of web-based Teradata end user tools that can access Teradata through a web server. JDBC will also provide support for access to other commercial databases. WinCLI is an additional, legacy API to Teradata from a network host. Communication with the Teradata Database System Communication from applications on the network-attached client goes over the LAN, to the Ethernet card on the node, to the Teradata Gateway software.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 54 of 137

On the database side, the Teradata Gateway software and the PE provide the connection to the Teradata Database. The Teradata Database is configured with two LAN connections for redundancy. This ensures high availability.

Node

The node is considered a network-attached client. If you install application software on a node, it will be treated like an application on a network-attached client. In other words, communications from applications on the node go through the Teradata Gateway. An application on a node can be executed through: System Console that manages an SMP system. Remote login, such as over a network-attached client connection.

Just for Fun . . .


As a review, answer this question: Which two can you use to run an application that is installed on a node? (Choose two.) A. Mainframe terminal B. Bus terminal C. System console D. Network-attached workstation

Feedback:
Check Answer Show Answer

Request Processing

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 55 of 137

The steps for processing a request like the one above are somewhat different, depending on whether the user is accessing the Teradata Database through a channelattached or network-attached client: 1. SQL request is sent from the client to the appropriate component on the node: Channel-attached client: request is sent to Channel Driver (through the TDP). Network-attached client: request is sent to Teradata Gateway (through CLIv2 or ODBC). 2. Request is passed to the PE(s). 3. PEs parse the request into AMP steps. 4. PE Dispatcher sends steps to the AMPs over the BYNET. 5. AMPs perform operations on data on the vdisks. 6. Response is sent back to PEs over the BYNET. 7. PE Dispatcher receives response. 8. Response is returned to the client (channel-attached or network-attached). Mainframe Request Flow Workstation Request Flow

Teradata Client Utilities Teradata has a robust suite of client utilities that enable users and system administrators to enjoy optimal response time and system manageability. Various client utilities are available for tasks from loading data to managing the system. Teradata utilities leverage the Teradata Databases high performance capabilities and are fully parallel and scalable. The same utilities run on smaller entry-level systems, and the largest MPP implementations.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 56 of 137

Teradata Database client utilities include the following, described in this section: Query Submitting Utilities BTEQ Teradata SQL Assistant Load and Unload Utilities FastLoad MultiLoad TPump FastExport Teradata Parallel Transporter (TPT) Administrative Utilities Teradata Manager Teradata Dynamic Workload Manager (TDWM) Priority Scheduler Database Query Log (DBQL) Teradata Workload Analyzer Performance Monitor (PMON) Teradata Active Systems Management (TASM) Teradata Analyst Pack Archive Utilities Archive Recovery Facility (ARC) NetVault (third party) NetBackup (third party)

Query Submitting Utilities


The Teradata Database provides tools that are front-end interfaces for submitting SQL queries. Two mentioned in this section are BTEQ and Teradata SQL Assistant.

BTEQ
BTEQ (Basic Teradata Query) -- pronounced BEE-teek -- is a Teradata Database tool used for submitting SQL queries on all platforms. BTEQ provides the following functionality: Standard report writing and formatting Basic import and export of small amounts of data to and from the Teradata Database across all platforms. For tables more than a few thousand rows, the Teradata Database load utilities are recommended for more efficiency. Ability to submit SQL requests in the following ways: Interactive Batch

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 57 of 137

Teradata SQL Assistant


Teradata SQL Assistant (formerly known as Queryman) is an information discovery/query tool that runs on Microsoft Windows. Teradata SQL Assistant enables you to access the Teradata Database as well as other ODBC-compliant databases. Some of its features include: Ability to save data in PC-based formats, such as Microsoft Excel, Microsoft Access, and text files. History of submitted SQL syntax, to help you build scripts for data mining and knowledge discovery. Help with SQL syntax. Import and export of small amounts of data to and from ODBC-compliant databases. For tables more than a few thousand rows, the Teradata Database load utilities are recommended for more efficiency.

Data Load and Unload Utilities

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 58 of 137

In a data warehouse environment, the database tables are populated from a variety of sources, such as mainframe applications, operational data marts, or other distributed systems throughout a company. These systems are the source of data such as daily transaction files, orders, usage records, ERP (enterprise resource planning) information, and Internet statistics. Teradata provides a suite of data load and unload utilities optimized for use with the Teradata Database. They run on any of the supported client platforms: Channel-attached client Network-attached client Node Using Teradata Load and Unload Utilities Teradata load and unload utilities are fully parallel. Because the utilities are scalable, they accommodate the size of the system. Performance is not limited by the capacity of the load and unload tools. The utilities have full restart capability. This feature means that if a load or unload job should be interrupted for some reason, it can be restarted again from the last checkpoint, without having to start the job from the beginning. The load and unload utilities are: FastLoad MultiLoad TPump FastExport Teradata Parallel Transporter (TPT) The concurrency limit for utilities is now 60: Up to 30 concurrent FastLoad and MultiLoad jobs. Up to 60 concurrent FastExport jobs (assuming no FastLoad or MultiLoad jobs).

FastLoad
Use the FastLoad utility to load data into empty tables. FastLoad loads to a single empty table at a time. FastLoad loads data into an empty table in parallel, using multiple sessions to transfer blocks of data. FastLoad achieves high performance by fully exploiting the resources of the system. After the data load is complete, the table can be made available to users. A typical use is for mini-batch or frequent batch where you load the data to an empty "staging" table, and then use an SQL INSERT/SELECT command to move it to an existing table.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 59 of 137

MultiLoad
Use the MultiLoad utility to maintain tables by: Inserting rows into a populated or empty table Updating rows in a table Deleting multiple rows from a table MultiLoad can load multiple input files concurrently and work on up to five tables at a time, using multiple sessions. MultiLoad is optimized to apply multiple rows in blocklevel operations. MultiLoad usually is run during a batch window, and places a lock on the destination table(s) to prevent user queries from getting inconsistent results before the data load or update is complete. Access locks may be used to query tables being maintained with MultiLoad.

TPump
Use TPump to: Continuously load, update, or delete data in tables Update lower volumes of data using fewer system resources than other load utilities Vary the resource consumption and speed of the data loading activity over time TPump performs the same operations as MultiLoad. TPump updates a row at a time and uses row hash locks, which eliminates the need for table locks and "batch windows" typical with MultiLoad. Users can continue to run queries during TPump data loads. In addition, TPump maintains up to 60 tables at a time.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 60 of 137

TPump has a dynamic throttle that operators can set to specify the percentage of system resources to be used for an operation. This enables operators to set when TPump should run at full capacity during low system usage, or within limits when TPump may affect other business users of the Teradata Database.

FastExport
Use the FastExport utility to export data from one or more tables or views on the Teradata Database to a client-based application. You can export data from any table or view on which you have the SELECT access rights. The destination for the exported data can be a: Host file: A file on your channel-attached or network-attached client system User-written application: An Output Modification (OUTMOD) routine you write to select, validate, and preprocess the exported data. FastExport is a data extract utility. It transfers large amounts of data using block transfers over multiple sessions and writes the data to a host file on the networkattached or channel-attached client. Typically, FastExport is run during a batch window, and the tables being exported are locked.

Teradata Parallel Transporter


Teradata Parallel Transporter is a load/update/export tool that enables data extraction, transformation and loading processes common to all data warehouses. Using built-in operators, Teradata Parallel Transporter combines the functionality of the Teradata utilities (FastLoad, MultiLoad, FastExport, and TPump) in a single parallel environment. Its extensible environment supports FastLoad INMODs, FastExport

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 61 of 137

OUTMODs, and Access Modules to provide access to all the data sources you use today. There is a set of open APIs (Application Programmer Interface) to add third party or custom data transformation to Teradata Parallel Transporter scripts. Using multiple, parallel tasks, a single Teradata Parallel Transporter script can load data from disparate sources into the Teradata Database in the same job. Teradata Parallel Transporter provides a single, SQL-like scripting language, as well as a GUI to make scripting faster and easier. You can do the extract, some transformation, and loads all in one SQL-like scripting language.

A single Teradata Parallel Transporter job can load data from multiple disparate sources into the Teradata Database, as indicated by the green arrow.

Teradata Parallel Transporter Operators The operators are components that "plug" into the Teradata Parallel Transporter infrastructure and actually perform the functions. The FastLoad INMOD and FastExport OUTMOD operators support the current FastLoad and FastExport INMOD/OUTMOD features. The Data Connector operator is an adapter for the Access Module or non-Teradata files. The SQL Select and Insert operators submit the Teradata SELECT and INSERT commands. The Load, Update, Export and Stream operators are similar to the current FastLoad, MultiLoad, FastExport and TPump utilities, but built for the Teradata PT parallel environment. The INMOD and OUTMOD adapters, Data Connector operator, and the SQL Select/Insert operators are included when you purchase the Infrastructure. The Load, Update, Export and Stream operators are purchased separately. To simplify these new concepts, let's compare the Teradata Parallel Transporter Operators with the classic utilities that we just covered. Teradata Parallel Transporter (TPT) Operator Teradata Utility Description TPT Operator Teradata Utility Description A consumer-type operator that uses the Teradata FastLoad protocol. Supports Error limits and Checkpoint/ Restart. Both support

LOAD

FastLoad

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 62 of 137

Multi-Value Compression and PPI. Utilizes the Teradata MultiLoad protocol to enable job based table UPDATE MultiLoad updates. This allows highly scalable and parallel inserts and updates to an existing table. A producer operator that emulates EXPORT FastExport the FastExport utility Uses multiple sessions to perform STREAM TPump DML transactions in near real-time. This operator emulates the Data Connector API. Reads external data DataConnector N/A files, writes data to external data files, reads an unspecified number of data files. Reads data from an ODBC ODBC N/A Provider.

Administrative Utilities Administrative utilities use a graphical user interface (GUI) to monitor and manage various aspects of a Teradata Database system. The administrative utilities are: Workload Management: Teradata Manager Teradata Dynamic Workload Manager (TDWM) Priority Scheduler Database Query Log (DBQL) Teradata Workload Analyzer Performance Monitor Teradata Active Systems Management (TASM) Teradata Analyst Pack

Workload Management
Workload Management in Teradata is used to control system resource allocation to the various workloads on the system. Some of the components that make up Teradatas Workload Management capability are: Teradata Manager

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 63 of 137

Teradata Manager is a production and performance monitoring system that helps a DBA or system manager monitor, control, and administer one or more Teradata Database systems through a GUI. Running on LAN-attached clients, Teradata Manager has a variety of tools and applications to gather, manipulate, and analyze information about each Teradata Database being administered. For examples of Teradata Manager functions, click here: Teradata Manager Examples

Teradata Dynamic Workload Manager (TDWM) Teradata Dynamic Workload Manager (also known as Teradata DWM or TDWM) is a query workload management tool that can restrict (run, suspend, schedule later, or reject) queries based on current workload and set thresholds. TDWM provides a graphical user interface (GUI) for creating rules that manage database access, increase database efficiency, and enhance workload capacity. Via the rules created through TDWM, queries can be rejected, throttled, or executed when they are submitted to the Teradata Database. For example, with TDWM a request can be scheduled to run periodically or during a specified time period. Results can be retrieved any time after the request has been submitted by TDWM and executed. TDWM can restrict queries based on factors such as: Analysis control thresholds - TDWM can restrict requests that will exceed a certain processing time, or whose expected result set size exceeds a specified number of rows. Object control thresholds - TDWM can limit access to and use of static criteria such as database objects and other items. Object controls can control workload requests based on user IDs, tables, views, date, time, macros, databases, and group IDs. Environmental factors -TDWM can manage requests based on dynamic environment factors, including database system CPU and disk utilization, network activity, and number of users.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 64 of 137

Teradata Dynamic Workload Manager is a key supporting product component for Teradata Active Systems Manager (TASM), a new concept as of Teradata V2r6.1, described in another sub-topic below. The major functions performed by the DBA are to: Define Filters and Throttles. Define Workloads (new) and their operating periods, goals and Priority Scheduler facility (PSF) mapping/weights. Define general TASM controls - TASM automates the allocation of resources to workloads to assist the DBA or application developer with system performance management. TDWM allows the Database Administrator to provide operational control of and to effectively manage and regulate access to the Teradata Database. The database administrator can use the following capabilities of TDWM to manage work submitted to the database in order to maximize system resource utilization: Query Management Scheduled Requests With Query Management, database query requests are intercepted within the Teradata Database, their components are compared against criteria that are defined by the administrator, and requests that fail to meet the criteria are restricted: either run, suspended, scheduled later, or rejected. With Scheduled Requests, clients can submit SQL requests to be executed at scheduled off-peak times. Priority Scheduler Priority Scheduler is a resource management tool that is used to assign resources and controls how computer resources, (e.g., CPU), are allocated to different users in a Teradata system. This resource management function is based on scheduler parameters that satisfy site-specific requirements and system parameters that depict the current activity level of the Teradata Database system. You can provide Priority Scheduler parameters to directly define a strategy for controlling resources.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 65 of 137

Database Query Log (DBQL) The Database Query Log (DBQL) logs query processing activity for later analysis. Query counts and response times can be charted and SQL text and processing steps can be compared to fine-tune applications for optimum performance. Teradata Workload Analyzer Teradata Workload Analyzer recommends candidate workloads for analysis. In addition, it provides the following capabilities: Identifies classes of queries and candidate workloads for analysis and recommends workload definitions and operating rules. Recommends workload allocation group mappings and Priority Scheduler facility (PSF) weights. Provides the ability to migrate existing Priority Schedule Definitions (PD Sets) into new workloads. Provides recommendations for appropriate workload Service Level Goals (SLGs). Establishes workload definitions from query history or directly. Can be used iteratively to analyze and understand how well existing workload definitions are working and modify them if necessary. Workload Analyzer creates a Workload Rule set, ( i.e., Workload Definitions and recommended Service Level Goals) by either using: 1. Statistics from DBQL data 2. Migrated current Priority Scheduler settings Teradata Workload Analyzer can also apply best practice standards to workload definitions such as assistance in Service Level Goal (SLG) definition and priority scheduler setting recommendations. In addition, Teradata Workload Analyzer supports the conversion of existing Priority Scheduler Definitions (PD Sets) into new workloads. Performance Monitor The Performance Monitor (formerly called PMON) collects near real-time system configuration, resource usage, and session information from the Teradata Database either directly or through Teradata Manager. Performance Monitor formats and displays this information as requested: Performance Monitor allows you to analyze current performance and both current and historical session information, and to abort sessions that are causing system problems. Application flow control: Resource control prior to execution. Control Pre-Execution what and how much is allowed to begin execution. Resource Management Query Executes Resource control during Manage the level of resources allocated to different priorities of executing work.

Teradata Dynamic Workload Manager

Priority Scheduler

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 66 of 137

execution Performance Monitor Database Query Log Allows DBA or user to During Query examine the active Execution workload. Application Query PostExecution Analyze query performance and behavior after completion.

Teradata Active Systems Management (TASM) Teradata Active System Management is made up of several products/tools that assist the DBA or application developer in defining and refining the rules that control the allocation of resources to workloads running on a system. These rules include filters, throttles, and workload definitions. Workload definitions are rules to control the allocation of resources to workloads and are new with Teradata V2R6.1. Tools are also provided to monitor workloads in real time and to produce historical reports of resource utilization by workloads. By analyzing this information, the workload definitions can be adjusted to improve the allocation of system resources. TASM is primarily comprised of three products that are used to create and manage workload definitions: Teradata Dynamic Workload Manager (TDWM) - (enhanced with TASM) Teradata Manager - which reviews historical workloads - (enhanced with TASM) Teradata Workload Analyzer (TWA) which recommends candidate workloads for analysis - (new with TASM) Teradata Active Systems Management (TASM), allows you to perform the following: Limit user concurrency s Monitor Service Level Goals (SLGs) on a system Optimize mixed workloads Reject queries based on table access Prioritize workloads Provide more consistent response times and influence response times React to hardware failures Block access on a table to a user, and Determine the workload on a system.

Teradata Analyst Pack


Teradata Analyst Pack is a suite of the following products. Teradata Visual Explain Teradata Visual Explain makes query plan analysis easier by providing the ability to

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 67 of 137

capture and graphically represent the steps of the plan and perform comparisons of two or more plans. It is intended for application developers, database administrators and database support personnel to better understand why the Teradata Database Optimizer chooses a particular plan for a given SQL query. All information required for query plan analysis such as database object definitions, data demographics and cost and cardinality estimates is available through the Teradata Visual Explain interface. It is helpful in identifying the performance implications of data skew and bad or missing statistics. Visual Explain uses a Query Capture Database to store query plans which can then be visualized or manipulated with other Teradata Analyst Pack tools. Teradata System Emulation Tool (Teradata SET) Teradata SET simplifies the task of emulating a target system by providing the ability to export and import all information necessary to fake out the optimizer in a test environment. This information can be used along with the Target Level Emulation feature to generate query plans on the test system as if they were run on the target system. This feature is useful for verifying queries and reproducing optimizer related issues in a test environment. Teradata SET allows the user to capture the following by database, query, or workload: System cost parameters Object definitions Random AMP samples Statistics Query execution plans Demographics This tool does not export user data. Teradata Index Wizard Teradata Index Wizard automates the process of manual index design by recommending secondary indexes for a particular workload. Teradata Index Wizard provides a graphical user interface (GUI) that guides the user through analyzing a database workload and provides recommendations for improving performance through the use of indexes. Teradata Index Wizard provides support for Partitioned Primary Indexes (PPI) recommendations. PPI is discussed in the Indexes module of this course. Teradata Statistics Wizard Teradata Statistics Wizard is a graphical tool that has been designed to automate the collection and re-collection of statistics, resulting in better query plans and helping the DBA to efficiently manage statistics. The Statistics Wizard enables the DBA to: Specify a workload to be analyzed for recommendations to improve the performance of the queries in that workload. Select databases, tables, indexes, or columns for analysis, collection, or recollection of statistics. Schedule the COLLECT STATISTICS activity. As changes are made within a database, the Statistics Wizard identifies those changes and recommends which tables should have statistics collected, based on age of data and table growth, and which columns/indexes would benefit from having statistics defined and collected for a specific workload. The DBA is then given the opportunity to accept or reject the recommendations.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 68 of 137

Archival Utilities

Teradata provides the Archive Recovery utility (ARC) to perform backup and restore operations on tables, databases, and other objects. In addition, ARC interfaces to third party products to support backup and restore capabilities in a network-attached environment. There are several scenarios where restoring objects from external media may be necessary: Restoring non-Fallback tables after a disk failure. Restoring tables that have been corrupted by batch processes that may have left the data in an uncertain state. Restoring tables, views, or macros that have been accidentally dropped by the user. Miscellaneous user errors resulting in damaged or lost database objects. Archive a single partition. With the ARC utility you can copy a table and restore it to another Teradata Database. It is scalable and parallel, and can run on a channel-attached client, network-attached client, or a node.

Archiving on Channel-Attached Clients


In a channel-attached (mainframe) client environment, ARC is used to back up and restore data. It supports commands written in Job Control Language (JCL). ARC archives and restores database objects, allowing recovery of data that may have been damaged or lost. ARC may be running on the node or on the channel-attached client, and will backup data directly across the channel into the mainframe-attached tape subsystem.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 69 of 137

Archiving on Network-Attached Clients


In a network-attached client environment, ARC is used to back up data, along with one of the following tape management products: NetVault (from BakBone Software Inc.) Veritas NetBackup from Symantec Software These products provide modules for Teradata Database systems that run on networkattached clients or a node (Microsoft Windows or UNIX MP-RAS). Data is backed up through these interfaces into a tape storage subsystem using the ARC utility.

Exercise 3.1 Processing a Request: Drag an icon from the group on the right to its correct position in the empty boxes on the left. Correctly placed icons will stay where you put them.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 70 of 137

To review this topic, click Request Processing. Exercise 3.2 Select the appropriate Teradata load or unload utility from the pull-down menus. Enables constant loading (streaming) of data into a table to keep data fresh. Data extract utility that exports data from a Teradata table and writes it to a host file. Updates, inserts, or deletes empty or populated tables (block level operation). Uses parallel processing to load an empty table. Performs the same function as the UPDATE Teradata Parallel Transporter operator. Performs the same function as the STREAM Teradata Parallel Transporter operator.

Feedback:
Show Answers Reset

To review these topics, click FastLoad, MultiLoad, TPump, and FastExport. Exercise 3.3 Move the software components required for a channel connection into the appropriate blue squares. Correctly placed components will stay where you put them.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 71 of 137

To review this topic, click Channel Attached Client. Exercise 3.4 Which three statements are true? (Choose three.) A. Teradata SQL Assistant and TDWM are the two utilities used for Teradata system management. B. TDWM can reject a query based on current workload and set thresholds. C. BTEQ runs on all client platforms to access the Teradata Database. D. Archive Recovery (ARC) is used to copy and restore a table to another Teradata Database. E. NetVault and Veritas NetBackup are utilities used for network management.

Feedback:
Check Answer Show Answer

To review these topics, click BTEQ, Teradata SQL Assistant, Teradata Manager, TDWM, Archiving on Channel-Attached Clients, and Archiving on Network-Attached Clients. Exercise 3.5 Select the correct type of connection (network-attached client or channel-attached client) from the drop-down boxes below that corresponds to the listed software and hardware components. Teradata Gateway Teradata Director Program Channel Driver Ethernet Card "mainframe host"

Feedback:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 72 of 137

Show Answers

Reset

To review this topic, click Channel Attached Client or Network Attached Client. Exercise 3.6 Select the correct Teradata Analyst Pack tool from the drop-down menus below. Verifies queries and reproduces optimizer related (query plans) issues in a test environment. Recommends one or more Secondary Indexes for a table. Uses a Query Capture Database to store query plans. Recommends and automates the Statistics Collection process.

Feedback:
Show Answers Reset

To review this topic, click Teradata Analyst Pack. Exercise 3.7 __________ is made up of several products/tools that assist the DBA or application developer in defining and refining the rules, (i.e., filters, throttles and workload definitions), that control the allocation of resources to workloads running on a system. A. Teradata Workload Analyzer B. Database Query Log C. Teradata Active Systems Manager D. Performance Monitor

Feedback:
To review this topic, click Administrative Utilities. Exercise 3.8 True or False: Workload definitions are rules to control the allocation of resources to workloads. A. True B. False

Feedback:
To review this topic, click Administrative Utilities.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 73 of 137

Exercise 3.9 Select the correct term from the drop-down menus below. is an application programming standard that defines common database access mechanisms to simplify the exchange of data between a client and server. ________ compliant applications connect with a database through the use of a driver that translates the application's ________ commands into database syntax. is a library of routines that enable an application program to access data stored in the Teradata Database. is an Application Programming Interface (API) that allows platform independent Java applications to access a DBMS using Structured Query Language (SQL). It enables the development of web-based Teradata end user tools that can access Teradata through a web server and also provides support for access to other commercial databases. is an additional legacy API that allows access to Teradata from a network host. .

Feedback:
Show Answers Reset

To review this topic, click Network Attached Client. Mod 4 - Data Structure Objectives After completing this module, you should be able to: Distinguish between a Teradata Database and a Teradata User. List and define the Teradata Database objects. Define Perm Space, Temp Space, and Spool Space, and explain how each is used. Describe the function of the Data Dictionary. List methods for authentication and security on Teradata. HOT TIP: This module contains links to important supplemental course information. Please be sure to click on each hotword link to capture all of the training content.

Creating Databases and Users In the Teradata Database, Databases (including a special category of Databases called

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 74 of 137

Users), have attributes assigned to them: Access Rights: Privileges that allow a User to perform operations (such as CREATE, DROP, and SELECT) against database objects. A User must have the correct access rights to a database object in order to access it. Perm Space: The maximum amount of Permanent Space assigned and available to a User or Database to store tables. Unlike some other relational databases, the Teradata Database does not physically pre-allocate Perm Space for Databases and Users when they are defined during object definition time. Only the Permanent Space limit is defined, then the space is consumed dynamically as needed. All Databases have a defined upper limit of Permanent Space. Spool Space: The amount of space assigned and available to a User or Database to gather answer sets. For example, when executing a conditional query, qualifying rows are temporarily stored using Spool Space. Depending on how the system is set up, a single query could temporarily use all available system space to store its result in spool. Permanent Space not being used for tables is available for Spool Space. Temp Space: The amount of space used for global temporary tables, and these results remain available to the User until the session is terminated. Tables created in Temp Space will survive a restart. Permanent Space not being used for tables is available for Temp Space as well as Spool Space.

A Logical Database Hierarchy


In a logical, hierarchical organization, Databases (including Users) are created subordinate to existing Databases or Users. The owning Database or User is called the parent. The subordinate Database or User is called the child. Permanent Space for the new Database or User comes from its immediate parent. When the Teradata Database software is first installed, all Permanent Space is assigned to Database DBC (also a User in Teradata Database terminology, because you can log on to it with a userid and password). During installation, the following Databases are created: Database Crashdumps (initially empty) User SystemFE (with its views and macros) User SysAdm (with its views and macros) Because Database DBC is the immediate parent of these child Databases, Permanent Space limits for the children are subtracted from Database DBC.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 75 of 137

Creating a New Database


After the initial installation, you will create your database hierarchy. One way to set up this hierarchy would be to create a Database Administrator User directly subordinate to Database DBC. Most of the system Permanent Space would be assigned to the Database Administrator User. This setup gives you the freedom to have multiple administrators logging on to the Database Administrator User, and limit the number of people logging on directly to Database DBC (which has more access rights than any other User). Next, all other Users and Databases would be created from the database administrator User, and their Permanent Space limits would be subtracted from the Database Administrator User's space limit. Your hierarchy would look like this: Database DBC at the highest level, the parent of all other Databases (including Users). User SysDBA (we called it SysDBA; you can assign it any name) with the majority of the system's Perm Space assigned to it. All Databases and Users in the system created from User SysDBA . Each table, view, macro, stored procedure, and trigger are owned by a Database (or User).

Data Layers
There are several layers built in to the EDW environment. These layers include: Staging the primary purpose of the staging layer is to perform data

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 76 of 137

transformation, either in the ETL or ELT process. Semantic this layer is the access layer. Access is often provided via views and business intelligence (BI) tools; whether a Teradata application or a 3rd party tool. Physical the physical layer is where denormalizations that will make access more efficient occur; pre-aggregations, summary tables, join indexes, etc. The purpose of this layer is to provide efficient, friendly access to end users.

Maximum Perm Space Allocations: An Example


Below is an example of how Permanent Space limits for Users and Databases come from the immediate parent User or Database. In this case, the User SysDBA has 500 GB of maximum Permanent Space assigned to it.

The User HR is created from SysDBA with 200 GB of maximum Permanent Space. The 200 GB for HR is subtracted from SysDBA, who now has 300 GB (500 GB minus 200 GB).

The User Payroll is created as a child of HR with 100 GB of Permanent Space. The 100 GB for Payroll is subtracted from HR, which now has 100 GB (200 GB minus 100 GB).

At a different level under SysDBA, Database Marketing is created as a child of SysDBA,

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 77 of 137

with 100 GB of maximum Permanent Space. The 100 GB for Marketing comes from its parent, SysDBA, which now has 200 GB (300 GB minus 100 GB).

A Teradata Database In Teradata Database systems, the words "database" and "user" have specific definitions. Database: The Teradata Definition In Teradata, a "database" is a logical grouping of information contained in tables. A Teradata Database also provides a key role in space allocation and access control. A Teradata Database is a defined, logical repository that can contain objects, including: Database: A defined object that may contain other database objects. User: A database that has a user ID and password for logging on to the Teradata Database, and may contain other database objects. Table: A two-dimensional structure of columns and rows of data. (Requires Perm Space) View: A virtual "window" into subsets of one or more tables or other views. It is pre-defined using a single SELECT statement. (Uses no Perm Space) Macro: A definition composed of one or more Teradata SQL and report formatting commands. (Uses no Perm Space) Trigger: One or more Teradata SQL statements attached to a table and executed when specified conditions are met. (Uses no Perm Space) Stored Procedure: A combination of procedural and non-procedural statements run using a single CALL statement. (Requires Perm Space) User Defined Function: Allows authorized users to write external functions. Teradata allows users to create scalar functions to return single value results, aggregate functions to return summary results, and table functions to return tables. UDFs may be used to protect sensitive data such as personally identifiable data. Note: A Database with no Perm Space can contain views, macros, and triggers, but no tables or stored procedures. These Teradata Database objects are created, maintained, and deleted using SQL. User: A Special Kind of Database A user may be a collection of tables, views, macros, triggers, and stored procedures. A

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 78 of 137

user is a specific type of database, and has attributes in addition to the ones listed above: User ID Password So, a user is the same as a database except that a user can actually log on to the database. To log on to a Teradata Database, you need to specify a user (which is simply a database with a password). You cannot log on to a database because it has no password. Note: In this course, we will use uppercase "U" for User and uppercase "D" for Database when referring to these specific Teradata Database objects.

Spool Space Maximum Spool Space As mentioned previously in "Creating Databases and Users," Spool Space is work space used to hold intermediate answer sets. Any Perm Space currently unassigned is available as Spool Space. Defining a Spool Space limit is not required when Users and Databases are created. If it is not defined, the Spool Space limit for the User or Database is inherited from its parent. Thus, if no Spool Space limit were defined for any Users or Databases, an erroneous SQL request could create a "runaway transaction" that consumes all of the system's resources. For this reason, defining Spool Space limits for a User or Database is highly recommended. The Spool Space limit for a Database or User is not subtracted from its immediate parent, but the Database or User's maximum spool allocation can only be as large as its immediate parent. For example: Database A has a Spool Space limit of 500 GB. Database B is created as a child of Database A. The maximum Spool Space that can be allocated to Database B is 500 GB.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 79 of 137

Database C is created as another child of Database A. The maximum Spool Space that can be allocated to Database C is also 500 GB.

Because Spool Space is work space, temporarily used and released by the system as needed, the total maximum Spool Space allocated for all the Databases and Users on the system can actually exceed the total system disk space. But this is not the amount of Spool Space actually consumed.

Consuming Spool Space


The maximum Spool Space for a Database (or User) is merely an upper limit of the Spool Space that the Database can use while processing a transaction. There are two limits to Spool Space utilization: The maximum Spool Space assigned to a User or Database. If a transaction is going to exceed its assigned limit, it is aborted and an error message is given stating that the maximum Spool Space was exceeded. Physical limitation of disk space. For a specific transaction, the system can only use the amount of Spool Space actually available on the system at that particular time, whether a maximum spool limit has been defined or not. If a job is going to exceed the Spool Space available on the system, an error message is given stating that there is not enough space to process the job. As the amount of Permanent Space used to store data varies over a long period of time, so will the amount of space available for spool (work space).

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 80 of 137

Temporary Space
Temporary Space is Permanent Space currently not being used. Temporary Space is used for global temporary tables, and these results remain available to the user until their session is terminated. Tables created in Temp Space will survive a restart.

Check Your Understanding


Which statement is true? (Check the best answer.) The Spool Space used by a request is limited to the amount of Spool Space assigned to the originating user and the physical space available on the system at that point in time. A request can use as much Spool Space as necessary as long as it does not exceed the systems total installed physical space limit. A request can use as much Spool Space as necessary as long as it does not exceed the Spool Space limit of the originating user, regardless of the space available on the system. The Spool Space used by a request is limited only by the maximum Perm Space of the originating user.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 81 of 137

Feedback:

Data Dictionary The Data Dictionary is a set of relational tables that contains information about the RDBMS and database objects within it. It is like the metadata or "data about the data" for a Teradata Database (except that it does not contain business rules, like true metadata does). The Data Dictionary resides in Database DBC. Some of the items it tracks are: Disk space Access rights Ownership Data definitions

Disk Space
The Data Dictionary stores information about how much space is allocated for perm and spool for each Database and User. The table below shows an example of Data Dictionary information for space allocations. In this example, the Users Payroll and Benefits have no Permanent Space allocated or consumed because they do not contain tables.

Access
The Data Dictionary also stores information about which Users can access which database objects. System Administrators are often responsible for archiving the system. In the example below, it is likely that the SysAdm User would have access to the tables in the Employee and Crashdumps databases, as well as other objects. When you grant and revoke access to any User for any database object, privileges are stored in the AccessRights table in the Data Dictionary.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 82 of 137

Owners
The Data Dictionary also stores information about which Databases and Users own each database object.

Definitions
The Data Dictionary stores definitions of all database objects, their names, and their place in the hierarchy.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 83 of 137

For macros, the Data Dictionary also stores the actual SQL statements of the macro. While stored procedures also contain statements (SQL and SPL statements), the statements for each stored procedure are kept in a separate table and distributed among the AMPs (like regular user data), rather than in the Data Dictionary.

Database Security There are several mechanisms for implementing security on a Teradata Database. These mechanisms include authenticating access to the Teradata Database with the following: LDAP Single Sign-On Passwords Authentication After users have logged on to Teradata Database and have been authenticated, they are authorized access to only those objects allowed by their database privileges. Additional Security Mechanisms In addition to authentication, there are several database objects or constructs that allow for a more secure database environment. These include: Privileges, or Access Rights Views Macros Stored Procedures User Defined Functions (UDFs) Roles a collection of Access Rights Privilege (access right) is the right to access or manipulate an object within Teradata. Privileges control user activities such as creating, executing, inserting, viewing,

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 84 of 137

modifying, deleting, or tracking database objects and data. Privileges may also include the ability to grant privileges to other users in the database. In addition to access rights, the database hierarchy can be set up such that users access tables or applications via the semantic layer, which could include Views, Macros, Stored Procedures, and even UDFs. Roles, which are a collection of access rights, can be granted to groups of users to further protect the security of data and objects within Teradata.

Exercise 4.1 When you log on to the Teradata Database, you must specify: A. The path to the data. B. A SELECT command. C. A User and password. D. An IP address.

Feedback:
To review this topic, click A Teradata Database. Exercise 4.2 Database_Employee was created with 500 GB of Perm Space. If Database_Addresses (100 GB of Perm Space) and Database_Compensation (100 GB of Perm Space) both are created from Database_Employee, how much available Perm Space does Database_Employee have now? A. 300 GB B. 500 GB C. 600 GB D. 700 GB

Feedback:

See Calculation To review this topic, click Space Allocations: An Example. Exercise 4.3 Select the answers from the options given in the drop-down boxes that correctly complete the sentences. A view is a "virtual table" that does not exist as an actual table.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 85 of 137

Permanent Space is pre-defined and allocated for a Database or User. Users must have privileges to access any database object. Perm Space limits apply to Databases, Users, tables, views, macros, triggers, and stored procedures. Temp Space is used for global temporary tables. Perm Space is assigned to a User or Database to gather answer sets.

Feedback:
Show Answers Reset

To review these topics, click Creating Databases and Users, Creating a New Database. Exercise 4.4 Select the choice from the drop-down box that corresponds to each statement: Privileges granted to Users and Databases. Work area consumed by the system as it processes requests. Maximum space allocated to Databases and Users for data.

Feedback:
Show Answers Reset

To review these topics, click Creating Databases and Users. Exercise 4.5 A Teradata User is a special type of database: A. Always B. Sometimes C. Never

Feedback:
To review this topic, click A Teradata Database. Exercise 4.6 True of False: A User-Defined Function (UDF) allows authorized users to write external functions. Teradata allows users to create scalar functions to return single value results, aggregate functions to return summary results, and table functions to return tables. UDFs may be used to protect sensitive data such as personally identifiable data.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 86 of 137

A. True B. False

Feedback:
To review this topic, click A Teradata Database. Exercise 4.7 The three Teradata Database security mechanisms for authenticating access to the Teradata Database are? (Choose three.) A. LDAP B. Single Sign-On C. User Defined Functions D. Passwords

Feedback:
Check Answer Show Answer

To review these topics, click Database Security. Exercise 4.8 Match the data layers built into the Teradata EDW environment to their definitions. The primary purpose for this layer is to perform data transformation, either in the ETL or ELT process. This layer is where denormalizations that will make access more efficient occur; preaggregations, summary tables, join indexes, etc. The purpose of this layer is to provide efficient, friendly access to end users. This is the access layer. Access is often provided via views and business intelligence (BI) tools; whether a Teradata application or a 3rd party tool.

Feedback:
Show Answers Reset

To review these topics, click Data Layers . Mod 5 - Data Protection Objectives After completing this module, you should be able to:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 87 of 137

Describe the types of data protection and fault tolerance used by the Teradata Database. Discuss the types of RAID protection used on Teradata Database systems. Explain basic data storage concepts. Explain the concept of Fallback tables. List the types and levels of locking provided by the Teradata Database. Describe the function of recovery journals, transient journals, and permanent journals. HOT TIP: This module contains links to important supplemental course information. Please be sure to click on each hotword link to capture all of the training content.

Protecting Data Several types of data protection are available with the Teradata Database. All the data protection methods shown on this page are covered in further detail later in this module.

RAID
Redundant Array of Inexpensive Disks (RAID) is a storage technology that provides data protection at the disk drive level. It uses groups of disk drives called "arrays" to ensure that data is available in the event of a failed disk drive or other component. The word, "redundant," implies that either data, functions, and/or components have been duplicated in the array's architecture. The industry has agreed on six RAID configuration levels (RAID 0 through RAID 5). The classifications do not imply superiority of one mode over the other, but differentiate how data is stored on the disk drives. With the Teradata Database, the two RAID technologies that are supported are RAID 1 and RAID 5. On systems using EMC disk drives, RAID 5 is called RAID S.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 88 of 137

Disk arrays contain the following major components: SCSI bus Physical disks Disk array controllers For maximum availability and performance, the Teradata Database uses dual redundant disk array controllers. Having two disk array controllers provides a level of protection in case one controller fails, and provides parallelism for disk access.

Fallback
Fallback is a Teradata Database feature that protects data against AMP failure. As shown later in this module, Fallback uses clusters of AMPs that provide for data availability and consistency if an AMP is unavailable.

Locks
Locks can be placed on database objects to prevent multiple users from simultaneously changing them. The four types of locks are: Exclusive Write Read Access

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 89 of 137

Journals
The Teradata Database has journals that are used for specific types of data or process recovery: Recovery Journals Permanent Journals

RAID 1 RAID 1 is a data protection scheme that uses mirrored pairs of disks to protect data from a single drive failure.

RAID 1: Effects on Your System


RAID 1 requires double the number of disks because every drive has an identical mirrored copy. Recovery with RAID 1 is faster than with RAID 5. The highest level of data protection is RAID 1 with Fallback.

RAID 1: How It Works

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 90 of 137

RAID 1 protects against a single disk failure using the following principles: Mirroring Reading Mirroring: RAID 1 maintains a mirrored disk for each disk in the system.

Note: If you configure more than one pair of disks per AMP, the RDAC stripes the data across both the regular and mirror disks.

Reading: Using both copies of the data, the system reads data blocks from the first available disk. This does not so much protect data as provide a performance benefit.

RAID 1: How It Handles Failures


If a disk fails, the Teradata Database is unaffected and the following are each handled in a different way: Reads Writes Replacements Reads: When a drive is down, the system reads the data from the other drive. There may be a minor performance penalty because the read will occur from one drive instead of both.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 91 of 137

Writes: When a drive is down, the system writes to the functional drive. No mirror image exists at this time. Replacements: After you replace the failed disk, the disk array controller automatically reconstructs the data on the new disk from the mirror image. Normal system performance is affected during the reconstruction of the failed disk.

RAID 5

RAID 5 is a data protection scheme that uses parity striping in a disk array to protect data from the failure of a single drive. Note: RAID S is the name for RAID 5 implemented on EMC disk drives.

RAID 5: Effects on Your System


The number of disks per rank varies from vendor to vendor. The number of disks in a rank impacts space utilization: 4 drives per rank requires a 33% increase in data space. 5 drives per rank requires a 25% increase in data space.

RAID 5 also uses some overhead during a write operation, because it has to read the data, then calculate and write the parity.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 92 of 137

RAID 5: How It Works


RAID 5 uses a data parity scheme to provide data protection. Rank: For the Teradata Database, RAID 5 uses the concept of a rank, which is a set of disks working together. Note that the disks in a rank are not directly cabled to each other.

Parity: In RAID 5, data is handled as follows: Data is striped across a rank of disks (spread across the disk drives) one segment at a time, using a binary "exclusive-or" (XOR) algorithm. Parity is also striped across all disk drives, interleaved with the data. A "parity byte" is an extra byte written to a drive in a rank. The process of writing data and parity to the disk drives includes a read-modify-write operation for each new segment: 1. Read existing data on the disk drives in the rank. 2. Read existing parity in that rank for the corresponding segment. 3. Calculate the parity: existing data + new data + existing parity = new parity. 4. Write new data. 5. Write new parity. If one of the disk drives in the rank becomes unavailable, the system uses the parity byte to calculate the missing data from the down drive so the system can remain operational. With a rank of 4 disks, if a disk fails, any missing data block may be reconstructed using the other 3 disks. In the example below, data bytes are written to disk drives 1, 2, and 3. The system calculates the parity byte using the binary XOR algorithm and writes it to disk drive 4.

RAID 5: How It Handles Failures

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 93 of 137

If a disk fails, the Teradata Database is unaffected and the following are each handled in different ways: Reads Writes Replacements Reads: Data is reconstructed on-the-fly as users request data using the binary XOR algorithm.

Writes: When a drive is down, the system writes to the functional drives, but not to the failed drive.

Replacements: After you replace the failed disk, the disk array controller automatically reconstructs the data on the new disk, using known data values to calculate the missing data. Normal system performance is affected during reconstruction of the failed disk.

Give It a Try
In the example below, Disk 2 has experienced a failure. To allow users to still access the data while Disk 2 is down, the system must calculate the data on the missing disk drive using the parity byte. What would be the missing byte for this segment?

A. 1111 0011 B. 0111 1011 C. 0010 0110 D. 0000 1100

Feedback:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 94 of 137

Fallback Fallback is a Teradata Database feature that protects data in the case of an AMP vproc failure. Fallback guarantees the maximum availability of data. You can specify Fallback protection at the table or database level. It is especially useful in applications that require high availability. Fallback protects your data by storing a second copy of each row of a table on a different AMP in the same cluster. If an AMP fails, the system accesses the Fallback rows to meet requests. Fallback provides AMP fault tolerance at the table level. With Fallback tables, if one AMP fails, all data is still available. Users may continue to use Fallback tables without any loss of access to data. During table creation or after a table is created, you may specify whether or not the system should keep a Fallback copy of the table. If Fallback is specified, it is automatic and transparent. Fallback guarantees that the two copies of a row will always be on different AMPs. If either AMP fails, the alternate row is still available on the other AMP.

Fallback: Effects on Your System


Fallback has the following effects on your system: Space In addition to the original database size, you need space for: Fallback-protected tables (100% additional storage space for each Fallbackprotected table) RAID protection of Fallback-protected tables

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 95 of 137

Performance There is a benefit to protecting your data, but there are costs associated with that benefit. With Fallback use, you need twice the disk space for storage and twice the I/O required for INSERTs, UPDATEs, and DELETEs of rows in Fallback protected tables. The Fallback option does not require any extra I/O for SELECTS, as the system will read from one copy or the other, and the Fallback I/O will be performed in parallel with the primary I/O so there is no performance hit. Fallback benefits include: A level of protection beyond RAID disk array protection. Can be specified on a table-by-table basis to protect data requiring the highest availability. Permits access to data while an AMP is off-line. Automatically restores data that was changed during the AMP off-line period. The highest level of data protection is Fallback and RAID1.

Fallback: Software Tools


The following Teradata utilities are used to recover a failed AMP: Vproc Manager: Enables you to: Display and modify vproc states. Initiate Teradata Database restarts. Table Rebuild: Reconstructs tables on an AMP from data on other AMPs in the cluster. Recovery Manager: Lets you monitor recovery processing.

Fallback: How It Works

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 96 of 137

Fallback is accomplished by grouping AMPs into clusters. When a table is defined as Fallback-protected, the system stores a second copy of each row in the table on a "Fallback AMP" in the AMP cluster. Below is a cluster of four AMPs. Each AMP has a combination of Primary and Fallback data rows: Primary Data Row: A record in a database table that is used in normal system operation. Fallback Data Row: The online backup copy of a Primary data row that is used in the case of an AMP failure. Write: Each Primary data row has a duplicate Fallback row on another AMP. The Primary and Fallback data rows are written in parallel.

P=Primary F=Fallback

Read: When an AMP is down with a table that is defined as Fallback, Teradata will access the Fallback copies of the rows. More Clusters: The diagram below shows how Fallback data is distributed among multiple clusters.

P=Primary F=Fallback

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 97 of 137

Fallback: How It Handles Failures


If two physical disks fail in the same RAID 5 rank or RAID 1 mirrored pair, the associated AMP vproc fails. Fallback protects against the failure of a single AMP in a cluster. If two AMPs in a cluster fail, the system halts and must be restarted manually, after the AMP is recovered by replacing the failed disk(s). Reads: When an AMP fails, the system reads all rows it needs from the remaining AMPs in the cluster. If the system needs to find a Primary row from the failed AMP, it reads the Fallback copy of that row, which is on another AMP.

Writes: A failed AMP is not available, so the system cannot access any of that AMP's disk space. Copies of its unavailable primary rows are available as Fallback rows on the other AMPs in the cluster, and are updated there.

Replacement: Repairing the failed AMP requires replacing the failed physical disks and bringing the AMP online. Once the AMP is online, the system uses the Fallback data on the other AMPs to automatically reconstruct data on the newly replaced disks.

Disk Allocation The operating system, PDE, and the Teradata Database do not recognize the physical disk hardware. Each software component recognizes and interacts with different components of the data storage environment: Operating system: Recognizes a logical unit (LUN). The operating system recognizes the LUN as its "disk," and is not aware that it is actually writing to spaces on multiple disk drives. This technique enables the use of RAID technology to provide data availability without affecting the operating system. PDE: Translates LUNs into vdisks using slices (in UNIX) or partitions (in Microsoft Windows and Linux) in conjunction with the Teradata Parallel Upgrade Tool. Teradata Database: Recognizes a virtual disk (vdisk). Using vdisks instead of direct connections to physical disk drives enables the use of RAID technology with the Teradata Database.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 98 of 137

Creating LUNs
Space on the physical disk drives is organized into LUNs. The RAID level determines how the space is organized. For example, if you are using RAID 5, a LUN includes a region of space from each of the physical disk drives in a rank.

Pdisks: User Data Space


After a LUN is created, it is divided into partitions. In UNIX systems, a LUN consists of one partition, which is further divided into slices: Boot slice (a very small slice, taking up only 35 sectors) User slices for storing data. These user slices are called "pdisks" in the Teradata Database.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 99 of 137

In Microsoft Windows systems, a LUN consists of multiple partitions, not slices. Thus, LUNs in Microsoft Windows do not have a boot slice. Instead, they contain a "Master Boot Record" that includes information such as the partition layout. The partitions store data and are called "pdisks" in the Teradata Database. Linux systems are similar to Microsoft Windows, both use a Master Boot Record and an MS DOS style partition table. In summary, pdisks are the user slices (UNIX), partitions (Microsoft Windows), or partitions (Linux) and are used for storage of the tables in a database. A LUN may have one or more pdisks.

Assigning Pdisks to AMPs


The pdisks (user slices or partitions, depending on the operating system) are assigned to an AMP through the software. No cabling is involved. The combined space on the pdisks is considered the AMP's vdisk. An AMP manages only its own vdisk (disk space assigned to it), not the vdisk of any other AMP. All AMPs then work in parallel, processing their portion of the data.

Vdisks and Ranks


Each AMP in the system is assigned one vdisk. Although numerous configurations are possible, generally all pdisks from a rank (RAID 5) or mirrored pair (RAID 1) are assigned to the same AMP for optimal performance. However, an AMP recognizes only the vdisk. The AMP has no control over the physical disks or ranks that compose the vdisk.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 100 of 137

Reviewing the Terminology


To help review the terminology you just learned, choose the correct term from the pulldown boxes next to each definition. A logical unit that is composed of a region of space from each of the physical disk drives in a rank. The operating system sees this logical unit as its "disk," and is not aware that it is actually writing to spaces on multiple disk drives. For a UNIX system, a portion of physical disk drive space that is used for storing data. One of these from each disk drive in a rank composes a LUN. For a Microsoft Windows system, a portion of physical disk drive space that is used for storing data. One of these from each disk drive in a rank composes a LUN. This is Teradata Database terminology for a user slice (UNIX) or partition (Microsoft Windows) that store data. It is just another name for user slice or partition, but from the Teradata Database point of view. These are assigned to AMPs, which manage the data stored. This is the collective name for all the logical disk space that an AMP manages. Thus, it is composed of all the pdisks assigned to that AMP (as many as 64 pdisks).

Feedback:
Show Answers Reset

Journals for Data Availability The following journals are kept on the system to provide data availability in the event of a component or process failure in the system:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 101 of 137

Recovery Journals Permanent Journals

Recovery Journals
The Teradata Database uses Recovery Journals to automatically maintain data integrity in the case of: An interrupted transaction (Transient Journal) An AMP failure (Down-AMP Recovery Journal) Recovery Journals are created, maintained, and purged by the system automatically, so no DBA intervention is required. Recovery Journals are tables stored on disk arrays like user data is, so they take up disk space on the system. Transient Journal A Transient Journal maintains data integrity when in-flight transactions are interrupted (due to aborted transactions, system restarts, and so on). Data is returned to its original state after transaction failure. A Transient Journal is used during normal system operation to keep "before images" of changed rows so the data can be restored to its previous state if the transaction is not completed. This happens on each AMP as changes occur. When a transaction is started, the system automatically stores a copy of all the rows affected by the transaction in the Transient Journal until the transaction is committed (completed). Once the transaction is complete, the "before images" are purged. In the event of a transaction failure, the "before images" are reapplied to the affected tables and deleted from the journal, and the "rollback" operation is completed. Down-AMP Recovery Journal The Down-AMP Recovery Journal allows continued system operation while an AMP is down (for example, when two disk drives fail in a rank or mirrored pair). A Down-AMP Recovery Journal is used with Fallback-protected tables to maintain a record of write transactions (updates, creates, inserts, deletes, etc.) on the failed AMP while it is unavailable. The Down-AMP Recovery Journal starts automatically after the loss of an AMP in a cluster, Any changes to the data on the failed AMP are logged into the Down-AMP Recovery Journal by the other AMPs in the cluster. When the failed AMP is brought back online, the restart process includes applying the changes in the Down-AMP Recovery Journal to the recovered AMP. The journal is discarded once the process is complete, and the AMP is brought online, fully recovered.

Permanent Journals

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 102 of 137

Permanent Journals are an optional feature used to provide an additional level of data protection. You specify the use of Permanent Journals at the table level. It provides fulltable recovery to a specific point in time. It can also reduce the need for costly and timeconsuming full-table backups. Permanent Journals are tables stored on disk arrays like user data is, so they take up additional disk space on the system. The Database Administrator maintains the Permanent Journal entries (deleting, archiving, and so on.) How Permanent Journals Work A Database (object) can have one Permanent Journal. When you create a table with Permanent Journaling, you must specify whether the Permanent Journal will capture: Before images -- for rollback to "undo" a set of changes to a previous state. After images -- for rollforward to "redo" to a specific state. You can also specify that the system keep both before images and after images. In addition, you can choose that the system captures: Single images (the default) -- this means that the Permanent Journal table is not Fallback protected. Dual images -- this means that the Permanent Journal table is Fallback protected. The Permanent Journal captures images concurrently with standard table maintenance and query activity. The additional disk space required may be calculated in advance to ensure adequate resources. Periodically, the Database Administrator must dump the Permanent Journal to external media, thus reducing the need for full-table backups since only changes are backed up rather than the entire database.

Locks Locking prevents multiple users who are trying to access or change the same data simultaneously from violating data integrity. This concurrency control is implemented by locking the target data. Locks are automatically acquired during the processing of a request and released when the request is terminated.

Levels of Locking
Locks may be applied at three levels: Database Locks: Apply to all tables and views in the database.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 103 of 137

Table Locks: Apply to all rows in the table or view. Row Hash Locks: Apply to a group of one or more rows in a table.

Types of Locks
The four types of locks are described below. Exclusive Exclusive locks are applied to databases or tables, never to rows. They are the most restrictive type of lock. With an exclusive lock, no other user can access the database or table. Exclusive locks are used when a Data Definition Language (DDL) command is executed (i.e., CREATE TABLE). An exclusive lock on a database or table prevents other users from obtaining any lock on the locked object. Write Write locks enable users to modify data while maintaining data consistency. While the data has a write lock on it, other users can only obtain an access lock. During this time, all other locks are held in a queue until the write lock is released. Read Read locks are used to ensure consistency during read operations. Several users may hold concurrent read locks on the same data, during which time no data modification is permitted. Read locks prevent other users from obtaining the following locks on the locked data: Exclusive locks Write locks Access Access locks can be specified by users unconcerned about data consistency. The use of an access lock allows for reading data while modifications are in process. Access locks are designed for decision support on tables that are updated only by small, single-row changes. Access locks are sometimes called "stale read" locks, because you may get

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 104 of 137

"stale data" that has not been updated. Access locks prevent other users from obtaining the following locks on the locked data: Exclusive locks

What Type of Lock?


Match the type of lock to the descriptions: Allows other users to see a stable version of the data, but not make any modifications. Allows other users to obtain an access lock only, not any other type of lock. This kind of lock cannot be applied to rows.

Feedback:
Show Answers Reset

Exercise 5.1 True or False: If a single disk drive fails, the Teradata Database halts, then restarts. A. True B. False

Feedback:
To review this topic, click RAID 1: How It Handles Failures or RAID 5: How It Handles Failures. Exercise 5.2 RAID 5 protects data from disk failures using: A. DARDAC B. Mirroring C. Parity Striping D. Partitioning

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 105 of 137

Feedback:
To review this topic, click RAID 5: How It Works. Exercise 5.3 Match the type of journal to the appropriate phrase: Stores before-images and after-images. Protects data from a transaction that does not complete. Starts logging changes for a Fallback table when an AMP goes down.

Feedback:
Show Answers Reset

To review this topic, click Down-AMP Recovery Journal, Transient Journal, or Permanent Journals. Exercise 5.4 Which three statements are true? (Choose three.) A. Fallback protects data from the failure of one AMP per cluster. B. A clique provides protection in the case of a node failure. C. ARC protects disk arrays from electrostatic discharge. D. Locks prevent multiple users from simultaneously changing the same data.

Feedback:
Check Answer Show Answer

To review these topics, click Fallback, Cliques Provide Resiliency, Archival Utilities, or Locks. Exercise 5.5 True or False: Restoration of Fallback-protected data starts automatically when a failed AMP is brought online. A. True B. False

Feedback:
2. True or False: Fallback protection is specified at the row hash level. A. True B. False

Feedback:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 106 of 137

To review these topics, click Fallback, Fallback: How It Handles Failures. Exercise 5.6 From the drop-down boxes below, match the storage concepts to the descriptions: The collection of pdisks used to store data. This space is assigned to an AMP. A collection of areas across the disk drives in a rank. The operating system sees this as its logical "disk." A collection of AMPs that keeps Fallback copies of rows for each other in case one AMP fails. An area of a LUN (also known as a user slice in UNIX or partition in Microsoft Windows) that stores user data. A collection of disk drives used to provide data availability.

Feedback:
Show Answers Reset

To review these topics, click Assigning Pdisks to AMPs, Creating LUNs, Fallback: How It Works, Pdisks: User Data Space, or RAID 5: How It Works. Mod 6 - Indexes Objectives After completing this module, you should be able to: List tasks Teradata Database Administrators never have to perform. Define primary and secondary indexes and their purposes. Distinguish between a primary index and a primary key. Distinguish between a UPI and a NUPI. Define a Partitioned Primary Index (PPI) and its purpose. Distinguish between a USI and a NUSI. Explain the makeup of the Row-ID and its role in row storage. Describe the sequence of events for locating a row. Explain the roles of the hashing algorithm and hash map in locating a row. Describe the operation of a full-table scan. HOT TIP: This module contains links to important supplemental course information. Please be sure to click on each hotword link to capture all of the training content.

Indexes in the Teradata Database Indexes are used to access rows from a table without having to search the whole table. In the Teradata Database, an index is made up of one or more columns in a table. Once

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 107 of 137

Teradata Database indexes are selected, they are maintained by the system. While other vendors may require data partitioning or index maintenance, these tasks are unnecessary with the Teradata Database. In the Teradata Database, there are two types of indexes: Primary Indexes define the way the data is distributed. Primary Indexes and Secondary Indexes are used to locate the data rows more efficiently than scanning the whole table. You specify which column or columns are used as the Primary Index when you create a table. Secondary Index columns can be specified when you create a table or at any time during the life of the table.

Data Distribution
When the Primary Index for a table is well chosen, the rows are evenly distributed across the AMPs for the best performance. The way to guarantee even distribution of data is by choosing a Primary Index whose columns contain unique values. The values do not have to be evenly spaced, or even "truly random," they just have to be unique to be evenly distributed. Each AMP is responsible for a subset of the rows in a table. If the data is evenly distributed, the work is evenly divided among the AMPs so they can work in parallel and complete their processing about the same time. Even data distribution is critical to performance because it optimizes the parallel access to the data.

Unevenly distributed data, also called "skewed data," causes slower response time as the system waits for the AMP(s) with the most data to finish their processing. The slowest AMP becomes a bottleneck. If distribution is skewed, an all-AMP operation will take longer than if all AMPs were evenly utilized.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 108 of 137

When data is loaded into the Teradata Database: The system automatically distributes the data across the AMPs based on row content (the Primary Index values). The distribution is the same regardless of the data volume being loaded. In other words, large tables are distributed the same way as small tables. Data is not distributed in any particular order. The benefits of having unordered data are that they don't need any maintenance to preserve order, and they are independent of any query being submitted. The automatic, unordered distribution of data eliminates tasks for a Teradata Database Administrator that are necessary with some other relational database systems. The DBA does not waste time on labor-intensive data maintenance tasks.

Teradata Database Manageability


A key benefit of the Teradata Database is its manageability. The list of tasks that Teradata Database Administrators do not have to do is long, and illustrates why the Teradata Database system is so easy to manage and maintain compared to other databases. Things Teradata Database Administrators Never Have to Do Teradata Database Administrators never have to do the following tasks: Reorganize data or index space. Pre-allocate table/index space. Physical partitioning of disk space. While it is possible to have partitioned indexes in the Teradata Database, they are not required, and they are created logically. Pre-prepare data for loading (convert, sort, split, etc.). Unload/reload data spaces due to expansion. With the Teradata Database, the data can be redistributed on the larger configuration with no offloading and reloading required. Write or run programs to split input source files into partitions for loading.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 109 of 137

With the Teradata Database, the workload for creating a table of 100 rows is the same as creating a table with 1,000,000,000 rows. Teradata Database Administrator know that if data doubles, the system can expand easily to accommodate it. The Teradata Database provides huge cost advantages, especially when it comes to staffing Database Administrators. Customers tell us that their DBA staff requirements for administering non-Teradata databases are three to four times higher. How Other Databases Store Rows and Manage Data Even data distribution is not easy for most databases to do. Many databases use range distribution, which creates intensive maintenance tasks for the DBA. Others may use indexes as a way to select a small amount of data to return the answer to a query. They use them to avoid accessing the underlying tables if possible. The assumption is that the index will be smaller than the tables so they will take less time to read. Because they scan indexes and use only part of the data in the index to search for answers to a query, they can carry extra data in the indexes, duplicating data in the tables. This way they do not have to read the table at all in some cases. This is not as efficient as the Teradata Database's method of data storage and access. Other DBAs have to ask themselves questions like: How should I partition the data? How large should I make the partitions? Where do I have data contention? How are the users accessing the data? Many other databases require the DBAs to manually partition the data. They might place an entire table in a single partition. The disadvantage of this approach is it creates a bottleneck for all queries against that data. It is not the most efficient way to either store or access data rows. With other databases, adding, updating and deleting data affects manual data distribution schemes thereby reducing query performance and requiring reorganization. A Teradata Database provides high performance because it distributes the data evenly across the AMPs for parallel processing. No partitioning or data re-organizations are needed. With the Teradata Database, your DBA can spend more time with users developing strategic applications to beat your competition!

What Do You Think?


Which two statements are true about data distribution and Teradata Database indexes? (Choose two.) A. If a table has 103 rows and there are 4 AMPs in the system, each AMP will not have exactly the same number of rows from that table. However, if the Primary Index is chosen well, each AMP will still contain some rows from that table. B. The rows of a table are stored on a single disk for best access performance. C. Skewed data leads to poor performance in processing data access requests.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 110 of 137

D. Teradata Database performance can be increased by maintaining the indexes and conducting periodic data partitioning and sorting.

Feedback:
Check Answer Show Answer

Primary Index A Primary Index (PI) is the physical mechanism for assigning a data row to an AMP and a location on the AMPs disks. It is also used to access rows without having to search the entire table. A Primary Index operation is always a one-AMP operation. You specify the column(s) that comprise the Primary Index for a table when the table is created. For a given row, the Primary Index value is the combination of the data values in the Primary Index columns. Choosing a Primary Index for a table is perhaps the most critical decision a database designer makes, because this choice affects both data distribution and access.

Primary Index Rules


The following rules govern how Primary Indexes in a Teradata Database must be defined as well as how they function: Rule 1: One Primary Index per table. Rule 2: A Primary Index value can be unique or non-unique. Rule 3: The Primary Index value can be NULL. Rule 4: The Primary Index value can be modified. Rule 5: The Primary Index of a populated table cannot be modified. Rule 6: A Primary Index has a limit of 64 columns.

Rule 1: One PI Per Table


Each table must have a Primary Index. The Primary Index is the way the system determines where a row will be physically stored. While a Primary Index may be composed of multiple columns, the table can have only one (single- or multiple-column) Primary Index.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 111 of 137

Rule 2: Unique or Non-Unique PI


There are two types of Primary Index: Unique Primary Index (UPI) - For a given row, the combination of the data values in the columns of a Unique Primary Index are not duplicated in other rows within the table, so the columns are unique. This uniqueness guarantees even data distribution and direct access. For example, in the case where old employee numbers are sometimes recycled, the combination of the Social Security Number and Employee Number columns would be a UPI. With a UPI, there is no duplicate row checking done during a load, which makes it a faster operation. Non-Unique Primary Index (NUPI) - For a given row, the combination of the data values in the columns of a Non-Unique Primary Index can be duplicated in other rows within the table. So, there can be more than one row with the same PI value. A NUPI can cause skewed data, but in specific instances can still be a good Primary Index choice. For example, either the Department Number column or the Hire Date column might be a good choice for a NUPI if you will be accessing the table most often via these columns.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 112 of 137

Rule 3: PI Can Be NULL


If the Primary Index is unique, you could have one row with a null value. If you have multiple rows with a null value, the Primary Index must be Non-Unique.

Rule 4: PI Value Can Be Modified


The Primary Index value can be modified. In the table below, if Loretta Ryan changes departments, the Primary Index value for her row changes. When you update the index value in a row, the Teradata Database re-hashes it and redistributes the row to its new location based on its new index value.

Rule 5: PI Cannot Be Modified


The Primary Index of a table cannot be modified. In the event that you need to change the Primary Index, you must drop the table, recreate it with the new Primary Index, and reload the table. The ALTER TABLE statement allows you to change the PI of a table if the table is empty.

Rule 6: PI Has 64-Column Limit

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 113 of 137

You can designate a Primary Index that is composed of 1 to 64 columns.

SQL Syntax for Creating a Primary Index


When a table is created, it must have a Primary Index specified. The Primary Index is designated in the CREATE TABLE statement in SQL. If you do not specify a Primary Index in the CREATE TABLE statement, the system will use the Primary Key as the Primary Index. If a Primary Key has not been specified, the system will choose the first unique column. If there are no unique columns, the system will use the first column in the table and designate it as a Non-Unique Primary Index. Creating a Unique Primary Index The SQL syntax to create a Unique Primary Index is:
CREATE TABLE sample_1 (col_a INT ,col_b INT ,col_c INT) UNIQUE PRIMARY INDEX (col_b);

Creating a Non-Unique Primary Index The SQL syntax to create a Non-Unique Primary Index is:
CREATE TABLE sample_2 (col_x INT ,col_y INT ,col_z INT) PRIMARY INDEX (col_x);

Modifying the Primary Index of a Table As mentioned in the Primary Index rules, you cannot modify the Primary Index of a table. In the event that you want to change the Primary Index, you must drop the table, recreate it with the new Primary Index, and reload the table.

Data Mechanics of Primary Indexes

This section describes how Primary Indexes are used in: Data distribution Data access

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 114 of 137

Distributing Rows to AMPs


The Teradata Database uses hashing to randomly distribute data across all AMPs for balanced performance. For example, in a two clique system, data is hashed across all AMPs in the system for even data distribution, which results in evenly distributed workloads. Each AMP holds a portion of the rows of each table. An AMP is responsible for the storage, maintenance, and retrieval of the data under its control. The Teradata Database's automatic hash distribution eliminates costly data maintenance tasks. An advantage of the Teradata Database is that the Teradata File System manages data and disk space automatically, which eliminates the need to rebuild indexes when tables are updated or structures change. Rows are distributed to AMPs during the following operations: Loading data into a table (one or more rows, using a data loading utility) Inserting or updating rows (one or more rows, using SQL) Changing the system configuration (redistribution of data, caused by reconfigurations to add or delete AMPs) When loading data or inserting rows, the data being affected by the load or insert is not available to other users until the transaction is complete. During a reconfiguration, no data is accessible to users until the system is operational in its new configuration. Row Distribution Process The process the system uses for inserting a row on an AMP is described below:

1. The system uses the Primary Index value in each row as input to the hashing algorithm. 2. The output of the hashing algorithm is the row hash value (in this example, 646). 3. The system looks at the hash map, which identifies the specific AMP where the row will be stored (in this example, AMP 3). 4. The row is stored on the target AMP. Duplicate Row Hash Values It is possible for the hashing algorithm to end up with the same row hash value for two

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 115 of 137

different rows. There are two ways this could happen: Duplicate NUPI values: If a Non-Unique Primary Index is used, duplicate NUPI values will produce the same row hash value. Hash synonym: Also called a hash collision, this occurs when the hashing algorithm calculates an identical row hash value for two different Primary Index values. Hash synonyms are rare. When using a Unique Primary Index, you will still get uniform data distribution. To differentiate each row in a table, every row is assigned a unique Row ID. The Row ID is the combination of the row hash value and a uniqueness value.

Row ID = Row Hash Value + Uniqueness Value


The uniqueness value is used to differentiate between rows whose Primary Index values generate identical row hash values. In most cases, only the row hash value portion of the Row ID is needed to locate the row.

When each row is inserted, the AMP adds the row ID, stored as a prefix of the row. The first row inserted with a particular row hash value is assigned a uniqueness value of 1. The uniqueness value is incremented by 1 for any additional rows inserted with the same row hash value.

Duplicate Rows
A duplicate row is a row in a table whose column values are identical to another row in the same table. In other words, the entire row is the same, not just the index. Although duplicate rows are not allowed in the relational model (because every Primary Key must be unique), the ANSI Standard does allow duplicate rows and the Teradata Database supports that. Because duplicate rows are allowed in the Teradata Database, how does it affect the UPI, which, by definition, is unique? When you create a table, the following definitions determine whether or not it can contain duplicate rows: MULTISET tables: May contain duplicate rows. The Teradata Database will not check for duplicate rows. SET tables: The default. The Teradata Database checks for and does not permit duplicate rows. If a SET table is created with a Unique Primary Index, the check for duplicate rows is replaced by a check for duplicate index values.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 116 of 137

Accessing a Row With a Primary Index


When a user submits an SQL request against a table using a Primary Index, the request becomes a one-AMP operation, which is the most direct and efficient way for the system to find a row. The process is explained below.

Hashing Process 1. 2. 3. 4. 5. 6. The primary index value goes into the hashing algorithm. The output of the hashing algorithm is the row hash value. The hash map points to the specific AMP where the row resides. The PE sends the request directly to the identified AMP. The AMP locates the row(s) on its vdisk. The data is sent over the BYNET to the PE, and the PE sends the answer set on to the client application.

Choosing a Unique or Non-Unique Primary Index Criteria for choosing a Primary Index include: Uniqueness: A UPI is often a good choice because it: Guarantees even data distribution. Eliminates duplicate row checking during a load, which makes it a faster operation. A NUPI with few duplicate values could provide good (if not perfectly uniform) distribution, and might meet the other criteria better. Known Access Paths - Use in value access: Retrievals, updates, and deletes that specify the Primary Index are much faster than those that do not. Because a Primary Index is a known access path to the data, it is best to choose column(s) that will be frequently used for access. For example, the following SQL statement would directly access a row based on the equality WHERE clause:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 117 of 137

SELECT * FROM employee WHERE employee_ID = ABC456789

A NUPI may be a better choice if the access is based on another, mostly unique column. For example, the table may be used by the Mail Room to track package delivery. In that case, a column containing room numbers or mail stops may not be unique if employees share offices, but a better choice for access. Join Performance - Use in join access: SQL requests that use a JOIN statement perform the best when the join is done on a Primary Index. Consider Primary Key and Foreign Key columns as potential candidates for Primary Indexes. For example, if the Employee table and the Payroll table are related by the Employee ID column, then the Employee ID column could be a good Primary Index choice for one or both of the tables. For join performance, a NUPI can be a better choice than a UPI. Non-volatile values: Look for columns where the values do not change frequently. For example, in an Invoicing table, the outstanding balance column for all customers probably has few duplicates, but probably changes too frequently to make a good Primary Index. A customer ID, statement number, or other more stable columns may be better choices. When choosing a Primary Index, try to find the column(s) that best fit these criteria and the business need.

What Do You Think?


Which three are key considerations in choosing a Primary Index? (Choose three.) A. Column(s) containing unique (or nearly unique) values for uniform distribution. B. Column(s) with values in sequential order for best load and access performance. C. Column(s) frequently used in queries to access data or to join tables. D. Column(s) with values that are stable (do not change frequently), to minimize redistribution of table rows. E. Column(s) with many duplicate values for redundancy.

Feedback:
Check Answer Show Answer

Partitioned Primary Index The Teradata Database provides an indexing mechanism called Partitioned Primary Index (PPI). PPI is used to:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 118 of 137

Improve performance for large tables when you submit queries that specify a range constraint. Reduce the number of rows to be processed by using a technique called partition elimination. Increase performance for incremental data loads, deletes, and data access when working with large tables with range constraints. Instantly drop old data and rapidly add new data. Avoid full-table scans without the overhead of a Secondary Index.

How Does PPI Work?


Data distribution with PPI is still based on the Primary Index: Hash Primary Determines which AMP gets the row Value Index With PPI, the ORDER in which the rows are stored on the AMP is affected. Using the traditional method, No Partitioned Primary Index (NPPI), the rows are stored in row hash order.
4 AMPs with Orders Table Defined with NPPI

Using PPI, the rows are stored first by partition and then by row hash. In our example, there are four partitions. Within the partitions, the rows are stored in row hash order.
4 AMPs with Orders Table Defined with PPI on O_Date

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 119 of 137

With PPI, the Optimizer uses partition elimination to eliminate partitions that are not included in the query. This reduces the number of partitions to be accessed and rows to be processed. For example, in the table above, a query specifying the date 02/09 allows the Optimizer to eliminate the other partitions so each AMP can access just the 02/09 partition to retrieve the rows. The multilevel PPI feature improves response to business questions. Specifically, it improves the performance of queries that can take advantage of partition elimination. For example, an insurance claims table could be partitioned by claim date and then subpartitioned by state. The analysis performed for a specific state (such as Connecticut) within a date range that is a small percentage of the many years of claims history in the data warehouse (such as March 2006) would take advantage of partition elimination for faster performance. Similarly, a retailer may commonly run an analysis of retail sales for a particular district (such as eastern Canada ) for a specific timeframe (such as the first quarter of 2004) on a table partitioned by date of sale and sub-partitioned by sales district.

Data Storage Using PPI


To store rows using PPI: specify Partitioning in the CREATE TABLE statement. The query will run through the hashing algorithm as normal, and come out with the Base Table ID, the Partition number(s), the Row Hash, and the Primary Index values.
Data Storage Using PPI

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 120 of 137

Access Without a PPI


Let's say you have a table with Store information by Location and did not use a PPI. If you query on Location 3 on this NPPI table, the entire table will be scanned to find records for Location (Full-Table Scan).
Access Without a PPI QUERY PLAN SELECT * FROM Store_NPPI WHERE Location_Number = 3; ALL-AMPs - Full-Table Scan

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 121 of 137

Access with a PPI


In the same example for a PPI table, you would partition the table with as many Locations as you have (or will soon have in the future.) Then if you query on Location 3, each AMP will use partition elimination and each AMP only has to scan partition 3 for the query. This query will run much faster than the Full-Table Scan in the previous example.
Access With a PPI QUERY PLAN SELECT * FROM Store WHERE Location_Number = 3;

ALL-AMPs - Single Partition Scan

Multi-Level Partitioned Primary Index

Multi-level partitioning allows each partition, (i.e., PPI) to be sub-partitioned. With MLPPI you can use multiple partitioning expressions instead of only one for a table or a non-compressed join index. Each partitioning level is defined independently using a Range_N or Case_N expression. With a multi-level PPI (MLPPI), you create multiple access paths to the rows in the base table that the Optimizer can choose from. This improves response to business questions by improving the performance of queries which take advantage of partition elimination. For example, an insurance claims table could be partitioned by claim date and then subpartitioned by state. The analysis performed for a specific state (such as Connecticut) within a date range that is a small percentage of the many years of claims history in the data warehouse (such as March 2006) would take advantage of partition elimination for faster performance. Note: an MLPPI table must have at least two partition levels defined. Syntax:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 122 of 137

Advantages and Disadvantages


Advantages of partitioned tables: They provide efficient searches by using partition elimination at the various levels or combination of levels. They reduce the I/O for range constraint queries They take advantage of dynamic partition elimination They provide multiple access paths to the data, and an MLPPI provides even more partition elimination and more partitioning expression choices, (i.e., you can use last name or some other value that is more readily available to query on.) The Primary Index may be either a UPI or a NUPI; a NUPI allows local joins to other similar entities Row hash locks are used for SELECT with equality conditions on the PI columns. Partitioned tables allow for fast deletes of data in a partition. They allow for range queries without having to use a secondary index. Specific partitions maybe archived or deleted. May be created on Volatile tables; global temp tables, base tables, and noncompressed join indexes. May replace a Value Ordered NUSI for access. Disadvantages of partitioned tables: Rows in a partitioned table are 2 bytes longer. Access via the Primary Index may take longer. Full table joins to a NPPI table with the same PI may take longer.

What is a NoPI Table? A NoPI Table is simply a table without a primary index. It is a Teradata 13.00 feature. As rows are inserted into a NoPI table, rows are always appended at the end of the table and never inserted in the middle of a hash sequence. Organizing/sorting rows based on row hash is therefore avoided. Prior to Teradata Database 13.00, Teradata tables required a primary index. The primary index was primarily used to hash and distribute rows to the AMPs according to hash ownership. The objective was to divide data as evenly as possible among the AMPs to make use of Teradatas parallel processing. Each row stored in a table has a RowID which includes the row hash that is generated by hashing the primary index value. For example, the optimizer can choose an efficient single-AMP execution plan for SQL requests that specify values for the columns of the primary index. Starting with Teradata Database 13.00, a table can be defined without a primary index. This feature is referred to as the NoPI Table feature. NoPI stands for No Primary Index. Without a PI, the hash value as well as AMP ownership of a row is arbitrary. Within the AMP, there are no row-ordering constraints and therefore rows can be appended to the end of the table as if it were a spool table. Each row in a NoPI table has a hash bucket

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 123 of 137

value that is internally generated. A NoPI table is internally treated as a hashed table; it is just that typically all the rows on one AMP will have the same hash bucket value. Benefits: A NoPI table will reduce skew in intermediate ETL tables which have no natural PI. Loads (FastLoad and TPump array insert) into a NoPI staging table are faster.

Secondary Index A Secondary Index (SI) is an alternate data access path. It allows you to access the data without having to do a full-table scan. Secondary indexes do not affect how rows are distributed among the AMPs. You can drop and recreate secondary indexes dynamically, as they are needed. Unlike Primary Indexes, Secondary Indexes are stored in separate subtables that require extra overhead in terms of disk space, and maintenance which is handled automatically by the system. So, Secondary Indexes do require some system resources.

What Do You Think?


In what instances would it be a good idea to define a Secondary Index for a table? (This information will be covered in this module, but here is a preview.) The Primary Index exists for even data distribution and access, but a Secondary Index is defined to efficiently generate reports based on a different set of columns. The Product table is accessed by the retailer (who accesses data based on the retailer's product code column), and by a vendor (who access the same data based on the vendor's product code column). The table already has a Unique Primary Index, but a second column must also have unique values. The column is specified as a Unique Secondary Index (USI) to enforce uniqueness on the second column. All of the above.

Feedback:

Secondary Index Rules


Several rules that govern how Secondary Indexes must be defined and how they function are:

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 124 of 137

Rule 1: Secondary Indexes are optional. Rule 2: Secondary Index values can be unique or non-unique. Rule 3: Secondary Index values can be NULL. Rule 4: Secondary Index values can be modified. Rule 5: Secondary Indexes can be changed. Rule 6: A Secondary Index has a limit of 64 columns.

Rule 1: Optional SI
While a Primary Index is required, a Secondary Index is optional. If one path to the data is sufficient, no Secondary Index need be defined. You can define 0 to 32 Secondary Indexes on a table for multiple data access paths. Different groups of users may want to access the data in various ways. You can define a Secondary Index for each heavily used access path.

Rule 2: Unique or Non-Unique SI


Like Primary Indexes, Secondary Indexes can be unique or non-unique. A Unique Secondary Index (USI) serves two possible purposes: Enforces uniqueness on a column or group of columns. The database will check USIs to see if the values are unique. For example, if you have chosen different columns for the Primary Key and Primary Index, you can make the Primary Key a USI to enforce uniqueness on the Primary Key. Speeds up access to a row (data retrieval speed). Accessing a row with a USI requires one or two AMPs, which is less direct than a UPI (one AMP) access, but more efficient than a full-table scan. A Non-Unique Secondary Index (NUSI) is usually specified to prevent full-table scans, in which every row of a table is read. The Optimizer determines whether a full-table scan or NUSI access will be more efficient, then picks the best method.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 125 of 137

Accessing a row with a NUSI requires all AMPs.

Rule 3: SI Can Be NULL


As with the Primary Index, the Secondary Index column may contain NULL values.

Rule 4: SI Value Can Be Modified


The values in the Secondary Index column may be modified as needed.

Rule 5: SI Can Be Changed


Secondary Indexes can be changed. Secondary Indexes can be created and dropped dynamically as needed. When the index is dropped, the system physically drops the subtable that contained it.

Rule 6: SI Has 64-Column Limit


You can designate a Secondary Index that is composed of 1 to 64 columns. To use the Secondary Index below, the user would specify both Budget and Manager Employee Number.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 126 of 137

Other Secondary Indexes


Join Index Join indexes have several uses: Define a pre-join table on frequently joined columns (with optional data aggregation) without denormalizing the database. Create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing their rows to the same AMP as the large table. Define a summary table without denormalizing the database. You can define a join index on one or several tables. Single-table join index functionality is an extension of the original intent of join indexes, hence the confusing adjective "join" used to describe a single-table join index. Sparse Index Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A sparse join index uses a constant expression in the WHERE clause of its definition to narrowly filter its row population. This is known as a Sparse Join Index. Hash Index Hash indexes are used for the same purposes as single-table join indexes. Hash indexes create a full or partial replication of a base table with a primary index on a foreign key column to facilitate joins of very large tables by hashing them to the same AMP. You can only define a hash index on a single table. Hash indexes are not indexes in the usual sense of the word. They are base tables that cannot be accessed directly by a query. Value-Ordered NUSI Value-ordered NUSIs are very efficient for range constraints and conditions with an inequality on the secondary index column set. Because the NUSI rows are sorted by data value, it is possible to search only a portion of the index subtable for a given range of key values. Thus, the major advantage of a value-ordered NUSI is in the performance of range queries. Value-ordered NUSIs have the following limitations: The sort key is limited to a single numeric column. The sort key column cannot exceed four bytes. They count as two indexes against the total of 32 non-primary indexes you can define on a base or join index table.

Join Indexes

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 127 of 137

A Join Index is an optional index which may be created by a User. Join indexes provide additional processing efficiencies: Eliminate base table access Eliminate aggregate processing Reduce Joins Eliminate redistributions The three basic types of join indexes commonly used with Teradata will be described first: Single Table Join Index Distribute the rows of a single table on the hash value of a foreign key value. Facilitates the ability to join the foreign key table with the primary key table without redistributing the data. Useful for resolving joins on large tables without having to redistribute the joined rows across the AMPs. Multi-Table Join Index Pre-join multiple tables; stores and maintains the result from joining two or more tables. Facilitates join operations by possibly eliminating join processing or by reducing/eliminating join data redistribution. Aggregate Join Index Aggregate one or more columns of a single table or multiple tables into a summary table. Facilitates aggregation queries by eliminating aggregation processing. The preaggregated values are contained in the AJI instead of relying on base table calculations. A join index is a system-maintained index table that stores and maintains the joined rows of two or more tables (multiple table join index) and, optionally, aggregates selected columns, referred to as an aggregate join index. Join indexes are defined in a way that allows join queries to be resolved without accessing or joining their underlying base tables. A join index is useful for queries where the index structure contains all the columns referenced by one or more joins, thereby allowing the index to cover all or part of the query. For obvious reasons, such an index is often referred to as a covering index. Join indexes are also useful for queries that aggregate columns from tables with large cardinalities. These indexes play the role of pre-join and summary tables without denormalizing the logical design of the database and without incurring the update anomalies presented by denormalized tables.

Using Secondary Indexes

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 128 of 137

In the table below, users will be accessing data based on the Department Name column. The values in that column are unique, so it has been made a USI for efficient access. In addition, the company wants reports on how many departments each manager is responsible for, so the Manager Employee Number can also be made a secondary index. It has duplicate values, so it is a NUSI.

How Secondary Indexes Are Stored


Secondary indexes are stored in index subtables. The subtables for USIs and NUSIs are distributed differently: USI: The Unique Secondary Indexes are hash distributed separately from the data rows, based on their USI value. (As you remember, the base table rows are distributed based on the Primary Index value). The subtable row may be stored on the same AMP or a different AMP than the base table row, depending on the hash value. NUSI: The Non-Unique Secondary Indexes are stored in subtables on the same AMPs as their data rows. This reduces activity on the BYNET and essentially makes NUSI queries an AMP-local operation - the processing for the subtable and base table are done on the same AMP. However, in all NUSI access requests, all AMPs are activated because the non-unique value may be found on multiple AMPs.

Data Access Without a Primary Index

You can submit a request without specifying a Primary Index and still access the data. The following access methods do not use a Primary Index: Unique Secondary Index (USI) Non-Unique Secondary Index (NUSI) Full-Table Scan

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 129 of 137

Accessing Data with a USI


When a user submits an SQL request using the table name and a Unique Secondary Index, the request becomes a one- or two-AMP operation, as explained below.

USI Access 1. The SQL is submitted, specifying a USI (in this case, a customer number of 56). 2. The hashing algorithm calculates a row hash value (in this case, 602). 3. The hash map points to the AMP containing the subtable row corresponding to the row hash value (in this case, AMP 2). 4. The subtable indicates where the base row resides (in this case, row 778 on AMP 4). 5. The message goes back over the BYNET to the AMP with the row and the AMP accesses the data row (in this case, AMP 4). 6. The row is sent over the BYNET to the PE, and the PE sends the answer set on to the client application. As shown in the example above, accessing data with a USI is typically a two-AMP operation. However, it is possible that the subtable row and base table row could end up being stored on the same AMP, because both are hashed separately. If both were on the same AMP, the USI request would be a one-AMP operation.

Accessing Data with a NUSI


When a user submits an SQL request using the table name and a Non-Unique Secondary Index, the request becomes an all-AMP operation, as explained below.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 130 of 137

NUSI Access 1. The SQL is submitted, specifying a NUSI (in this case, a last name of "Adams"). 2. The hashing algorithm calculates a row hash value for the NUSI (in this case, 567). 3. All AMPs are activated to find the hash value of the NUSI in their index subtables. The AMPs whose subtables contain that value become the participating AMPs in this request (in this case, AMP1 and AMP2). The other AMPs discard the message. 4. Each participating AMP locates the row IDs (row hash value plus uniqueness value) of the base rows corresponding to the hash value (in this case, the base rows corresponding to hash value 567 are 640, 222, and 115). 5. The participating AMPs access the base table rows, which are located on the same AMP as the NUSI subtable (in this case, one row from AMP 1 and two rows from AMP 2). 6. The qualifying rows are sent over the BYNET to the PE, and the PE sends the answer set on to the client application (in this case, three qualifying rows are returned).

Full-Table Scan Accessing Data Without Indexes


In the Teradata Database, you can access data on any column, whether that column is an index or not. You can ask any question, of any data, at any time. If the request does not use a defined index, the Teradata Database does a full-table scan. A full-table scan is another way to access data without using Primary or Secondary Indexes. In evaluating an SQL request, the Optimizer examines all possible access methods and chooses the one it believes to be the most efficient. While Secondary Indexes generally provide a more direct access path, in some cases

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 131 of 137

the Optimizer will choose a full-table scan because it is more efficient. A request could turn into a full-table scan when: An SQL request searches on a NUSI column with many duplicates. For example, if a request using last names in a Customer database searched on the very prevalent "Smith" in the United States, then the Optimizer may choose a full table scan to efficiently find all the matching rows in the result set. An SQL request uses a non-equality WHERE clause on an index column. For example, if a request searched an Employee database for all employees whose annual salary is greater than $100,000, then a full-table scan would be used, even if the Salary column is an index. In this example, a full-table scan can be avoided by using an equality WHERE clause on a defined index column. An SQL request uses a range WHERE clause on an index column. For example, if a request searched an Employee database for all employees hired between January 2001 and June 2001, then a full-table scan would be used, even if the Hire_Date column is an index. For all requests, you must specify a value for each column in the index or the Teradata Database will do a full-table scan. A full-table scan is an all-AMP operation. Every data block must be read and each data row is accessed only once. As long as the choice of Primary Index has caused the table rows to distribute evenly across all of the AMPs, the parallel processing of the AMPs working simultaneously can accomplish the full-table scan quickly. However, if a Primary Index causes skewed data distribution, all AMP operations will take longer. While full-table scans are impractical and even disallowed on some commercial database systems, the Teradata Database routinely permits ad-hoc queries with fulltable scans. When choosing between a NUSI and a full-table scan, if the optimizer determines that there is no selective SI, hash or join index and that most of the rows in the table would qualify for the answer set if a NUSI were used, it would most likely choose the full-table scan as the most efficient access method. If statistics are stale or have not been collected on the NUSI column(s), the optimizer may choose to do a full-table scan, as it does not have updated data demographics.

Summary of Keys and Indexes Some fundamental differences between Keys and Indexes are shown below:
Keys Indexes

A relational modeling convention used in a logical data model. Uniquely identify a row (Primary Key). Establish relationships between tables (Foreign Key).

A Teradata Database mechanism used in a physical database design. Used for row distribution (Primary Index). Used for row access (Primary Index and Secondary Index).

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 132 of 137

While most commercial database systems use the Primary Key as a way to retrieve data, a Teradata Database system does not. In the Teradata Database, you use the Primary Key only when designing a database, as a mechanism for maintaining referential integrity according to relational theory. The Teradata Database itself does not require keys in order to manage the data, and can function fully with no awareness of Primary Keys. The Teradata Database's parallel architecture uses Primary Indexes to distribute and access the data rows. A Primary Index is always required when creating a Teradata Database table. A Primary Index may include the same columns as the Primary Key, but does not have to. In some cases, you may want the Primary Key and Primary Index to be different. For example, a credit card account number may be a good Primary Key, but customers may prefer to use a different kind of identification to access their accounts.

Rules for Keys and Indexes


A summary of the rules for keys (in the relational model) and indexes (in the Teradata Database) is shown below.
Rule 1 2 3 4 5 Primary Key One PK Unique values No NULLs Values should not change Column should not change Foreign Key Multiple FKs Unique or non-unique NULLs allowed Values may be changed Column should not change Primary Index One PI Unique or non-unique NULLs allowed Secondary Index 0 to 32 SIs Unique or nonunique NULLs allowed

Values may be changed Values may be (redistributes row) changed Column cannot be changed (drop and recreate table) 64-column limit n/a Index may be changed (drop and recreate index) 64-column limit n/a

6 7

No column limit n/a

No column limit FK must exist as PK in the related table

Defining Primary and Foreign Keys in the Teradata Database


Although Primary Indexes are required and Primary Keys are not, you do have the option to define a Primary Key or Foreign Key for any table. When you define a Primary Key in a Teradata Database table, the RDBMS will implement the specified column(s) as an index. Because a Primary Key requires unique values, a defined Primary Key is implemented as one of the following: Unique Primary Index (If the DBA did not specify the Primary Index in the

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 133 of 137

CREATE TABLE statement.) Unique Secondary Index (If the PK was not chosen to be the PI) When a Primary Key is defined in Teradata SQL and implemented as an index, the rules that govern that type of index now apply to the Primary Key. For example, in relational theory, there is no limit to the number of columns in a Primary Key. However, if you specify a Primary Key in Teradata SQL, the 64-column limit for indexes now applies to that Primary Key.

What Do You Think?


Which statement is true? (Choose the best answer.) A. A Primary Index is used to distribute data, while a Primary Key is used to uniquely identify a row. B. A Primary Key is used to access data, while a Primary Index is used to uniquely identify a row. C. In a Teradata Database system, "Primary Key" means the same thing as "Primary Index." D. A Primary Index is used to distribute data, while a Primary Key is converted to a hash map.

Feedback:

Exercise 6.1 Which one provides uniform data distribution through the hashing algorithm? A. UPI B. NUPI C. Both UPI and NUPI D. Neither UPI nor NUPI

Feedback:
To review this topic, click Rule 2: Unique or Non-Unique PI or Distributing Rows to AMPs. Exercise 6.2 The output from the hashing algorithm is the: A. hash map

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 134 of 137

B. uniqueness value C. row ID D. row hash

Feedback:
To review this topic, click Distributing Rows to AMPs. Exercise 6.3 Choose the appropriate answers from the drop-down boxes that complete each sentence: Accessing a row with a Unique Secondary Index (USI) typically requires AMP(s). Accessing a row with a Non-Unique Secondary Index (NUSI) requires AMP(s). A full-table scan accesses row(s). Accessing a row with a Unique Primary Index (UPI) accesses row(s) on one AMP. Accessing a row with a Non-Unique Primary Index (NUPI) accesses multiple rows on (s).

AMP

Feedback:
Show Answers Reset

To review these topics, click Accessing a Row With a Primary Index, Accessing Data with a USI, Accessing Data with a NUSI, Full-Table Scan - Accessing Data Without Indexes. Exercise 6.4 Which column should be selected as the Primary Index in the CUSTOMER table below? The table contains information on 50,000 customers of this regional telecommunication services company. Whenever a customer calls, the call center operator should be able to easily access and confirm customer information. In addition, the company wants to track all service activities on a perhousehold basis. Select the best Primary Index for the business use.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 135 of 137

A. Column 4, because each address is clearly a household, which is what is being tracked. B. Column 5, because it is nearly unique, easy to remember and input, and can be used for householding. C. Column 2, because most of the customers with the same last name belong to a single household. D. Columns 2 and 3 together, because the combination is nearly unique, and it is easy for the customer to remember. E. Column 1, because it is the Primary Key and its unique values will cause table rows to be distributed evenly for best performance. Customers must give their Customer ID when calling for service.

Feedback:
To review this topic, click Choosing a UPI or NUPI. Exercise 6.5 The row ID helps the system to locate a row in case of a(n): A. even distribution of rows B. Unique Primary Index C. multi-AMP request D. hash synonym

Feedback:
To review this topic, click Distributing Rows to AMPs or Accessing a Row With a Primary Index. Exercise 6.6 Which task does a Teradata Database Administrator have to perform? (Choose one.) A. Select Primary Indexes B. Re-organize data C. Pre-prepare data for loading D. Pre-allocate table space

Feedback:
To review this topic, click Teradata Database Manageability. Exercise 6.7 With a ______ you create multiple access paths to the rows in the base table that the Optimizer can choose from which improves response to business questions by improving the performance of queries which take advantage of partition

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 136 of 137

elimination? A. Multi-Level Partitioned Primary Index (MLPPI) B. NUPI C. Partitioned Primary Index (PPI) D. NoPI

Feedback:
To review this topic, click Multi-Level PPI , What is a NoPI Table? , Choosing a Unique or Non-Unique Primary Index or Partitioned Primary Index . Exercise 6.8 True or False: A NoPI Table is simply a table without a primary index. As rows are inserted into a NoPI table, rows are always appended at the end of the table and never inserted in the middle of a hash sequence. Organizing/sorting rows based on row hash is therefore avoided. A. True B. False

Feedback:
To review this topic, click What is a NoPI Table?. Exercise 6.9 True or False: If statistics are stale or have not been collected on the NUSI column(s), the optimizer may choose to do a fulltable scan, as it does not have updated data demographics. A. True B. False

Feedback:
To review this topic, click Choosing a Unique or Non-Unique Primary Index or Data Access without a Primary Index. Teradata Certification Teradata Certification

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Page 137 of 137

Now that you have learned about the Teradata Database basics, consider the first level of Teradata Certification, Teradata Certified Professional. Information on the Teradata Certified Professional Program (TCPP) including exam objectives, practice questions, test center locations, and registration information is located on the Teradata Certified Professional Program (TCPP) website. Candidates for the Teradata Certified Professional Certification must pass the Teradata 12 Basics Certification exam administered at Prometric testing centers listed on the TCPP website. We recommend you review the WBT content and the practice questions located on the TCPP website before signing up for the official Teradata 12 Basics Certification exam.

file://C:\Documents and Settings\PJ186002\Desktop\teradata intoduction wbt.htm

10/20/2011

Das könnte Ihnen auch gefallen