Beruflich Dokumente
Kultur Dokumente
Objectives
HOT TIP: This module contains links to important supplemental course information.
Please be sure to click on each hotword link to capture all of the training content.
The Teradata Database is a large database server that accommodates multiple client
applications making inquiries against it concurrently. Various client platforms access the
database through a TCP-IP connection or across an IBM mainframe channel
connection. The Teradata Database is accessed using SQL (Structured Query
Language), the industry standard access language for communicating with an RDBMS.
The ability to manage large amounts of data is accomplished using the concept of
parallelism, wherein many individual processors perform smaller tasks concurrently to
accomplish an operation against a huge repository of data. To date, only parallel
architectures can handle databases of this size.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 2 of 137
Based on what you know so far, what do you think are some Teradata Database
features that make it so successful in today's business environment? (Details on the
following are coming up next.)
j
k
l
m
n A. Scalability.
j
k
l
m
n B. Single data store.
j
k
l
m
n C. High degree of parallelism.
j
k
l
m
n D. Ability to model the business.
j
k
l
m
n E. All of the above.
Feedback:
In this Web-Based Training, you will learn about many features that make the Teradata
Database, an RDBMS, right for business-critical applications. To start with, this section
covers these key features:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 3 of 137
The Teradata Database acts as a single data store, with multiple client applications
making inquiries against it concurrently.
Instead of replicating a database for different purposes, with the Teradata Database you
store the data once and use it for many applications. The Teradata Database provides
the same connectivity for an entry-level system as it does for a massive enterprise data
warehouse.
Scalability
"Linear scalability" means that as you add components to the system, the performance
increase is linear. Adding components allows the system to accommodate increased
workload without decreased throughput. Linear scalability enables the system to grow to
support more users/data/queries/complexity of queries without experiencing
performance degradation. As the configuration grows, performance increase is linear,
slope of 1. The Teradata Database was the first commercial database system to scale
to and support a trillion bytes of data.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 4 of 137
The Teradata Database can scale from 100 gigabytes to over 100 terabytes of data on a
single system without losing any performance capability. The Teradata Database's
scalability provides investment protection for customer's growth and application
development. The Teradata Database is the only database that is predictably scalable
in multiple dimensions, and this extends to data loading with the use of parallel loading
utilities. The Teradata Database provides automatic data distribution and no
reorganizations of data are needed. The Teradata Database is scalable in multiple ways,
including hardware, query complexity, and number of concurrent users.
Hardware
With the Teradata Database, you can increase the size of your system without replacing:
Databases - When you expand your system, the data is automatically redistributed
through the reconfiguration process, without manual interventions such as sorting,
unloading and reloading, or partitioning.
Platforms - The Teradata Database's modular structure allows you to add
components to your existing system.
Data model - The physical and logical data models remain the same regardless of
data volume.
Applications - Applications you develop for Teradata Database configurations will
continue to work as the system grows, protecting your investment in application
development.
Complexity
The Teradata Database is adept at complex data models that satisfy the information
needs throughout an enterprise. The Teradata Database efficiently processes
increasingly sophisticated business questions as users realize the value of the answers
they are getting. It has the ability to perform large aggregations during query run time
and can perform up to 64 joins in a single query.
Concurrent Users
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 5 of 137
Unconditional Parallelism
Teradata supports ad-hoc queries using ANSI-standard SQL, and includes SQL-ready
database management information (log files). This allows Teradata to interface with 3rd
party Business Intelligence (BI) tools and submit queries from other database systems.
A data warehouse built on a business model contains information from across the
enterprise. Individual departments can use their own assumptions and views of the data
for analysis, yet these varying perspectives have a common basis for a "single view of
the business."
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 6 of 137
With the Teradata Database's centrally located, logical architecture, companies can get
a cohesive view of their operations across functional areas to:
You get consistent answers from the different viewpoints above using a single business
model, not functional models for different departments. In a functional model, data is
organized according to what is done with it. But what happens if users later want to do
some analysis that has never been done before? When a system is optimized for one
department's function, the other departments' needs (and future needs) may not be met.
A Teradata Database allows the data to represent a business model, with data
organized according to what it represents, not how it is accessed, so it is easy to
understand. The data model should be designed without regard to usage and be the
same regardless of data volume. With a Teradata Database as the enterprise data
warehouse, users can ask new questions of the data that were never anticipated,
throughout the business cycle and even through changes in the business environment.
A key Teradata Database strength is its ability to model the customer's business. The
Teradata Database supports business models that are truly normalized, avoiding the
costly star schema and snowflake implementations that many other database vendors
use. The Teradata Database can support star schema and other types of relational
modeling, but Third Normal Form is the method for relational modeling that we
recommend to customers. Our competitors typically implement star schema or snowflake
models either because they are implementing a set of known queries in a transaction
processing environment, or because their architecture limits them to that type of model.
Normalization is the process of reducing a complex data structure into a simple, stable
one. Generally this process involves removing redundant attributes, keys, and
relationships from the conceptual data model. The Teradata Database supports
normalized logical models because it is able to perform 64 table joins and large
aggregations during queries.
The Teradata Database Optimizer is the most robust in the industry, able to handle:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 7 of 137
The Teradata Database is a relational database. Relational databases are based on the
relational model, which is founded on mathematical Set Theory. The relational model
uses and extends many principles of Set Theory to provide a disciplined approach to
data management. Users and applications access data in an RDBMS using industry-
standard SQL statements. SQL is a set-oriented language for relational database
management.
Rows
Each row contains all the columns in the table. A row is one instance of all columns,
and each table can contain only one row format. The order of rows is arbitrary and
does not imply priority, hierarchy, or significance. It is a single entity in the table.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 8 of 137
Columns
Each column contains "like data," such as only part names, or only supplier names, or
only employee numbers. In the example below, the Last_Name column contains last
names only, and nothing else. The data in the columns is atomic data, so a telephone
number might be divided into three columns: the area code, the prefix, and the suffix, so
the customer data can be analyzed according to area code, etc. Missing data values
would be represented by "nulls" (the absence of a value). Within a table, the column
position is arbitrary.
A relational database is a set of logically related tables. Tables are logically related to
each other by a common field, so information such as customer telephone numbers and
addresses can exist in one table, yet be accessible for multiple purposes.
Relational databases do not use access paths to locate data; data connections are
made by data values. Data connections are made by matching values in one column
with the values in a corresponding column in another table. In relational terminology, this
connection is referred to as a join.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 9 of 137
The diagrams below show how the values in one table may be matched to values in
another table. The tables below shows customer, order, and billing statement data,
related by a common field, Customer ID. The common field of Customer ID lets you look
up information such as a customer name for a particular statement number, even though
the data exists in two different tables. This is done by performing a join between the
tables using the common field, Customer ID. Here are a few other examples of questions
that can be answered:
To sum up, a relational database is a collection of tables. The data contained in the
tables can be associated using columns with matching data values.
Logical/Relational Modeling
The logical model should be independent of usage. A variety of front-end tools can be
accommodated so that the database can be created quickly.
The design of the data model is the same regardless of data volume.
An enterprise model is one that provides the ability to look across functional processes.
Normalization is the process of reducing a complex data structure into a simple, stable
one. Generally this process involves removing redundant attributes, keys, and
relationships from the conceptual data model. Normalization theory is constructed
around the concept of normal forms that define a system of constraints. If a relation
meets the constraints of a particular normal form, we say that relation is “in normal form."
The intent of normalizing a relational database is to put one fact in one place. By
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 10 of 137
decomposing your relations into normalized forms, you can eliminate the majority of
update anomalies that can occur when data is stored in de-normalized tables.
A slightly more detailed statement of this principle would be the definition of a relation (or
table) in a normalized relational database: A relation consists of a primary key, which
uniquely identifies any tuple, and zero or more additional attributes, each of which
represents a single-valued (atomic) property of the entity type identified by the primary
key. A tuple is an ordered set of values. The separator for each value is often a comma.
Common uses for the tuple as a data type are:
Entities
Attributes
Relationships
First normal form rules state that each and every attribute within an entity instance has
one and only one value. No repeating groups are allowed within entities.
Second normal form requires that the entity must conform to the first normal form rules.
Every non-key attribute within an entity is fully dependent upon the entire key (key
attributes) of the entity, not a subset of the key.
Third normal form requires that the entity must conform to the first and second normal
form rules. In addition, no non-key attributes within an entity is functionally dependent
upon another non-key attribute within the same entity.
While the Teradata Database can support any data model that can be processed via
SQL; an advantages of a normalized data model is the ability to support previously
unknown (ad-hoc) questions.
Star Schema
The star schema (sometimes referenced as star join schema) is the simplest style of
data warehouse schema. The star schema consists of a few fact tables (possibly only
one, justifying the name) referencing any number of dimension tables. The star schema
is considered an important special case of the snowflake schema.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 11 of 137
Primary Key
In the relational model, a Primary Key (PK) is used to designate a unique identifier for
each row when you design a table. A Primary Key can be composed of one or more
columns. In the example below, the Primary Key is the employee number.
Rules governing how Primary Keys must be defined and how they function are:
In the logical model, each table requires a Primary Key because that is how each row is
able to be uniquely identified. Each table must have one, and only one, Primary Key. In
any given row, the value of the Primary Key uniquely identifies the row. The Primary
Key may span more than one column, but even then, there is only one Primary Key.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 12 of 137
Rule 2: Unique PK
Within the column(s) designated as the Primary Key, the values in each row must be
unique. No duplicate values are allowed. The Primary Key's purpose is to uniquely
identify a row. In a multi-column Primary Key, the combined value of the columns must
be unique, even if an individual column in the Primary Key has duplicate values.
Within the Primary Key column, each row must have a Primary Key value and cannot be
NULL (without a value). Because NULL is indeterminate, it cannot "identify" anything.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 13 of 137
Primary Key values should not be changed. If you changed a Primary Key, you would
lose all historical tracking of that row.
Additionally, the column(s) designated as the Primary Key should not be changed. If you
changed a Primary Key, you would lose all the information relating that table to other
tables.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 14 of 137
In the relational model, there is no limit to the number of columns that can be designated
as the Primary Key, so it may consist of one or more columns. In the example below, the
Primary Key consists of three columns: EMPLOYEE NUMBER, LAST NAME, and FIRST
NAME.
Foreign Key
A Foreign Key (FK) is an identifier that links related tables. A Foreign Key defines how
two tables are related to each other. Each Foreign Key references a matching Primary
Key in another table in the database. For example, in the table below, the Department
Number column that is a Foreign Key actually exists in another table as a Primary Key.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 15 of 137
Having tables related to each other gives users the flexibility to look at the data in
different ways, without the database administrator having to manage and maintain many
tables of duplicate data for different applications.
Rules governing how Foreign Keys must be defined and how they operate are:
Foreign Keys are optional; not all tables have them. Tables that do have them can have
multiple Foreign Keys because a table can relate to many other tables. In fact, a table
can have an unlimited number of foreign keys. In the example table below:
The Department Number Foreign Key relates to the Department Number Primary
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 16 of 137
Having tables related to each other makes a relational database flexible so that different
users can look up information they need, while simplifying the database administration
so the data doesn't have to be duplicated for each purpose or application.
Duplicate Foreign Key values are allowed. More than one employee could be assigned
to the same department.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 17 of 137
NULL (missing) Foreign Key values are allowed. For example, under special
circumstances, an employee might not be assigned to a department.
Foreign Key values may be changed. For example, if Arnando Villegas moves from
Department 403 to Department 587, the Foreign Key value in his row would change.
The Foreign Key may consist of one or more columns. A multi-column foreign key is
used to relate to a multi-column Primary Key in a related table. In the relational model,
there is no limit to the number of columns that can be designated as a Foreign Key.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 18 of 137
Each Foreign Key must exist as a Primary Key in a related table. A department number
that does not exist in the Department Table would be invalid as a Foreign Key value in
the Employee Table.
This rule can apply even if the Foreign Key is NULL, or missing. Remember, a missing
value is defined as a non-value; there is no value present. So the rule could be better
stated: if a value exists in the Foreign Key column, it must match a Primary Key value in
the related table.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 19 of 137
To check your understanding of Primary Keys and Foreign Keys, complete this
sentence. According to the relational model, a single table can have either: (Choose
two.)
c
d
e
f
g A. Multiple primary keys.
c
d
e
f
g B. Multiple foreign keys.
c C. No primary keys.
d
e
f
g
c D. No foreign keys.
d
e
f
g
Feedback:
Check Answer Show Answer
Exercise 1.1
Feedback:
Show Answers Reset
Exercise 1.2
j
k
l
m
n A. A database is a two-dimensional array of rows and columns.
j
k
l
m
n B. A Primary Key must contain one, and only one, column.
j C. Foreign Keys have no relationship to existing Primary Key selections.
k
l
m
n
j D. Teradata is an ideal foundation for customer relationship management, e-commerce, and
k
l
m
n
active data warehousing applications.
Feedback:
To review these topics, click How is the Teradata Database Used?, What is a Relational
Database?, Primary Key, or Foreign Key.
Exercise 1.3
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 20 of 137
Feedback:
Exercise 1.4
Feedback:
Exercise 1.5
How many calendars were shipped on 4/15? (These same tables were used in the previous
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 21 of 137
exercise.)
j
k
l
m
n A. 10
j
k
l
m
n B. 2
j C. 40
k
l
m
n
j D. 30
k
l
m
n
Feedback:
Exercise 1.6
j
k
l
m
n A. Ability to model the business, with data organized according to what it represents.
j B. Provides a mature, parallel-aware Optimizer that chooses the least expensive plan for the
k
l
m
n
SQL request.
j C. Provides linear scalability, so there is no performance degradation as you grow the system.
k
l
m
n
j D. Gives each department in the enterprise a self-contained, functional data store for their own
k
l
m
n
assumptions and analysis.
j E. Provides automatic and even data distribution for faster query processing via its
k
l
m
n
unconditional parallel architecture.
Feedback:
To review these topics, click Single Data Store, Scalability, Unconditional Parallelism, Ability to
Model the Business, and Mature, Parallel-Aware Optimizer.
Exercise 1.7
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 22 of 137
usage.
j
k
l
m
n A. True
j
k
l
m
n B. False
Feedback:
Objectives
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 23 of 137
Stage 1 Reporting: The initial stage typically focuses on reporting from a single view of
the business to drive decision-making across functional and/or product boundaries.
Questions are usually known in advance, such as a weekly sales report.
Stage 2 Analyzing: Focuses on why something happened, such as why sales went down
or discovering patterns in customer buying habits. Users perform ad-hoc analysis, slicing
and dicing the data at a detail level, and questions are not known in advance.
Stage 3 Predicting: Analysts utilize the system to leverage information to predict what
will happen next in the business to proactively manage the organization's strategy. This
stage requires data mining tools and building predictive models using historical detail. As
an example, users can model customer demographics for target marketing.
Examples:
Stage 5 Active Data Warehousing: The larger the role an ADW plays in the operational
aspects of decision support, the more incentive the business has to automate the decision
processes. You can automate decision-making when a customer interacts with a web site.
Interactive customer relationship management (CRM) on a web site or at an ATM is about
making decisions to optimize the customer relationship through individualized product
offers, pricing, content delivery and so on. As technology evolves, more and more
decisions become executed with event-driven triggers to initiate fully automated decision
processes.
Example: determine the best offer for a specific customer based on a real-time event,
such as a significant ATM deposit.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 24 of 137
Active Enterprise Intelligence is the seamless integration of the ADW into the customer’s
existing business and technical architectures.
Active Enterprise Intelligence (AEI) is a business strategy for providing strategic and
operational intelligence to back office and front line users from a single enterprise data
warehouse.
Active - Is responsive, agile, and capable of driving better, faster decisions that
drive intelligent, and often immediate, actions.
Enterprise - Provides a single view of the business, across appropriate business
functions, and enables new operational users, processes, and applications.
Intelligence - Supports traditional strategic users and new operational users of the
Enterprise Data Warehouse. Most importantly, it enables the linkage and alignment
of operational systems, business processes and people with corporate goals so
companies may execute on their strategies.
The technology that enables that business value is the Teradata Active Data Warehouse
(ADW). The Teradata ADW is a combination of products, features, services, and
business partnerships that support the Active Enterprise Intelligence business strategy.
ADW is an extension of our existing Enterprise Data Warehouse (EDW).
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 25 of 137
Data warehouses are beginning to take on mission-critical roles supporting CRM, one-
to-one marketing, and minute-to-minute decision-making. Data warehousing
requirements have evolved to demand a decision capability that is not just oriented
toward corporate staff and upper management, but actionable on a day-to-day basis.
Decisions such as when to replenish Barbie dolls at a particular retail outlet may not be
strategic at the level of customer segmentation or long-term pricing strategies, but when
executed properly, they make a big difference to the bottom line. We refer to this
capability as "tactical" decision support.
Tactical decisions are the drivers for day-to-day management of the business.
Businesses today want more than just strategic insight from their data warehouse
implementations - they want better execution in running the business through more
effective use of information for the decisions that get made thousands of times per day.
The origin of the active data warehouse is the timely, integrated store of detail data
available for analytic business decision-making. It is only from that source that the
additional traits needed by the active data warehouse can evolve. These new "active"
traits are supplemental to data warehouse functionality. For example, the work mix in the
database still includes complex decision support queries, but expands to take on short,
tactical queries, background data feeds, and possibly event-driven updates all at the
same time. Data volumes and user concurrency levels may explode upward beyond
expectation. Restraints may need to be placed on the longer, analytical queries in order
to guarantee tactical work throughput. While accessing the detail data directly remains
an important opportunity for analytical work, tactical work may thrive on shortcuts and
summaries, such as operational data store (ODS) level information. And for both
strategic and tactical decisions to be useful to the business, today's data, this hour's
data, even this minute's data has to be available.
The Teradata Database is positioned exceptionally well for stepping up to the challenges
related to high availability, large multi-user workloads, and handling complex queries
that are required for an active data warehouse implementation. The Teradata Database
technology supports evolving business requirements by providing high performance
and scalability for:
Mixed workloads (both tactical and strategic queries) for mission critical
applications
Large amounts of detail data
Concurrent users
The Teradata Database provides 7x24 availability and reliability, as well as continuous
updating of information so data is always fresh and accurate.
Traditionally, data processing has been divided into two categories: on-line transaction
processing (OLTP) and decision support systems (DSS). For either, requests are
handled as transactions. A transaction is a logical unit of work, such as a request to
update an account.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 26 of 137
DSS
OLTP
OLAP
Data Mining
Data Mining
Data Mining (predictive modeling) involves analyzing moderate to large amounts of
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 27 of 137
detailed historical data to detect behavioral patterns (for example, buying, attrition, or
fraud patterns), that are then used to predict future behavior. There are two phases to
data mining. Phase 1: An “analytic model” is built from historical data incorporating the
detected behavior patterns (takes minutes to hours). Phase 2: The model is then applied
against current detail data of customers (that is, customers are scored), to predict likely
outcomes (takes seconds or less). Scores can indicate a customer's likelihood of
purchasing a product, switching to a competitor, or being fraudulent.
Until recently, most business decisions were based on summary data. The problem is
that summarized data is not as useful as detail data and cannot answer some questions
with accuracy. With summarized data, peaks and valleys are leveled when the peaks fall
at the end of a reporting period and are cut in half.
Here's another example. Think of your monthly bank statement that records checking
account activity. If it only told you the total amount of deposits and withdrawals, would
you be able to tell if a certain check had cleared? To answer that question you need a
list of every check received by your bank. You need detail data.
Consider your own business and how it uses data. Is that data detailed or summarized?
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 28 of 137
Which type of data processing supports answering this type of question, "How many
women's dresses did our store sell in December of last year?"
j
k
l
m
n A. OLTP
j B. Data Mining
k
l
m
n
j C. OLAP
k
l
m
n
j D. DSS
k
l
m
n
Feedback:
Both cursor and set processing define set(s) of rows of the data to process; but, while a
cursor processes the rows sequentially, set processing takes its sets at once. Both can
be processed with a single command.
Row-by-Row Processing
Row-by-row processing is where there are many rows to process, one row is fetched at
a time and all calculations are done on it, then it is updated or inserted. Then the next
row is fetched and processed. This is row-by-row processing and it makes for a slow
program.
Set Processing
A lot of data processing is set processing, which is what relational databases do best.
Instead of processing row-by-row sequentially, you can process relational data set-by-
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 29 of 137
set, without a cursor. For example, to sum all payment rows with 100 or less balances, a
single SQL statement completely processes all rows that meet the condition as a set.
With sufficient rows to process, this can be 10 to 30 or more times faster than row-at-a-
time processing.
When determining how fast something is, there are two kinds of measures. You can
measure how long it takes to do something or you can measure how much gets done per
unit time. The former is referred to as response time, access time, transmission time, or
execution time depending on the context. The latter is referred to as throughput.
Response Time
This speed measure is specified by an elapsed time from the initiation of some activity
until its completion. The phrase response time is often used in operating systems
contexts.
Throughput
A throughput measure is an amount of something per unit time. For operating systems
throughput is often measured as tasks or transactions per unit time. For storage systems
or networks throughput is measured as bytes or bits per unit time. For processors, the
number of instructions executed per unit time is an important component of performance.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 30 of 137
In order to improve both response time and throughput on a Teradata system, you can:
Many data warehouses get their data directly from operational systems so that the data
is timely and accurate. While data warehouses may begin somewhat small in scope and
purpose, they often grow quite large as their utility becomes more fully exploited by the
enterprise.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 31 of 137
Data Marts
In addition, an enterprise data model is more extensible than an application data model.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 32 of 137
While data marts have obvious value, they are not a true enterprise-wide solution
and can become very costly over time as more and more are added.
A major problem with proliferating data marts is that, depending on where you look
for answers, there is often inconsistency.
They may not provide the historical depth of a true data warehouse.
Because data marts are designed to handle specific types of queries from a specific
type of user, they are often not good at ad hoc, or "what if" queries like a data
warehouse is.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 33 of 137
A Teradata Database system contains one or more nodes. A node is a term for a
processing unit under the control of a single operating system. The node is where the
processing occurs for the Teradata Database. There are two types of Teradata
Database systems:
SMP system: System Console (keyboard and monitor) attached directly to the
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 34 of 137
SMP node
MPP system: Administration Workstation (AWS)
To access a Teradata Database system, a user typically logs on through one of multiple
client platforms (channel-attached mainframes or network-attached workstations). Client
access is discussed in the next module.
Node Components
A node is the basic building block of a Teradata Database system, and contains a large
number of hardware and software components. A conceptual diagram of a node and its
major components is shown below. Hardware components are shown on the left side of
the node and software components are shown on the right side.
The Teradata Database virtual processors, or vprocs (which are the PEs and AMPs),
share the components of the nodes (memory and cpu). The main component of the
"shared-nothing" architecture is that each AMP manages its own dedicated portion of the
system's disk space (called the vdisk) and this space is not shared with other AMPs.
Each AMP uses system resources independently of the other AMPs so they can all work
in parallel for high system performance overall.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 35 of 137
j
k
l
m
n PDE is an application that runs on the Teradata Database software.
j AMPs manage system disks on the node.
k
l
m
n
j The host channel adapter card connects to "bus and tag" cables through a Teradata
k
l
m
n
Gateway.
j An Ethernet card is a hardware component used in the connection between a
k
l
m
n
network-attached client and the node.
Feedback:
Teradata Virtual Storage, introduced with Teradata 13.00, is a change to the way in
which Teradata accesses storage. The purpose is to manage a multi-temperature
warehouse. Teradata Virtual Storage pools all of the cylinders within a clique's disk
space and allocates cylinders from this storage pool to individual AMPs. You can add
storage to the clique-storage-pool versus to every AMP which allows sharing of storage
devices among AMPs. It will allow you to store data that is accessed more frequently
("hot data") on faster devices and data that is accessed less frequently ("cold data") on
slower devices and it can automatically migrate the data based on access frequency.
Teradata Virtual Storage is designed to allow the Teradata Database to make use of
new storage technologies such as adding fast Solid State Disks (SSDs) to an existing
system with a different disk technology/speed/capacity. Teradata Virtual Storage
enables the mixing of drive sizes, speeds, and technologies so you can "mix" storage
devices. Since storage is pooled and shared by the AMPs, adding drives does not
require adding AMPs.
Pooling clique storage and allocating cylinders from the storage pool to individual
AMPs
Tracking where data is stored on the physical media
Maintaining statistics on the frequency of data access and on the performance of
physical storage media
Migrating frequently used data (“hot data”) to fast disks and data used less
frequently (“cold data”) to slower disks.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 36 of 137
Teradata Virtual Storage can migrate data away from a physical storage
device in order to prepare for removal or replacement of the device. This
process is called “evacuation.” Complete data evacuation requires a system
restart, but Teradata Virtual Storage supports a “soft evacuation” feature that
allows much of the data to be moved while the system remains online. This
can minimize system down time when evacuations are necessary.
This diagram illustrates the conceptual differences with and without Teradata
Virtual Storage.
Cylinders were addressed by drive # AMPs don't know the physical location of
and cylinder #. a cylinder and it can change.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 37 of 137
With Teradata Virtual Storage you can easily add storage to an existing
system.
When the PE dispatches the steps for the AMPs to perform, they are dispatched onto
the BYNET. The messages are routed to the appropriate AMP(s) where results sets and
status information are generated. This response information is also routed back to the
requesting PE via the BYNET. Depending on the nature of the dispatch request, the
communication between nodes may be to all nodes (Broadcast message) or to one
specific node (point-to-point message) in the system.
Scalable: As you add more nodes to the system, the overall network bandwidth
scales linearly. This linear scalability means you can increase system size without
performance penalty -- and sometimes even increase performance.
High performance: An MPP system typically has two BYNET networks (BYNET 0
and BYNET 1). Because both networks in a system are active, the system benefits
from having full use of the aggregate bandwidth of both the networks.
Fault tolerant: Each network has multiple connection paths. If the BYNET detects
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 38 of 137
The BYNET hardware and software handle the communication between the vprocs and
the nodes.
Hardware: The nodes of an MPP system are connected with the BYNET
hardware, consisting of BYNET boards and cables.
Software: The BYNET driver (software) is installed on every node. This BYNET
driver is an interface between the PDE software and the BYNET hardware.
SMP systems do not contain BYNET hardware. The PDE and BYNET software
emulate BYNET activity in a single-node environment.
For more information on communication between the vprocs and nodes, click here.
(Note: You do not need to know this information for the certification exam.)
1. When a message is delivered to a node using BYNET hardware and software, PDE
software on the node has the ability to route the message to which three? (Choose
three.)
c
d
e
f
g A. A single vproc on a node
c
d
e
f
g B. A group of vprocs on a node
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 39 of 137
c
d
e
f
g C. All vprocs on a node
c
d
e
f
g D. All vprocs on all nodes
Feedback:
Check Answer Show Answer
Cliques
A clique (pronounced, "kleek") is a group of nodes that share access to the same disk
arrays. Each multi-node system has at least one clique. The cabling determines which
nodes are in which cliques -- the nodes of a clique are connected to the disk array
controllers of the same disk arrays.
In the event of a node failure, cliques provide for data access through vproc migration.
When a node resets, the following happens to the AMPs:
1. When the node fails, the Teradata Database restarts across all remaining nodes in
the system.
2. The vprocs (AMPs) from the failed node migrate to the operational nodes in its
clique.
3. The PE vprocs will migrate as follows: LAN attached PEs will migrate to other
nodes in the clique. Channel attached PEs will not migrate. While that node
remains down, that channel connection is not available.
4. Disks managed by the AMP remain available and processing continues while the
failed node is being repaired.
Cliques in a System
Vprocs are distributed across all nodes in the system. Multiple cliques in the system
should have the same number of nodes.
The diagram below shows three cliques. The nodes in each clique are cabled to the
same disk arrays. The overall system is connected by the BYNET. If one node goes
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 40 of 137
down in a clique the vprocs will migrate to the other nodes in the clique, so data remains
available. However, system performance decreases due to the loss of a node. System
performance degradation is proportional to clique size.
A Hot Standby Node (HSN) is a node that is a member of a clique that is not configured
(initially) to execute any Teradata vprocs. If a node in the clique fails, the AMPs from the
failed node move to the hot standby node. The performance degradation is 0%.
When the failed node is recovered/repaired and restarted, it becomes the new hot
standby node. A second restart of Teradata is not needed.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 41 of 137
Software Components
For each node in the system, you need both of the following:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 42 of 137
Operating System
The Parallel Database Extensions (PDE) software layer was added to the operating
system to support the parallel software environment. The PDE controls the virtual
processor (vproc) resources.
A Trusted Parallel Application (TPA) uses PDE to implement virtual processors (vprocs).
The Teradata Database is classified as a TPA. The four components of the Teradata
Database TPA are:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 43 of 137
A Parsing Engine (PE) is a virtual processor (vproc) that manages the dialogue between
a client application and the Teradata Database, once a valid session has been
established. Each PE can support a maximum of 120 sessions. The PE handles an
incoming request in the following manner:
1. The Session Control component verifies the request for session authorization
(user names and passwords), and either allows or disallows the request.
3. The Optimizer is cost-based and develops the least expensive plan (in terms of
time) to return the requested response set. Processing alternatives are evaluated
and the fastest alternative is chosen. This alternative is converted into executable
steps, to be performed by the AMPs, which are then passed to the Dispatcher.
The Optimizer is "parallel aware," meaning that it has knowledge of the system
components (how many nodes, vprocs, etc.), which enables it to determine the
fastest way to process the query. In order to maximize throughput and minimize
resource contention, the Optimizer must know about system configuration,
available units of parallelism (AMPs and PEs), and data demographics. The
Teradata Database Optimizer is robust and intelligent, and enables the Teradata
Database to handle multiple complex, ad-hoc queries efficiently.
4. The Dispatcher controls the sequence in which the steps are executed and
passes the steps received from the optimizer onto the BYNET for execution by the
AMPs.
5. After the AMPs process the steps, the PE receives their responses over the
BYNET.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 44 of 137
6. The Dispatcher builds a response message and sends the message back to the
user.
An AMP accesses data from its single associated vdisk, which is made up of multiple
ranks of disks. An AMP responds to Parser/Optimizer steps transmitted across the
BYNET by selecting data from or storing data to its disks. For some requests, the AMPs
may redistribute a copy of the data to other AMPs.
The Database Manager subsystem resides on each AMP. This subsystem will:
Earlier in this course, we discussed the logical organization of data into tables. The
Database Manager subsystem provides a bridge between that logical organization and
the physical organization of the data on disks. The Database Manager performs a
space-management function that controls the use and allocation of space.
To review the AMP software, click the buttons (rectangles) on the AMP.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 45 of 137
Channel Driver software is the means of communication between an application and the
PEs assigned to channel-attached clients. There is one Channel Driver per node.
In the diagram below, the blue dots show the communication from the channel-attached
client, to the host channel adapter in the node, to the Channel Driver software, to the PE,
and back to the client.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 46 of 137
In the diagram below, the blue dots show the communication from the network-attached
client, to the Ethernet card in the node, to the Teradata Gateway software, to the PE,
and back to the client.
Each platform is purpose built to meet different analytical requirements. They all
leverage the Teradata Database. Customers may easily migrate applications from one
platform to another without having to change data models, ETL, or underlying structures.
The Teradata Extreme Data Appliance 1550 provides for deep strategic intelligence from
extremely large amounts of detailed data. It supports very high-volume, non-enterprise
data/analysis requirements for a small number of power users in specific workgroups or
projects that are outside of the enterprise data warehouse (EDW).
This appliance is based on the field proven Teradata Active Enterprise Data Warehouse
5550 processing nodes and provides the same scalability and data warehouse
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 47 of 137
These models are targeted to the full-scale large data warehouse. They offer expansion
capabilities up to 1024 TPA and non-TPA nodes. The power of the Teradata Database combined
with the throughput, power and performance of both the Intel® Xeon™ quad-core processors and
BYNET V3 technologies offers unsurpassed performance and capacity within the scalable data
warehouse.
The Teradata Data Mart Appliance 2500 is a server that is optimized specifically for high
DSS performance. The Teradata Data Mart Appliance 2550 and 2555 have similar
characteristics to the 2500, but are approximately 40% - 45% faster on a per node basis.
These systems are optimized for fast scans and heavy “deep dive” analytics.
Characteristics of the Teradata Data Mart Appliance 2500/2550/2555 include:
Exercise 2.1
Select the answers from the options given in the drop-down boxes that correctly complete the
sentences.
Feedback:
Show Answers Reset
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 48 of 137
To review these topics, click Cliques Provide Resiliency, Using the BYNET, or Cliques.
Exercise 2.2
Which three statements about the Teradata Database are true? (Choose three.)
c
d
e
f
g A. Runs on a foundation called a TPA.
c B. PDE is a software layer that allows TPAs to run in a parallel software environment.
d
e
f
g
c C. There are two types of virtual processors: AMPs and PEs.
d
e
f
g
c D. Runs on UNIX MP-RAS (discontinued after Teradata 13), Windows 2000, and Linux.
d
e
f
g
Feedback:
Check Answer Show Answer
To review these topics, click Software Components, Parallel Database Extensions (PDE), A
Teradata Database System, or Operating System.
Exercise 2.3
Four of these components are contained in the TPA software. Click each of your choices and
check the Feedback box below each time to see if you are correct.
Feedback:
Show Answers Reset
Exercise 2.4
Select AMP, BYNET, or PE in the pull-down menu as the component responsible for the following
tasks:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 49 of 137
Feedback:
Show Answers Reset
To review these topics, click Node Components, Communication Between Nodes, Communication
Between Vprocs, Teradata Database Software: PE, and Teradata Database Software: AMP.
Exercise 2.5
From the drop-down box below, select the answer that correctly completes the sentence.
Feedback:
Exercise 2.6
Select OLAP, OLTP, DSS or Data Mining (DM) in the pull-down menu as the appropriate type of
data processing for the following requests:
Feedback:
Show Answers Reset
Exercise 2.7
From the drop-down box below, select the answer that correctly completes the sentence.
A(n) may contain detail or summary data and is a special purpose subset
of enterprise data for a particular function or application, rather than for general use.
Feedback:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 50 of 137
Exercise 2.8
From the drop-down box below, select the answer that correctly completes the sentence.
Feedback:
Exercise 2.9
From the drop-down box below, select the answer that correctly completes the sentence.
Feedback:
Exercise 2.10
Select Teradata Extreme Data Appliance (e.g. 1550) Teradata Active Enterprise Data Warehouse
(e.g. 5550) or Teradata Data Mart Appliance (e.g. 2550) in the pull-down menu as the appropriate
platform for each description:
Feedback:
Show Answers Reset
Exercise 2.11
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 51 of 137
Feedback:
Show Answers Reset
Exercise 2.12
j
k
l
m
n A. True
j
k
l
m
n B. False
Feedback:
Objectives
Client Connections
Users can access data in the Teradata Database through an application on both
channel-attached and network-attached clients. Additionally, the node itself can act as a
client. Teradata client software is installed on each client (channel-attached, network-
attached, or node) and communicates with RDBMS software on the node. You may hear
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 52 of 137
either type of client referred to by the term "host," though this term is not typically used in
documentation or product literature.
The client may be a mainframe system, such as IBM or Amdahl, which is channel-
attached to the Teradata Database, or it may be a PC, UNIX, or Linux-based system that
is LAN-attached.
The client application submits an SQL request to the database, receives the response,
and submits the response to the user. This application could be a business intelligence
(BI) tool or a data integration (DI/ETL/ELT) tool, submitting queries to Teradata or
loading/updating tables in the database.
Channel-Attached Client
Communication from client applications on the mainframe goes through the mainframe
channel, to the Host Channel Adapter on the node, to the Channel Driver software.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 53 of 137
The Teradata Database supports network-attached clients connected to the node over a
LAN. The following software components installed on the network-attached client are
responsible for communication between client applications and the Teradata Gateway
on a Teradata Database node:
Communication from applications on the network-attached client goes over the LAN, to
the Ethernet card on the node, to the Teradata Gateway software.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 54 of 137
On the database side, the Teradata Gateway software and the PE provide the
connection to the Teradata Database. The Teradata Database is configured with two
LAN connections for redundancy. This ensures high availability.
Node
As a review, answer this question: Which two can you use to run an application that is
installed on a node? (Choose two.)
c
d
e
f
g A. Mainframe terminal
c B. Bus terminal
d
e
f
g
c C. System console
d
e
f
g
c D. Network-attached workstation
d
e
f
g
Feedback:
Check Answer Show Answer
Request Processing
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 55 of 137
The steps for processing a request like the one above are somewhat different,
depending on whether the user is accessing the Teradata Database through a channel-
attached or network-attached client:
1. SQL request is sent from the client to the appropriate component on the node:
Channel-attached client: request is sent to Channel Driver (through the TDP).
Network-attached client: request is sent to Teradata Gateway (through CLIv2
or ODBC).
2. Request is passed to the PE(s).
3. PEs parse the request into AMP steps.
4. PE Dispatcher sends steps to the AMPs over the BYNET.
5. AMPs perform operations on data on the vdisks.
6. Response is sent back to PEs over the BYNET.
7. PE Dispatcher receives response.
8. Response is returned to the client (channel-attached or network-attached).
Teradata has a robust suite of client utilities that enable users and system administrators
to enjoy optimal response time and system manageability. Various client utilities are
available for tasks from loading data to managing the system.
Teradata utilities leverage the Teradata Database’s high performance capabilities and
are fully parallel and scalable. The same utilities run on smaller entry-level systems, and
the largest MPP implementations.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 56 of 137
Teradata Database client utilities include the following, described in this section:
The Teradata Database provides tools that are front-end interfaces for submitting SQL
queries. Two mentioned in this section are BTEQ and Teradata SQL Assistant.
BTEQ
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 57 of 137
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 58 of 137
In a data warehouse environment, the database tables are populated from a variety of
sources, such as mainframe applications, operational data marts, or other distributed
systems throughout a company. These systems are the source of data such as daily
transaction files, orders, usage records, ERP (enterprise resource planning) information,
and Internet statistics. Teradata provides a suite of data load and unload utilities
optimized for use with the Teradata Database. They run on any of the supported client
platforms:
Channel-attached client
Network-attached client
Node
Teradata load and unload utilities are fully parallel. Because the utilities are scalable,
they accommodate the size of the system. Performance is not limited by the capacity of
the load and unload tools.
The utilities have full restart capability. This feature means that if a load or unload job
should be interrupted for some reason, it can be restarted again from the last
checkpoint, without having to start the job from the beginning.
FastLoad
MultiLoad
TPump
FastExport
Teradata Parallel Transporter (TPT)
FastLoad
FastLoad loads to a single empty table at a time. FastLoad loads data into an empty
table in parallel, using multiple sessions to transfer blocks of data. FastLoad achieves
high performance by fully exploiting the resources of the system. After the data load is
complete, the table can be made available to users. A typical use is for mini-batch or
frequent batch where you load the data to an empty "staging" table, and then use an
SQL INSERT/SELECT command to move it to an existing table.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 59 of 137
MultiLoad
MultiLoad can load multiple input files concurrently and work on up to five tables at a
time, using multiple sessions. MultiLoad is optimized to apply multiple rows in block-
level operations. MultiLoad usually is run during a batch window, and places a lock on
the destination table(s) to prevent user queries from getting inconsistent results before
the data load or update is complete. Access locks may be used to query tables being
maintained with MultiLoad.
TPump
TPump performs the same operations as MultiLoad. TPump updates a row at a time and
uses row hash locks, which eliminates the need for table locks and "batch windows"
typical with MultiLoad. Users can continue to run queries during TPump data loads. In
addition, TPump maintains up to 60 tables at a time.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 60 of 137
TPump has a dynamic throttle that operators can set to specify the percentage of system
resources to be used for an operation. This enables operators to set when TPump
should run at full capacity during low system usage, or within limits when TPump may
affect other business users of the Teradata Database.
FastExport
Use the FastExport utility to export data from one or more tables or views on the
Teradata Database to a client-based application.
You can export data from any table or view on which you have the SELECT access
rights. The destination for the exported data can be a:
FastExport is a data extract utility. It transfers large amounts of data using block
transfers over multiple sessions and writes the data to a host file on the network-
attached or channel-attached client. Typically, FastExport is run during a batch window,
and the tables being exported are locked.
Using built-in operators, Teradata Parallel Transporter combines the functionality of the
Teradata utilities (FastLoad, MultiLoad, FastExport, and TPump) in a single parallel
environment. Its extensible environment supports FastLoad INMODs, FastExport
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 61 of 137
OUTMODs, and Access Modules to provide access to all the data sources you use
today. There is a set of open APIs (Application Programmer Interface) to add third party
or custom data transformation to Teradata Parallel Transporter scripts. Using multiple,
parallel tasks, a single Teradata Parallel Transporter script can load data from disparate
sources into the Teradata Database in the same job.
A single Teradata Parallel Transporter job can load data from multiple disparate
sources into the Teradata Database, as indicated by the green arrow.
The operators are components that "plug" into the Teradata Parallel Transporter
infrastructure and actually perform the functions.
The FastLoad INMOD and FastExport OUTMOD operators support the current
FastLoad and FastExport INMOD/OUTMOD features.
The Data Connector operator is an adapter for the Access Module or non-Teradata
files.
The SQL Select and Insert operators submit the Teradata SELECT and INSERT
commands.
The Load, Update, Export and Stream operators are similar to the current
FastLoad, MultiLoad, FastExport and TPump utilities, but built for the Teradata PT
parallel environment.
The INMOD and OUTMOD adapters, Data Connector operator, and the SQL
Select/Insert operators are included when you purchase the Infrastructure. The Load,
Update, Export and Stream operators are purchased separately.
To simplify these new concepts, let's compare the Teradata Parallel Transporter
Operators with the classic utilities that we just covered.
Teradata
TPT Operator Description
Utility
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 62 of 137
Administrative Utilities
Administrative utilities use a graphical user interface (GUI) to monitor and manage
various aspects of a Teradata Database system.
Workload Management:
Teradata Manager
Teradata Dynamic Workload Manager (TDWM)
Priority Scheduler
Database Query Log (DBQL)
Teradata Workload Analyzer
Performance Monitor
Teradata Active Systems Management (TASM)
Teradata Analyst Pack
Workload Management
Teradata Manager
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 63 of 137
Teradata Manager is a production and performance monitoring system that helps a DBA
or system manager monitor, control, and administer one or more Teradata Database
systems through a GUI. Running on LAN-attached clients, Teradata Manager has a
variety of tools and applications to gather, manipulate, and analyze information about
each Teradata Database being administered.
For examples of Teradata Manager functions, click here: Teradata Manager Examples
For example, with TDWM a request can be scheduled to run periodically or during a
specified time period. Results can be retrieved any time after the request has been
submitted by TDWM and executed.
Analysis control thresholds - TDWM can restrict requests that will exceed a
certain processing time, or whose expected result set size exceeds a specified
number of rows.
Object control thresholds - TDWM can limit access to and use of static criteria
such as database objects and other items. Object controls can control workload
requests based on user IDs, tables, views, date, time, macros, databases, and
group IDs.
Environmental factors -TDWM can manage requests based on dynamic
environment factors, including database system CPU and disk utilization, network
activity, and number of users.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 64 of 137
The database administrator can use the following capabilities of TDWM to manage work
submitted to the database in order to maximize system resource utilization:
Query Management
Scheduled Requests
With Query Management, database query requests are intercepted within the Teradata
Database, their components are compared against criteria that are defined by the
administrator, and requests that fail to meet the criteria are restricted: either run,
suspended, scheduled later, or rejected.
With Scheduled Requests, clients can submit SQL requests to be executed at scheduled
off-peak times.
Priority Scheduler
Priority Scheduler is a resource management tool that is used to assign resources and
controls how computer resources, (e.g., CPU), are allocated to different users in a
Teradata system. This resource management function is based on scheduler parameters
that satisfy site-specific requirements and system parameters that depict the current
activity level of the Teradata Database system. You can provide Priority Scheduler
parameters to directly define a strategy for controlling resources.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 65 of 137
Identifies classes of queries and candidate workloads for analysis and
recommends workload definitions and operating rules.
Recommends workload allocation group mappings and Priority Scheduler facility
(PSF) weights.
Provides the ability to migrate existing Priority Schedule Definitions (PD Sets) into
new workloads.
Provides recommendations for appropriate workload Service Level Goals (SLGs).
Establishes workload definitions from query history or directly.
Can be used “iteratively” to analyze and understand how well existing workload
definitions are working and modify them if necessary.
Teradata Workload Analyzer can also apply best practice standards to workload
definitions such as assistance in Service Level Goal (SLG) definition and priority
scheduler setting recommendations.
Performance Monitor
The Performance Monitor (formerly called PMON) collects near real-time system
configuration, resource usage, and session information from the Teradata Database
either directly or through Teradata Manager. Performance Monitor formats and displays
this information as requested: Performance Monitor allows you to analyze current
performance and both current and historical session information, and to abort sessions
that are causing system problems.
Resource
Manage the level of
Management
Priority resources allocated to
Query
Scheduler different priorities of
Executes
executing work.
Resource
control during
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 66 of 137
execution
Allows DBA or user to
Performance During Query
examine the active
Monitor Execution
workload.
Analyze query
Application
Database performance and
Query Post-
Query Log behavior after
Execution
completion.
Tools are also provided to monitor workloads in real time and to produce historical
reports of resource utilization by workloads. By analyzing this information, the workload
definitions can be adjusted to improve the allocation of system resources.
TASM is primarily comprised of three products that are used to create and manage
“workload definitions”:
Teradata Active Systems Management (TASM), allows you to perform the following:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 67 of 137
capture and graphically represent the steps of the plan and perform comparisons
of two or more plans. It is intended for application developers, database administrators
and database support personnel to better understand why the Teradata Database
Optimizer chooses a particular plan for a given SQL query. All information required for
query plan analysis such as database object definitions, data demographics and cost
and cardinality estimates is available through the Teradata Visual Explain interface. It is
helpful in identifying the performance implications of data skew and bad or missing
statistics. Visual Explain uses a Query Capture Database to store query plans which
can then be visualized or manipulated with other Teradata Analyst Pack tools.
Teradata SET allows the user to capture the following by database, query, or workload:
As changes are made within a database, the Statistics Wizard identifies those changes
and recommends which tables should have statistics collected, based on age of data
and table growth, and which columns/indexes would benefit from having statistics
defined and collected for a specific workload. The DBA is then given the opportunity to
accept or reject the recommendations.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 68 of 137
Archival Utilities
Teradata provides the Archive Recovery utility (ARC) to perform backup and restore
operations on tables, databases, and other objects.
In addition, ARC interfaces to third party products to support backup and restore
capabilities in a network-attached environment.
There are several scenarios where restoring objects from external media may be
necessary:
With the ARC utility you can copy a table and restore it to another Teradata Database. It
is scalable and parallel, and can run on a channel-attached client, network-attached
client, or a node.
ARC may be running on the node or on the channel-attached client, and will backup data
directly across the channel into the mainframe-attached tape subsystem.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 69 of 137
In a network-attached client environment, ARC is used to back up data, along with one
of the following tape management products:
These products provide modules for Teradata Database systems that run on network-
attached clients or a node (Microsoft Windows or UNIX MP-RAS). Data is backed up
through these interfaces into a tape storage subsystem using the ARC utility.
Exercise 3.1
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 70 of 137
Exercise 3.2
Select the appropriate Teradata load or unload utility from the pull-down menus.
Enables constant loading (streaming) of data into a table to keep data fresh.
Data extract utility that exports data from a Teradata table and writes it to a host file.
Updates, inserts, or deletes empty or populated tables (block level operation).
Uses parallel processing to load an empty table.
Performs the same function as the UPDATE Teradata Parallel Transporter operator.
Performs the same function as the STREAM Teradata Parallel Transporter operator.
Feedback:
Show Answers Reset
Exercise 3.3
Move the software components required for a channel connection into the appropriate blue
squares. Correctly placed components will stay where you put them.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 71 of 137
Exercise 3.4
c A. Teradata SQL Assistant and TDWM are the two utilities used for Teradata system
d
e
f
g
management.
c B. TDWM can reject a query based on current workload and set thresholds.
d
e
f
g
c C. BTEQ runs on all client platforms to access the Teradata Database.
d
e
f
g
c D. Archive Recovery (ARC) is used to copy and restore a table to another Teradata Database.
d
e
f
g
c E. NetVault and Veritas NetBackup are utilities used for network management.
d
e
f
g
Feedback:
Check Answer Show Answer
To review these topics, click BTEQ, Teradata SQL Assistant, Teradata Manager, TDWM, Archiving
on Channel-Attached Clients, and Archiving on Network-Attached Clients.
Exercise 3.5
Select the correct type of connection (network-attached client or channel-attached client) from the
drop-down boxes below that corresponds to the listed software and hardware components.
Teradata Gateway
Teradata Director Program
Channel Driver
Ethernet Card
"mainframe host"
Feedback:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 72 of 137
To review this topic, click Channel Attached Client or Network Attached Client.
Exercise 3.6
Select the correct Teradata Analyst Pack tool from the drop-down menus below.
Feedback:
Show Answers Reset
Exercise 3.7
j
k
l
m
n A. Teradata Workload Analyzer
j B. Database Query Log
k
l
m
n
j C. Teradata Active Systems Manager
k
l
m
n
j D. Performance Monitor
k
l
m
n
Feedback:
Exercise 3.8
j
k
l
m
n A. True
j
k
l
m
n B. False
Feedback:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 73 of 137
Exercise 3.9
Feedback:
Show Answers Reset
Objectives
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 74 of 137
When the Teradata Database software is first installed, all Permanent Space is assigned
to Database DBC (also a User in Teradata Database terminology, because you can log
on to it with a userid and password). During installation, the following Databases are
created:
Because Database DBC is the immediate parent of these child Databases, Permanent
Space limits for the children are subtracted from Database DBC.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 75 of 137
After the initial installation, you will create your database hierarchy. One way to set up
this hierarchy would be to create a Database Administrator User directly subordinate to
Database DBC. Most of the system Permanent Space would be assigned to the
Database Administrator User. This setup gives you the freedom to have multiple
administrators logging on to the Database Administrator User, and limit the number of
people logging on directly to Database DBC (which has more access rights than any
other User).
Next, all other Users and Databases would be created from the database administrator
User, and their Permanent Space limits would be subtracted from the Database
Administrator User's space limit. Your hierarchy would look like this:
Database DBC at the highest level, the parent of all other Databases (including
Users).
User SysDBA (we called it SysDBA; you can assign it any name) with the majority
of the system's Perm Space assigned to it.
All Databases and Users in the system created from User SysDBA .
Each table, view, macro, stored procedure, and trigger are owned by a Database
(or User).
Data Layers
There are several “layers” built in to the EDW environment. These layers include:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 76 of 137
Below is an example of how Permanent Space limits for Users and Databases come
from the immediate parent User or Database. In this case, the User SysDBA has 500 GB
of maximum Permanent Space assigned to it.
The User HR is created from SysDBA with 200 GB of maximum Permanent Space. The
200 GB for HR is subtracted from SysDBA, who now has 300 GB (500 GB minus 200
GB).
The User Payroll is created as a child of HR with 100 GB of Permanent Space. The 100
GB for Payroll is subtracted from HR, which now has 100 GB (200 GB minus 100 GB).
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 77 of 137
with 100 GB of maximum Permanent Space. The 100 GB for Marketing comes from its
parent, SysDBA, which now has 200 GB (300 GB minus 100 GB).
A Teradata Database
In Teradata Database systems, the words "database" and "user" have specific
definitions.
Note: A Database with no Perm Space can contain views, macros, and triggers, but no
tables or stored procedures.
These Teradata Database objects are created, maintained, and deleted using SQL.
A user may be a collection of tables, views, macros, triggers, and stored procedures. A
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 78 of 137
user is a specific type of database, and has attributes in addition to the ones listed
above:
User ID
Password
So, a user is the same as a database except that a user can actually log on to the
database. To log on to a Teradata Database, you need to specify a user (which is simply
a database with a password). You cannot log on to a database because it has no
password.
Note: In this course, we will use uppercase "U" for User and uppercase "D" for Database
when referring to these specific Teradata Database objects.
Spool Space
As mentioned previously in "Creating Databases and Users," Spool Space is work space
used to hold intermediate answer sets. Any Perm Space currently unassigned is
available as Spool Space.
Defining a Spool Space limit is not required when Users and Databases are created. If it
is not defined, the Spool Space limit for the User or Database is inherited from its parent.
Thus, if no Spool Space limit were defined for any Users or Databases, an erroneous
SQL request could create a "runaway transaction" that consumes all of the system's
resources. For this reason, defining Spool Space limits for a User or Database is highly
recommended.
The Spool Space limit for a Database or User is not subtracted from its immediate
parent, but the Database or User's maximum spool allocation can only be as large as its
immediate parent. For example:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 79 of 137
Because Spool Space is work space, temporarily used and released by the system as
needed, the total maximum Spool Space allocated for all the Databases and Users on
the system can actually exceed the total system disk space. But this is not the amount of
Spool Space actually consumed.
The maximum Spool Space for a Database (or User) is merely an upper limit of the
Spool Space that the Database can use while processing a transaction. There are two
limits to Spool Space utilization:
Physical limitation of disk space. For a specific transaction, the system can only
use the amount of Spool Space actually available on the system at that
particular time, whether a maximum spool limit has been defined or not. If a job is
going to exceed the Spool Space available on the system, an error message is
given stating that there is not enough space to process the job.
As the amount of Permanent Space used to store data varies over a long period of time,
so will the amount of space available for spool (work space).
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 80 of 137
Temporary Space
Temporary Space is Permanent Space currently not being used. Temporary Space is
used for global temporary tables, and these results remain available to the user until
their session is terminated. Tables created in Temp Space will survive a restart.
j The Spool Space used by a request is limited to the amount of Spool Space
k
l
m
n
assigned to the originating user and the physical space available on the system at that
point in time.
j A request can use as much Spool Space as necessary as long as it does not exceed
k
l
m
n
the system’s total installed physical space limit.
j A request can use as much Spool Space as necessary as long as it does not exceed
k
l
m
n
the Spool Space limit of the originating user, regardless of the space available on the
system.
j The Spool Space used by a request is limited only by the maximum Perm Space of
k
l
m
n
the originating user.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 81 of 137
Feedback:
Data Dictionary
The Data Dictionary is a set of relational tables that contains information about the
RDBMS and database objects within it. It is like the metadata or "data about the data" for
a Teradata Database (except that it does not contain business rules, like true metadata
does). The Data Dictionary resides in Database DBC. Some of the items it tracks are:
Disk space
Access rights
Ownership
Data definitions
Disk Space
The Data Dictionary stores information about how much space is allocated for perm and
spool for each Database and User. The table below shows an example of Data
Dictionary information for space allocations. In this example, the Users Payroll and
Benefits have no Permanent Space allocated or consumed because they do not contain
tables.
Access
The Data Dictionary also stores information about which Users can access which
database objects.
System Administrators are often responsible for archiving the system. In the example
below, it is likely that the SysAdm User would have access to the tables in the Employee
and Crashdumps databases, as well as other objects. When you grant and revoke
access to any User for any database object, privileges are stored in the AccessRights
table in the Data Dictionary.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 82 of 137
Owners
The Data Dictionary also stores information about which Databases and Users own each
database object.
Definitions
The Data Dictionary stores definitions of all database objects, their names, and their
place in the hierarchy.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 83 of 137
For macros, the Data Dictionary also stores the actual SQL statements of the macro.
While stored procedures also contain statements (SQL and SPL statements), the
statements for each stored procedure are kept in a separate table and distributed among
the AMPs (like regular user data), rather than in the Data Dictionary.
Database Security
LDAP
Single Sign-On
Passwords
Authentication
After users have logged on to Teradata Database and have been authenticated, they are
authorized access to only those objects allowed by their database privileges.
Privilege (access right) is the right to access or manipulate an object within Teradata.
Privileges control user activities such as creating, executing, inserting, viewing,
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 84 of 137
modifying, deleting, or tracking database objects and data. Privileges may also include
the ability to grant privileges to other users in the database.
In addition to access rights, the database hierarchy can be set up such that users
access tables or applications via the semantic layer, which could include Views, Macros,
Stored Procedures, and even UDFs.
Roles, which are a collection of access rights, can be granted to groups of users to
further protect the security of data and objects within Teradata.
Exercise 4.1
j
k
l
m
n A. The path to the data.
j
k
l
m
n B. A SELECT command.
j C. A User and password.
k
l
m
n
j D. An IP address.
k
l
m
n
Feedback:
Exercise 4.2
j
k
l
m
n A. 300 GB
j B. 500 GB
k
l
m
n
j C. 600 GB
k
l
m
n
j D. 700 GB
k
l
m
n
Feedback:
See Calculation
Exercise 4.3
Select the answers from the options given in the drop-down boxes that correctly complete the
sentences.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 85 of 137
Feedback:
Show Answers Reset
To review these topics, click Creating Databases and Users, Creating a New Database.
Exercise 4.4
Select the choice from the drop-down box that corresponds to each statement:
Feedback:
Show Answers Reset
Exercise 4.5
j
k
l
m
n A. Always
j B. Sometimes
k
l
m
n
j C. Never
k
l
m
n
Feedback:
Exercise 4.6
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 86 of 137
j
k
l
m
n A. True
j
k
l
m
n B. False
Feedback:
Exercise 4.7
The three Teradata Database security mechanisms for authenticating access to the Teradata
Database are? (Choose three.)
c
d
e
f
g A. LDAP
c
d
e
f
g B. Single Sign-On
c C. User Defined Functions
d
e
f
g
c D. Passwords
d
e
f
g
Feedback:
Check Answer Show Answer
Exercise 4.8
Match the data “layers” built into the Teradata EDW environment to their definitions.
The primary purpose for this layer is to perform data transformation, either in the ETL
or ELT process.
This layer is where denormalizations that will make access more efficient occur; pre-
aggregations, summary tables, join indexes, etc. The purpose of this layer is to provide efficient,
friendly access to end users.
This is the “access” layer. Access is often provided via views and business intelligence
(BI) tools; whether a Teradata application or a 3rd party tool.
Feedback:
Show Answers Reset
Objectives
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 87 of 137
Describe the types of data protection and fault tolerance used by the Teradata
Database.
Discuss the types of RAID protection used on Teradata Database systems.
Explain basic data storage concepts.
Explain the concept of Fallback tables.
List the types and levels of locking provided by the Teradata Database.
Describe the function of recovery journals, transient journals, and permanent
journals.
Protecting Data
Several types of data protection are available with the Teradata Database. All the data
protection methods shown on this page are covered in further detail later in this module.
RAID
Redundant Array of Inexpensive Disks (RAID) is a storage technology that provides data
protection at the disk drive level. It uses groups of disk drives called "arrays" to ensure
that data is available in the event of a failed disk drive or other component. The word,
"redundant," implies that either data, functions, and/or components have been duplicated
in the array's architecture. The industry has agreed on six RAID configuration levels
(RAID 0 through RAID 5). The classifications do not imply superiority of one mode over
the other, but differentiate how data is stored on the disk drives.
With the Teradata Database, the two RAID technologies that are supported are RAID 1
and RAID 5. On systems using EMC disk drives, RAID 5 is called RAID S.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 88 of 137
SCSI bus
Physical disks
Disk array controllers
For maximum availability and performance, the Teradata Database uses dual redundant
disk array controllers. Having two disk array controllers provides a level of protection in
case one controller fails, and provides parallelism for disk access.
Fallback
Fallback is a Teradata Database feature that protects data against AMP failure. As
shown later in this module, Fallback uses clusters of AMPs that provide for data
availability and consistency if an AMP is unavailable.
Locks
Locks can be placed on database objects to prevent multiple users from simultaneously
changing them. The four types of locks are:
Exclusive
Write
Read
Access
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 89 of 137
Journals
The Teradata Database has journals that are used for specific types of data or process
recovery:
Recovery Journals
Permanent Journals
RAID 1
RAID 1 is a data protection scheme that uses mirrored pairs of disks to protect data from
a single drive failure.
RAID 1 requires double the number of disks because every drive has an identical
mirrored copy. Recovery with RAID 1 is faster than with RAID 5. The highest level of
data protection is RAID 1 with Fallback.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 90 of 137
RAID 1 protects against a single disk failure using the following principles:
Mirroring
Reading
Mirroring: RAID 1 maintains a mirrored disk for each disk in the system.
Note: If you configure more than one pair of disks per AMP, the RDAC stripes the data
across both the regular and mirror disks.
Reading: Using both copies of the data, the system reads data blocks from the first
available disk. This does not so much protect data as provide a performance benefit.
If a disk fails, the Teradata Database is unaffected and the following are each handled in
a different way:
Reads
Writes
Replacements
Reads: When a drive is down, the system reads the data from the other drive. There
may be a minor performance penalty because the read will occur from one drive instead
of both.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 91 of 137
Writes: When a drive is down, the system writes to the functional drive. No mirror image
exists at this time.
Replacements: After you replace the failed disk, the disk array controller automatically
reconstructs the data on the new disk from the mirror image. Normal system
performance is affected during the reconstruction of the failed disk.
RAID 5
RAID 5 is a data protection scheme that uses parity striping in a disk array to protect
data from the failure of a single drive.
Note: RAID S is the name for RAID 5 implemented on EMC disk drives.
The number of disks per rank varies from vendor to vendor. The number of disks in a
rank impacts space utilization:
RAID 5 also uses some overhead during a write operation, because it has to read the
data, then calculate and write the parity.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 92 of 137
Rank: For the Teradata Database, RAID 5 uses the concept of a rank, which is a set of
disks working together. Note that the disks in a rank are not directly cabled to each
other.
Data is striped across a rank of disks (spread across the disk drives) one segment
at a time, using a binary "exclusive-or" (XOR) algorithm.
Parity is also striped across all disk drives, interleaved with the data. A "parity
byte" is an extra byte written to a drive in a rank. The process of writing data and
parity to the disk drives includes a read-modify-write operation for each new
segment:
1. Read existing data on the disk drives in the rank.
2. Read existing parity in that rank for the corresponding segment.
3. Calculate the parity: existing data + new data + existing parity = new parity.
4. Write new data.
5. Write new parity.
If one of the disk drives in the rank becomes unavailable, the system uses the
parity byte to calculate the missing data from the down drive so the system can
remain operational. With a rank of 4 disks, if a disk fails, any missing data block
may be reconstructed using the other 3 disks.
In the example below, data bytes are written to disk drives 1, 2, and 3. The system
calculates the parity byte using the binary XOR algorithm and writes it to disk drive 4.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 93 of 137
If a disk fails, the Teradata Database is unaffected and the following are each handled in
different ways:
Reads
Writes
Replacements
Reads: Data is reconstructed on-the-fly as users request data using the binary XOR
algorithm.
Writes: When a drive is down, the system writes to the functional drives, but not to the
failed drive.
Replacements: After you replace the failed disk, the disk array controller automatically
reconstructs the data on the new disk, using known data values to calculate the missing
data. Normal system performance is affected during reconstruction of the failed disk.
Give It a Try
In the example below, Disk 2 has experienced a failure. To allow users to still access the
data while Disk 2 is down, the system must calculate the data on the missing disk drive
using the parity byte. What would be the missing byte for this segment?
j
k
l
m
n A. 1111 0011
j B. 0111 1011
k
l
m
n
j C. 0010 0110
k
l
m
n
j D. 0000 1100
k
l
m
n
Feedback:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 94 of 137
Fallback
Fallback is a Teradata Database feature that protects data in the case of an AMP vproc
failure. Fallback guarantees the maximum availability of data. You can specify Fallback
protection at the table or database level. It is especially useful in applications that require
high availability.
Fallback protects your data by storing a second copy of each row of a table on a
different AMP in the same cluster. If an AMP fails, the system accesses the Fallback
rows to meet requests. Fallback provides AMP fault tolerance at the table level. With
Fallback tables, if one AMP fails, all data is still available. Users may continue to use
Fallback tables without any loss of access to data.
During table creation or after a table is created, you may specify whether or not the
system should keep a Fallback copy of the table. If Fallback is specified, it is automatic
and transparent.
Fallback guarantees that the two copies of a row will always be on different AMPs. If
either AMP fails, the alternate row is still available on the other AMP.
Space
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 95 of 137
Performance
There is a benefit to protecting your data, but there are costs associated with that
benefit. With Fallback use, you need twice the disk space for storage and twice the I/O
required for INSERTs, UPDATEs, and DELETEs of rows in Fallback protected tables.
The Fallback option does not require any extra I/O for SELECTS, as the system will read
from one copy or the other, and the Fallback I/O will be performed in parallel with the
primary I/O so there is no performance hit.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 96 of 137
Below is a cluster of four AMPs. Each AMP has a combination of Primary and Fallback
data rows:
Primary Data Row: A record in a database table that is used in normal system
operation.
Fallback Data Row: The online backup copy of a Primary data row that is used in
the case of an AMP failure.
Write: Each Primary data row has a duplicate Fallback row on another AMP. The
Primary and Fallback data rows are written in parallel.
P=Primary F=Fallback
Read: When an AMP is down with a table that is defined as Fallback, Teradata will
access the Fallback copies of the rows.
More Clusters: The diagram below shows how Fallback data is distributed among
multiple clusters.
P=Primary F=Fallback
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 97 of 137
If two physical disks fail in the same RAID 5 rank or RAID 1 mirrored pair, the associated
AMP vproc fails. Fallback protects against the failure of a single AMP in a cluster.
If two AMPs in a cluster fail, the system halts and must be restarted manually, after the
AMP is recovered by replacing the failed disk(s).
Reads: When an AMP fails, the system reads all rows it needs from the remaining AMPs
in the cluster. If the system needs to find a Primary row from the failed AMP, it reads the
Fallback copy of that row, which is on another AMP.
Writes: A failed AMP is not available, so the system cannot access any of that AMP's
disk space. Copies of its unavailable primary rows are available as Fallback rows on the
other AMPs in the cluster, and are updated there.
Replacement: Repairing the failed AMP requires replacing the failed physical disks and
bringing the AMP online. Once the AMP is online, the system uses the Fallback data on
the other AMPs to automatically reconstruct data on the newly replaced disks.
Disk Allocation
The operating system, PDE, and the Teradata Database do not recognize the physical
disk hardware. Each software component recognizes and interacts with different
components of the data storage environment:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 98 of 137
Creating LUNs
Space on the physical disk drives is organized into LUNs. The RAID level determines
how the space is organized. For example, if you are using RAID 5, a LUN includes a
region of space from each of the physical disk drives in a rank.
In UNIX systems, a LUN consists of one partition, which is further divided into
slices:
Boot slice (a very small slice, taking up only 35 sectors)
User slices for storing data. These user slices are called "pdisks" in the
Teradata Database.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 99 of 137
In summary, pdisks are the user slices (UNIX), partitions (Microsoft Windows), or
partitions (Linux) and are used for storage of the tables in a database. A LUN may have
one or more pdisks.
The pdisks (user slices or partitions, depending on the operating system) are assigned
to an AMP through the software. No cabling is involved.
The combined space on the pdisks is considered the AMP's vdisk. An AMP manages
only its own vdisk (disk space assigned to it), not the vdisk of any other AMP. All AMPs
then work in parallel, processing their portion of the data.
Each AMP in the system is assigned one vdisk. Although numerous configurations are
possible, generally all pdisks from a rank (RAID 5) or mirrored pair (RAID 1) are
assigned to the same AMP for optimal performance.
However, an AMP recognizes only the vdisk. The AMP has no control over the physical
disks or ranks that compose the vdisk.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 100 of 137
To help review the terminology you just learned, choose the correct term from the pull-
down boxes next to each definition.
A logical unit that is composed of a region of space from each of the physical
disk drives in a rank. The operating system sees this logical unit as its "disk," and is not
aware that it is actually writing to spaces on multiple disk drives.
For a UNIX system, a portion of physical disk drive space that is used for
storing data. One of these from each disk drive in a rank composes a LUN.
For a Microsoft Windows system, a portion of physical disk drive space that
is used for storing data. One of these from each disk drive in a rank composes a LUN.
This is the collective name for all the logical disk space that an AMP
manages. Thus, it is composed of all the pdisks assigned to that AMP (as many as 64
pdisks).
Feedback:
Show Answers Reset
The following journals are kept on the system to provide data availability in the event of a
component or process failure in the system:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 101 of 137
Recovery Journals
Permanent Journals
Recovery Journals
The Teradata Database uses Recovery Journals to automatically maintain data integrity
in the case of:
Recovery Journals are created, maintained, and purged by the system automatically, so
no DBA intervention is required. Recovery Journals are tables stored on disk arrays like
user data is, so they take up disk space on the system.
Transient Journal
A Transient Journal maintains data integrity when in-flight transactions are interrupted
(due to aborted transactions, system restarts, and so on). Data is returned to its
original state after transaction failure.
A Transient Journal is used during normal system operation to keep "before images" of
changed rows so the data can be restored to its previous state if the transaction is not
completed. This happens on each AMP as changes occur. When a transaction is
started, the system automatically stores a copy of all the rows affected by the
transaction in the Transient Journal until the transaction is committed (completed). Once
the transaction is complete, the "before images" are purged. In the event of a transaction
failure, the "before images" are reapplied to the affected tables and deleted from the
journal, and the "rollback" operation is completed.
The Down-AMP Recovery Journal allows continued system operation while an AMP is
down (for example, when two disk drives fail in a rank or mirrored pair). A Down-AMP
Recovery Journal is used with Fallback-protected tables to maintain a record of write
transactions (updates, creates, inserts, deletes, etc.) on the failed AMP while it is
unavailable.
The Down-AMP Recovery Journal starts automatically after the loss of an AMP in a
cluster, Any changes to the data on the failed AMP are logged into the Down-AMP
Recovery Journal by the other AMPs in the cluster. When the failed AMP is brought back
online, the restart process includes applying the changes in the Down-AMP Recovery
Journal to the recovered AMP. The journal is discarded once the process is complete,
and the AMP is brought online, fully recovered.
Permanent Journals
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 102 of 137
Permanent Journals are an optional feature used to provide an additional level of data
protection. You specify the use of Permanent Journals at the table level. It provides full-
table recovery to a specific point in time. It can also reduce the need for costly and time-
consuming full-table backups.
Permanent Journals are tables stored on disk arrays like user data is, so they take up
additional disk space on the system. The Database Administrator maintains the
Permanent Journal entries (deleting, archiving, and so on.)
When you create a table with Permanent Journaling, you must specify whether the
Permanent Journal will capture:
You can also specify that the system keep both before images and after images. In
addition, you can choose that the system captures:
Single images (the default) -- this means that the Permanent Journal table is not
Fallback protected.
Dual images -- this means that the Permanent Journal table is Fallback protected.
The Permanent Journal captures images concurrently with standard table maintenance
and query activity. The additional disk space required may be calculated in advance to
ensure adequate resources. Periodically, the Database Administrator must dump the
Permanent Journal to external media, thus reducing the need for full-table backups since
only changes are backed up rather than the entire database.
Locks
Locking prevents multiple users who are trying to access or change the same data
simultaneously from violating data integrity. This concurrency control is implemented by
locking the target data.
Locks are automatically acquired during the processing of a request and released when
the request is terminated.
Levels of Locking
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 103 of 137
Types of Locks
Exclusive
Exclusive locks are applied to databases or tables, never to rows. They are the most
restrictive type of lock. With an exclusive lock, no other user can access the database or
table. Exclusive locks are used when a Data Definition Language (DDL) command is
executed (i.e., CREATE TABLE). An exclusive lock on a database or table prevents
other users from obtaining any lock on the locked object.
Write
Write locks enable users to modify data while maintaining data consistency. While the
data has a write lock on it, other users can only obtain an access lock. During this time,
all other locks are held in a queue until the write lock is released.
Read
Read locks are used to ensure consistency during read operations. Several users may
hold concurrent read locks on the same data, during which time no data modification is
permitted. Read locks prevent other users from obtaining the following locks on the
locked data:
Exclusive locks
Write locks
Access
Access locks can be specified by users unconcerned about data consistency. The use of
an access lock allows for reading data while modifications are in process. Access locks
are designed for decision support on tables that are updated only by small, single-row
changes. Access locks are sometimes called "stale read" locks, because you may get
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 104 of 137
"stale data" that has not been updated. Access locks prevent other users from obtaining
the following locks on the locked data:
Exclusive locks
Allows other users to see a stable version of the data, but not make any
modifications.
Allows other users to obtain an access lock only, not any other type of
lock.
This kind of lock cannot be applied to rows.
Feedback:
Show Answers Reset
Exercise 5.1
j
k
l
m
n A. True
j
k
l
m
n B. False
Feedback:
Exercise 5.2
j
k
l
m
n A. DARDAC
j
k
l
m
n B. Mirroring
j C. Parity Striping
k
l
m
n
j D. Partitioning
k
l
m
n
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 105 of 137
Feedback:
Exercise 5.3
Feedback:
Show Answers Reset
To review this topic, click Down-AMP Recovery Journal, Transient Journal, or Permanent Journals.
Exercise 5.4
c
d
e
f
g A. Fallback protects data from the failure of one AMP per cluster.
c B. A clique provides protection in the case of a node failure.
d
e
f
g
c C. ARC protects disk arrays from electrostatic discharge.
d
e
f
g
c D. Locks prevent multiple users from simultaneously changing the same data.
d
e
f
g
Feedback:
Check Answer Show Answer
To review these topics, click Fallback, Cliques Provide Resiliency, Archival Utilities, or Locks.
Exercise 5.5
True or False: Restoration of Fallback-protected data starts automatically when a failed AMP is
brought online.
j
k
l
m
n A. True
j B. False
k
l
m
n
Feedback:
j
k
l
m
n A. True
j
k
l
m
n B. False
Feedback:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 106 of 137
Exercise 5.6
From the drop-down boxes below, match the storage concepts to the descriptions:
The collection of pdisks used to store data. This space is assigned to an AMP.
A collection of areas across the disk drives in a rank. The operating system sees this as
its logical "disk."
A collection of AMPs that keeps Fallback copies of rows for each other in case one AMP
fails.
An area of a LUN (also known as a user slice in UNIX or partition in Microsoft Windows)
that stores user data.
A collection of disk drives used to provide data availability.
Feedback:
Show Answers Reset
To review these topics, click Assigning Pdisks to AMPs, Creating LUNs, Fallback: How It Works,
Pdisks: User Data Space, or RAID 5: How It Works.
Mod 6 - Indexes
Objectives
Indexes are used to access rows from a table without having to search the whole table.
In the Teradata Database, an index is made up of one or more columns in a table. Once
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 107 of 137
Teradata Database indexes are selected, they are maintained by the system. While
other vendors may require data partitioning or index maintenance, these tasks are
unnecessary with the Teradata Database.
You specify which column or columns are used as the Primary Index when you create a
table. Secondary Index columns can be specified when you create a table or at any time
during the life of the table.
Data Distribution
When the Primary Index for a table is well chosen, the rows are evenly distributed across
the AMPs for the best performance. The way to guarantee even distribution of data is by
choosing a Primary Index whose columns contain unique values. The values do not
have to be evenly spaced, or even "truly random," they just have to be unique to be
evenly distributed.
Each AMP is responsible for a subset of the rows in a table. If the data is evenly
distributed, the work is evenly divided among the AMPs so they can work in parallel and
complete their processing about the same time. Even data distribution is critical to
performance because it optimizes the parallel access to the data.
Unevenly distributed data, also called "skewed data," causes slower response time as
the system waits for the AMP(s) with the most data to finish their processing. The
slowest AMP becomes a bottleneck. If distribution is skewed, an all-AMP operation will
take longer than if all AMPs were evenly utilized.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 108 of 137
The system automatically distributes the data across the AMPs based on row
content (the Primary Index values).
The distribution is the same regardless of the data volume being loaded. In other
words, large tables are distributed the same way as small tables.
Data is not distributed in any particular order. The benefits of having unordered data are
that they don't need any maintenance to preserve order, and they are independent
of any query being submitted. The automatic, unordered distribution of data eliminates
tasks for a Teradata Database Administrator that are necessary with some other
relational database systems. The DBA does not waste time on labor-intensive data
maintenance tasks.
A key benefit of the Teradata Database is its manageability. The list of tasks that
Teradata Database Administrators do not have to do is long, and illustrates why the
Teradata Database system is so easy to manage and maintain compared to other
databases.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 109 of 137
With the Teradata Database, the workload for creating a table of 100 rows is the same
as creating a table with 1,000,000,000 rows. Teradata Database Administrator know that
if data doubles, the system can expand easily to accommodate it. The Teradata
Database provides huge cost advantages, especially when it comes to staffing Database
Administrators. Customers tell us that their DBA staff requirements for administering
non-Teradata databases are three to four times higher.
Even data distribution is not easy for most databases to do. Many databases use range
distribution, which creates intensive maintenance tasks for the DBA. Others may use
indexes as a way to select a small amount of data to return the answer to a query. They
use them to avoid accessing the underlying tables if possible. The assumption is that the
index will be smaller than the tables so they will take less time to read. Because they
scan indexes and use only part of the data in the index to search for answers to a query,
they can carry extra data in the indexes, duplicating data in the tables. This way they do
not have to read the table at all in some cases. This is not as efficient as the Teradata
Database's method of data storage and access.
Many other databases require the DBAs to manually partition the data. They might
place an entire table in a single partition. The disadvantage of this approach is it creates
a bottleneck for all queries against that data. It is not the most efficient way to either
store or access data rows.
With other databases, adding, updating and deleting data affects manual data
distribution schemes thereby reducing query performance and requiring reorganization.
A Teradata Database provides high performance because it distributes the data evenly
across the AMPs for parallel processing. No partitioning or data re-organizations are
needed. With the Teradata Database, your DBA can spend more time with users
developing strategic applications to beat your competition!
Which two statements are true about data distribution and Teradata Database indexes?
(Choose two.)
c A. If a table has 103 rows and there are 4 AMPs in the system, each AMP will not
d
e
f
g
have exactly the same number of rows from that table. However, if the Primary Index is
chosen well, each AMP will still contain some rows from that table.
c B. The rows of a table are stored on a single disk for best access performance.
d
e
f
g
c C. Skewed data leads to poor performance in processing data access requests.
d
e
f
g
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 110 of 137
Primary Index
A Primary Index (PI) is the physical mechanism for assigning a data row to an AMP
and a location on the AMPs disks. It is also used to access rows without having to
search the entire table. A Primary Index operation is always a one-AMP operation.
You specify the column(s) that comprise the Primary Index for a table when the table is
created. For a given row, the Primary Index value is the combination of the data values
in the Primary Index columns.
Choosing a Primary Index for a table is perhaps the most critical decision a database
designer makes, because this choice affects both data distribution and access.
The following rules govern how Primary Indexes in a Teradata Database must be
defined as well as how they function:
Each table must have a Primary Index. The Primary Index is the way the system
determines where a row will be physically stored. While a Primary Index may be
composed of multiple columns, the table can have only one (single- or multiple-column)
Primary Index.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 111 of 137
Unique Primary Index (UPI) - For a given row, the combination of the data values
in the columns of a Unique Primary Index are not duplicated in other rows within
the table, so the columns are unique. This uniqueness guarantees even data
distribution and direct access. For example, in the case where old employee
numbers are sometimes recycled, the combination of the Social Security Number
and Employee Number columns would be a UPI. With a UPI, there is no duplicate
row checking done during a load, which makes it a faster operation.
Non-Unique Primary Index (NUPI) - For a given row, the combination of the data
values in the columns of a Non-Unique Primary Index can be duplicated in other
rows within the table. So, there can be more than one row with the same PI
value. A NUPI can cause skewed data, but in specific instances can still be a
good Primary Index choice. For example, either the Department Number column or
the Hire Date column might be a good choice for a NUPI if you will be accessing
the table most often via these columns.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 112 of 137
If the Primary Index is unique, you could have one row with a null value. If you have
multiple rows with a null value, the Primary Index must be Non-Unique.
The Primary Index value can be modified. In the table below, if Loretta Ryan changes
departments, the Primary Index value for her row changes.
When you update the index value in a row, the Teradata Database re-hashes it and
redistributes the row to its new location based on its new index value.
In the event that you need to change the Primary Index, you must drop the table,
recreate it with the new Primary Index, and reload the table.
The ALTER TABLE statement allows you to change the PI of a table if the table is
empty.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 113 of 137
When a table is created, it must have a Primary Index specified. The Primary Index is
designated in the CREATE TABLE statement in SQL.
If you do not specify a Primary Index in the CREATE TABLE statement, the system
will use the Primary Key as the Primary Index. If a Primary Key has not been specified,
the system will choose the first unique column. If there are no unique columns, the
system will use the first column in the table and designate it as a Non-Unique Primary
Index.
As mentioned in the Primary Index rules, you cannot modify the Primary Index of a table.
In the event that you want to change the Primary Index, you must drop the table,
recreate it with the new Primary Index, and reload the table.
Data distribution
Data access
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 114 of 137
The Teradata Database uses hashing to randomly distribute data across all AMPs for
balanced performance. For example, in a two clique system, data is hashed across all
AMPs in the system for even data distribution, which results in evenly distributed
workloads. Each AMP holds a portion of the rows of each table. An AMP is responsible
for the storage, maintenance, and retrieval of the data under its control. The Teradata
Database's automatic hash distribution eliminates costly data maintenance tasks. An
advantage of the Teradata Database is that the Teradata File System manages
data and disk space automatically, which eliminates the need to rebuild indexes
when tables are updated or structures change.
Loading data into a table (one or more rows, using a data loading utility)
Inserting or updating rows (one or more rows, using SQL)
Changing the system configuration (redistribution of data, caused by
reconfigurations to add or delete AMPs)
When loading data or inserting rows, the data being affected by the load or insert is not
available to other users until the transaction is complete. During a reconfiguration, no
data is accessible to users until the system is operational in its new configuration.
The process the system uses for inserting a row on an AMP is described below:
1. The system uses the Primary Index value in each row as input to the hashing
algorithm.
2. The output of the hashing algorithm is the row hash value (in this example, 646).
3. The system looks at the hash map, which identifies the specific AMP where the
row will be stored (in this example, AMP 3).
4. The row is stored on the target AMP.
It is possible for the hashing algorithm to end up with the same row hash value for two
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 115 of 137
To differentiate each row in a table, every row is assigned a unique Row ID. The Row
ID is the combination of the row hash value and a uniqueness value.
The uniqueness value is used to differentiate between rows whose Primary Index
values generate identical row hash values. In most cases, only the row hash value
portion of the Row ID is needed to locate the row.
When each row is inserted, the AMP adds the row ID, stored as a prefix of the row. The
first row inserted with a particular row hash value is assigned a uniqueness value of 1.
The uniqueness value is incremented by 1 for any additional rows inserted with the
same row hash value.
Duplicate Rows
A duplicate row is a row in a table whose column values are identical to another row in
the same table. In other words, the entire row is the same, not just the index. Although
duplicate rows are not allowed in the relational model (because every Primary Key must
be unique), the ANSI Standard does allow duplicate rows and the Teradata Database
supports that.
Because duplicate rows are allowed in the Teradata Database, how does it affect the
UPI, which, by definition, is unique? When you create a table, the following definitions
determine whether or not it can contain duplicate rows:
MULTISET tables: May contain duplicate rows. The Teradata Database will not
check for duplicate rows.
SET tables: The default. The Teradata Database checks for and does not permit
duplicate rows. If a SET table is created with a Unique Primary Index, the check for
duplicate rows is replaced by a check for duplicate index values.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 116 of 137
When a user submits an SQL request against a table using a Primary Index, the request
becomes a one-AMP operation, which is the most direct and efficient way for the
system to find a row. The process is explained below.
Hashing Process
A NUPI with few duplicate values could provide good (if not perfectly uniform)
distribution, and might meet the other criteria better.
Known Access Paths - Use in value access: Retrievals, updates, and deletes
that specify the Primary Index are much faster than those that do not. Because a
Primary Index is a known access path to the data, it is best to choose column(s)
that will be frequently used for access. For example, the following SQL statement
would directly access a row based on the equality WHERE clause:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 117 of 137
Join Performance - Use in join access: SQL requests that use a JOIN statement
perform the best when the join is done on a Primary Index. Consider Primary Key
and Foreign Key columns as potential candidates for Primary Indexes. For
example, if the Employee table and the Payroll table are related by the Employee
ID column, then the Employee ID column could be a good Primary Index choice for
one or both of the tables.
Non-volatile values: Look for columns where the values do not change frequently.
For example, in an Invoicing table, the outstanding balance column for all
customers probably has few duplicates, but probably changes too frequently to
make a good Primary Index. A customer ID, statement number, or other more
stable columns may be better choices.
When choosing a Primary Index, try to find the column(s) that best fit these criteria and
the business need.
Which three are key considerations in choosing a Primary Index? (Choose three.)
c
d
e
f
g A. Column(s) containing unique (or nearly unique) values for uniform distribution.
c B. Column(s) with values in sequential order for best load and access performance.
d
e
f
g
c C. Column(s) frequently used in queries to access data or to join tables.
d
e
f
g
c D. Column(s) with values that are stable (do not change frequently), to minimize
d
e
f
g
redistribution of table rows.
c E. Column(s) with many duplicate values for redundancy.
d
e
f
g
Feedback:
Check Answer Show Answer
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 118 of 137
Improve performance for large tables when you submit queries that specify a
range constraint.
Reduce the number of rows to be processed by using a technique called partition
elimination.
Increase performance for incremental data loads, deletes, and data access when
working with large tables with range constraints.
Instantly drop old data and rapidly add new data.
Avoid full-table scans without the overhead of a Secondary Index.
With PPI, the ORDER in which the rows are stored on the AMP is affected. Using the
traditional method, No Partitioned Primary Index (NPPI), the rows are stored in row hash
order.
Using PPI, the rows are stored first by partition and then by row hash. In our example,
there are four partitions. Within the partitions, the rows are stored in row hash order.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 119 of 137
With PPI, the Optimizer uses partition elimination to eliminate partitions that are not
included in the query. This reduces the number of partitions to be accessed and rows to
be processed. For example, in the table above, a query specifying the date 02/09 allows
the Optimizer to eliminate the other partitions so each AMP can access just the 02/09
partition to retrieve the rows.
For example, an insurance claims table could be partitioned by claim date and then sub-
partitioned by state. The analysis performed for a specific state (such as Connecticut)
within a date range that is a small percentage of the many years of claims history in the
data warehouse (such as March 2006) would take advantage of partition elimination for
faster performance.
Similarly, a retailer may commonly run an analysis of retail sales for a particular district
(such as eastern Canada ) for a specific timeframe (such as the first quarter of 2004) on
a table partitioned by date of sale and sub-partitioned by sales district.
To store rows using PPI: specify Partitioning in the CREATE TABLE statement. The
query will run through the hashing algorithm as normal, and come out with the Base
Table ID, the Partition number(s), the Row Hash, and the Primary Index values.
Data Storage Using PPI
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 120 of 137
Let's say you have a table with Store information by Location and did not use a PPI. If
you query on Location 3 on this NPPI table, the entire table will be scanned to find
records for Location (Full-Table Scan).
Access Without a PPI
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 121 of 137
In the same example for a PPI table, you would partition the table with as many
Locations as you have (or will soon have in the future.) Then if you query on Location 3,
each AMP will use partition elimination and each AMP only has to scan partition 3 for the
query. This query will run much faster than the Full-Table Scan in the previous example.
Access With a PPI
SELECT * FROM Store
QUERY
WHERE Location_Number = 3;
PLAN
ALL-AMPs - Single Partition
Scan
Multi-level partitioning allows each partition, (i.e., PPI) to be sub-partitioned. With MLPPI you can
use multiple partitioning expressions instead of only one for a table or a non-compressed join
index. Each partitioning level is defined independently using a Range_N or Case_N expression.
With a multi-level PPI (MLPPI), you create multiple access paths to the rows in the base table that
the Optimizer can choose from. This improves response to business questions by improving
the performance of queries which take advantage of partition elimination.
For example, an insurance claims table could be partitioned by claim date and then sub-
partitioned by state. The analysis performed for a specific state (such as Connecticut) within a
date range that is a small percentage of the many years of claims history in the data warehouse
(such as March 2006) would take advantage of partition elimination for faster performance.
Note: an MLPPI table must have at least two partition levels defined.
Syntax:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 122 of 137
A NoPI Table is simply a table without a primary index. It is a Teradata 13.00 feature. As
rows are inserted into a NoPI table, rows are always appended at the end of the table
and never inserted in the middle of a hash sequence. Organizing/sorting rows based
on row hash is therefore avoided.
Prior to Teradata Database 13.00, Teradata tables required a primary index. The primary
index was primarily used to hash and distribute rows to the AMPs according to hash
ownership. The objective was to divide data as evenly as possible among the AMPs to
make use of Teradata’s parallel processing. Each row stored in a table has a RowID
which includes the row hash that is generated by hashing the primary index value. For
example, the optimizer can choose an efficient single-AMP execution plan for SQL
requests that specify values for the columns of the primary index.
Starting with Teradata Database 13.00, a table can be defined without a primary index.
This feature is referred to as the NoPI Table feature. NoPI stands for No Primary Index.
Without a PI, the hash value as well as AMP ownership of a row is arbitrary. Within the
AMP, there are no row-ordering constraints and therefore rows can be appended to the
end of the table as if it were a spool table. Each row in a NoPI table has a hash bucket
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 123 of 137
value that is internally generated. A NoPI table is internally treated as a hashed table; it
is just that typically all the rows on one AMP will have the same hash bucket value.
Benefits:
A NoPI table will reduce skew in intermediate ETL tables which have no natural PI.
Loads (FastLoad and TPump array insert) into a NoPI staging table are faster.
Secondary Index
A Secondary Index (SI) is an alternate data access path. It allows you to access the data
without having to do a full-table scan. Secondary indexes do not affect how rows are
distributed among the AMPs.
You can drop and recreate secondary indexes dynamically, as they are needed. Unlike
Primary Indexes, Secondary Indexes are stored in separate subtables that require extra
overhead in terms of disk space, and maintenance which is handled automatically by the
system. So, Secondary Indexes do require some system resources.
In what instances would it be a good idea to define a Secondary Index for a table? (This
information will be covered in this module, but here is a preview.)
j The Primary Index exists for even data distribution and access, but a Secondary
k
l
m
n
Index is defined to efficiently generate reports based on a different set of columns.
j The Product table is accessed by the retailer (who accesses data based on the
k
l
m
n
retailer's product code column), and by a vendor (who access the same data based on
the vendor's product code column).
j The table already has a Unique Primary Index, but a second column must also have
k
l
m
n
unique values. The column is specified as a Unique Secondary Index (USI) to enforce
uniqueness on the second column.
j All of the above.
k
l
m
n
Feedback:
Several rules that govern how Secondary Indexes must be defined and how they
function are:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 124 of 137
Rule 1: Optional SI
While a Primary Index is required, a Secondary Index is optional. If one path to the data
is sufficient, no Secondary Index need be defined.
You can define 0 to 32 Secondary Indexes on a table for multiple data access paths.
Different groups of users may want to access the data in various ways. You can define a
Secondary Index for each heavily used access path.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 125 of 137
As with the Primary Index, the Secondary Index column may contain NULL values.
Secondary Indexes can be changed. Secondary Indexes can be created and dropped
dynamically as needed. When the index is dropped, the system physically drops the
subtable that contained it.
You can designate a Secondary Index that is composed of 1 to 64 columns. To use the
Secondary Index below, the user would specify both Budget and Manager Employee
Number.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 126 of 137
Join Index
You can define a join index on one or several tables. Single-table join index functionality
is an extension of the original intent of join indexes, hence the confusing adjective "join"
used to describe a single-table join index.
Sparse Index
Any join index, whether simple or aggregate, multi-table or single-table, can be sparse. A
sparse join index uses a constant expression in the WHERE clause of its definition to
narrowly filter its row population. This is known as a Sparse Join Index.
Hash Index
Hash indexes are used for the same purposes as single-table join indexes. Hash
indexes create a full or partial replication of a base table with a primary index on a
foreign key column to facilitate joins of very large tables by hashing them to the same
AMP.
You can only define a hash index on a single table. Hash indexes are not indexes in the
usual sense of the word. They are base tables that cannot be accessed directly by a
query.
Value-Ordered NUSI
Value-ordered NUSIs are very efficient for range constraints and conditions with an
inequality on the secondary index column set. Because the NUSI rows are sorted by
data value, it is possible to search only a portion of the index subtable for a given range
of key values. Thus, the major advantage of a value-ordered NUSI is in the performance
of range queries.
Join Indexes
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 127 of 137
A Join Index is an optional index which may be created by a User. Join indexes provide
additional processing efficiencies:
The three basic types of join indexes commonly used with Teradata will be described
first:
Distribute the rows of a single table on the hash value of a foreign key value.
Facilitates the ability to join the foreign key table with the primary key table without
redistributing the data.
Useful for resolving joins on large tables without having to redistribute the joined
rows across the AMPs.
Pre-join multiple tables; stores and maintains the result from joining two or more
tables.
Facilitates join operations by possibly eliminating join processing or by
reducing/eliminating join data redistribution.
Aggregate one or more columns of a single table or multiple tables into a summary
table.
Facilitates aggregation queries by eliminating aggregation processing. The pre-
aggregated values are contained in the AJI instead of relying on base table
calculations.
A join index is a system-maintained index table that stores and maintains the joined rows
of two or more tables (multiple table join index) and, optionally, aggregates selected
columns, referred to as an aggregate join index.
Join indexes are defined in a way that allows join queries to be resolved without
accessing or joining their underlying base tables. A join index is useful for queries where
the index structure contains all the columns referenced by one or more joins, thereby
allowing the index to cover all or part of the query. For obvious reasons, such an index is
often referred to as a covering index. Join indexes are also useful for queries that
aggregate columns from tables with large cardinalities. These indexes play the role of
pre-join and summary tables without denormalizing the logical design of the database
and without incurring the update anomalies presented by denormalized tables.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 128 of 137
In the table below, users will be accessing data based on the Department Name column.
The values in that column are unique, so it has been made a USI for efficient access. In
addition, the company wants reports on how many departments each manager is
responsible for, so the Manager Employee Number can also be made a secondary
index. It has duplicate values, so it is a NUSI.
Secondary indexes are stored in index subtables. The subtables for USIs and NUSIs are
distributed differently:
USI: The Unique Secondary Indexes are hash distributed separately from the data
rows, based on their USI value. (As you remember, the base table rows are
distributed based on the Primary Index value). The subtable row may be stored on
the same AMP or a different AMP than the base table row, depending on the hash
value.
NUSI: The Non-Unique Secondary Indexes are stored in subtables on the same
AMPs as their data rows. This reduces activity on the BYNET and essentially
makes NUSI queries an AMP-local operation - the processing for the subtable and
base table are done on the same AMP. However, in all NUSI access requests, all
AMPs are activated because the non-unique value may be found on multiple
AMPs.
You can submit a request without specifying a Primary Index and still access the data.
The following access methods do not use a Primary Index:
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 129 of 137
When a user submits an SQL request using the table name and a Unique Secondary
Index, the request becomes a one- or two-AMP operation, as explained below.
USI Access
1. The SQL is submitted, specifying a USI (in this case, a customer number of 56).
2. The hashing algorithm calculates a row hash value (in this case, 602).
3. The hash map points to the AMP containing the subtable row corresponding to the
row hash value (in this case, AMP 2).
4. The subtable indicates where the base row resides (in this case, row 778 on AMP
4).
5. The message goes back over the BYNET to the AMP with the row and the AMP
accesses the data row (in this case, AMP 4).
6. The row is sent over the BYNET to the PE, and the PE sends the answer set on to
the client application.
As shown in the example above, accessing data with a USI is typically a two-AMP
operation. However, it is possible that the subtable row and base table row could end up
being stored on the same AMP, because both are hashed separately. If both were on
the same AMP, the USI request would be a one-AMP operation.
When a user submits an SQL request using the table name and a Non-Unique
Secondary Index, the request becomes an all-AMP operation, as explained below.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 130 of 137
NUSI Access
1. The SQL is submitted, specifying a NUSI (in this case, a last name of "Adams").
2. The hashing algorithm calculates a row hash value for the NUSI (in this case, 567).
3. All AMPs are activated to find the hash value of the NUSI in their index subtables.
The AMPs whose subtables contain that value become the participating AMPs in
this request (in this case, AMP1 and AMP2). The other AMPs discard the
message.
4. Each participating AMP locates the row IDs (row hash value plus uniqueness
value) of the base rows corresponding to the hash value (in this case, the base
rows corresponding to hash value 567 are 640, 222, and 115).
5. The participating AMPs access the base table rows, which are located on the same
AMP as the NUSI subtable (in this case, one row from AMP 1 and two rows from
AMP 2).
6. The qualifying rows are sent over the BYNET to the PE, and the PE sends the
answer set on to the client application (in this case, three qualifying rows are
returned).
In the Teradata Database, you can access data on any column, whether that column is
an index or not. You can ask any question, of any data, at any time.
If the request does not use a defined index, the Teradata Database does a full-table
scan. A full-table scan is another way to access data without using Primary or
Secondary Indexes. In evaluating an SQL request, the Optimizer examines all possible
access methods and chooses the one it believes to be the most efficient.
While Secondary Indexes generally provide a more direct access path, in some cases
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 131 of 137
the Optimizer will choose a full-table scan because it is more efficient. A request could
turn into a full-table scan when:
An SQL request searches on a NUSI column with many duplicates. For example, if
a request using last names in a Customer database searched on the very
prevalent "Smith" in the United States, then the Optimizer may choose a full table
scan to efficiently find all the matching rows in the result set.
An SQL request uses a range WHERE clause on an index column. For example, if
a request searched an Employee database for all employees hired between
January 2001 and June 2001, then a full-table scan would be used, even if the
Hire_Date column is an index.
For all requests, you must specify a value for each column in the index or the Teradata
Database will do a full-table scan. A full-table scan is an all-AMP operation. Every data
block must be read and each data row is accessed only once. As long as the choice
of Primary Index has caused the table rows to distribute evenly across all of the AMPs,
the parallel processing of the AMPs working simultaneously can accomplish the full-table
scan quickly. However, if a Primary Index causes skewed data distribution, all AMP
operations will take longer.
While full-table scans are impractical and even disallowed on some commercial
database systems, the Teradata Database routinely permits ad-hoc queries with full-
table scans.
When choosing between a NUSI and a full-table scan, if the optimizer determines that
there is no selective SI, hash or join index and that most of the rows in the table would
qualify for the answer set if a NUSI were used, it would most likely choose the full-table
scan as the most efficient access method.
If statistics are stale or have not been collected on the NUSI column(s), the optimizer
may choose to do a full-table scan, as it does not have updated data demographics.
Some fundamental differences between Keys and Indexes are shown below:
Keys Indexes
A relational modeling convention used A Teradata Database mechanism used
in a logical data model. in a physical database design.
Uniquely identify a row (Primary Key). Used for row distribution (Primary Index).
Establish relationships between tables Used for row access (Primary Index and
(Foreign Key). Secondary Index).
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 132 of 137
While most commercial database systems use the Primary Key as a way to retrieve
data, a Teradata Database system does not. In the Teradata Database, you use the
Primary Key only when designing a database, as a mechanism for maintaining
referential integrity according to relational theory. The Teradata Database itself does not
require keys in order to manage the data, and can function fully with no awareness of
Primary Keys.
The Teradata Database's parallel architecture uses Primary Indexes to distribute and
access the data rows. A Primary Index is always required when creating a Teradata
Database table.
A Primary Index may include the same columns as the Primary Key, but does not have
to. In some cases, you may want the Primary Key and Primary Index to be different. For
example, a credit card account number may be a good Primary Key, but customers may
prefer to use a different kind of identification to access their accounts.
A summary of the rules for keys (in the relational model) and indexes (in the Teradata
Database) is shown below.
4 Values should not Values may be Values may be changed Values may be
change changed (redistributes row) changed
5 Column should not Column should not Column cannot be Index may be
change change changed (drop and changed (drop
recreate table) and recreate
index)
6 No column limit No column limit 64-column limit 64-column limit
Although Primary Indexes are required and Primary Keys are not, you do have the
option to define a Primary Key or Foreign Key for any table. When you define a Primary
Key in a Teradata Database table, the RDBMS will implement the specified column(s) as
an index. Because a Primary Key requires unique values, a defined Primary Key is
implemented as one of the following:
Unique Primary Index (If the DBA did not specify the Primary Index in the
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 133 of 137
Unique Secondary Index (If the PK was not chosen to be the PI)
When a Primary Key is defined in Teradata SQL and implemented as an index, the rules
that govern that type of index now apply to the Primary Key. For example, in relational
theory, there is no limit to the number of columns in a Primary Key. However, if you
specify a Primary Key in Teradata SQL, the 64-column limit for indexes now applies to
that Primary Key.
j A. A Primary Index is used to distribute data, while a Primary Key is used to uniquely
k
l
m
n
identify a row.
j B. A Primary Key is used to access data, while a Primary Index is used to uniquely
k
l
m
n
identify a row.
j C. In a Teradata Database system, "Primary Key" means the same thing as "Primary
k
l
m
n
Index."
j D. A Primary Index is used to distribute data, while a Primary Key is converted to a
k
l
m
n
hash map.
Feedback:
Exercise 6.1
j
k
l
m
n A. UPI
j
k
l
m
n B. NUPI
j C. Both UPI and NUPI
k
l
m
n
j D. Neither UPI nor NUPI
k
l
m
n
Feedback:
Exercise 6.2
j
k
l
m
n A. hash map
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 134 of 137
j
k
l
m
n B. uniqueness value
j
k
l
m
n C. row ID
j D. row hash
k
l
m
n
Feedback:
Exercise 6.3
Choose the appropriate answers from the drop-down boxes that complete each sentence:
Accessing a row with a Unique Secondary Index (USI) typically requires AMP(s).
Accessing a row with a Non-Unique Secondary Index (NUSI) requires AMP(s).
A full-table scan accesses row(s).
Accessing a row with a Unique Primary Index (UPI) accesses row(s) on one AMP.
Accessing a row with a Non-Unique Primary Index (NUPI) accesses multiple rows on AMP
(s).
Feedback:
Show Answers Reset
To review these topics, click Accessing a Row With a Primary Index, Accessing Data with a USI,
Accessing Data with a NUSI, Full-Table Scan - Accessing Data Without Indexes.
Exercise 6.4
Which column should be selected as the Primary Index in the CUSTOMER table below? The table
contains information on 50,000 customers of this regional telecommunication services company.
Whenever a customer calls, the call center operator should be able to easily access and confirm
customer information. In addition, the company wants to track all service activities on a per-
household basis. Select the best Primary Index for the business use.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 135 of 137
j
k
l
m
n A. Column 4, because each address is clearly a household, which is what is being tracked.
j
k
l
m
n B. Column 5, because it is nearly unique, easy to remember and input, and can be used for
householding.
j C. Column 2, because most of the customers with the same last name belong to a single
k
l
m
n
household.
j D. Columns 2 and 3 together, because the combination is nearly unique, and it is easy for the
k
l
m
n
customer to remember.
j E. Column 1, because it is the Primary Key and its unique values will cause table rows to be
k
l
m
n
distributed evenly for best performance. Customers must give their Customer ID when calling for
service.
Feedback:
Exercise 6.5
j
k
l
m
n A. even distribution of rows
j B. Unique Primary Index
k
l
m
n
j C. multi-AMP request
k
l
m
n
j D. hash synonym
k
l
m
n
Feedback:
To review this topic, click Distributing Rows to AMPs or Accessing a Row With a Primary Index.
Exercise 6.6
j
k
l
m
n A. Select Primary Indexes
j B. Re-organize data
k
l
m
n
j C. Pre-prepare data for loading
k
l
m
n
j D. Pre-allocate table space
k
l
m
n
Feedback:
Exercise 6.7
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 136 of 137
elimination?
j
k
l
m
n A. Multi-Level Partitioned Primary Index (MLPPI)
j
k
l
m
n B. NUPI
j C. Partitioned Primary Index (PPI)
k
l
m
n
j D. NoPI
k
l
m
n
Feedback:
Exercise 6.8
j
k
l
m
n A. True
j
k
l
m
n B. False
Feedback:
Exercise 6.9
j
k
l
m
n A. True
j
k
l
m
n B. False
Feedback:
Teradata Certification
Teradata Certification
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010
Page 137 of 137
Now that you have learned about the Teradata Database basics, consider the first
level of Teradata Certification, Teradata Certified Professional. Information on the
Teradata Certified Professional Program (TCPP) including exam objectives,
practice questions, test center locations, and registration information is located
on the Teradata Certified Professional Program (TCPP) website. Candidates for
the Teradata Certified Professional Certification must pass the Teradata 12 Basics
Certification exam administered at Prometric testing centers listed on the TCPP
website.
We recommend you review the WBT content and the practice questions located
on the TCPP website before signing up for the official Teradata 12 Basics
Certification exam.
http://www.teradatau.courses.teradata.com/learning/BLADE_MS/legacy/18109_IntrotoTer... 11/3/2010