Sie sind auf Seite 1von 15

Data Warehousing

Data Mining Primer


for the Data Warehouse Professional

By:
Arlene Zaima
Data Mining Marketing Manager
Teradata
Contributor:
James Kashner
CTO
Teradata Data Mining Lab

Data Mining Primer


Table of Contents

Executive Summary . . . . . . . . . . . . . . . . . 2

Executive Summary

What Exactly is Data Mining? . . . . . . . . . 3

By now, youve probably heard or read about the rewards

Data Mining Makes Its Way


to the Business World. . . . . . . . . . . . . . . 3-4

that data mining can bring to your business. But, very

What Can Data Mining


Do for Your Business? . . . . . . . . . . . . . . . . 4

little has been written to explain the challenges facing


many Information Technology (IT) organizations as they

The Difference Between


OLAP and Data Mining . . . . . . . . . . . . 4-5

try to make data mining part of their business intelligence

How Does Data Mining Work? . . . . . . . 5-6

operations. This paper explores data mining from the IT

The Data Mining Process . . . . . . . . . . . . . 6

perspective giving a quick overview of the data mining

The Relationship Between Data Mining


and Data Warehousing . . . . . . . . . . . . . . . 6

technology, technical challenges, and solutions for

Data Mining Terms and Techniques . . . 6-7

implementing successful data mining projects.

Data Mining Challenges . . . . . . . . . . . 7-13


Data Mining with Teradata . . . . . . . . . . 13

This white paper explains data mining in terms that can

Teradata Warehouse Miner . . . . . . . . . . . 13

be understood by data warehouse professionals. These

Teradata Data Mining Labs . . . . . . . . . . 14

explanations include:

The Data Mining Lab Engagement . . . . . 14


How to Get Started with Data Mining . . . 14
Driving Higher ROI. . . . . . . . . . . . . . 14-15

> How data mining is used for business advantages today


> The integral relationship between data mining and
data warehousing

Summary . . . . . . . . . . . . . . . . . . . . . . . . 15

> The challenges that may be encountered with


data mining
> The details about how to get started with data mining

PAGE 2 OF 15

Data Mining Primer


What Exactly is Data Mining?

Data Mining Makes Its Way

scientific research, and behavioral profiling.

Data mining is a powerful technology

to the Business World

In the past ten years, data mining technology

that converts detail data into competitive

Since the mid 1980s, data mining has

has journeyed from the scientific and

intelligence that businesses can use to

been very effective in select and focused

academic worlds into the business world

proactively predict future trends and

situations such as medical diagnosis,

where it adds a new dimension of predictive

behaviors. Some vendors define data


mining as a tool or as the application of
an algorithm to data. But the truth is, data
mining is not just a tool or an algorithm.
Data mining is a process of discovering

Analytic
Application

Business Question

Business Value

Customer
Segmentation

What market segments


do my customers fall
into and what are their
characteristics?

Personalize customer
relationships for higher
customer satisfaction
and retention.

Propensity to Buy

Which customers are


most likely to respond
to my promotion?

Target customers
based on their need
to increase their loyalty
to your product line.
Also, increase campaign
profitability by focusing
on the most likely to buy.

Customer
Profitability

What is the lifetime


profitability of my
customer?

Make individual
business interaction
decisions based on the
overall profitability
of customers.

Fraud Detection

How can I tell which


transactions are likely
to be fraudulent?

Quickly determine fraud


and take immediate
action to minimize cost.

Customer
Attrition

Which customer is
at risk of leaving?

Prevent loss of highvalue customers and


let go of lower value
customers.

Channel
Optimization

What is the best channel


to reach my customer in
each segment?

Interact with customers


based on their
preference and your
need to manage cost.

and interpreting previously unknown


patterns in data to solve business problems. Data mining is an iterative process
in which each cycle further refines the
result set. This can be a complex process,
but there are tools available today to help
you navigate through the steps of the data
mining process.
From an IT perspective, the data mining
process requires exploration of data,
creating the analytic data set, building
and testing the model, and integrating
the results into business applications.
Therefore, the IT organization must
provide an environment capable of
addressing the following
challenges:
> Exploring and preprocessing of large
data volumes
> Sufficient processing power to
efficiently analyze many variables
(columns) and rows in a timely
manner
> Integrating data mining results into
the business process
> Creating an extensible and manageable
data mining environment

PAGE 3 OF 15

Figure 1. Business Value of Analytic Applications.

Data Mining Primer


analysis. To be effective in the business
world, the data mining process had to be
adapted to deliver models in a more time-

OLAP

Data Mining

Typically focuses
on current facts

Typically focuses on future


outcomes or trends

Commonly uses predefined


aggregate data

Requires detail data

Verification driven/Factual results

Discovery driven

Ad hoc queries and reports

Statistical and machine


learning techniques

sensitive manner. Today, with the advent


of in-database data mining techniques,
businesses

have

finally

found

it

possible to benefit from the complex,


predictive
characteristics of a very powerful technology.
What Can Data Mining
Do for Your Business?
For years, businesses have relied on
reports and ad hoc query tools to glean
useful information from their data.
However, as data volumes continue to

Figure 2. Differences between OLAP and Data Mining.

increase, finding valuable information


becomes a daunting task. Data mining
technology was designed to sift through
detailed historical data to identify hidden
patterns that are not obvious to humans
or query tools. Many of these previously
hidden patterns reveal intelligence that
can be integrated into business processes
to provide predictive capabilities that lead

allows you to analyze and understand

about outcomes or traits before knowing

particular business drivers. Typically, a

their true values. Data mining techniques

specific descriptive or factual question is

are used to find meaningful, often com-

formulated and either validated or refuted

plex, and previously unknown patterns in

through ad hoc queries. OLAP results are

data. For example, you may ask, How

factual results. For example, you may ask,

many size 7 shoes should I order for the

How many size 7 shoes did I sell in the

next season? Data mining techniques can

past three months? The results are factual

be used to build models based on detail

Data mining makes analytical business

answers that enable you to validate your

data to predict the number of size 7 shoes

applications, such as CRM, smarter by

hypothesis or order decision. But what

sold within a given time period. Typically,

providing insight that goes beyond just

happens if you have hundreds of variables

OLAP analyses use predefined, summa-

the obvious knowledge. By making your

to analyze? It becomes difficult to formu-

rized or aggregated data, such as

applications smarter, data mining trans-

late a good hypothesis or relationship

multi-dimensional cubes, where data

lates into a higher return on your

among your data. In addition, OLAP tools

mining requires detail data that is aggre-

warehouse investment. (See Figure 1).

dont produce predictive or estimated values

gated to optimal levels and analyzed at the

with associated accuracy expectations.

individual record level.

OLAP and Data Mining

Data mining, on the other hand, is a form

Although these technologies are used for

A commonly asked question is What is

of discovery driven analysis where statisti-

different purposes, OLAP and data

the difference between data mining and

cal and machine learning techniques are

mining are complementary. During the

on-line analytical processing (OLAP)?

used to make predictions or estimates

data mining exploration phase, you may

to strategic business decision-making.

The Difference between

OLAP is a business intelligence tool that

PAGE 4 OF 15

Data Mining Primer


use OLAP technology to help you under-

Business problems that lend themselves to

stand your data. Data mining results can

data mining are predictive and descriptive

also be used in OLAP applications by

in nature. Predictive models are used to

incorporating new predictive variables or

predict an outcome, referred to as the

scores as dimensions or attributes in your

dependent or target variable, based on the

OLAP tool. For example, if you calculate a

value of other variables in the data set. For

new predictive variable called Customer

example, a predictive model could deter-

Value that characterizes the value of a

mine the likelihood that a customer will

customer to your business in terms of

purchase a product based on her income,

profitability, you can include this new

number of children, current product

variable as an attribute in your OLAP

ownership, or debt. Predictive techniques

tool. When retailers analyze which prod-

build models based on a training set of

ucts to stock, they can consider products

data with a known outcome, such as prior

that attract high-value or profitable

buying patterns. The algorithm analyzes

customers. (See Figure 2).

the values of all input variables and

How Does Data Mining Work?

identifies which variables are significant

Define the
Business Problem
>Define the business
objectives
>Examine the data
>Define initial approach
>Scope project

Explore and
Preprocess Data
>Data acquisition
>Data exploration
>Data selection
>Data transformation

as predictors for a desired outcome.

Data mining leverages artificial intelligence and statistical techniques to build

Unlike predictive models, descriptive

models. Data mining models are built

models do not predict variables based on

from situations where you know the

known outcomes, but rather, describe a

outcome. These models are then applied

particular pattern that has no known

to other situations where you dont know

outcome. Common techniques include

the outcome. For example, if your data

data visualization where large volumes of

warehouse identifies customers who have

data are reduced to a picture that can be

responded to past marketing campaigns,

easily understood. Another common

you can create a model that identifies the

descriptive technique is clustering, where

characteristics of those customers. This

data are grouped into subsets based on

model can be applied to a wider customer

common attributes. For example, you may

database, identifying customers who

use descriptive techniques to determine

demonstrate the same characteristics,

customer segments and their attributes.

allowing you to target those likely to


respond, thereby improving response
rates and reducing marketing costs.

In many cases, both descriptive and


predictive models are used to solve
business problems. A descriptive technique may identify customer segments
based on value in terms of profitability to

PAGE 5 OF 15

Develop Model
>Design model
>Train, test and validate
model
>Interpret and evaluate

Deploy Knowledge
>Deploy model
>Reports
>Application integration

Figure 3. The Data Mining Process.

Data Mining Primer


your business, and a predictive technique

The Relationship between

Analytic Model

may identify the likelihood that a particu-

Data Mining and Data

A model is a set of logical rules or a

lar segment will defect to your competitor.

Warehousing

mathematical formula that represents

By combining results of the descriptive

Data mining is all about data. You can

patterns found in data that are useful for a

technique to predict customer defection,

mine inconsistent or dirty data, and find

business purpose. Once a model has been

you can act to prevent attrition of your

patterns. However, the patterns will be

built based on one set of data, it can be

high-value customers.

meaningless if your data do not accurately

reused to search for the discovered

reflect the business you are modeling. The

patterns in other similar data. Models are

key to data mining is ensuring that you

sometimes called predictive models since

have a foundation of good, quality data

they can be used to predict behaviors that

that is cleansed, consistent, and accurate.

relate to the discovered patterns.

A data warehouse provides the right

Association

an on-going process requiring mainte-

foundation for data mining. Although

This modeling technique is commonly

nance throughout the life of the model.

data mining can be done without having

referred to as affinity analysis and is used

The Data Mining Process


You cannot buy a data mining product,
apply it to data, and expect to generate a
meaningful model. Data mining models
are built as part of a data mining process

The data mining process is not linear, but

a warehouse in place, the process of

to identify items that occur together during

an iterative process where you loop back

gathering, cleansing, and transforming

a particular event. For example, affinity

to the previous phase. For example, the

the data from multiple data sources can

analysis is commonly used to study market

initial model you create may lead to

be arduous. Once the process has been

baskets by identifying which combinations

insight requiring you to return to the

completed for one model, you must

of products are most likely to be purchased

data preprocessing phase to create new

repeat the process for subsequent data

together. Another form of this technique

analytical variables.

mining projects. Approximately 70%

is sequence analysis, a variation on affinity

of the data mining process involves

analysis. Using sequence analysis, you could

accessing, exploring, and preparing the

begin to understand the order in which

The data mining process contains four


high-level steps: Define the Business
Problem, Explore and Preprocess the Data,
Develop the Data Model, and Deploy
Knowledge (See Figure 3). Tasks for each
step are listed in the diagram to provide a

data. The data warehouse makes data

customers tend to purchase specific

mining more viable by removing many

products. These results may be helpful in

of the data redundancy and system

the early phases of establishing a potential

management issues. This allows people

cross-selling strategy.

to focus on analysis.

brief overview of the data mining process.

Clustering

We will discuss the data mining process in-

Data Mining Terms and

Clustering is a type of modeling technique

depth when we define the Teradata data

Techniques

that can be used to place items into groups

mining methodology. Although each step

This section briefly describes a few data

based on like characteristics. The goal of

is important, most of your time will be

mining terms and techniques commonly

clustering is to create groups of items that

spent in the data exploration and prepro-

used to solve predictive and descriptive

are similar based on their attributes within

cessing phase. A well-structured data

analytical problems.

a given group, but which are very different

warehouse can significantly reduce the


pain felt in this phase.

PAGE 6 OF 15

Data Mining Primer


Data Mining Challenges

from items in other groups. Clustering is

predictor variables. Linear regression can

frequently used to create customer

be used to predict the amount of over-

Although it may appear that data mining is

segments based on a customers behavior

draft protection to offer a customer based

the next logical step for companies that have

or other characteristics. Customers in the

on their account balances, years of service,

already implemented their data warehouse,

same segment share similar characteristics

and other characteristics.

the reality is that many businesses struggle

and tend to behave consistently. Knowledge


of the typical behavior of a particular
segment can be powerful information if
you want to predict the behavior of an
unknown member of that segment.

with getting their data mining projects to


Logistic Regression
A statistical technique used to find the
best-fitting linear relationship between
a categorical target variable and a set of

deliver meaningful results. To be successful,


data mining requires the right team, the
right methodology, the right architecture
and the right technology.

predictors. It is commonly used to predict


Data Visualization

Yes or No questions, such as whether or

The Right Team

This process takes large amounts of data

not a particular transaction is likely to

A big challenge to bringing data mining

and reduces them into more easily

be fraudulent.

interpreted graphs, charts, or tables.


Instead of large sets of numbers, colored
pictures tell the story with clarity.

Neural Networks
This is a non-linear predictive modeling
technique, loosely based on the structure

Decision Tree

of the human brain that learns through

This technique produces a tree-shaped

training. This technique is commonly

structure that represents a set of decisions

used to predict a future outcome based

to predict a value of the target variable.

on historic data. However, it frequently

This algorithm leverages a variety of

requires substantial expertise to understand

techniques to separate or classify data

the rationale for the decisions and predic-

based upon rules. Decision Trees are

tions it makes. The Neural Network is

commonly used to model good/bad risk

sometimes referred to as a black box

or loan approval/rejection because the

because it produces a model that is less

models are represented by rules that

understandable, but often more accurate.

humans easily understand. Although each


rule might be easily understood, some
decision trees contain thousands of rules,
requiring data mining tools with good
visualization techniques to interpret many
rules appropriately.

A score is an outcome of a model that


represents a predicted or inferred value
on some trait or characteristic of interest.
You can think of a score as the result of

Linear Regression

customer value, the score for each cus-

A statistical technique used to find the

tomer may be a number that indicates

best-fitting linear relationship between

a value of a particular customer.

PAGE 7 OF 15

by the data mining team. Data mining


projects must be a collaborative effort
driven by business experts, developed by
analytic modelers, and supported by IT.
Your internal skill sets may be developed
over time, which may mean initially hiring
data mining consultants to develop your
data mining capability with the ultimate
objective of transferring knowledge to
your team. To ensure a successful data
mining outcome, you will need the
following experts on the team:
Business Domain Experts

Score

the model. If your model calculates the

a numeric target variable and its set of

into the company as an internal corporate


service is developing the skill sets required

Its imperative to have the business


analysts involved in the data mining
project. They should be the champions
and drivers of every data mining project.
They are the ones who need the answers
that result from the project, and therefore,
they are the people who must clarify the
business issues to be solved by the project.

Data Mining Primer


The business experts should ultimately be

> Technical expertise for evaluating,

dollars in revenue and cost savings for

held accountable for the results of the data

installing, and maintaining the tool

customers. This section defines the

mining project.

environment

Teradata data mining methodology.

The skill sets needed by the business


domain experts include:
> Ability to ask and answer strategic
questions

> Application expertise for effectively


deploying analytic models into the

Although all tasks are equally important,


for the purpose of this paper, we will

business environment, the data

focus primarily on the activities that affect

warehouse, and the operational and

the data warehouse. (See Figure 4).

application environments

> Intimacy with enterprise data


(accessing and manipulating it for

Analytic Modelers/Data Miners

analysis and forecasting)

Analytic Modelers/Data Miners are

> Ability to clarify outcomes and


expectations for thorough evaluation
and validation of analytic models
> Expertise with certain data analysis
tools (Excel, OLAP)
> Background in statistical techniques

responsible for preparing the data,


designing the model, building the model,
and deploying it against the data. The
organization to integrate the model into
the decision support infrastructure and
business processes.
The skills needed by a Data Miner/Analytic

Information Technology Support

Modeler include:

Information technology support is critical

> Expertise in statistics and/or artificial

The IT organization responsible for the


data warehouse provides the bulk of the
IT support; however, other groups may be
called upon to assist with data cleansing
and model integration.

Business Problem
Definition

analytic modeler works with the IT

for forecasting and strategic planning

to the success of the data mining project.

Project Management

intelligence

Architecture
and Technology
Preparation

Data Preparation

> Successful application of advanced


algorithms in a real-world setting
> An understanding of the business
domain (otherwise business domain
experts can provide this support)

Model Development
Test and Validation

The skills needed by an information


technologist include:
> Data expertise combined with business
understanding
> The ability to find, access, and
manipulate data
> Detailed understanding of data
structure and transformations

The Right Methodology


Data mining, like data warehousing, is an
ongoing process that must be maintained
and changed as business drivers change.
The key to a successful data mining
project is to base it on a proven method-

Knowledge Transfer

ology. Teradatas data mining


methodology has delivered successful
models that have uncovered millions of

PAGE 8 OF 15

Knowledge Discovery
and Deployment

Figure 4. Teradata Data Mining Methodology.

Data Mining Primer


Project Management

exploratory data analysis. Data visualization

Every successful project requires clearly

Architecture and Technology


Preparation

and descriptive statistical techniques are

defined objectives, requirements, deliver-

Before tackling a data mining problem,

used to uncover data quality issues and

ables and resources. Data mining projects

you must understand the development

to better understand the characteristics of

are no exception. Project management

and implementation requirements for

your data. You may uncover data quality

activities are required throughout the

the analytic models. These requirements

issues or missing data, which can jeopardize

projects life. The project manager ensures

determine how the models are built, what

the integrity of any analytic model, so you

the project will produce satisfactory

software is required, and whether or not

must compensate, if not correct, the data

deliverables from both a technical

new hardware is required. In most cases,

issue. For example, if you are missing

and business perspective. Basic project

your development and production envi-

values, you must determine the best

management tasks include:

ronments will be different. However, you

method for filling in missing data. You

> Align the scope and expectations


of the project
> Ensure communication among
team members
> Develop a project plan
> Coordinate documentation and

may leverage the same environment with

could consider using a data mining tech-

appropriate resources. There are several

nique to predict the value of a missing

techniques to building models. Based on

variable based on other data points.

your environment and requirements, the


right balance of client/server and/or in-

Next, you must isolate and prepare your

database mining must be chosen.

data for the particular model. You may


exclude outliers for some models, whereas

Data Preparation

you may build a model based on outliers.

This is the most time consuming step,

For example, if you were predicting

but also the most critical. You must first

baseball attendance and revenue, you

collect all the data necessary for your

would need to exclude abnormal atten-

> Assess project effectiveness

project. If you have an enterprise ware-

dance data, such as attendance data from

> Close out the project

house, youre in luck. However, you may

1994, the year of the baseball players strike.

still need to pull data from different

In other cases, such as fraud detection, you

interim deliverables
> Coordinate application development
activities and tasks

Business Problem Definition Successful

sources. First, examine your data sources

data mining begins with a clearly defined

should include outliers since they may

to see what is available to address the

business objective. Without clearly defined

represent fraudulent transactions.

business problem. Second, ensure that

business objectives, the

your data is computationally valid and

Once you have selected your data, some

data mining project will likely lead

consistent. For example, if you are

level of transformations may be required.

nowhere. For example, increasing your

pulling from different data sources, you

Detail data, as they exist in the data

customer base is a very different objective

must resolve conflicts among data

warehouse, are not necessarily ready for

than increasing the number of your most

which can be a daunting task. To avoid

data mining. You may want to derive

valuable customers. Everything from data

these issues, we highly recommend

optimal aggregations or new analytic

preprocessing to model selection is driven

starting with a data warehouse where

variables to build a better model. For

by your business objective. The business

these conflicts are resolved.

example, debt-income ratio may be a

Once you have gathered data from the

Some statistical techniques and algorithms

problem is described in operational


terms so you can determine initial data
availability and the analytic approach.

better predictor than just debt or income.


different sources, your next step is to
explore your data. This is often called

PAGE 9 OF 15

Data Mining Primer


also require numeric data or data within

model accuracy. The data mining tool you

a certain range. For those variables, you

use should have sufficient model, parameter,

build it: to produce scores for data with

need to recode or transform them into

and row-level diagnostics that allow you to

unknown outcomes that you want to

the appropriate input variable for the

identify and understand specific strengths

predict with high confidence. The amount

data mining technique.

and weaknesses in your model during

of effort put into maximizing the validity

these first two steps. After youve refined

of a model is directly proportional to its

your model based upon the diagnostics,

business value.

Model Development, Test, and Validation


The next step is to build an analytical

to use the model for the purpose you

its time to validate your model.

model an iterative process of applying

The analytical models are tested using

analytical techniques to the analytical data

Model validation is a process by which an

statistical techniques; comparing models

set and interpreting mathematical equa-

analytical modeler attempts to establish

developed from different analytical

tions. The resulting equations are refined

and maximize a generalizable model

techniques and the results for these

as iterations are performed. Each iteration

beyond the data set with which the model

models are further validated against

provides higher statistical and conceptual

was created. The validation data are used

the business criteria for the project. Once

confidence in the results.

as an independent source of information

you develop the model, you must also

to assess the degree to which your models

establish a process to validate and to

accuracy might be overstated. Overstated

refresh the model as the data changes. Its

accuracy is frequently referred to as

also necessary to monitor the continuing

overfitting, a case where a model is built

business validity of the analytical models.

Earlier in the process, you identified a


preliminary analytical approach required
to solve the business problem. Now you
must select the specific analytical algorithms or statistical techniques that are
most appropriate for building your
model. Your selection of specific
analytical techniques often requires you to
revisit some aspects of data preprocessing
that you performed in the previous step.

to closely fit the training and test data, but


not the data that you intend to score.
Overfitting has a direct and adverse affect
on the usefulness, or validity, of your
model. For example, if you build a
granular model where the rules or formulas are so specific to a single instance (e.g.,

Once youve selected the algorithms, its

income=$50,000, gender=F, marital

time to build the model. Building an

status=divorced, age=28, first

analytical model requires at least three

name=June, hair color =red, number

broad steps: (a) training or fitting (b)

of children=3, cat= 0, dogs=2, etc.) your

testing (c) validation. This requires you

accuracy for the training and test sets

to segment your data into at least the

can be 100%. However, when this model

following three different data sets:

is applied to another data set, accuracy

(a) training (b) test (c) validation. Your

of results is almost guaranteed to be

model is built using the training data, and

horrendous. If the rules or formulas in

then tested using the test data to assess the

a model are so tightly bound to any


particular data set, then you wont be able

PAGE 10 OF 15

Knowledge Delivery and Deployment


Knowledge derived through analytical
models unlocks the ROI from your
warehouse. There are several methods
for deploying the models. Your IT organization may run the model and deliver the
results to your business users for business
decisions. The model or intelligence
generated from the model can also be
integrated into your customer relationship
management (CRM) or analytical applications to facilitate business user access to
the results. Regardless of your implementation, data mining adds intelligence to
your business in the form of scores,
predictions, descriptions, and profiles.

Data Mining Primer


Knowledge Transfer
One of the unique components of the
Teradata data mining methodology is

Sources

knowledge transfer. Knowledge transfer


spans the entire data mining project
beginning with the initial interviews
with each data mining team member to

Data
Warehouse

determine their professional knowledge


transfer objectives for the project.
Mentoring and education throughout the

Analytic
Data Marts

data mining project arms the data mining


team with the necessary modeling and
process knowledge to interpret results,

Desktop
Client

maintain the modeling environment,


Distributed Sources
and Analytic Data
Marts

and monitor the analytical model.

Data Warehouse
with Analytic
Data Marts

Centralized
Data Mining

The Right Architecture


There are several data mining architectures

Figure 5. Data Mining Architecture.

commonly used today. They include the


distributed, independent data mart; the
data warehouse with dependent data

mining tool and database vendors highly

marts; and the centralized data warehouse

recommend beginning with a data ware-

criteria. Although youre pulling from a

and mining architectures (See Figure 5).

house if youre planning to integrate data

single source, you must still contend with

Each architecture is described below.

mining into your business intelligence

the data movement from your warehouse

strategy. Another reason an analyst may

to your analytical server, potential human

Distributed, Independent Data Marts

opt for a distributed data mart model is

error that can occur with sampling, and

The Distributed Sources with Analytic

for data autonomy. Once you extract data

analytic server management issues. In

Data Marts method requires data to

from your sources, you have full control

addition to data movement, you must

be extracted from multiple sources to

over your analytical environment. The

ensure the data you select are a sample

analytical servers. Data gathered from

second scenario, Data Warehouse with

that accurately reflects the business

various sources must be converted into

Analytic Data Mart, allows you to achieve

environment. Building models against

a common and consistent format then

autonomy with a data warehouse.

samples that dont represent your data will

merged together into an analytic data

produce poor models. Remember that its

mart. Data mining is an iterative process.

Data Warehouse with Dependent

Its true that you dont need a data ware-

(Analytic) Data Marts

house to mine data, however the data

Using a data warehouse simplifies the data

movement and data management can add

management issues since the data have

months to your data mining project. Data

already been gathered, cleansed, and

PAGE 11 OF 15

transformed to meet your warehouse

all about your data. There are other, more


efficient alternatives.

Data Mining Primer


Centralized Data Warehouse and Mining

proprietary format for efficient processing.

has required data mining to move from

As data mining projects are implemented

Technology limitations are discussed in

desktop and general-purpose toolboxes

across the enterprise, the number of users

the next section.

on client/server configurations to enter-

leveraging the data mining models continues to grow as does the need to access large
data infrastructures. Data warehouse
solution providers recognize this situation
and are incorporating data mining extensions within the database to offer
centralized data mining architecture. The
analytic processing performed within the
database minimizes data movement in and
out of the database and leverages the
parallelism of the database. A massively
parallel database provides a massively
parallel analytical engine that you can use
to build, test, and deploy analytical models.

prise applications on massively parallel


The Right Technology
The right technology begins with the right
foundation: the right data warehouse.
Effective data mining depends on a
comprehensive and robust data warehouse,
not a summarized data mart because its
difficult to predict the attributes that will
contribute to the data mining model. In
addition, you must select a warehouse
that is built on the right foundation. Some
companies are trying to do data warehousing with a database that was designed for
OLTP operational processing of high-

processing (MPP) configurations.


Unfortunately, most tool vendors fail to
leverage parallel technology for efficient
data processing. However, some database
vendors are in a unique position to
provide an in-database approach to data
mining to answer this need. Mining
directly in the database streamlines the
data mining process by eliminating data
movement and leveraging the parallelism
of the database engine for the performance and data scalability required to
analyze large volumes of detail data.

speed transactions. The functions


The data warehouse becomes a centralized

performed in OLTP adding, deleting,

Data I/O

repository for your analytical data, data

modifying records or row-level functions

As large volumes of data are processed and

mining models, and data mining results

are entirely different from analyzing large

models are deployed across the enterprise,

providing an ideal foundation for data

volumes of historical data and require very

the I/O required by most tools creates a

mining projects. Data are available for

different database capabilities.

network bandwidth problem. As gigabytes

multiple mining projects across your


entire enterprise, and your analytical
models can be run against your entire

and even terabytes are moved from


Scalability and Performance
To get a higher return on their data

database to analytic server to business


server, the I/O puts a strain on the entire

customer table within your warehouse.

warehousing investments, data warehouse

enterprise network. In-database mining

Data mining models and results combined

users are asking more complex questions

eliminates the I/O issues by moving the

with your detailed customer records give

that require access to large amounts of

functions to the data versus moving data

you insight about customer value, buying

data. As data volumes and the complexity

patterns, and preferences.

of the business problems grow, analyses

to the functions.

will inevitably take longer to process,

Tools

The Data Warehouse with Analytical Data

requiring acceleration of the data mining

The right technology includes tools that

Marts architecture is the most commonly

process. Users, who analyze data ware-

provide a comprehensive set of statistical

used architecture today because of the

houses that scale to the multi-terabyte

and machine learning functions along

limitations of databases and data mining

range, struggle with desktop and

with visualization and data preprocessing

tools. Most data mining tool vendors

client/server data mining tools that dont

techniques. Many tools provide a

require data to be converted into their

scale to meet their requirements. This

PAGE 12 OF 15

Data Mining Primer


Teradata Warehouse Miner
Teradata Warehouse Miner dynamically
generates Teradata SQL statements and

Traditional
Approach

executes them from a Windows client.


The SQL is constructed from options,
tables, and columns selected by the user in

Teradatas
In-Database Mining
Teradata Database

Analytical
Data

Results

Benefits of In-DBS:
>/Eliminates data
movement
>/Minimizes data
redundancy
>/Reduces cost of
system and data
management
>/Eliminates potential
sampling errors
>/Leverages parallel
database engines

the graphical Windows interface. In some


cases, Teradata Warehouse Miner breaks
the algorithms into steps so that the steps
which require data access are performed
via SQL, while other steps requiring
numerical processing are handled by
the Teradata Warehouse Miner client.
Teradata Warehouse Miner processes
functions in the most optimal manner
leveraging the parallelism of Teradata
Database whenever possible.
Traditionally, data mining technologies

Figure 6. Benefits of In-DBS Mining.

require that you move data out of the


centralized data warehouse and into
sophisticated set of analytical algorithms

and issues facing companies today.

proprietary or flat file structures. With

and graphical interfaces. However, they fail

Teradatas in-database data mining

this technique, many copies of the data

to provide a robust set of data visualization

approach sets us apart from other data

will reside in various analytical servers

and data preprocessing functions. Since the

mining solution providers in the industry.

or data marts. Imagine how much time

bulk of the data mining process is spent

Our centralized solution permits users to

it could take to create 20 samples of a

exploring and conditioning data, you need

do data exploration, data preprocessing,

terabyte-sized database, extract them into

tools that will facilitate data exploration,

analytic modeling, scoring, and deployment

different locations, convert them into

visualization, transformation, and data

all within the database using SQL, taking

different formats and finally, import them

management. Tools must also process large

advantage of Teradata Databases unlimited

into applications. Can you afford the time

data volumes and provide an interface that

scalability and exceptional performance.

and inefficiencies of this method?

enables integration of analytical models

Performing data mining in the database

into business applications.

streamlines the process by eliminating data

Teradata Warehouse Miners analytic

movement and the overhead associated

operations can be performed on the data

with managing the data and the systems

within the Teradata Database. Results

involved in a distributed environment.

from the analysis are stored within your

In-database mining also reduces data

enterprise data warehouse providing access

redundancy and improves data reliability.

to all users as necessary. (See Figure 6.)

Data Mining with Teradata


Data warehouse solution providers,
such as Teradata, a division of NCR, fully
understand the data mining challenges

PAGE 13 OF 15

Data Mining Primer


Teradata Data Mining Labs

assess the potential of data mining in

Teradata Data Mining Accelerator

Where Advanced Analytics

their environment. This controlled, highly

Packages help you get started with data

Come to Life

secure Proof of Concept (POC) is a low-

mining through special educational

Teradatas Data Mining Services help

risk, high-value engagement that shows

offerings and pricing incentives. Heres

many customers leverage data mining

how data mining can be applied to answer

a brief overview of the Data Mining

to grow their business, reduce costs, and

your complex business questions.

Accelerator Packages:

The project length for a data mining

> Exploration Package: This package is

better serve their customers, giving them


the competitive edge. Our worldwide Data
Mining Labs and the San Diego-based
Data Mining Center of Expertise are
uniquely qualified to offer clients a secure
environment where they can investigate
how data mining will help them solve
their most complex business problems.

POC varies, but typically it takes two

designed for Teradata customers who

weeks to perform business problem

want to understand what data mining

qualification and clarification and data

can contribute to their business. Theyre

discovery for the business questions.

already analyzing data in the warehouse

Preparing and analyzing the data and

using query, reporting, and OLAP tools

developing the analytic models takes

but want to explore the possibility of

from four to eight weeks. The time of

including predictive analysis.

Teradatas Data Mining Lab consultants

engagement depends on many variables,

are experts in analytical modeling with

but the three factors are data cleanliness,

is designed for Teradata customers

a strong background in statistics and

data availability, and the clarity of the

who are ready to integrate data mining

artificial intelligence. Technological

business problem to be solved. The data

into their business processes. Theyre

expertise combined with business knowl-

mining engagement can be used in the

already using their data warehouse for

edge is their forte they know how to

lab as a Proof of Concept or in a clients

business intelligence and are ready to

help customers leverage a sophisticated

live production environment.

expand their analytic capabilities with

How to Get Started

data mining.

technology to solve their business challenges with analytical solutions.


A Data Mining Lab engagement offers

with Data Mining the

> Expansion Package: This package

> Expert Package: This package

Accelerator Packages

enables customers who have a staff of

consulting services, educational workshops,

from Teradata

analytic modelers experienced in the

and analytic model development to help

Many organizations are interested in data

use of client/server-based data mining

you integrate predictive models into your

mining, but dont know what the next

tools to integrate in-database mining

business process.

steps are to successfully integrate data

to leverage the best of both worlds.

The Data Mining Lab


Engagement Low Risk,
High Value

mining with their business intelligence


strategy. Teradata puts data mining into
your hands with a full complement of
data mining services built around a

Data Mining Lab engagements have been

mentoring program that ensures that

used by Teradata customers to help them

you learn how to mine your own data.

PAGE 14 OF 15

Driving Higher ROI Data


Mining Customer Experiences
Ever-increasing global economic challenges are prompting companies to
explore new ways to get more from their
data warehousing investment.

Data Mining Primer


Teradata.com

Technologies that offer valuable insight

> A U.S. telecommunications provider

and predictive capabilities to drive busi-

improved their targeted marketing

your competitive advantage. Reports and

ness growth and improve their ROI are

response rate tenfold by targeting

OLAP techniques provide the

a great next step after the data warehouse

customers identified through

capabilities for navigating massive data

is in place.

data mining.

warehouses but not the insight required

cated analysis of your data to maintain

to stay ahead of your competitors. Data

Data mining is the right technology for

Summary

supercharging CRM and analytic

To develop analytic solutions that can be

applications by inserting intelligence in the

applied throughout your enterprise, you

form of predictions, scores, descriptions,

need a powerful infrastructure that is built

and profiles (where data mining excels).

for analytic processing. The volume of

Volumes of historical data containing

data being created and captured and the

facts about what occurred in business

amount of transaction data can cause

Data Mining Marketing Manager, with

operations can be analyzed and used to

massive bottlenecks in your decision flow:

contributions from James Kashner, CTO

predict what will happen in the future.

thousands of variables, millions of transac-

of the Teradata Data Mining Lab. For

tions per day, and millions of customers.

more information, visit our web site at

You require timely, accurate, and sophisti-

Teradata.com.

Data mining is one of the fastest growing

mining with Teradata Database offers the


analytic foundation to unlock the
intelligence from your enterprise data
warehouse.
This paper was written by Arlene Zaima,

business intelligence technologies because it


pays off in quantitative value. For example,
here are a few facts from some of the first

Not Just Better, but the Best Our Benchmarks Prove it!

Teradata data mining implementers:


> A European financial institution saved
$8.2 million by gaining a better
understanding of their customers
ATM behavior. They were able to
strategically place ATMs to reduce fees
and increase loyalty.
> A South American telecommunications
provider retained 98% of their highvalue customers during deregulation.
They identified who their high-value
customers were, understood their profile
and customer satisfaction level, and
marketed to their customer segments.

PAGE 15 OF 15

A packaged goods manufacturer had a partner Customer Loyalty program in


which they collected data from their retail partners, analyzed the data comparing
their product sales with other products, and sent this information back to the
partners. This program was based on five analytical programs including market
basket analysis and promotion monitoring. The analysis was performed on an
IBM AIX server and a data mining analytic server. The entire process took 312
hours, not including data extract, coding, or data copy, making the application
too costly to operate. The Teradata benchmarking team created programs that
used Teradata SQL and ran everything directly in Teradata Database. With
Teradata, the process ran in only 12 hours, saving the Customer Loyalty program.

Das könnte Ihnen auch gefallen