Sie sind auf Seite 1von 54

Not found

your data?
Data Mining

Hai na!

PROJECT REPORT ON “DATA MINING


TECHNIQUES”

(Submitted to Symbiosis Centre for Distance


Learning, in partial fulfillment of PGDITM
programme)

Submitted by:
Ashutosh Bhardwaj
Reg 201515474

1|Page
DATA MINING-The Knowledge Discovery in Database

PREFECE
""""
“You have no choice but to operate in a world shaped by globalization and
the information revolution. There are two options: adapt or die.”
-Andy Grove, Chairman, Intel

ALL of first I want to thank CONVERGENCE for giving me the


opportunity to show our ability in front of the students and experts in paper present

The last few years have seen a growing recognition of information as a key
business tool. Those who successfully gather, analyze, understand, and act upon the
information are among the winners in this new “information age”.

In any business or work just gathering the information is not sufficient,


they need to store it for future purposes. Data Base Management Systems are great tools
to define and response to some questions or quarry, Still there is some questions or some
data which we want, is not directly accessible from the database. Just seeing the Data
Base you can not make some decisions and may not predict the market as well as
customers.

At this point Data Mining is very important for the user, Tough the Data
Mining is not a magic wand but it can find the “hidden” information from your database,
it can predict the market as well as the customer up to certain level of accuracy. Data
Mining also able to take the result depending on multiple database, may be in different
DBMS or different companies.

We have tries to give detail introduction for Data Mining, Main features of
it and also tries to give detail s of development of Data Mining.

We hope that after reading this report one can better understand the Data
Mining and can develop the application or say Data Mining software which can give the
facilities for mining the Data Base in real meaning. Because in real meaning your
software should have Artificial Intelligence too detect some models or methods for Data
Mining.

2|Page
DATA MINING-The Knowledge Discovery in Database

INDEX
CONTENTS

 ABSTRACT
 INTRODUCTION TO DATA MINING
 LEARNING FROM PAST MISTAKES?
 INTODUCTION TO DATA WAREHOUSES
 DATA MINING AND DATA WAREHOUSING

 FOUNDATION OF DATA MINING


 ARCHITECTURE OF DATA MINING
 PHYSICAL STRUCTURE OF DATA WAREHOUSING
 CHARACTERISTICS OF DATA WAREHOUSING
 TYPES OF DATA MINING
 HOW DATA MINING WORKS?
 GOALS OF DATA MINING
 INTEGRATED DATA MINING AND CAMPAIGN
MANAGEMENT
 THE INTEGRATED DATA MINING AND CAMPAIGN
MANAGEMENT PROCESS
 BENEFITS OF INTEGRATING DATA MINING AND
COMPAIGN MANAGEMENT
 TEN STEPS OF DATA MINING
 EVALUTING BENEFITS OF DATA MINING MODEL
 DATA MINING SUITE
 SCOPE OF DATA MINING
 PROFITABLE APPLICATION OF DATA MINING
 TYPICAL FUNCTIONALITY OF DATA WAREHOUSES
 WHAT DATA MINING CAN’T DO?
 DIFFICULTIES IN WORKING WITH DATA WAREHOUSING
 GLOSSARY
 CONCLUSION
 BIBLOGRAPHY

3|Page
DATA MINING-The Knowledge Discovery in Database

ABSTRACT
Data Mining gains its name, and to some degree its popularity, by playing
off of a meaning that the data that you have stored is much like a “ mountain” and that
buried within the mountain (just as buried within your data) are certain “gems” of great
value. The problem is that there are also lots of non-valuable rocks and rubble in the
mountain that need to be mined through and discarded in order to get to that which is
valuable. The trick is that both for mountains of rock and mountains of data you need
some power tools to unearth the value of the data. For rock, this means earthmovers and
dynamite; for data, this means powerful computers and data mining software.

Data Mining is a process for organizations, which uncover patterns hidden in


their data that can be used to predict the behavior of customers, products and processes.

Here the Database can be global, or more than one database may be on different
DBMS, but the Data Mining can extract the all database and gives you the results which
you want. This process gives you the information from the database may be it is not
visible directly.

Data Mining can give the some results, some combinations or some specific
characteristics of customer, product or processes, which is further useful to next working.
It can be said that there is some Artificial Intelligence in the Data Mining.

Data Mining is the tool, which can give your data the intelligence for any
particular models or work. The Building of Data Mining software is very easy if you go
through proper steps.

The data mining is often referred as K.D.D Knowledge Discovery in


Database. Because in the process of Data Mining we are mining the data or we are
initiated the process of knowledge discovery in database.

4|Page
DATA MINING-The Knowledge Discovery in Database

Introduction to Data Mining


Discovering hidden value in your data warehouse

Databases today can range in size into the terabytes - more than
1,000,000,000,000 bytes of data. Within these masses of data lies hidden information of
strategic importance. But when there are so many trees, how do you draw meaningful
conclusions about the forest?

The newest answer is Data Mining, which is being used both to increase
revenues and to reduce costs. The potential returns are enormous. Innovative
organizations worldwide are already using data mining to locate and appeal to higher-
value customers, to reconfigure their product offerings to increase sales, and to minimize
losses due to error or fraud.

Data mining, the extraction of hidden predictive information from large


databases, is a powerful new technology with great potential to help companies focus on
the most important information in their data warehouses. Data mining tools predict future
trends and behaviors, allowing businesses to make proactive, knowledge-driven
decisions. The automated, prospective analyses offered by data mining move beyond the
analyses of past events provided by retrospective tools typical of decision support
systems. Data mining tools can answer business questions that traditionally were too time
consuming to resolve. They scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations.

Most companies already collect and refine massive quantities of data. Data
mining techniques can be implemented rapidly on existing software and hardware
platforms to enhance the value of existing information resources, and can be integrated
with new products and systems as they are brought on-line. When implemented on high
performance client/server or parallel processing computers, data mining tools can analyze
massive databases to deliver answers to questions such as, "Which clients are most likely
to respond to my next promotional mailing, and why?"

Data Mining is a process that uses a variety of data analysis tools to


discover patterns and relationships in data that may be used to make valid predictions.

The first and simplest analytical step in data mining is to describe the data –
summarize its statically attributes(such as means and standard derivations), visually
review it using charts and graphs, and look for potentially meaningful links among
variables(such as values that often occur together). Collecting, exploring and selecting
the right data are critically important.

5|Page
DATA MINING-The Knowledge Discovery in Database

But data description alone cannot provide action plan. You must build a
predictive model based on patterns determined from non results, then test the model on
result out side the original sample. A good model should never be confused with reality,
but it can be a useful guide to understanding your business.

The final step is to empirically verify the model. For example, from a
database of customers who have already responded to a particular offer. You have built a
model predicting which prospects are likeliest to respond to the se offer. Can you rely on
this prediction?

The data mining is often referenced as K.D.D. Knowledge Discovery in Data


Base because in the process of data mining we are mining the data or we are initiated the
process of knowledge discovery in data base.

The knowledge discovery data base comprises six phases:


 Data Selection
 Data Cleansing
 Enrichment
 Data Transformation or Data Encoding
 Reporting and display of discovered information
 Data Mining

As an example consider a transaction data base maintained by a specialty


consumer goods retailer. Suppose the client data includes a customer name, zip code,
phone-number, date of purchase, item-codes, qty, and total amount. A variety of new
knowledge can be discovered by K.D.D. processing on this client data base.

Data mining must be preceded by significant data preparation before it can


yield useful information that can directly influence business decisions. The result of data
mining may be reported in a variety of formats such as listing, graphic outputs, summary
tables, or visualizations.

6|Page
DATA MINING-The Knowledge Discovery in Database

Learning from Past Mistakes? ::

“Those who can not remember the past are condensed to repeat it”.
-G.Santayna

Data Mining works the same way as a human being does. It uses historical
information (experience) to learn from the past. However, in order for the data mining
technology to pull the “gold” out of your database, you do have to tell it what the gold
looks like (what business problem you would like to solve). It then uses the description of
that “gold” to look for similar examples in the database, and uses these pieces of
information from the past to develop a predictive model of what will happen in the future.

Is Data Mining replace skilled purposes?

Data mining dose not replace skilled business analysts or manages, but
rather gives them a powerful new tool to improve the job they are doing . Any company
that knows its business and its customers is already aware of many important, high pay
off patterns that its employees have observed over the years. What data mining can do is
confirm such empirical observations and find new, sable patterns that yield steady
incremental improvement.

7|Page
DATA MINING-The Knowledge Discovery in Database

Introduction to data warehousing

Data warehousing is integration of information to boost the organization's


decision support system. Data warehousing is subject oriented, integrated, nonvolatile,
time-variant collection of data in support of management’s decision. It provides
architecture to build an environment so user can access every piece of information of the
organization. It is a way to design very large database with historical and summarized
data. These Non-volatile data is collected from heterogeneous sources and analyzed by
data warehouse components. So the data stored in warehouse are generally read only or
not modified frequently.

They support high performance demands on organizati9ons data and in


formations. Several types of applications –OLAP, DSS, and Data mining applications –
are supported OLAP is a term used to describe the analysis of complex data from the data
warehouse. In the hands of skilled knowledge workers , OLAP tools used distributed
computing capabilities for analysis that require more storage and processing power then
can be economically and efficiently locate on individually desktop. DSS also known as
EIS(Executive Information Systems) support organizations leading decision makers with
higher level data for complex and important decisions.

Traditional data bases support On-Line Transaction processing (OLTP) which


includes insertions, updates and delusions, while also supporting g information, query
requirement. Traditional, relational data bases are optimized to process query that may
touch a small part of data base and transitions that deal with insertions or updates of a
few tuples per relation to process. Thus, they can’t be optimized four OLAP, DSS or Data
Mining. By contrast data warehouses are designed precisely to support efficient
extraction, processing and presentation for analytic and decision making purposes. Data
warehouses generally contains large amount of data from multiple sources that may
include data bases from different data models and some times files acquired from
independent systems and platforms.

8|Page
DATA MINING-The Knowledge Discovery in Database

Data mining and Data warehousing


In modern organizations, users of data are often completely removed from the data
sources. Many people only need Read-access to data, but still need a very rapid access to
a larger volume of data then can conveniently downloaded to the desktop. Often such
data comes from multiple access data bases. Because many of analysis performed are
recurrent and predictable, software venders and system support staff have begun to design
system to support this system. At present there is a need to provide decisions from middle
management upward with information at the correct level of detail to support
Decision making. Data warehousing, OLAP(On-Line Analytical processing), and Data
mining provide this functionality.

The data to be mined is first abstracted from enterprise data warehouse into a data
mining or data marts. There is a some real benefit if your data is already part of a data
warehouse. The problems of cleansing data for a data warehouse and for a data mining
are very similar. If the data has already been cleansed for data warehouse, then it most
likely will not need further cleaning in order to be mined.

The data mining data base may be logical rather than physical subset of your data
warehouse provided that the data warehouse DBMS can support the additional resource
demands of data mining. If it can not , then you will be better off with a separate data
mining data base.

A data warehouse is not a requirement of data mining. Setting up a large data


warehouse that consolidate data from multiple sources, resolves data integrity problems
and loads the data into a query data base can be enormous task, sometimes taking years
and costing million of dollars. You could, however, mined data from one or more
operational or transactional data bases by simply extracting it into a real-only data base.

The goal of data warehouse is to support decision making with data. Data mining
can be used in conjunction to help with certain types of decisions. Data mining can be
applied to operational data bases with individual transactions. To make data mining more
efficient, the data warehouse should have aggregated or summarized the collection of
data. Data mining helps extracting meaningful new patterns that can not be found
necessary by merely querying or processing data in the data warehouse. Data mining
applications should Therefore be strongly consider early, during design of data
warehouse. Also, data mining tools should be designed to facilitate their use in
conjunction with data warehouses. In fact, for very large data bases running into terabytes
of data, successful use of data base mining applications will applications will depend first
on the construction of a data warehouse.

9|Page
DATA MINING-The Knowledge Discovery in Database

Data mining and OLAP


One of the most common questions from data processing professionals is about
the difference between data mining and OLAP (On-Line Analytical Processing).
As we shall see, they are very different tools that can complement each other.

OLAP is part of the spectrum of decision support tools. Traditional query and
report tools describe what is in a database. OLAP goes further; it’s used to answer why
Certain things are true. The user forms a hypothesis about a relationship and verifies it
with a series of queries against the data. For example, an analyst might want to determine
the factors that lead to loan defaults. He or she might initially hypothesize that people
with low incomes are bad credit risks and analyze the data base with OLAP to verify (or
disprove) this assumption. If that hypothesis were not borne out by the data, the analyst
might then look at high debt as the determinant of risk. If the data did not support this
guess either, he or she might then try debt and income together as the best predictor of
bad credit risks.

In other word, the OLAP analyst generates a series of hypothetical patterns and
relationships and uses queries against the database to verify them or disprove them.
OLAP analysis essentially a deductive process. But what happens when the number of
variables being analyzed is in dozens of even hundred? It becomes much more difficult
and time-consuming to find a good hypothesis (let alone be confident that there is not a
better explanation than the one found), and analysis the database with OLAP to verify or
disprove it.

Data mining is different from OLAP because rather than verify hypothetical
patterns, it uses the data itself to uncover such patterns. It is essentially an inductive
process. For example, suppose the analyst who wanted to identify the risk factors for
loan default were to use a data mining tool. The data mining tool might discover that
people with high debt and low incomes were bad credit risks (as above) , but it might go
further and also discover a pattern the analyst did not think to try, such as that age is also
a determinant of risk.

Here is where data mining and OLAP can complement each other. Before acting
on the pattern, the analyst needs to know what the financial implications would be of
using the discovered pattern to govern who gets credits. The OLAP tool can allow the
analyst to answer those kinds of questions. Furthermore, OLAP is also complementary in
the early stages of the knowledge discovery process because it can help you explore your
data, for instance by focusing attention on important variables, identifying exceptions, or
finding interactions. This is important because your understand your data, the more
effective the knowledge discovery process will be.

10 | P a g e
DATA MINING-The Knowledge Discovery in Database

The Foundations of Data Mining

Data mining techniques are the result of a long process of research and product
development. This evolution began when business data was first stored on computers,
continued with improvements in data access, and more recently, generated technologies
that allow users to navigate through their data in real time. Data mining takes this
evolutionary process beyond retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now sufficiently mature:
 Massive data collection
 Powerful multiprocessor computers
 Data mining algorithms

Commercial databases are growing at unprecedented rates. A recent META


Group survey of data warehouse projects found that 19% of respondents are beyond the
50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some
industries, such as retail, these numbers can be much larger. The accompanying need for
improved computational engines can now be met in a cost-effective manner with parallel
multiprocessor computer technology. Data mining algorithms embody techniques that
have existed for at least 10 years, but have only recently been implemented as mature,
reliable, understandable tools that consistently outperform older statistical methods.
In the evolution from business data to business information, each new step has built upon
the previous one. For example, dynamic data access is critical for drill-through in data
navigation applications, and the ability to store large databases is critical to data mining.
From the user’s point of view, the four steps listed in Table 1 were revolutionary because
they allowed new business questions to be answered accurately and quickly.

11 | P a g e
DATA MINING-The Knowledge Discovery in Database

Evolutionary Enabling Product


Business Question Characteristics
Step Technologies Providers
Data "What was my total Retrospective,
Computers, tapes,
Collection revenue in the last IBM, CDC static data
disks
(1960s) five years?" delivery
Relational
Oracle,
databases Retrospective,
"What were unit sales Sybase,
Data Access (RDBMS), dynamic data
in New England last Informix,
(1980s) Structured Query delivery at
March?" IBM,
Language (SQL), record level
Microsoft
ODBC

Data On-line analyticPilot,


Warehousing "What were unit salesprocessing Comshare, Retrospective,
& in New England last(OLAP), Arbor, dynamic data
Decision March? Drill down tomultidimensional Cognos, delivery at
Support Boston." databases, dataMicro multiple levels
(1990s) warehouses strategy

Pilot,
Advanced Lockheed,
"What’s likely to Prospective,
Data Mining algorithms, IBM, SGI,
happen to Boston proactive
(Emerging multiprocessor numerous
unit sales next information
Today) computers, startups
month? Why?" delivery
massive databases (nascent
industry)

Table 1. Steps in the Evolution of Data Mining.

The core components of data mining technology have been under


development for decades, in research areas such as statistics, artificial intelligence, and
machine learning. Today, the maturity of these techniques, coupled with high-
performance relational database engines and broad data integration efforts, make these
technologies practical for current data warehouse environments.

DATA MINING-The Knowledge Discovery in Database

12 | P a g e
An Architecture for Data Mining

To best apply these advanced techniques, they must be fully integrated with a data
warehouse as well as flexible interactive business analysis tools. Many data mining tools
currently operate outside of the warehouse, requiring extra steps for extracting,
importing, and analyzing the data. Furthermore, when new insights require operational
implementation, integration with the warehouse simplifies the application of results from
data mining. The resulting analytic data warehouse can be applied to improve business
processes throughout the organization, in areas such as promotional campaign
management, fraud detection, new product rollout, and so on. Figure 1 illustrates an
architecture for advanced analysis in a large data warehouse.

Figure 1 - Integrated Data Mining Architecture

The ideal starting point is a data warehouse containing a combination of internal


data tracking all customer contact coupled with external market data about competitor
activity. Background information on potential customers also provides an excellent basis
for prospecting. This warehouse can be implemented in a variety of relational database
systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and
fast data access.

An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-


user business model to be applied when navigating the data warehouse. The

13 | P a g e
multidimensional structures allow the user to analyze the data as they want to view their
business – summarizing by product line, region, and other key perspectives of their
business. The Data Mining Server must be integrated with the data warehouse and the
OLAP server to embed ROI-focused business analysis directly into this infrastructure.

DATA MINING-The Knowledge Discovery in Database

An advanced, process-centric metadata template defines the data mining


objectives for specific business issues like campaign management, prospecting, and
promotion optimization. Integration with the data warehouse enables operational
decisions to be directly implemented and tracked. As the warehouse grows with new
decisions and results, the organization can continually mine the best practices and apply
them to future decisions.

This design represents a fundamental shift from conventional decision support


systems. Rather than simply delivering data to the end user through query and reporting
software, the Advanced Analysis Server applies users’ business models directly to the
warehouse and returns a proactive analysis of the most relevant information. These
results enhance the metadata in the OLAP Server by providing a dynamic metadata layer
that represents a distilled view of the data. Reporting, visualization, and other analysis
tools can then be applied to plan future actions and confirm the impact of those plans.

14 | P a g e
DATA MINING-The Knowledge Discovery in Database

Physical structure of data warehouse :

Data warehouse is a central repository for data. There are three different basic
architectures for constructing a data warehouse. In first type there is only central location
to store data, which we call data warehouse physical storage media. In this type of
construction, data is gathered from heterogeneous, data sources, like different types of
files, local database system and from other external sources.
As the data is stored in a central place its' access is very easy and simple. But
disadvantage of this construction is the loss of performance.
In second type of construction data is decentralized. As the data cannot be stored
physically together but logically it is consolidated in data warehouse environment. In this
construction department wise data and site wise data is stored at their local place. Local
application and other generated data is stored in local database but information about
data, called metadata (data about data) is stored in central site. This local database can
also maintain their metadata locally for their local work as well as central site. This local
database with metadata is called "Data Marts".
An advantage of this architecture is that the logical data warehouse is only virtual.
Central data warehouse is not storing any actual data but information of data so any user
who wants to access data can make query to central site and this central site prepare
resultant data for user. This entire process to collect data from physical database is
transparent to user.
Third and last type of construction creates a hierarchical view of data. Here the central
data warehouse is also storing actual data and data marts on next level store copy or
summary of physical central data warehouse. Local data marts store the data, which is
related to related to their local site only.
The advantage of distributed and hierarchical construction are (1) Retrieval time of data
from data warehouse is less and (2) volume of data is also reduced. Although data is
integrated through metadata so anyone from anywhere can access data and processing is
divided in different physical machines. For better response of data retrieval, scalable data
warehouse architecture is very important. Data warehouse response is also depending on
metadata so design of metadata is also very important for every data warehouse.

15 | P a g e
DATA MINING-The Knowledge Discovery in Database

Issues in Integration of data in data warehouse :

As discussed above, you can physically design your data warehouse as using any of
three construction type. But to integrate data in a data warehouse require some procedure
like data extraction and data migration, data cleansing / data scrubbing and data
integration.

Data extraction and data migration :

To extract data from operational databases, files and other external sources, extraction
tools are required. This process should be detailed and documented correctly. If this
process is not properly documented then it will create problems while integration with
other data and also create difficulties at later stage. So data extraction should provide high
level of integration and make efficient metadata for data warehouse.
Data migration is a task to convert the data from one system to another. It should
provide type checking of integrity constraints in data warehouse. It should also find out
inconsistency and missing values while converting metadata for entire process so one can
easily identified problem in migration process.

Data cleansing / Data scrubbing :

Data warehouse collect data from heterogeneous sources in organization. These data are
integrated in such a manner so any end-user can access that data very easily. For facilitate
end-user, DWA (Data Warehouse Administration) must be aware about right approach of
warehouse. To collect data from different operating system, from different network,
different application files like C, COBOL, FORTARN and different operational
databases. So our first step is to design a platform on which we can access data from
every system and put them together in a warehouse. Before transferring data from one
system to another, data must be standardized. This standard is always related to format of
data, structure of data and information collection.

16 | P a g e
DATA MINING-The Knowledge Discovery in Database

Characteristics of Data warehousing :

 multidimensional conceptual view


 generic dimensionality
 unlimited dimensions and aggregation levels
 unrestricted cross- dimensional operations
 dynamic sparser matrix handling
 client-server architecture
 multi user support
 accessibility
 transparency
 intuitive data manipulation
 consistent reporting performance
 flexible reporting
Because they encompass large volume of data, data warehousing are generally an
order of magnitude larger than the source databases. The sheer volume of data is an issue
that has been deal with through enterprise, virtual data warehouse, and data marts:
 Enterprise-wide data warehouses are huge projects requiring massive investment of
time and resources.
 Virtual data warehouse provide views of optional databases that are materialized for
efficient access.
 Data marts are targeted to a subset of the organization, such as department, and are
more tightly focused.

17 | P a g e
DATA MINING-The Knowledge Discovery in Database

Types of Data Mining:

The term “knowledge” is very broadly interpreted as involving some degree of


intelligence. Knowledge is often classified as inductive and deductive. Knowledge can be
represented in many forms: in unstructured sense, it can be represented by rules, or
prepositional logic. In a structured form, it may be represented in decision trees,
semantic, neural networks or hierarchical classes or frames. The knowledge discover
during data mining can be described in five ways as follows.
1. Association rules-These rules correlate the presence of a set of items
with another range of values for another set of variables. Examples:
(1) when a female retail shopper buys a handbag, she is likely to buy
shoes.
(2) An X-ray image containing characteristics a and b is likely to
also exhibit characteristic c.

2. Classification hierarchies-The goal is to work from an existing set of


events or transactions to create a hierarchy of classes. Examples
(1) A population may be divided into five ranges of credit
worthiness based on a history of previous credit transactions.
(2) A model may be developed for the factors that determine these
desirability of location of a store on a 1-10 scale.(3) Mutual
funds may be classified based on performance data using
characteristics such as growth, income, and stability.

3. Sequential patterns- A sequence of actions or events is sought. Example:


If a patient underwent cardiac bypass surgery for blocks arties
and an aneurysm and later developed high blood urea within year
of surgery, he is likely to suffer from kidney failure within next
18 months. Detection of sequential pattern is equivalent to
detecting association among events with certain relationship.

4. Patterns within time series-Similarities can be detected within positions


of time series. Three examples follow with the stock market price data as
a time series:

18 | P a g e
(1) stocks of a utility company ABC Power and a financial company
XYZ securities show the same pattern during 1998 in terms of
closing stock price.
(2) Two products show the same selling pattern in summer but a
different one in winter.
(3) A pattern in solar magnetic wind may be used to predict changes
in earth atmospheric conditions.

DATA MINING-The Knowledge Discovery in Database

5. Categorization and segmentation-A given population of events or items


can be partitioned (segmented) into sets of “similar” elements.
Examples:
(1) An entire population of treatment data on a disease may be
divided into groups based on similarities of side effects
produced.
(2) The adult population may be categorized into five groups from
“most likely to buy” to “list likely to buy” a new product.
(3) The web excise a collection of users against a set of document
may be analyzed in terms of the keywords of documents to
reviles clusters categorized of users.

19 | P a g e
DATA MINING-The Knowledge Discovery in Database

How Data Mining Works

How exactly is data mining able to tell you important things that you didn't
know or what is going to happen next? The technique that is used to perform these feats
in data mining is called modeling. Modeling is simply the act of building a model in one
situation where you know the answer and then applying it to another situation that you
don't. For instance, if you were looking for a sunken Spanish galleon on the high seas the
first thing you might do is to research the times when Spanish treasure had been found by
others in the past. You might note that these ships often tend to be found off the coast of

Bermuda and that there are certain characteristics to the ocean currents, and
certain routes that have likely been taken by the ship’s captains in that era. You note these
similarities and build a model that includes the characteristics that are common to the
locations of these sunken treasures. With these models in hand you sail off looking for
treasure where your model indicates it most likely might be given a similar situation in
the past. Hopefully, if you've got a good model, you find your treasure.

This act of model building is thus something that people have been doing for
a long time, certainly before the advent of computers or data mining technology. What
happens on computers, however, is not much different than the way people build models.
Computers are loaded up with lots of information about a variety of situations where an
answer is known and then the data mining software on the computer must run through
that data and distill the characteristics of the data that should go into the model. Once the
model is built it can then be used in similar situations where you don't know the answer.
For example, say that you are the director of marketing for a telecommunications
company and you'd like to acquire some new long distance phone customers. You could
just randomly go out and mail coupons to the general population - just as you could
randomly sail the seas looking for sunken treasure. In neither case would you achieve the
results you desired and of course you have the opportunity to do much better than random
- you could use your business experience stored in your database to build a model.

As the marketing director you have access to a lot of information about all
of your customers: their age, sex, credit history and long distance calling usage. The good
20 | P a g e
news is that you also have a lot of information about your prospective customers: their
age, sex, credit history etc. Your problem is that you don't know the long distance calling
usage of these prospects (since they are most likely now customers of your competition).
You'd like to concentrate on those prospects who have large amounts of long distance
usage. You can accomplish this by building a model. Table 2 illustrates the data used for
building a model for new customer prospecting in a data warehouse.

DATA MINING-The Knowledge Discovery in Database

Customers Prospects
General information (e.g.
Known Known
demographic data)
Proprietary information (e.g.
Known Target
customer transactions)

Table 2 - Data Mining for Prospecting

The goal in prospecting is to make some calculated guesses about the


information in the lower right hand quadrant based on the model that we build going
from Customer General Information to Customer Proprietary Information. For instance, a
simple model for a telecommunications company might be:
98% of my customers who make more than $60,000/year spend more than $80/month on
long distance

This model could then be applied to the prospect data to try to tell something
about the proprietary information that this telecommunications company does not
currently have access to. With this model in hand new customers can be selectively
targeted.
Test marketing is an excellent source of data for this kind of modeling. Mining the results
of a test market representing a broad but relatively small sample of prospects can provide
a foundation for identifying good prospects in the overall market. Table 3 shows another
common scenario for building models: predict what is going to happen in the future.

Yesterday Today Tomorrow


Static information and current
plans (e.g. demographic data,Known Known Known
marketing plans)
Dynamic information (e.g.
Known Known Target
customer transactions)

Table 3 - Data Mining for Predictions

21 | P a g e
If someone told you that he had a model that could predict customer usage how
would you know if he really had a good model? The first thing you might try would be to
ask him to apply his model to your customer base - where you already knew the answer.
With data mining, the best way to accomplish this is by setting aside some of your data in
a vault to isolate it from the mining process. Once the mining is complete, the results can
be tested against the data held in the vault to confirm the model’s validity. If the model
works, its observations should hold for the vaulted data.

DATA MINING-The Knowledge Discovery in Database

Goals of Data Mining

 Prediction:- Data mining can show how certain attributes within the data will
behave in the future. Examples of predictive data mining include the analysis of
buying transactions to predict what consumer will buy under certain discount,
how much sales volume store would generate in given period whether deleting
product line would yield more profits, business logic is used coupled with data
mining. In scientific context, certain scientific wave patterns may predict an
earthquake with high probability.

 Identification:- Data patterns can be used to identify the existence of an item, an


event, or an activity. For example, intruders trying to break a system may be
identified by the programs executed, files accessed, and CPU time per session. In
biological applications, existence of a gene may be identified by certain sequences
of nucleotide symbols in the DNA sequence. The area known as authentication is
a from of identification. It ascertains whether a user is indeed a specific user or
one from an authorized class; it involves a comparison of parameters or images or
signals against a database.

 Classification:- Data Mining can partition data so that different classes or


categories can be identified based on combinations of parameters. For example,
customers in super market can be categorized in discount seeking shoppers ,
shoppers in a rush, loyal regular shoppers , and infrequent shoppers, this
classification Is used in analysis of customer buying transactions as post mining
activity. Classification based on common domain knowledge is used as input to
decompose mining problem and make it simpler. For instance, health foods, party
foods, school lunch foods are distinct categories in business super market. It
makes sense to analyze relationship within and across categories as separate
problems. Search categorization used to encode data appropriately before
subjecting it to further data mining.

 Optimization:- One eventual goal of data mining may be to optimize use of


limited resources such as time, space, money or materials and to maximize output

22 | P a g e
variables such as sales or profits under given set of constraints. This goal of data
mining resembles objective function used in operations research problems that
deals with optimization under constraint.

DATA MINING-The Knowledge Discovery in Database

Integrating Data Mining and Campaign Management


The closer Data Mining and Campaign
Management work together, the better the business results.
Today, Campaign Management software uses the scores
generated by the Data Mining model to sharpen the focus of
targeted customers or prospects, thereby increasing response
rates and campaign effectiveness.

Unfortunately, the use of a model within Campaign Management today is often a manual,
time-intensive process. When someone in marketing wants to run a campaign that uses
model scores, he or she usually calls someone in the modeling group to get a file
containing the database scores. With the file in hand, the marketer must then solicit the
help of someone in the information technology group to merge the scores with the
marketing database.

This disjointed process is fraught with problems:


 The large numbers of campaigns that run on a daily or weekly basis can be
difficult to schedule and can swamp the available resources.
 The process is error prone; it is easy to score the wrong database or the wrong
fields in a database.
 Scoring is typically very inefficient. Entire databases are usually scored, not just
the segments defined for the campaign. Not only is effort wasted, but the manual
process may also be too slow to keep up with campaigns run weekly or daily.

The solution to these problems is the tight integration of Data Mining and
Campaign Management technologies. Integration is crucial in two areas:
First, the Campaign Management software must share the definition of the defined
campaign segment with the Data Mining application to avoid modeling the entire
database. For example, a marketer may define a campaign segment of high-income males
between the ages of 25 and 35 living in the northeast. Through the integration of the two

23 | P a g e
applications, the Data Mining application can automatically restrict its analysis to
database records containing just those characteristics.

Second, selected scores from the resulting predictive model must flow
seamlessly into the campaign segment in order to form targets with the highest profit
potential.

DATA MINING-The Knowledge Discovery in Database

The integrated Data Mining and Campaign Management process

This section examines how to apply the integration of Data Mining and Campaign
Management to benefit the organization. The first step creates a model using a Data
Mining tool. The second step takes this model and puts it to use in the production
environment of an automated database marketing campaign.

Step 1: Creating the model

An analyst or user with a background in modeling creates a predictive model


using the Data Mining application. This modeling is usually completely separate from
campaign creation. The complexity of the model creation typically depends on many
factors, including database size, the number of variables known about each customer, the
kind of Data Mining algorithms used and the modeler’s experience.

Interaction with the Campaign Management software begins when a model


of sufficient quality has been found. At this point, the Data Mining user exports his or her
model to a Campaign Management application, which can be as simple as dragging and
dropping the data from one application to the other.
This process of exporting a model tells the Campaign Management software that the
model exists and is available for later use.

Step 2: Dynamically scoring the data

24 | P a g e
Dynamic scoring allows you to score an already-
defined customer segment within your Campaign
Management tool rather than in the Data Mining tool.
Dynamic scoring both avoids mundane, repetitive manual
chores and eliminates the need to score an entire database.
Instead, dynamic scoring marks only relevant customer
subsets and only when needed.

Scoring only the relevant customer subset and eliminating the manual process
shrinks cycle times. Scoring data only when needed assures "fresh," up-to-date results.
Once a model is in the Campaign Management system, a user (usually someone other
than the person who created the model) can start to build marketing campaigns using the
predictive models. Models are invoked by the Campaign Management System.

When a marketing campaign invokes a specific predictive model to perform


dynamic scoring, the output is usually stored as a temporary score table. When the score
table is available in the data warehouse, the Data Mining engine notifies the Campaign
Management system and the marketing campaign execution continues.

DATA MINING-The Knowledge Discovery in Database

Here is how a dynamically scored customer segment


might be defined:
Where
Length_of_service = 9
and
Average_balance > 150
and
In_Model (promo9). score > 0.80

In this example:
 Length of service =9 limits the application of the
model to those customers in the ninth month of
their 12-month contracts, thus targeting customers
only at the most vulnerable time. (In reality, there
is likely a variety of contract lengths to consider
this when formulating the selection criteria.)
 Average balance > 150 selects only customers
spending, on average, more than $150 each month.
The marketer deemed that it would unprofitable to
send the offer to less valuable customers.
 Promo9 is the name of a logged predictive model

25 | P a g e
that was created with a Data Mining application.
This criterion includes a threshold score, 0.80,
which a customer must surpass to be considered "in
the model." This third criteria limits the campaign
to just those customers in the model, i.e. those
customers most likely to require an inducement to
prevent them switching to a competitor.

Data Mining and Campaign Management in the real world

DATA MINING-The Knowledge Discovery in Database

Ideally, marketers who build campaigns should be able to apply any model logged in the
Campaign Management system to a defined target segment. For example, a marketing
manager at a cellular telephone company might be interested in high-value customers
likely to switch to another carrier. This segment might be defined as customers who are
nine months into a twelve-month contract, and whose average monthly balance is more
than $150.

The easiest approach to retain these customers is to offer all of them a new
high-tech telephone. However, this is expensive and wasteful since many customers
would remain loyal without any incentive.

26 | P a g e
DATA MINING-The Knowledge Discovery in Database

The Benefits of integrating Data Mining and Campaign Management

For marketers:
 Improved campaign results through the use of model scores that further
refine customer and prospect segments.
Records can be scored when campaigns are ready to run, allowing the use of the
most recent data. "Fresh" data and the selection of "high" scores within defined
market segments improve direct marketing results.
 Accelerated marketing cycle times that reduce costs and increase the
likelihood of reaching customers and prospects before competitors.
Scoring takes place only for records defined by the customer segment, eliminating
the need to score an entire database. This is important to keep pace with
continuously running marketing campaigns with tight cycle times.
Accelerated marketing "velocity" also increases the number of opportunities used
to refine and improve campaigns. The end of each campaign cycle presents
another chance to assess results and improve future campaigns.
 Increased accuracy through the elimination of manually induced errors. The
Campaign Management software determines which records to score and
when.

For statisticians:

27 | P a g e
 Less time spent on mundane tasks of extracting and importing files, leaving
more time for creative – building and interpreting models. Statisticians have
greater impact on corporate bottom line.

As a database marketer, you understand that some customers present much greater profit
potential than others. But, how will you find those high-potential customers in a database
that contains hundreds of data items for each of millions of customers?

Data Mining software can help find the "high-profit" gems buried in mountains of
information. However, merely identifying your best prospects is not enough to improve
Instead, to reduce costs and improve results, the marketer could use a predictive model to
select only those valuable customers who would likely defect to a competitor unless they
receive the offer.

DATA MINING-The Knowledge Discovery in Database

The ten Steps of Data Mining ::

Here is process for extracting hidden knowledge from your data warehouse, your
customer information file, or any other company database.

1. Identify The Objective

Before you begin, be clear on what you hope to accomplish with your
analysis. Know in advance the business goal of the data mining. Establish whether or
not the goal is measurable. Some possible goals are to
- Find sales relationships between specific products or services
- Identify specific parching patterns over time
- Identify potential types of customers
- Find product sales trends.

2. Select The Data

Once you have defined your goal, your next step is to select the data to
meet this goal. This may be a subset of your data warehouse or a data mart that
contains specific product information. It may be your customer information file.
Segment as much as possible the scope of the data to be mined.

28 | P a g e
Here are some key issues.

- Are the data adequate to describe the phenomena the data mining analysis
is attempting to model?
- Can you enhance internal customer records with external lifestyle and
demographic data?
- Are the data stable-will the mined attributes be the same after the analysis?

- If you are merging databases and you find a common field for linking
them?
- How current and relevant are the data to the business goal?

DATA MINING-The Knowledge Discovery in Database

3. Prepare The Data

Once you’ve assembled the data, you must decide which attributes to
convert into usable formats. Consider the input of domain experts-creators and
users of the data.
- Establish strategies for handling missing data, extraneous noise, and outliers
- Identify redundant variables in the dataset and decide which fields to exclude
- Decide on a log or square transformation, if necessary
- Visually inspect the dataset to get a feel for the database
- Determine the distribution frequencies of the data
You can postpone some of these decisions until you select a data-
mining tool. For example, if you need a neural network or polynomial
network you may have to transform some of your fields.

4. Audit The Data

Evaluate the structure of your data in data in order to determine the


appropriate tools.
- What is the radio of categorical/binary attributes in the database?
- What is the nature and structure of the database?
- What is the overall condition of the dataset?
- What is the distribution of the dataset?

29 | P a g e
Balance the objective assessment of the structure of your data against your
user’ need to understand the findings. Neural nets, for example, don’t explain their results

5. Select The Tools

Two concerns drive he selection of the appropriate data mining


tool- your business objectives and your data structure. Both should guide you to the
same tool. Consider these questions when evaluating a set of potential tools.

- Is the data set heavily categorical?


- What platforms do your candidate tools support?
- Are the candidate tools ODBC-compliant?
- What data format can the tools import?

No single tool is likely to provide the answer to your data mining


project,. Some tools integrate several technologies into a suite of statistical
analysis programs, a neural network, and a symbolic classifier.

DATA MINING-The Knowledge Discovery in Database

6. Format The Solution

In conjunction with your data audit, your business objective and


the selection of your tool determine the format of your solution. The Key questions
are
- What is the optimum format of the solution- decision tree, rules, C code,
and SQL syntax?
- What are the available format options?
- What is thee goal of the solution?
- What do the end-users need-graphs, reports, code?

7. Construct The Model

At this point that the data mining processing begins. Usually the
first step is to use the random number seed to split the data into a training set and a
test set and construct and evaluate a model. The generation of the classification rules,
decision trees, clustering sub-groups, score, code, weights and evaluation data/error
rates takes place at this stage. Resolve these issues:
- Are error rates at acceptable level? Can you improve them?
- What extraneous attributes did you find? Can you purge them?
- Is additional data or a different methodology necessary?
- Will you have to train and test a new data set?

30 | P a g e
8. Validate The Findings

Share and discuss the results of the analysis with the business
client or domain expert. Ensure that the findings are correct and appropriate to the
business objectives.
- Do the findings make sense?
- Do you have to return any prior steps to improve results?
- Can use other data mining tools to replicate the findings?

9. Deliver The Findings

Provide a final report to the business unit or client. The report


should source code, and rules, some of the issues are:
- Will additional data improve the analysis?
- What strategic insight did you discover and how is it applicable?
- What proposals can result from the data mining analysis?
- Do you findings meet the business objective?

DATA MINING-The Knowledge Discovery in Database

10. Integrate The Solution


Share the findings with all interested end-users in the appropriate
business units. You might wind up incorporating the results of the analysis into the
company’s business procedures. Some of the data mining solutions may involve
- SQL syntax for distribution to end-users
- C code incorporated into a production system
- Rules integrated into a decision support system.

Although data mining tools automate database analysis, they can lead
to faulty findings and erroneous conclusions if you’re not careful. Bear in mind that data
mining is a business process with a specific goal- to extract a competitive insight from
historical records in a database.

31 | P a g e
DATA MINING-The Knowledge Discovery in Database

Evaluating the Benefits of a Data Mining Model

Other representations of the model often incorporate expected costs and expected
revenues to provide the most important measure of model quality: profitability. A
profitability graph like the one shown below can help determine how many prospects to
include in a campaign. In this example, it is easy to see that contacting all customers will
result in a net loss. However, selecting a threshold score of approximately 0.8 will
maximize profitability.

32 | P a g e
For a closer look at
how the use of model
scores can improve
profitability, consider
an example campaign
with the following
assumptions:

 Database size: 2,000,000


 Maximum possible response: 40,000
 Cost to reach one customer: $1.00
 Profit margin from a positive response: $40.00

As the table below shows, a random sampling of the full customer/prospect database
produces a loss regardless of the campaign target size. However, by targeting customer
using a Data Mining model, the marketer can select a smaller target that includes a higher
percentage of good prospects. This more focused approach generates a profit until the
target becomes too large and includes too many poor prospects.

Campaign Random Selection Targeted Selection


Size Cost Response Revenue Net Response Revenue Net
100,000 $100,000 2,000 $80,000 ($20,000) 4,000 $160,000 $60,000
400,000 $400,000 8,000 $320,000 ($80,000) 30,000 $1,200,000 $800,000
1,000,000 $1,000,000 20,000 $800,000 ($200,000) 35,000 $1,400,000 $400,000
2,000,000 $2,000,000 40,000 $1,600,000 ($400,000) 40,000 $1,600,000 ($400,000)

DATA MINING-The Knowledge Discovery in Database

33 | P a g e
The data mining suite

The Data Mining Suite TM


is truly unique, providing the most powerful,
complete and comprehensive solution for enterprise-wide, large scale decision
support. It leads the world of discovery with the exceptional ability to directly
mine large multi-table SQL databases.

The Data Mining Suite works directly on large SQL repositories with no need
for sampling or extract files. It accesses large volumes of multi-table relational
data on the server, incrementally discovers powerful patterns and delivers
automatically generated English text and graphs as explainable documents on
the intranet.

The Data Mining Suite is based on a solid foundation with a total vision for
decision support. The three-tiered, server-based implementation provides
highly scalable discovery on huge SQL databases with well over 90% of the
computations performed directly on the server, in parallel if desired.

DATA MINING-The Knowledge Discovery in Database

The Data Mining Suite relies on the genuinely unique mathematical


foundation we pioneered to usher in a new level of functionality for decision
support. This mathematical foundation has given rise to novel algorithms that
work directly on very large datasets, delivering unprecedented power and
functionality. The power of these algorithms allows us to discover rich
patterns of knowledge in huge databases that could have never been found
before.

With server-based discovery, the Data Mining Suite performs over 90%
of the analyses on the server, with SQL, C programs and Java. Discovery takes
place simultaneously along multiple dimensions on the server, and is not
limited by the power of the client. The system analyzes both relational and
multi-dimensional data, discovering highly refined patterns that reveal the real
nature of the dataset. Using built-in advanced mathematical techniques, these
findings are carefully merged by the system and the results are delivered to the
user in plain English, accompanied by tables and graphs that highlight the key

34 | P a g e
patterns.

The Data Mining Suite pioneered multi-dimensional data mining.


Before this, OLAP had usually been a multi-dimensional manual endeavor,
while data mining had been a single dimensional automated activity. The
Rule-based Influence Discovery System bridged the gap between OLAP and
TM

data mining. This dramatic new approach forever changed the way
corporations use decision support. No longer are OLAP and data mining
viewed as separate activities, but are fused to deliver maximum benefit. The
patterns discovered by the system include multi-dimensional influences and
contributions, OLAP affinities and associations, comparisons, trends and
variations. The richness of these patterns delivers unparalleled business
benefits to users, allowing them to make better decisions than ever before.

The Data Mining Suite also pioneered the use of incremental pattern-
base population. With incremental data mining, the system automatically
discovers changes in patterns as well as the patterns of change. For instance,
each month sales data is mined and the changes in the sales trends as well as
the trends of change in how products sell together are added to the pattern-
base. Over time, this knowledge becomes a key strategic asset to the
corporation.

DATA MINING-The Knowledge Discovery in Database

The Data Mining Suite currently consists of these modules:

 Rule-based Influence Discovery


 Dimensional Affinity Discovery
 Trend Discovery Module
 Incremental Pattern Discovery
 Forensic Discovery
 The Predictive Modeler

These truly unique products are all designed to work together, d in concert
with the Knowledge Access Suite .
TM

Rule-based Influence Discovery

35 | P a g e
The Rule-based Influence Discovery System is aware of both influences
and contributions along multiple dimensions and merges them in an intelligent
manner to produce very rich and powerful patterns that can not be obtained by
either OLAP or data mining alone. The system performs multi-table,
dimensional data mining at the server level, providing the best possible results.
The Rule-based Influence Discovery System is not a multi-dimensional
repository, but a data mining system. It accesses granular data in a large
database via standard SQL and reaches for multi-dimensional data via a
ROLAP approach of the user's choosing.

Dimensional Affinity Discovery

The Affinity Discovery System automatically analyzes large datasets


and finds association patterns that describe how various items "group
together" or "happen together". Flat affinity just tells us how items group
together, without providing logical conditions for the association. Dimensional
(OLAP) affinity is more powerful and describes the dimensional conditions
under which stronger item groupings take place. The Affinity Discovery
System includes a number of useful features that make it a unique industrial
strength product. These features include hierarchy and cluster definitions,
exclusion lists, unknown-value management, among others.

DATA MINING-The Knowledge Discovery in Database

The OLAP Discovery System

The OLAP Discovery System is aware of both influences and


contributions along multiple dimensions and merges them in an intelligent
manner to produce very rich and powerful patterns that can not be obtained by
either OLAP or data mining alone. The system merges OLAP and data mining
at the server level, providing the best possible results. The OLAP Discovery
System is not an OLAP engine or a multi-dimensional repository, but a data
mining system. It accesses granular data in a large database via standard SQL
and reaches for multi-dimensional data via an OLAP/ROLAP engine of the
user's choosing.

36 | P a g e
Incremental Pattern Discovery

Incremental Pattern Discovery deals with temporal data segments that


gradually become available over time, e.g. once a week, once a month, etc.
Data is periodically supplied to the Incremental Discovery System in terms of a
"data snap-shot" which corresponds to a given time-segment, e.g. monthly
sales figures. Patterns in the data snap-shot are found on a monthly basis and
are added to the pattern-base. As new data becomes available (say once a
month) the system automatically finds new patterns, merges them with the
previous patterns, stores them in the pattern-base and notes the differences
from the previous time-periods.

Trend Discovery

Trend Discovery with the Data Mining Suite uncovers time-related


patterns that deal with change and variation of quantities and measures. The
system expresses trends in terms of time-grains, time-windows, slopes and
shapes. The time-grain defines the smallest grain of time to be considered, e.g.
a day, a week or a month. Time-windows define how time grains are grouped
together, e.g. we may look at daily trends with weekly windows, or we may
look at weekly grains with monthly windows. Slopes define how quickly a
measure is increasing or decreasing, while shapes give us various categories of
trend behavior, e.g. smoothly increasing vs. erratically changing.

DATA MINING-The Knowledge Discovery in Database

Forensic Discovery

Forensic Discovery with the Data Mining Suite relies on automatic


anomaly detection. The system first identifies what is usual and establishes a
set of norms through pattern discovery. The transactions or activities that
deviate from the norm are then identified as unusual. Business users can
discover where unusual activities may be originating and the proper steps can
be taken to remedy and control the problems. The automatic discovery of
anomalies is essential in that the ingenious tactics used to spread activities
within multiple transactions can usually not be guessed beforehand

Predictive Modeler

37 | P a g e
The Data Mining Suite Predictive Modeler makes predictions and
forecasts by using the rules and patterns which the data mining process
generates. While induction performs pattern discovery to generate rules, the
Predictive Modeler performs pattern matching to make predictions based on
the application of these rules. The predictive models produced by the system
have higher accuracy because the discovery process works on the entire
dataset and need not rely on sampling.

The output from the seven component products of the Data Mining Suite is
stored within the pattern-base and is accessible with PQL: The Pattern Query
Language. Readable English text and graphs are automatically generated in
ASCII and HTML formats for the delivery on the inter/intranet.

DATA MINING-The Knowledge Discovery in Database

The Data Mining Suite is Unique


The Reasons for the Multi-faceted Power

The products in the Data Mining Suite deliver the most advanced and scalable
TM

technologies within a user friendly environment. The specific reasons draw on the solid
mathematical foundation, which Information Discovery, Inc. pioneered and a highly
scalable implementation. Click here to see what makes The Knowledge Access Suite. So
unique.

The Data Mining Suite is distinguished by the following unique capabilities:

Direct Access to Very Large SQL Databases

38 | P a g e
The Data Mining Suite works directly on very large SQL databases and does not
require samples, extracts and/or flat files. This alleviates the problems associated
with flat files which lose the SQL engine's power (e.g. parallel execution) and
which provide marginal results. Another advantage of working on an SQL
database is that the Data Mining Suite has the ability to deal with both numeric
and non-numeric data uniformly. The Data Mining Suite does not fix the ranges in
numerical data beforehand, but finds ranges in the data dynamically by itself.

Multi-Table Discovery

The Data Mining Suite discovers patterns in multi-table SQL databases without
having to join and build an extract file. This is a key issue in mining large
databases. The world is full of multi-table databases which can not be joined and
meshed into a single view. In fact, the theory of normalization came about
because data needs to be in more than one table. Using single tables is an affront
to all the work of E.F. Codd on database design. If you challenge the DBA in a
really large database to put things in a single table you will either get a laugh or a
blank stare -- in many cases the database size will balloon beyond control. In fact,
there are many cases where no single view can correctly represent the semantics
of influence because the ratios will always be off regardless of how you join. The
Data Mining Suite leads the world of discovery with the unique ability to mine
large multi-table databases.

DATA MINING-The Knowledge Discovery in Database

No Sampling or Extracts

Sampling theory was invented because one could not have access to the
underlying population being analyzed. But a warehouse is there to provide such
access.

General and Powerful Patterns

The format of the patterns discovered by the Data Mining Suite is very general
and goes far beyond decision trees or simple affinities. The advantage to this is
that the general rules discovered are far more powerful than decision trees.
Decision trees are very limited in that they cannot find all the information in the
database. Being rule-based keeps the Data Mining Suite from being constrained to
one part of a search space and makes sure that many more clusters and patterns
are found -- allowing the Data Mining Suite to provide more information and
better predictions.

39 | P a g e
Language of Expression

The Data Mining Suite has a powerful language of expression, going several
times beyond what most other systems can handle. For instance, for logical
statements it can express statements such as "IF Destination State = Departure
State THEN..." or "IF State is not Arizona THEN ...". Surprisingly most other data
mining systems can not express these simple patterns. And the Data Mining Suite
pioneered dimensional affinities such as IF Day = Saturday WHEN PaintBrush is
purchased ALSO Paint is purchased". Again most other systems cannot handle
this obvious logic.

Uniform Treatment of Numeric and Non-numeric Data

The Data Mining Suite is unique in its ability to deal with various data types in a
uniform manner. It can smoothly deal with a large number of non-numeric values
and also automatically discovers ranges within numeric data. Moreover, the Data
Mining Suite does not fix the ranges in numerical data but discovers interesting
ranges by itself. For example, given the field Age, the Data Mining Suite does not
expect this to be broken into 3 segments of (1-30), (31-60), (61 and above).
Instead it may find two ranges such as (27-34) and (48-61) as important in the
data set and will use these in addition to the other ranges.

DATA MINING-The Knowledge Discovery in Database

Use of Data Dependencies

Should a data mining system be aware of the functional (and other dependencies)
that exist in a database? "Yes" and very much so. The use of these dependencies
can significantly enhance the power of a discovery system -- in fact ignoring them
can lead to confusion. The Data Mining Suite takes advantage of data
dependencies.

Server-based Architectures

The Data Mining Suite has a three level client server architecture whereby the
user interface runs on a thin intranet client and the back-end process for analysis
is done on a Unix server. The majority of the processing time is spent on the
server and these computations run both by using parallel SQL and non-SQL calls
managed by the Data Mining Suite itself. Only about 50% of the computations on
the server are SQL-based and the other statistical computations are already

40 | P a g e
managed by the Data Mining Suite program itself, at times by starting separate
processes on different nodes of the server.

System Initiative

The Data Mining Suite uses system initiative in the data mining process. It forms
hypothesis automatically based on the character of the data and converts the
hypothesis into SQL statements forwarded to the RDBMS for execution. The
Data Mining Suite then selects the significant patterns filter the unimportant
trends.

Transparent Discovery and Predictions

The Data Mining Suite provides explanations as to how the patterns are being
derived. This is unlike neural nets and other opaque techniques in which the
mining process is a mystery. Also, when performing predictions, the results are
transparent. Many business users insist on understandable and transparent results.

Not Noise Sensitive

The Data Mining Suite is not sensitive to noise because internally it uses fuzzy
logic analysis. As the data gathers noise, the Data Mining Suite will only reduce
the level on confidence associated with the results provided. However, it will still
produce the most significant findings from the data set.

DATA MINING-The Knowledge Discovery in Database

Analysis of Large Databases

The Data Mining Suite has been specifically tuned to work on databases with an
extremely large number of rows. It can deal with data sets of 50 to 100 million
records on parallel machines. It derives its capabilities from the fact that it does
not need to write extracts and uses SQL statements to perform its process.
Generally the analyses performed in the Data Mining Suite are performed on
about 50 to 120 variables and 30 to 100 million records directly. It is, however,
easier to increase the number of records based on the specific optimization
options with the Data Mining Suite to deal with very large databases.

These unique features and benefits make the Data Mining Suite the ideal solution for
large-scale Data Mining in business and industry.

41 | P a g e
DATA MINING-The Knowledge Discovery in Database

What Data Mining can’t do?

Data mining is a tool, not a magic wand. It won’t sit in your database watching
what happens and send you e-mails to get your attention when it sees an interesting
pattern. It doesn’t eliminate the need to know your business, to understand your data, or
to understand analytical methods. Data mining assists business analysts with finding
patterns and relationships in the data-it does not tell you the value of the patterns to
organization. Furthermore, the patterns uncovered by data mining must be verified in the
real world.

Remember that the predictive relationships found via data mining are not
necessarily causes of an action or behavior. For example, data mining might determine
that males with incomes between $50,000 and $65,000 who subscribe to certain
magazines are likely purchasers of a product you want to sell. While you can take

42 | P a g e
advantages of this pattern, say by aiming your marketing at people who fit the pattern,
you should not assume that any of these factors cause them to buy your product.

To ensure meaningful results, it’s vital that you understand your data. The quality
of your output will often be sensitive to outliers (data values that are very different from
the typical values in your database), irrelevant columns or columns that vary together
(such as age and date of birth), the way you encode your data, and the data you leave in
and the data you exclude. Algorithms vary in their sensitivity to such data issues, but it is
unwise to depend on a data-mining product to make all the right decisions on its own.

Data mining will not automatically discover solutions without guidance. Rather
than setting the vague goal, “Help improve the response to my direct mail solicitation”,
you might use data mining to find the characteristics of people who (1) respond to your
solicitation, or (2) respond AND make large purchase. The patterns data mining finds for
those two goals may be very different.

Although a good data mining tool shelters you from the intricacies of statistical
techniques, it requires you to understand the working of the tools you choose and the
algorithms on which they are based. The choices you make in setting up your data mining
tool and the optimizations you choose will affect the accuracy and speed of your models.

DATA MINING-The Knowledge Discovery in Database

Difficulties of Implementing Data Warehouses

Some significant operational issues arise with data warehousing: construction,


administration, and quality control. Project management- the design, construction, and
implementation of the warehouse- is an important an challenging consideration that
should not be underestimated. The building of an enterprise-wide warehouse in a large
organization is a major undertaking, potentially taking years from conceptualization to
implementation. Because of the difficulty and amount of lead time required for such an
undertaking, the widespread development and deployment of data marts may provide an
attractive, especially to those organizations with urgent needs for OLAP, DSS, and/or
data mining support.

The administration of a data warehouse is an intensive enterprise, proportional to


the size and complexity of the warehouse. An organization that attempts to administer a

43 | P a g e
data warehouse must realistically understand the complex nature of its administration.
Although designed for read- access, a data warehouse is no more a static structure than
any of its information sources. Sources databases can expected to evolve. The warehouse
schema and acquisition component must be expected to be updated to handle these
evolutions.

A significant issue in data warehouse is the quality control of data. Both quality
and consistency of data are major concerns. Although the data passes through a cleaning
function during acquisition, quality and consistency remain significant issues for the
database administrator. Melding data from heterogeneous and disparate sources is a major
challenge given differences in naming, domains definitions, identification numbers, and
the like. Every time a sources database changes, the warehouse administrator must
consider the possible interactions with other elements of the warehouse.

Usage projection should be estimated conservatively prior to construction of the


data warehouse and should be revised continually to reflect currents requirements. As
utilization patterns become clear and change over time, storage and access paths can be
turned to remain optimized for support of the organization’s use of its warehouse. This
activity should be continue through out the life of the house in order to remain ahead of
demand. The warehouse should also be designed to accommodate addition and attrition
of data sources with out major redesign.

DATA MINING-The Knowledge Discovery in Database

Administration of a data warehouse will require far broader skills than are needed
for traditional database administration. A team of highly skilled technical experts with
overlapping areas of expertise will likely be needed, rather than a single individual. Like
database administration, data warehouse administration is only partly technical; a large
part of the responsibility requires working efficiently with all the members of the
organization with an interest in the data warehouse. However difficult that can be at times
for database administrators, it is that much more challenging for data warehouse
administrators, as the scope of their responsibilities is considerably broader.

Design of the management function and selection of the management team for a
database warehouse are crucial. Managing the data warehouse in a large organization will
surely be a major task. Many commercial tools are already available to support
management functions. Effective data warehouse management will certainly be a team
function, requiring a wide set of technical skills, careful coordination, and effective
leadership. Just as we must prepare for the evolution of the warehouse, we must also
recognize that the skills of the management team will, of necessity, evolve with it.

44 | P a g e
DATA MINING-The Knowledge Discovery in Database

The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable
business information in a large database — for example, finding linked products in
gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both
processes require either sifting through an immense amount of material, or intelligently
probing it to find exactly where the value resides. Given databases of sufficient size and
quality, data mining technology can generate new business opportunities by providing
these capabilities:

 Automated prediction of trends and behaviors. Data mining automates the process
of finding predictive information in large databases. Questions that traditionally
required extensive hands-on analysis can now be answered directly from the data
— quickly. A typical example of a predictive problem is targeted marketing. Data
mining uses data on past promotional mailings to identify the targets most likely
to maximize return on investment in future mailings. Other predictive problems

45 | P a g e
include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.

 Automated discovery of previously unknown patterns. Data mining tools sweep


through databases and identify previously hidden patterns in one step. An example
of pattern discovery is the analysis of retail sales data to identify seemingly
unrelated products that are often purchased together. Other pattern discovery
problems include detecting fraudulent credit card transactions and identifying
anomalous data that could represent data entry keying errors.

Data mining techniques can yield the benefits of automation on existing software and
hardware platforms, and can be implemented on new systems as existing platforms are
upgraded and new products developed. When data mining tools are implemented on high
performance parallel processing systems, they can analyze massive databases in minutes.
Faster processing means that users can automatically experiment with more models to
understand complex data. High speed makes it practical for users to analyze huge
quantities of data. Larger databases, in turn, yield improved predictions.
Databases can be larger in both depth and breadth:
 More columns. Analysts must often limit the number of variables they examine
when doing hands-on analysis due to time constraints. Yet variables that are
discarded because they seem unimportant may carry information about unknown
patterns. High performance data mining allows users to explore the full depth of a
database, without reselecting a subset of variables.
 More rows. Larger samples yield lower estimation errors and variance, and allow
users to make inferences about small but important segments of a population.

DATA MINING-The Knowledge Discovery in Database

A recent Granter Group Advanced Technology Research Note listed data mining
and artificial intelligence at the top of the five key technology areas that "will clearly
have a major impact across a wide range of industries within the next 3 to 5 years."2
Gartner also listed parallel architectures and data mining as two of the top 10 new
technologies in which companies will invest during the next 5 years. According to a
recent Gartner HPC Research Note, "With the rapid advance in data capture, transmission
and storage, large-systems users will increasingly need to implement new and innovative
ways to mine the after-market value of their vast stores of detail data, employing MPP
[massively parallel processing] systems to create new sources of business advantage (0.9
probability)."3

The most commonly used techniques in data mining are:


 Artificial neural networks: Non-linear predictive models that learn through
training and resemble biological neural networks in structure.

46 | P a g e
 Decision trees: Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. Specific decision tree
methods include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID) .
 Genetic algorithms: Optimization techniques that use processes such as genetic
combination, mutation, and natural selection in a design based on the concepts of
evolution.
 Nearest neighbor method: A technique that classifies each record in a dataset
based on a combination of the classes of the k record(s) most similar to it in a
historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor
technique.

DATA MINING-The Knowledge Discovery in Database

Profitable Applications

A wide range of companies have deployed successful applications of data


mining. While early adopters of this technology have tended to be in information-
intensive industries such as financial services and direct mail marketing, the technology
is applicable to any company looking to leverage a large data warehouse to better manage
their customer relationships. Two critical factors for success with data mining are: a large,
well-integrated data warehouse and a well-defined understanding of the business process
within which data mining is to be applied (such as customer prospecting, retention,
campaign management, and so on).

Some successful application areas include:

47 | P a g e
 A pharmaceutical company can analyze its recent sales force activity and their
results to improve targeting of high-value physicians and determine which
marketing activities will have the greatest impact in the next few months. The
data needs to include competitor market activity as well as information about the
local health care systems. The results can be distributed to the sales force via a
wide-area network that enables the representatives to review the
recommendations from the perspective of the key attributes in the decision
process. The ongoing, dynamic analysis of the data warehouse allows best
practices from throughout the organization to be applied in specific sales
situations.

 A credit card company can leverage its vast warehouse of customer transaction
data to identify customers most likely to be interested in a new credit product.
Using a small test mailing, the attributes of customers with an affinity for the
product can be identified. Recent projects have indicated more than a 20-fold
decrease in costs for targeted mailing campaigns over conventional approaches.

 A diversified transportation company with a large direct sales force can apply data
mining to identify the best prospects for its services. Using data mining to analyze
its own customer experience, this company can build a unique segmentation
identifying the attributes of high-value prospects. Applying this segmentation to a
general business database such as those provided by Dun & Bradstreet can yield a
prioritized list of prospects by region.

DATA MINING-The Knowledge Discovery in Database

 A large consumer package goods company can apply data mining to improve its
sales process to retailers. Data from consumer panels, shipments, and competitor
activity can be applied to understand the reasons for brand and store switching.
Through this analysis, the manufacturer can select promotional strategies that best
reach their target customer segments.

Each of these examples have a clear common ground. They leverage the
knowledge about customers implicit in a data warehouse to reduce costs and improve the
value of customer relationships. These organizations can now focus their efforts on the
most important (profitable) customers and prospects, and design targeted marketing
strategies to best reach them.
.

48 | P a g e
DATA MINING-The Knowledge Discovery in Database

Functionality of Data Warehouses

Data warehouses exist to facilitate complex, data-intensive, and frequent ad hoc queries.
Accordingly, data warehouses must provide far greater and more efficient query support
than is demanded of transitional databases. The data warehouse access component
supports enhanced spreadsheet functionality, efficient query processing, structured
queries, ad hoc queries, data mining, and materialized views. In particular, enhanced
spreadsheet functionality includes support for state-of-the-art spreadsheet
applications(e.g., MS Excel) as well as for OLAP applications programs. These offer
preprogrammed functionalities such as the following:
 Roll-up: Data is summarized with increasing generalization (weekly to quarterly
to annually).
 Drill-down: Increasing levels of detail are revealed (the complement of roll-up).
 Pivot: Cross tabulation (also referred as rotation) is performed.
 Slice and dice: Performing projection operations on the dimensions.

49 | P a g e
 Sorting: Data is sorted by ordinal value.
 Selection: Data is available by value or range.
 Derived (computed) attributes: Attributes are computed by operations on stored
and derived values.

DATA MINING-The Knowledge Discovery in Database

Glossary of Data Mining Terms

Analytical model:-
A structure and process for analyzing a dataset. For example, a decision tree is a model
for the classification of a dataset.

Data cleansing;-
The process of ensuring that all values in a dataset are consistent and correctly recorded.

Anomalous data:-
Data that result from errors (for example, data entry keying errors) or that represent
unusual events. Anomalous data should be examined carefully because it may carry
important information

Artificial neural networks:-

50 | P a g e
Non-linear predictive models that learn through training and resemble biological neural
networks in structure.

Data visualization:-
The visual interpretation of complex relationships in multidimensional data.

CHAID:-
Chi Square Automatic Interaction Detection. A decision tree technique used for
classification of a dataset. Provides a set of rules that you can apply to a new
(unclassified) dataset to predict which records will have a given outcome. Segments a
dataset by using chi square tests to create multi-way splits. Preceded, and requires more
data preparation than, CART.

Data mining:-
The extraction of hidden predictive information from large databases.

CART::-
Classification and Regression Trees. A decision tree technique used for classification of a
dataset. Provides a set of rules that you can apply to a new (unclassified) dataset to
predict which records will have a given outcome. Segments a dataset by creating 2-way
splits. Requires less data preparation than CHAID.

Data warehouse:-
A system for storing and delivering massive quantities of data.

DATA MINING-The Knowledge Discovery in Database

Multidimensional database:-
A database designed for on-line analytical processing. Structured as a multidimensional
hypercube with one axis per dimension.

Nearest neighbor:-
A technique that classifies each record in a dataset based on a combination of the classes
of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called
a k-nearest neighbor technique.

OLAP:-
On-line analytical processing. Refers to array-oriented database applications that allow
users to view, navigate through, manipulate, and analyze multidimensional databases

Predictive model:-
A structure and process for predicting the values of specified variables in a dataset.

51 | P a g e
Prospective data analysis:-
Data analysis that predicts future trends, behaviors, or events based on historical data.

RAID :-
Redundant Array of Inexpensive Disks. A technology for the efficient parallel storage of
data for high-performance computer systems

DATA MINING-The Knowledge Discovery in Database

Conclusion:

Data mining offers great promise in helping organizations uncover patterns hidden
in their data that can be used to predict the behavior of customers, products and
processes. However, data mining tools need to be guided by users who understand the
business, the data, and the general nature of the analytical methods involved. Realistic
expectations can yield rewarding results across a wide range of applications, from
improving revenues to reducing costs.

Building models is only one step in knowledge discovery. It’s vital to properly
collect and prepare the data, and to check the models against the real world. The “best”
model is often found after models of several different types ,or by trying different
technologies or algorithms.

52 | P a g e
Choosing the right data mining products means finding a tool with good basic
capabilities, an interface that matches the skill level of the people who’ll be using it, and
features relevant to your specific business problems. After you have narrowed down the
list of potential solutions, get a hands-on trial of the likeliest ones.

Comprehensive data warehouses that integrate operational data with customer,


supplier, and market information have resulted in an explosion of information.
Competition requires timely and sophisticated analysis on an integrated view of the data.
However, there is a growing gap between more powerful storage and retrieval systems
and the users’ ability to effectively analyze and act on the information they contain. Both
relational and OLAP technologies have tremendous capabilities for navigating massive
data warehouses, but brute force navigation of data is not enough. A new technological
leap is needed to structure and prioritize information for specific end-user problems. The
data mining tools can make this leap. Quantifiable business benefits have been proven
through the integration of data mining with current information systems, and new
products are on the horizon that will bring this integration to an even wider audience of
users.

DATA MINING-The Knowledge Discovery in Database

BIBLOGRAPHY

Introduction to Data Mining and Knowledge Discovery


By: Two crow corporation
Fundamentals of Database systems
By: Ramez Elmasri
www.datamining.com
www.data-mine.com
www.threatling.com
www.research.microsoft.com/datamine

53 | P a g e
www.dwinfocenter.org

54 | P a g e