Sie sind auf Seite 1von 176

A REPORT ON

St. Francis Institute of Management and Research


Mount Poinsur, S.V.P Road, Borivali (West)
Mumbai-400103

pg. 1
St. Francis Institute of Management and
Research.
Mount Poinsur, S.V.P Road, Borivali (West),Mumbai-400103.`

Winter Project (Information technology studies)


Report

Title:

Prepared for Mumbai University in the partial fulfillment


of the requirement for the award of the degree in

MASTER IN MANAGEMENT STUDIES

SUBMITTED BY

Patel Pramod Rameshchandra


Roll No: 38 Year: 2009-11

Under the Guidance of

Prof. Manoj Mathew.

pg. 2
St. Francis Institute of
Management and Research

Certificate Of Merit

This is certify that the work


entered in this project is the
work of an individual

Mr.Patel Pramod
Rameshchandra.
Roll No: 38 MMS-II

Has worked for the Semester IV


of the year
2010-2011 in the college.
Date
:

pg. 3
[Internal in- charge]
[ External in- charge]

[College Stamp]
[Director]

pg. 4
Acknowledgment
I would like to express my sincere gratitude
toward the MBA department of the
ST. Francis Institute of Management and
Research
for encouraging me in the development of this
project. I would like to thank, our
Director Dr. Thomas Mathew, my internal
project guide Prof.Manoj Mathew and our
faculty coordinator Prof.Vaishali Kulkarni for all
their help and co-operation.
Above this I would not like to miss this
precious opportunity to thank, prof. Thomas
Mathew, prof.Sinimole, M.F Kumbar , Sherli
Biju ,
Mohini Ozarkar & Steve Halge our librarian, my
friends, Mr.Subandu K. Maity, Mr. Durgesh
Tanna,
Miss Hiral Shah, Mr.Narinder Singh Kabo,
Miss Radhika S. Appaswamy, Miss Payal P.
Patel ,
Miss Bhagyalaxmi Subramaniam, Mrs. Soma L.
Joshua and my parents for helping, guiding
and supporting us in all problems.

pg. 5
pg. 6
Table of Contents
Type chapter title (level 1).............................................................1
Type chapter title (level 2)............................................................2
Type chapter title (level 3).........................................................3
Type chapter title (level 1).............................................................4
Type chapter title (level 2)............................................................5
Type chapter title (level 3).........................................................6

pg. 7
Executive Summary

Data mining is a process that uses a variety of data


analysis tools to discover knowledge, patterns and
relationships in data that may be used to make valid
predictions. With the popularity of object-oriented
database systems in database applications, it is important
to study the data mining methods for object-oriented
databases. The traditional Database Management Systems
(DBMSs) have limitations when handling complex
information and user defined data types which could be
addressed by incorporating Object-oriented programming
concepts into the existing databases. Classification is a
well-established data mining task that has been
extensively studied in the fields of statistics, decision
theory and machine learning literature. This study focuses
on the design of an object-oriented database, through
incorporation of object-oriented programming concepts
into existing relational databases. In the design of the
database, the object-oriented programming concepts
namely inheritance and polymorphism are employed. The
design of the object-oriented database is done in such a
well equipped manner that the design itself aids in
efficient data mining. Our main objective is to reduce the
implementation overhead and the memory space required
for storage when compared to the traditional databases.

pg. 8
Purpose of the study

The purpose of this is to find the effective way of data mining


using Object Oriented Database and to improve CRM using Data
Mining

Significance of the study

This work will help provide additional information for the database
administrator who is engaged in the improvement the way of data
mining from data warehouse and also the effective to handle data
mining. This research work is not done in the intention to replace
or duplicate the work that is being done by me but rather its
outcome that can help to complement the Business Analyst

pg. 9
Objective of the project

The general objective of this project is to investigate and


recommend suitable way for data mining. The data mining solution
proposed in this study could help support a data mining process as
well as contribute to build a smooth way of data handing within
organization. In this work, data mining implementations of other
companies were investigated though CRM magazine current year
and last year.

In order to meet the general objective of this project the following


key activities must be carried out:
• To Study and Understand the Basic concept of database, data
warehouse and data mining.
• To Study and Understand the Object oriented database
• To design a simple Object oriented database
• To do effective data mining in the designed Object oriented
database.
• To hit upon effective way of memory saving data mining
using Object oriented database.
• To effective way of data mining to successes in CRM
• To build profitable Customer Relationships with Data
Mining

pg. 10
Limitation of the project

This project does not focuses on the whole Database design it only
focuses on three tables that is Customers Table, Suppliers Table,
Employee Table but in real scenario there is not only three tables it
has many numbers of tables in the database

pg. 11
Need for study

Data mining roots are traced back along three family lines. The
longest of these three lines is classical statistics. Without statistics,
there would be no data mining, as statistics are the foundation of
most technologies on which data mining is built. Classical statistics
embrace concepts such as regression analysis, standard distribution,
standard deviation, standard variance, discriminated analysis,
cluster analysis, and confidence intervals, all of which are used to
study data and data relationships. These are the very building
blocks with which more advanced statistical analyses are
underpinned. Certainly, within the heart of today's data mining
tools and techniques, classical statistical analysis plays a significant
role.

Data mining's second longest family line is artificial intelligence, or


AI. This discipline, which is built upon heuristics as opposed to
statistics, attempts to apply human-thought-like processing to
statistical problems. Because this approach requires vast computer
processing power, it was not practical until the early 1980s, when
computers began to offer useful power at reasonable prices. AI
found a few applications at the very high end scientific/government
markets, but the required supercomputers of the era priced AI out
of the reach of virtually everyone else. The notable exceptions were
certain AI concepts which were adopted by some high-end
commercial products, such as query optimization modules for
Relational Database Management Systems (RDBMS).

pg. 12
The third family line of data mining is machine learning, which is
more accurately described as the union of statistics and AI. While
AI was not a commercial success, its techniques were largely co-
opted by machine learning. Machine learning, able to take
advantage of the ever-improving price/performance ratios offered
by computers of the 80s and 90s, found more applications because
the entry price was lower than AI. Machine learning could be
considered an evolution of AI, because it blends AI heuristics with
advanced statistical analysis. Machine learning attempts to let
computer programs learn about the data business analyst study,
such that programs make different decisions based on the qualities
of the studied data, using statistics for fundamental concepts, and
adding more advanced AI heuristics and algorithms to achieve its
goals.

Data mining, in many ways, is fundamentally the adaptation of


machine learning techniques to business applications. Data mining
is best described as the union of historical and recent developments
in statistics, AI, and machine learning. These techniques are then
used together to study data and find previously-hidden trends or
patterns within. Data mining is finding increasing acceptance in
science and business areas which need to analyze large amounts of
data to discover trends which business analyst could not otherwise
find.

pg. 13
Methodology

This is a primary research. In this project I used exploratory


research technique. This research technique is used is closely
related to tracking and is used in qualitative research projects.
Exploratory research provides insights into and comprehension of
an issue or situation. It should draw definitive conclusions only
with extreme caution. Exploratory research technique is used
because a problem has not been clearly defined. The secondary
data is collected through reviewing magazine and articles.

Internet is used as a source of most of the database relevant to the


issues involved in the study.

pg. 14
Analysis

The following are the major activities of this project:

Task I – The Literature/ Computer Weekly Magazines/Articles


Review

• To study the significance of having good Object-Oriented


Database Design
• Review the Literature/ Computer Monthly Newspapers\
CRN Magazines/Articles Review
• Review other relevant data Mining ways of Object-Oriented
Database

Task II – Problem Analysis

This is the first and base stage of the project. At this stage,
requirement elicitation is conducted. Potential problem areas of
designing the database are identified.
Technological, social, and educational elements are identified and
examined. Alternatives are explored.

• Information and data collected is analyzed.


• An Object-Oriented Database design criterion is developed.
• Evaluate an effective way of data mining using object
oriented database.

Task III – Proposed Effective Way of Data Mining Using


Object Oriented Database

Propose an effective way of data mining in object oriented


database.

pg. 15
Introduction to Database

The Database Management System

A Database Management System is a collection of software tools


intended for the purpose of efficient storage and retrieval of data in
a computer system. Some of the important concepts involved in the
design and implementation of a Database Management System are
discussed below.

The Database

“A database is an integrated collection of automated data files


related to one another in the support of a common purpose.”

“A database is a collection of information that is organized so that


it can easily be accessed, managed, and updated. In one view,
databases can be classified according to types of content:
bibliographic, full-text, numeric, and images.”

Each file in a database is made up of data elements – numbers,


dates, amounts, quantities, names, addresses and other identifiable
items of data.

The smallest component of data in a computer is the bit, a binary


element with the values 0 and 1. Bits are used to build bytes, which
are used to build data elements. Data files contain records that are
made up of data elements and a database consists of files. Starting
from the highest level, the hierarchy is as follows:

1. Database
2. File
3. Record
4. Data element
5. Character (byte)
6.Bit

pg. 16
The Data Element

A data element is a place in a file used to store an item of


information that is uniquely identifiable by its purpose and
contents. A data value is the information stored in a data element.
The data element has functional relevance to the application being
supported by the database.

The Data Element Dictionary

“A data element dictionary is a table of data elements including at


least the names, data types and lengths of every data element in the
subject database.”

The data element dictionary is central to the application of the


database management tools. It forms the basic database schema or
the meta-data, which is the description of the database. The DBMS
constantly refers to this Data Element Dictionary for interpreting
the data stored in the database.

The Data Element Types

Relevant to the database management system, there are a variety of


data types that are supported. Examples of common data element
types supported are numeric, alphanumeric, character strings, date
and time.

pg. 17
Files

A database contains a set of files related to one another by a


common purpose. A file is collection of records. The records are
alike in format but each record is unique in content, therefore the
records in a file have the same data elements but different data
element values.

“A file is a set of records where the records have the same data
elements in the same format.”

The organization of the file provides functional storage of data,


related to the purpose of the system that the data base supports.
Interfile relationships are based on the functional relationships of
their purposes.

pg. 18
Database Schemas

“A schema is the expression of the data base in terms of the files it


stores, the data elements in each file, the key data elements used
for record identification, and the relationships between files.”

The translation of a schema into a data base management software


system usually involves using a language to describe the schema to
the data base management system.

The Key Data Elements

“The primary key data element in a file is the data element used to
uniquely describe and locate a desired record. The key can be a
combination of more than one data element.”

The definition of the file includes the specification of the data


element or elements that are the key to the file. A file key logically
points to the record that it indexes

pg. 19
An Interfile Relationship

In a database, it is possible to relate one file to another in one of the


following three ways:

• One to one
• Many to one
• One to many
• Many to many

In such interfile relationships, the database management system


may or may not enforce data integrity called referential integrity.

Mapping Cardinalities, or cardinality ratios, express the number of


entities to which another entity can be associated via a relationship
set. Mapping cardinalities are most useful in describing binary
relationship sets, although Entity can contribute to the description
of relationship sets that involve more than two entity sets.

• One to one. An entity in A is associated with at most one


entity in B, and an entity in B is associated with at most one
entity in A.

• One to many. An entity in A is associated with any number


(zero or more) of entities in B. An entity in B, however, can
be associated with at most one entity in A.

• Many to one. An entity in A is associated with at most one


entity in B. An entity in B, however, can be associated with
any number (zero or more) of entities in A.

• Many to many. An entity in A is associated with any


number (zero or more) of entities in B, and an entity in B is
associated with any number (zero or more) of entities in A.

pg. 20
The Data Models

The data in a database may be organized in 3 principal models:

• Hierarchical Data Model: The relationships between the


files form a hierarchy.

• Network Data Model: This model is similar to hierarchical


model except that a file can have multiple parents.

• Relational Data Model: Here, the files have no parents and


no children. Files are unrelated. Here the relationships are
explicitly defined by the user and maintained internally by
the database

The Data Definition Language

The format of the database and the format of the tables must be in a
format that the computer can translate into the actual physical
storage characteristics for the data. The Data Definition Language
(DDL) is used for such a specification.

{CREATE, ALTER, DROP}

The Data Manipulation Language

The Data Definition Language is used to describe the database to


the DBMS; there is a need for a corresponding language for
programs to use to communicate with the DBMS. Such a language
is called the Data Manipulation Language (DML). The DDL
describes the records to the application programs and the DML
provides an interface to the DBMS. The first used the record format
and the second uses the external function calls.

{SELECT, INSERT, UPDATE, DELETE}

pg. 21
The Query Language

The Query Language is used primarily for the process of retrieval


of data stored in a database. This data is retrieved by issuing query
commands to DBMS, which in turn interprets and appropriately
processes them.

Figure 1: The Database System

pg. 22
Introduction to Data Warehouse and Data
Mining

The Data Warehouse

“A data warehouse is a central repository for all or significant


parts of the data that an enterprise's various business systems
collect.”

The term was coined by W. H. Inmon. IBM sometimes uses the


term "information warehouse."

“A single, complete and consistent store of data obtained from a


variety of different sources made available to end users in a what
customer can understand and use in a business context.”
-Barry Devlin

Typically, a data warehouse is housed on an enterprise mainframe


server. Data from various online transaction processing (OLTP)
applications and other sources is selectively extracted and
organized on the data warehouse database for use by analytical
applications and user queries. Data warehousing emphasizes the
capture of data from diverse sources for useful analysis and access,
but does not generally start from the point-of-view of the end user
or knowledge worker who may need access to specialized,
sometimes local databases. The latter idea is known as the data
mart.

Applications of data warehouses include data mining, Web Mining,


and decision support systems (DSS).

pg. 23
The Data Mining

“Data mining is sorting through data to identify patterns and


establish relationships.”

“Looking for the hidden patterns and trends in data that is not
immediately apparent from summarizing the data”

Data mining parameters includes:

• Association: Looking for patterns where one event is


connected to another event

• Sequence or path analysis :Looking for patterns where one


event leads to another later event

• Classification: Looking for new patterns (May result in a


change in the way the data is organized)

• Clustering: Finding and visually documenting groups of


facts not previously known

• Forecasting: Discovering patterns in data that can lead to


reasonable predictions about the future (This area of data
mining is known as predictive analytics.)

Data mining techniques are used in a many research areas,


including mathematics, cybernetics, genetics and marketing. Web
mining, a type of data mining used in customer relationship
management (CRM), takes advantage of the huge amount of
information gathered by a Web site to look for patterns in user
behavior.

pg. 24
We are in an age often referred to as the “information age”. In this
information age, because we believe that information leads to”
power and success”, and thanks to sophisticated technologies such
as computers, satellites, etc., Organizations have been collecting
tremendous amounts of information. Initially, with the advent of
computers and means for mass digital storage, Organizations
started collecting and storing all sorts of data, counting on the
power of computers to help sort through this amalgam of
information. Unfortunately, these massive collections of data stored
on disparate structures very rapidly became overwhelming. This
initial chaos has led to the creation of structured databases and
Database Management Systems (DBMS).

The efficient Database Management Systems have been very


important assets for management of a large corpus of data and
especially for effective and efficient retrieval of particular
information from a large collection whenever needed. The
proliferation of Database Management Systems has also
contributed to recent massive gathering of all sorts of information.
Today, Organizations have far more information than
Organizations can handle: from business transactions and scientific
data, to satellite pictures, text reports and military intelligence.
Information retrieval is simply not enough anymore for decision-
making.

Confronted with huge collections of data, Organizations have now


created new needs to help us make better managerial choices.
These needs are automatic summarization of data, extraction of the
“essence” of information stored, and the discovery of patterns in
raw data.

pg. 25
What kind of Information Data Mining is collecting?

Organization have been collecting a myriad of data, from simple


numerical measurements and text documents, to more complex
information such as spatial data, multimedia channels, and
hypertext documents. Here is a non-exclusive list of a variety of
information collected in digital form in databases and in flat files.

• Business Transactions: Every transaction in the business


industry is (often) “memorized” for perpetuity. Such
transactions are usually time related and can be inter-
business deals such as purchases, exchanges, banking, stock,
etc., or intra-business operations such as management of in-
house wares and assets. Large department stores.

For example, thanks to the widespread use of bar codes, store


millions of transactions daily representing often terabytes of data.
Storage space is not the major problem, as the price of hard disks is
continuously dropping, but the effective use of the data in a
reasonable time frame for competitive decision making is definitely
the most important problem to solve for businesses that struggle to
survive in a highly competitive world.

• Scientific Data: Whether in a Swiss nuclear accelerator


laboratory counting particles, in the Canadian forest studying
readings from a grizzly bear radio collar, on a South Pole
iceberg gathering data about oceanic activity, or in an
American university investigating human psychology, our
society is amassing colossal amounts of scientific data that
need to be analyzed. Unfortunately, Organizations can
capture and store more new data faster than Organizations
can analyze the old data already accumulated.

pg. 26
• Medical and Personal Data: From government census to
personnel and customer files, very large collections of
information are continuously gathered about individuals and
groups. Governments, companies and organizations such as
hospitals, are stockpiling very important quantities of
personal data to help them manage human resources, better
understand a market, or simply assist clientele. Regardless of
the privacy issues this type of data often reveals, this
information is collected, used and even shared. When
correlated with other data this information can shed light on
customer behaviors and the like.

• Surveillance Video and Pictures: With the amazing


collapse of video camera prices, video cameras are becoming
ubiquitous. Video tapes from surveillance cameras are
usually recycled and thus the content is lost. However, there
is a tendency today to store the tapes and even digitize them
for future use and analysis.

• Satellite sensing: There are a countless number of satellites


around the globe: some are geo-stationary above a region,
and some are orbiting around the Earth, but all are sending a
non-stop stream of data to the surface. NASA, which
controls a large number of satellites, receives more data
every second than what all NASA researchers and engineers
can cope with. Many satellite pictures and data are made
public as soon as satellite sensing are received in the hopes
that other researchers can analyze them.

• Games: Our society is collecting a tremendous amount of


data and statistics about games, players and athletes. From
hockey scores, basketball passes and car-racing lapses, to
swimming times, boxer’s pushes and chess positions, all the
data are stored. Commentators and journalists are using this
information for reporting, but trainers and athletes would
want to exploit this data to improve performance and better
understand opponents.

pg. 27
• Digital Media: The proliferation of cheap scanners, desktop
video cameras and digital cameras is one of the causes of the
explosion in digital media repositories. In addition, many
radio stations, television channels and film studios are
digitizing their audio and video collections to improve the
management of their multimedia assets.

• CAD and Software Engineering Data: There are a


multitude of “Computer Assisted Design” (CAD) systems
for architects to design buildings or engineers to conceive
system components or circuits. These systems are generating
a tremendous amount of data. Moreover, software
engineering is a source of considerable similar data with
code, function libraries, objects, etc., which need powerful
tools for management and maintenance.

• Virtual Worlds: There are many applications making use of


three-dimensional virtual spaces. These spaces and the
objects Virtual Worlds contain are described with special
languages such as VRML. Ideally, these virtual spaces are
described in such a way that Virtual Worlds can share
objects and places. There is a remarkable amount of virtual
reality object and space repositories available. Management
of these repositories as well as content-based search and
retrieval from these repositories are still research issues,
while the size of the collections continues to grow.

• Text Reports and Memos (E-mail Messages): Most of the


communications within and between companies or research
organizations or even private people, are based on reports
and memos in textual forms often exchanged by e-mail.
These messages are regularly stored in digital form for future
use and reference creating formidable digital libraries.

pg. 28
• The World Wide Web Repositories: Since the inception of
the World Wide Web in 1993, documents of all sorts of
formats, content and description have been collected and
inter-connected with hyperlinks making it the largest
repository of data ever built. Despite its dynamic and
unstructured nature, its heterogeneous characteristic, and it’s
very often redundancy and inconsistency, the World Wide
Web is the most important data collection regularly used for
reference because of the broad variety of topics covered and
the infinite contributions of resources and publishers. Many
believe that the World Wide Web will become the
compilation of human knowledge.

pg. 29
What are Data Mining and Knowledge Discovery?

With the enormous amount of data stored in files, databases, and


other repositories, it is increasingly important, if not necessary, to
develop powerful means for analysis and perhaps interpretation of
such data and for the extraction of interesting knowledge that could
help in decision-making.

Data Mining, also popularly known as “Knowledge Discovery in


Databases” (KDD), refers to the nontrivial extraction of implicit,
previously unknown and potentially useful information from data
in databases. While data mining and knowledge discovery in
databases (or KDD) are frequently treated as synonyms, data
mining is actually part of the knowledge discovery process.

pg. 30
The following figure 2 shows data mining as a step in an iterative
knowledge discovery process.

Figure2: Data Mining is the core of Knowledge Discovery


Process

The Knowledge Discovery in Databases process comprises of a


few steps leading from raw data collections to some form of new
knowledge.

pg. 31
The iterative process consists of the following steps:

• Data Cleaning: Also known as data cleansing, it is a phase


in which noise data and irrelevant data are removed from the
collection.

• Data Integration: At this stage, multiple data sources, often


heterogeneous, may be combined in a common source.

• Data Selection: At this step, the data relevant to the analysis


is decided on and retrieved from the data collection.

• Data Transformation: Also known as data consolidation, it


is a phase in which the selected data is transformed into
forms appropriate for the mining procedure.

• Data Mining: It is the crucial step in which clever


techniques are applied to extract patterns potentially useful.

• Pattern Evaluation: In this step, strictly interesting patterns


representing knowledge are identified based on given
measures.

• Knowledge Representation: It is the final phase in which


the discovered knowledge is visually represented to the user.
This essential step uses visualization techniques to help users
understand and interpret the data mining results.

It is common to combine some of these steps together.

For instance, data cleaning and data integration can be performed


together as a pre-processing phase to generate a data warehouse.
Data selection and data transformation can also be combined where
the consolidation of the data is the result of the selection, or, as for
the case of data warehouses, the selection is done on transformed
data.

pg. 32
The KDD is an iterative process. Once the discovered knowledge is
presented to the user, the evaluation measures can be enhanced, the
mining can be further refined, new data can be selected or further
transformed, or new data sources can be integrated, in order to get
different, more appropriate results.

Data mining derives its name from the similarities between


searching for valuable information in a large database and mining
rocks for a vein of valuable ore. Both imply either sifting through a
large amount of material or ingeniously probing the material to
exactly pinpoint where the values reside. It is, however, a
misnomer, since mining for gold in rocks is usually called “gold
mining” and not “rock mining”, thus by analogy, data mining
should have been called “knowledge mining” instead.
Nevertheless, data mining became the accepted customary term,
and very rapidly a trend that even overshadowed more general
terms such as knowledge discovery in databases (KDD) that
describe a more complete process. Other similar terms referring to
data mining are: data dredging, knowledge extraction and pattern
discovery.

pg. 33
What kind of Data can be mined?

In principle, data mining is not specific to one type of media or


data. Data mining should be applicable to any kind of information
repository. However, algorithms and approaches may differ when
applied to different types of data. Indeed, the challenges presented
by different types of data vary significantly.

Data mining is being put into use and studied for databases,
including relational databases, object-relational databases and
object oriented databases, data warehouses, transactional databases,
unstructured and semi structured repositories such as the World
Wide Web, advanced databases such as spatial databases,
multimedia databases, time-series databases and textual databases,
and even flat files. Here are some examples in more detail:

• Flat files: Flat files are actually the most common data
source for data mining algorithms, especially at the research
level. Flat files are simple data files in text or binary format
with a structure known by the data mining algorithm to be
applied. The data in these files can be transactions, time-
series data, scientific measurements, etc.

pg. 34
• Relational Databases: Briefly, a relational database
consists of a set of tables containing either values of entity
attributes, or values of attributes from entity relationships.
Tables have columns and rows, where columns represent
“Attributes” and rows represent “tuples”. A tuple in a
relational table corresponds to either an object or a
relationship between objects and is identified by a set of
attribute values representing a “unique key”.

In Figure 3 I have present some relations Customer, Items,


and Borrow representing business activity in a fictitious
video store VideoStore. These relations are just a subset of
what could be a database for the video store and is given as
an example.

Figure 3: Fragments of some relations from a


relational database for VideoStore

pg. 35
The most commonly used query language for relational database is
SQL, which allows retrieval and manipulation of the data stored in
the tables, as well as the calculation of aggregate functions such as
average, sum, min, max and count. For instance, an SQL query to
select the videos grouped by category would be:

“SELECT count (*) FROM Items WHERE type=video GROUP BY


category.”

Data mining algorithms using relational databases can be more


versatile than data mining algorithms specifically written for flat
files, since Relational Databases can take advantage of the structure
inherent to relational databases. While data mining can benefit
from SQL for data selection, transformation and consolidation, it
goes beyond what SQL could provide, such as predicting,
comparing, detecting deviations, etc.

pg. 36
• Data Warehouses: A data warehouse as a storehouse is a
repository of data collected from multiple data sources (often
heterogeneous) and is intended to be used as a whole under
the same unified schema. A data warehouse gives the option
to analyze data from different sources under the same roof.

Let us suppose that VideoStore becomes a franchise in New York.


Many video stores belonging to VideoStore Company may have
different databases and different structures. If the executive of the
company wants to access the data from all stores for strategic
decision-making, future direction, marketing, etc., it would be more
appropriate to store all the data in one site with a homogeneous
structure that allows interactive analysis.

In other words, data from the different stores would be loaded,


cleaned, transformed and integrated together. To facilitate decision
making and multi-dimensional views, data warehouses are usually
modeled by a multi-dimensional data structure. Figure 4 shows an
example of a three dimensional subset of a data cube structure used
for VideoStore data warehouse.

pg. 37
o Figure 4: A multi-dimensional data cube structure
commonly used in data for data warehousing

The figure shows summarized rentals grouped by film categories,


then a cross table of summarized rentals by film categories and
time (in quarters). The data cube gives the summarized rentals
along three dimensions: category, time, and city. A cube contains
cells that store values of some aggregate measures (in this case
rental counts), and special cells that store summations along
dimensions. Each dimension of the data cube contains a hierarchy
of values for one attribute.

pg. 38
Because of their structure, the pre-computed summarized data.
Data warehouses contain and the hierarchical attribute values of
their dimensions, data cubes are well suited for fast interactive
querying and analysis of data at different conceptual levels, known
as On-Line Analytical Processing (OLAP). OLAP operations allow
the navigation of data at different levels of abstraction, such as
drill-down, roll-up, slice, dice, etc. Figure 5 illustrates the drill-
down (on the time dimension) and roll-up (on the location
dimension) operations.

Figure 5: Summarized data from VideoStore


before and after drill-down and roll-up operations.

pg. 39
• Transaction Databases: A transaction database is a set of
records representing transactions, each with a time stamp, an
identifier and a set of items. Associated with the transaction
files could also be descriptive data for the items.

For example, in the case of the video store, the rentals table such as
shown in Figure 6 represents the transaction database. Each record
is a rental contract with a customer identifier, a date, and the list of
items rented (i.e. video tapes, games, VCR, etc.).

Since relational databases do not allow nested tables (i.e. a set as


attribute value), transactions are usually stored in flat files or stored
in two normalized transaction tables, one for the transactions and
one for the transaction items. One typical data mining analysis on
such data is the so-called market basket analysis or association
rules in which associations between items occurring together or in
sequence are studied.

Figure 6: Fragment of a transaction database for


the rentals at VideoStore

pg. 40
• Multimedia Databases: Multimedia databases include
video, images, and audio and text media. Multimedia
Databases can be stored on extended object-relational or
object-oriented databases, or simply on a file system.
Multimedia is characterized by its high dimensionality,
which makes data mining even more challenging. Data
mining from multimedia repositories may require computer
vision, computer graphics, image interpretation, and natural
language processing methodologies.

• Spatial Databases: Spatial databases are databases that, in


addition to usual data, store geographical information like
maps, and global or regional positioning. Such spatial
databases present new challenges to data mining algorithms.

Figure 7: Visualization of spatial OLAP (from


GeoMiner system)

pg. 41
• Time-Series Databases: Time-series databases contain time
related data such stock market data or logged activities.
These databases usually have a continuous flow of new data
coming in, which sometimes causes the need for a
challenging real time analysis. Data mining in such databases
commonly includes the study of trends and correlations
between evolutions of different variables, as well as the
prediction of trends and movements of the variables in time.
Figure 8 shows some examples of time-series data.

Figure 8: Examples of Time-Series Data


(Source: Thompson Investors Group)

pg. 42
• World Wide Web: The World Wide Web is the most
heterogeneous and dynamic repository available. A very
large number of authors and publishers are continuously
contributing to its growth and metamorphosis, and a massive
number of users are accessing its resources daily. Data in the
World Wide Web is organized in inter-connected documents.
These documents can be text, audio, video, raw data, and
even applications. Conceptually, the World Wide Web is
comprised of three major components: The “Content of the
Web”, which encompasses documents available; The
“Structure of the Web”, which covers the hyperlinks and the
relationships between documents; and The “Usage of the
web”, describing how and when the resources are accessed.
A fourth dimension can be added relating the dynamic nature
or evolution of the documents. Data mining in the World
Wide Web, or web mining, tries to address all these issues
and is often divided into web content mining, web structure
mining and web usage mining.

pg. 43
What can be discovered?

The kinds of patterns that can be discovered depend upon the data
mining tasks employed. By and large, there are two types of data
mining tasks: descriptive data mining tasks that describe the
general properties of the existing data, and predictive data mining
tasks that attempt to do predictions based on inference on available
data.

The data mining functionalities and the variety of knowledge data


mining discover are briefly presented in the following list:

• Characterization: Data characterization is a summarization


of general features of objects in a target class, and produces
what is called characteristic rules. The data relevant to a
user-specified class are normally retrieved by a database
query and run through a summarization module to extract the
essence of the data at different levels of abstractions.

For example, one may want to characterize the VideoStore


customers who regularly rent more than 30 movies a year. With
concept hierarchies on the attributes describing the target class, the
attribute oriented induction method can be used, for example, to
carry out data summarization. Note that with a data cube containing
summarization of data, simple OLAP operations fit the purpose of
data characterization.

pg. 44
• Discrimination: Data discrimination produces what are
called discriminate rules and is basically the comparison of
the general features of objects between two classes referred
to as the target class and the contrasting class. For example,
one may want to compare the general characteristics of the
customers who rented more than 30 movies in the last year
with those whose rental account is lower than 5. The
techniques used for data discrimination are very similar to
the techniques used for data characterization with the
exception that data discrimination results include
comparative measures.

pg. 45
• Association analysis: Association analysis is the discovery
of what are commonly called association rules. It studies the
frequency of items occurring together in transactional
databases, and based on a threshold called support, identifies
the frequent item sets. Another threshold, confidence, which
is the conditional probability than an item appears in a
transaction when another item appears, is used to pinpoint
association rules. Association analysis is commonly used for
market basket analysis.

For example, it could be useful for the OurVideoStore manager to


know what movies are often rented together or if there is a
relationship between renting a certain type of movies and buying
popcorn or pop. The discovered association rules are of the form:
P®Q [s,c], where P and Q are conjunctions of attribute value-pairs,
and s (for support) is the probability that P and Q appear together in
a transaction and c (for confidence) is the conditional probability
that Q appears in a transaction when P is present. For example, the
hypothetic associations rule:

“RentType(X, “game”) Ù Age(X, “13-19”) ® Buys(X, “pop”)


[s=2%, c=55%]”

Would indicate that 2% of the transactions considered are of


customers aged between 13 and 19 who are renting a game and
buying a pop, and that there is a certainty of 55% that teenage
customers who rent a game also buy pop.

pg. 46
• Classification: Classification analysis is the organization of
data in given classes. Also known as supervised
classification, the classification uses given class labels to
order the objects in the data collection. Classification
approaches normally use a training set where all objects are
already associated with known class labels. The
classification algorithm learns from the training set and
builds a model. The model is used to classify new objects.

For example, after starting a credit policy, the VideoStore


managers could analyze the customers’ behaviors vis-à-vis their
credit, and label accordingly the customers who received credits
with three possible labels “safe”, “risky” and “very risky”. The
classification analysis would generate a model that could be used to
either accept or reject credit requests in the future.

pg. 47
• Prediction: Prediction has attracted considerable attention
given the potential implications of successful forecasting in a
business context. There are two major types of predictions:
one can either try to predict some unavailable data values or
pending trends, or predict a class label for some data. The
latter is tied to classification. Once a classification model is
built based on a training set, the class label of an object can
be foreseen based on the attribute values of the object and
the attribute values of the classes. Prediction is however
more often referred to the forecast of missing numerical
values, or increase/ decrease trends in time related data. The
major idea is to use a large number of past values to consider
probable future values.

• Clustering: Similar to classification, clustering is the


organization of data in classes. However, unlike
classification, in clustering, class labels are unknown and it
is up to the clustering algorithm to discover acceptable
classes. Clustering is also called unsupervised classification,
because the classification is not dictated by given class
labels. There are many clustering approaches all based on the
principle of maximizing the similarity between objects in a
same class (intra-class similarity) and minimizing the
similarity between objects of different classes (inter-class
similarity).

• Outlier analysis: Outliers are data elements that cannot be


grouped in a given class or cluster. Also known as
exceptions or surprises, Outlier analyses are often very
important to identify. While outliers can be considered noise
and discarded in some applications, Outlier analysis can
reveal important knowledge in other domains, and thus can
be very significant and their analysis valuable.

pg. 48
• Evolution and deviation analysis: Evolution and deviation
analysis pertain to the study of time related data that changes
in time. Evolution analysis models evolutionary trends in
data, which consent to characterizing, comparing, classifying
or clustering of time related data. Deviation analysis, on the
other hand, considers differences between measured values
and expected values, and attempts to find the cause of the
deviations from the anticipated values.

It is common that users do not have a clear idea of the kind of


patterns organization can discover or need to discover from the data
at hand. It is therefore important to have a versatile and inclusive
data mining system that allows the discovery of different kinds of
knowledge and at different levels of abstraction. This also makes
interactivity an important attribute of a data mining system.

pg. 49
Is all that is Discovered Interesting and Useful?

Data mining allows the discovery of knowledge potentially useful


and unknown. Whether the knowledge discovered is new, useful or
interesting, is very subjective and depends upon the application and
the user. It is certain that data mining can generate, or discover, a
very large number of patterns or rules.

In some cases the number of rules can reach the millions. One can
even think of a meta-mining phase to mine the oversized data
mining results. To reduce the number of patterns or rules
discovered that have a high probability to be non-interesting, one
has to put a measurement on the patterns. However, this raises the
problem of completeness. The user would want to discover all rules
or patterns, but only those that are interesting. The measurement of
how interesting a discovery is, often called interestingness, can be
based on quantifiable objective elements such as validity of the
patterns when tested on new data with some degree of certainty, or
on some subjective depictions such as understandability of the
patterns, novelty of the patterns, or usefulness.

Discovered patterns can also be found interesting if business


analyst confirm or validate a hypothesis sought to be confirmed or
unexpectedly contradict a common belief. This brings the issue of
describing what is interesting to discover, such as meta-rule guided
discovery that describes forms of rules before the discovery
process, and interestingness refinement languages that interactively
query the results for interesting patterns after the discovery phase.
Typically, measurements for interestingness are based on
thresholds set by the user. These thresholds define the
completeness of the patterns discovered.

Identifying and measuring the interestingness of patterns and rules


discovered, or to be discovered is essential for the evaluation of the
mined knowledge and the KDD process as a whole. While some
concrete measurements exist, assessing the interestingness of
discovered knowledge is still an important research issue.

pg. 50
How do we Categorize Data Mining Systems?

There are many data mining systems available or being developed.


Some are specialized systems dedicated to a given data source or
are confined to limited data mining functionalities, other are more
versatile and comprehensive. Data mining systems can be
categorized according to various criteria among other classification
are the following:

• Classification according to the type of data source mined:

This classification categorizes data mining systems


according to the type of data handled such as spatial data,
multimedia data, time-series data, text data, World Wide
Web, etc.

• Classification according to the data model drawn on:

• This classification categorizes data mining systems based on


the data model involved such as relational database, object-
oriented database, data warehouse, transactional, etc.

• Classification according to the king of knowledge


discovered:

This classification categorizes data mining systems based on


the kind of knowledge discovered or data mining
functionalities, such as characterization, discrimination,
association, classification, clustering, etc. Some systems tend
to be comprehensive systems offering several data mining
functionalities together.

pg. 51
• Classification according to mining techniques used:

Data mining systems employ and provide different


techniques. This classification categorizes data mining
systems according to the data analysis approach used such as
machine learning, neural networks, genetic algorithms,
statistics, visualization, database oriented or data warehouse-
oriented, etc. The classification can also take into account the
degree of user interaction involved in the data mining
process such as query-driven systems, interactive exploratory
systems, or autonomous systems. A comprehensive system
would provide a wide variety of data mining techniques to fit
different situations and options, and offer different degrees
of user interaction.

pg. 52
What are the Issues in Data Mining?

Data mining algorithms embody techniques that have sometimes


existed for many years, but have only lately been applied as
reliable and scalable tools that time and again outperform older
classical statistical methods. While data mining is still in its
infancy, it is becoming a trend and ubiquitous. Before data mining
develops into a conventional, mature and trusted discipline, many
still pending issues have to be addressed. Some of these issues are
addressed below. Note that these issues are not exclusive and are
not ordered in any way.

• Security and Social Issues:

Security is an important issue with any data collection that is


shared and/or is intended to be used for strategic decision-making.
In addition, when data is collected for customer profiling, user
behavior understanding, correlating personal data with other
information, etc., large amounts of sensitive and private
information about individuals or companies is gathered and stored.
This becomes controversial given the confidential nature of some
of this data and the potential illegal access to the information.
Moreover, data mining could disclose new implicit knowledge
about individuals or groups that could be against privacy policies,
especially if there is potential dissemination of discovered
information. Another issue that arises from this concern is the
appropriate use of data mining. Due to the value of data, databases
of all sorts of content are regularly sold, and because of the
competitive advantage that can be attained from implicit
knowledge discovered, some important information could be
withheld, while other information could be widely distributed and
used without control.

pg. 53
• User Interface Issues:

The knowledge discovered by data mining tools is useful as long as


it is interesting, and above all understandable by the user. Good
data visualization eases the interpretation of data mining results, as
well as helps users better understand their needs. Many data
exploratory analysis tasks are significantly facilitated by the ability
to see data in an appropriate visual presentation. There are many
visualization ideas and proposals for effective data graphical
presentation. However, there is still much research to accomplish in
order to obtain good visualization tools for large datasets that could
be used to display and manipulate mined knowledge. The major
issues related to user interfaces and visualization is “screen real-
estate”, information rendering, and interaction. Interactivity with
the data and data mining results is crucial since it provides means
for the user to focus and refine the mining tasks, as well as to
picture the discovered knowledge from different angles and at
different conceptual levels.

pg. 54
• Mining Methodology Issues:

These issues pertain to the data mining approaches applied and


their limitations. Topics such as versatility of the mining
approaches, the diversity of data available, the dimensionality of
the domain, the broad analysis needs (when known), the
assessment of the knowledge discovered, the exploitation of
background knowledge and metadata, the control and handling of
noise in data, etc. are all examples that can dictate mining
methodology choices.

For instance, it is often desirable to have different data mining


methods available since different approaches may perform
differently depending upon the data at hand. Moreover, different
approaches may suit and solve user’s needs differently.

Most algorithms assume the data to be noise-free. This is of course


strong assumption. Most datasets contain exceptions, invalid or
incomplete information, etc., which may complicate, if not obscure,
the analysis process and in many cases compromise the accuracy of
the results. As a consequence, data preprocessing (data cleaning
and transformation) becomes vital. It is often seen as lost time, but
data cleaning, as time consuming and frustrating as it may be, is
one of the most important phases in the knowledge discovery
process. Data mining techniques should be able to handle noise in
data or incomplete information.

More than the size of data, the size of the search space is even more
decisive for data mining techniques. The size of the search space is
often depending upon the number of dimensions in the domain
space. The search space usually grows exponentially when the
number of dimensions increases. This is known as the curse of
dimensionality. This “curse” affects so badly the performance of
some data mining approaches that it is becoming one of the most
urgent issues to solve.

pg. 55
• Performance Issues:

Many artificial intelligence and statistical methods exist for data


analysis and interpretation. However, these methods were often not
designed for the very large data sets data mining is dealing with
today. Terabyte sizes are common. This raises the issues of
scalability and efficiency of the data mining methods when
processing considerably large data. Algorithms with exponential
and even medium-order polynomial complexity cannot be of
practical use for data mining. Linear algorithms are usually the
norm. In same theme, sampling can be used for mining instead of
the whole dataset. However, concerns such as completeness and
choice of samples may arise. Other topics in the issue of
performance are incremental updating, and parallel programming.
There is no doubt that parallelism can help solve the size problem
if the dataset can be subdivided and the results can be merged later.
Incremental updating is important for merging results from parallel
mining, or updating data mining results when new data becomes
available without having to re-analyze the complete dataset.

pg. 56
• Data Source Issues:

There are many issues related to the data sources, some are
practical such as the diversity of data types, while others are
philosophical like the data glut problem. We certainly have an
excess of data since we already have more data than Organizations
can handle and Organizations are still collecting data at an even
higher rate. If the spread of Database Management Systems has
helped increase the gathering of information, the advent of data
mining is certainly encouraging more data harvesting. The current
practice is to collect as much data as possible now and process it,
or try to process it, later. The concern is whether Organizations are
collecting the right data at the appropriate amount, whether
Organizations know what the business want to do with it, and
whether Organizations distinguish between what data is important
and what data is insignificant. Regarding the practical issues
related to data sources, there is the subject of heterogeneous
databases and the focus on diverse complex data types.
Organizations are storing different types of data in a variety of
repositories. It is difficult to expect a data mining system to
effectively and efficiently achieve good mining results on all kinds
of data and sources. Different kinds of data and sources may
require distinct algorithms and methodologies. Currently, there is a
focus on relational databases and data warehouses, but other
approaches need to be pioneered for other specific complex data
types. A versatile data mining tool, for all sorts of data, may not be
realistic. Moreover, the proliferation of heterogeneous data sources,
at structural and semantic levels, poses important challenges not
only to the database community but also to the data mining
community.

pg. 57
Better Data Mining Models.

This implies that Business Analyst can measure the quality of


the models. Business Analyst know, only one quality
measure really matters (whatever lift-adepts and AUC-adepts
may tell Business Analyst: what’s in it for their business Is It
Money?

Following are some points to remember for Effective Data


Mining Models

• There is no data like more data

 Observations

Push the data mining tool to the Maximum limits. The more data
in use, the better the model created.

The best models are “Ensembles” of weak learners, like bagging.


Instead of feeding one data file to the algorithm and let it do the
sampling, learning, averaging, The Business Analyst prefer to
make the samples itself and feed one at the time to the algorithm.
That way it is possible to use a lot more data before the tool
crashes. For each individual model Business Analyst push it to its
maximum limits. The averaging can be done afterward.

A second advantage of making the samples by Business Analyst


itself is that marketer can chose to generate non-overlapping
samples as much as possible. That way the total number of
different observations used in model building reaches much higher
levels than by feeding only one file to the modeling tool.

pg. 58
 Variables

Calculate additional (derived) fields. This is fairly easy.


Business Analyst can multiply, subtract, divide, add,
numbers. it should have some business meaning

Find additional information, inside or outside the company.

• Find the best algorithm

It tempting to state that probably for each problem there is one best
algorithm. So all data miner have to do is to try a handful of really
different algorithms to find out which one is the best for the
problem. Different data miners will use the same algorithm
differently, according to their taste, experience, mood, preference

So find out which algorithm works best for Data Miner and their
business problem.

pg. 59
• Zoom in on the business targets

When data miners want to use a data mining model to select the
customers who are most likely to buy the business outstanding
product XYZ, it is reasonable to use the business past buyers of
XYZ as the positive targets in the model. Data Miner get a model
with an excellent lift and use it for a mailing.

When the mailing campaign is over, data miner now have all the
data company need to create a new, better, model for product
XYZ. The business targets the past buyers of XYZ in response to
the business mailing. With this new model, data miner will not
only take their “natural” propensity to buy into account, but also
their willingness to respond to the customer mailing

If the databases contain far more observations than the data mining
tool likes, the only thing data miner can do is use samples.
Calculate the model, and data miner can use it. But data miner can
push it a bit further. Use the model to score the entire customer
base. And now zoom in on the customers with the best scores.
Let’s say the top-10%. Use them to calculate a new, second model
which will use the far more tiny differences in customer
information to find the really promising ones.

• Make it simple

Nevertheless, data miner have to keep business data mining work


as simple as possible, because the business who pays the bills
wants data miner to deliver good models, on time for his
campaigns.

pg. 60
• Automate as much as possible

The data miner should not to try out every possible algorithm in
each data mining project. If problem A was best solved with
algorithm X, than probably problem B, which is very similar to A,
should equally be tackled with algorithm X. No need to waste time
checking out other algorithms.

pg. 61
Introduction to Object-Oriented Database

In the modern computing world, the amount of data generated and


stored in databases of organizations is vast and continuing to grow
at a rapid pace. The data stored in these databases possess valuable
hidden knowledge. The discovery of such knowledge can be very
fruitful for taking effective decisions. Thus the need for developing
methods for extracting knowledge from data is quite evident. Data
mining, a promising approach to knowledge discovery, is the use of
pattern recognition technologies with statistical and mathematical
techniques for discovering meaningful new correlations, patterns
and trends by analyzing large amounts of data stored in
repositories. Data mining has made its impact on many applications
such as marketing, customer relationship management,
engineering, medicine, crime analysis, expert prediction, Web
mining, and mobile computing, among others. In general, data
mining tasks can be classified into two categories: Descriptive
mining and Predictive mining.

“Descriptive Mining” is the process of extracting vital


characteristics of data from databases. Some of descriptive mining
techniques are Clustering, Association Rule Mining and Sequential
mining.

“Predictive Mining” is the process of deriving hidden patterns and


trends from data to make predictions. The predictive mining
techniques consist of a series of tasks namely Classification,
Regression and Deviation detection.

One of the important tasks of Data Mining is Data Classification


which is the process of finding a valuable set of models that are
self-descriptive and distinguishable data classes or concepts, to
predict the set of classes with an unknown class label.

pg. 62
For example, in the transportation network, all highways with the
same structural and behavioral properties can be classified as a
class highway. From the application point of view, Classification
helps in credit approval, product marketing, and medical diagnosis.
So many techniques such as decision trees, neural networks,
nearest neighbor methods and rough set-based methods enable the
creation of classification models. Regardless of the potential
effectiveness of Data Mining to appreciably enhance data analysis,
this technology still to be a niche technology unless an effort is
taken to integrate Data Mining technology with traditional database
system. Database systems offer a uniform framework for Data
Mining by proficiently administering large datasets, integrating
different data-types and storing the discovered knowledge.

For over a decade, Relational databases (RDB) has been the


accepted solution for efficient storage and retrieval of huge
volumes of data. The Relational Databases are based on tables
which are static components of organizational information. In
addition to this, Relational Databases can handle only simple
predefined data types and faces problems when dealt with complex
data types, user defined data types and multimedia. Thus, the
Relational Databases technology fails to handle the needs of
complex information systems. Often the semantics of relational
databases are left unexplored within many relationships which
cannot be extracted without users’ help. Object-Oriented Databases
(OODB) solves many of these problems. Based on the concept of
abstraction and generalization, object oriented models capture the
semantics and complexity of the data. . Therefore, many research
organizations are employing Object-Oriented Database (OODB) to
solve their problems of data storage, retrieval and processing.

pg. 63
An Object-Oriented Database (OODB) is a database in which the
concepts of object-oriented languages are utilized. The principal
strength of Object-Oriented database (OODB) is its ability to
handle applications involving complex and interrelated
information. But in the current scenario, the existing Object-
Oriented Database Management System (OODBMS) technologies
are not efficient enough to compete in the market with their
relational counterparts. Apart from that, there are numerous
applications built on existing Relational Database Management
Systems (RDBMS). It's difficult, if not impossible, to move off
those Relational Databases. Hence, Database Administor intend to
incorporate the object-oriented concepts into the existing Relational
Database Management Systems (RDBMSs), thereby exploiting the
features of RDBMSs and Object-Oriented (OO) concepts.
Undoubtedly, one of the significant characteristic of object-oriented
programming is inheritance.

“Inheritance” is the concept by which the variables and methods


defined in the parent class (super class) are automatically inherited
by its child class (sub class). There are two ways to represent class
relationships in object-oriented programming and objects are "is a"
and "has a" relationships. .

In an “is -a” relationship, an object of a sub class can also be


thought of as an object of its super class.

For instance, a class named Car exhibits an "is a" relationship with
a base class named Vehicle, since a car is a vehicle.

In a "has-a" relationship which is also known as “Composition” , a


class object holds one or more object references as data members.

pg. 64
For Example, a bicycle has a steering wheel and, in the same way,
a wheel has spokes. Inheritance can also be stated as
generalization, because the “is-a” relationship represents a
hierarchy between the classes of objects.

In generalization hierarchies, the data members and methods of the


super class are inherited by the subclasses and the objects of the
subclass can use up those common properties of the super class
without redefinition.
For example, a "fruit" is a generalization of "apple", "orange",
"mango" and many others. Similarly, one can consider fruit to be
an abstraction of apple, orange, etc. Conversely, since apples are
fruits (i.e., an apple is-a fruit), apples are bound to contain all the
attributes common for a fruit. This concept of generalization is very
powerful, because it reduces redundancy and maintains “Integrity”.

pg. 65
“Polymorphism” is another important Object oriented programming
concept. It is a general term which stands for ‘Many forms’.
Polymorphism in brief can be defined as "One Interface, Many
Implementations". It is a property of being able to assign a different
meaning or usage to something in different contexts in particular,
to allow an entity such as a variable, a function, or an object to take
more than one form. Polymorphism is different from Method
Overloading or Method Overriding. In literature, polymorphism
can be classified into three different kinds namely: pure, static, and
dynamic.

• “Pure Polymorphism” refers to a function which can take


parameters of several data types.

• “Static Polymorphism” can be stated as functions and


operators overloading.

• “Dynamic Polymorphism” is achieved by employing


inheritance and virtual functions.

Dynamic binding or runtime binding allows one to substitute


polymorphic objects for each other at run-time. Polymorphism has
a number of advantages. Its chief advantage is that it simplifies the
definition of clients, as it allows the client to substitute at run-time,
an instance of one class for another instance of a class that has the
same “Polymorphic Interface”.

pg. 66
It is becoming increasingly important to extend the domain of
study from relational database systems to object-oriented database
systems and probe the knowledge discovery mechanisms in object-
oriented databases, because object-oriented database systems have
emerged as a popular and influential setting in advanced database
applications. The fact that standards, are still not defined for
Object-Oriented Database Management Systems (OODBMSs) as
those for relational Database Management Systems (DBMSs) and
as most organizations have their information systems based on a
relational Database Management Systems (DBMS) technology, the
incorporation of the object oriented programming concepts into the
existing Relational Database Management Systems (RDBMSs) will
be an ideal choice to design a database that best suit the advanced
database applications.

A novel and innovative approach for the design of an object-


oriented database is presented in my study. The design of the
Object-Oriented Database (OODB) is carried out in a expert mode
with the intention of achieving efficient classification on the
database. In my proposed approach, I have utilized the object-
oriented programming concepts: inheritance and polymorphism to
achieve the above stated goals. Chiefly, I have extended the
existing relational databases by incorporating the object-oriented
programming concepts, to attain an object-oriented database. The
Object-Oriented Database (OODB) is structured mainly by
employing the class hierarchies of inheritance. The inheritance or
the class relationships namely {“is-a” and “has- a”} are used to
represent a class hierarchy in the proposed Object-Oriented
Database (OODB). Another object- oriented programming concept,
Polymorphism is utilized to achieve better classification.
Polymorphism enables the usage of simple SQL queries to classify
the designed Object-Oriented Database (OODB). The experimental
results stated, portrays the efficiency of the proposed approach. The
designed Object-Oriented Database (OODB) demands less
implementation overhead and saves considerable memory space
compared to Relational Databases (RDBs) while exploiting its
essential features.

pg. 67
Object-Oriented Database (OODB)

The chief advantage of Object-Oriented Database (OODB) is its


ability to represent real world concepts as data models in an
effective and presentable manner. Object-Oriented Database
(OODB) is optimized to support object-oriented applications,
different types of structures including trees, composite objects and
complex data relationships. The Object-Oriented Database
(OODB) system handles complex databases efficiently and it
allows the users to define a database, with features for creating,
altering, and dropping tables and establishing constraints. From the
user’s perception, Object-Oriented Database (OODB) is just a
collection of objects and inter-relationships among objects . Those
objects that resemble in properties and behavior are organized into
classes. Every class is a container of a set of common attributes and
methods shared by similar objects.

• The “Attributes or Instance Variables” define the


“Properties of a Class”.

• The “Method” describes the “Behavior of the Objects


associated with the Class”.

• A “Class/Subclass Hierarchy” is used to represents


“Complex Objects where Attributes of an Object itself
contains Complex Objects”.

pg. 68
The most important Object-Oriented Concept employed in an
Object-Oriented Database (OODB) model includes the inheritance
mechanism and composite object modeling. In order to cope with
the increased complexity of the object-oriented model, one can
divide class features as follows: simple attributes - attributes with
scalar types; complex attributes - attributes with complex types,
simple methods - methods accessing only local class simple
attributes; complex methods - methods that return or refer instances
of other classes . The object-oriented approach uses two important
abstraction principles for structuring designs: Classification and
Generalization.

“Classification” is defined as, “An abstraction principle by which


objects with similar properties are grouped into classes defining
the structure and behavior of their instances.”

“Generalization” is defined as, “An abstraction principle by which


all the common properties shared by several classes are organized
into a single super class to form a Class Hierarchy.

From the very outset of the first Object-Oriented Database


Management Systems (OODBMS) Gemstone in the mid-eighties, a
dozen other commercial Object-Oriented Database Management
Systems (OODBMSs) have joined the fierce competition in the
market. Regarding the applications of Object-Oriented Database
(OODB), its vendors have laid their focus on Computer Aided
Design (CAD), Computer Aided Manufacturing (CAM) and
Computer Aided Software Engineering (CASE). All these user
applications are meant to handle complex information and the
Object-Oriented Database (OODB) systems promises to propose
efficient solutions to these problems. Factory and office automation
are other application areas of object-oriented database technology.

pg. 69
New Approach to the Design of Object Oriented
Database

In general computer literature, defines three approaches to build an


Object-Oriented Database Management Systems (OODBMS)
extending an Object-Oriented Programming Language (OOPL),
extending a Relational Database Management System (RDBMS),
and starting from scratch.

The “First” approach develops an Object-Oriented Database


Management System (ODBMS) by encompassing to an Object-
Oriented Programming Language (OOPL) persistent storage to
achieve multiple concurrent accesses with transaction support.

The “Second” is an extended relational approach; an Object-


Oriented Database Management Systems (OODBMS) is built by
incorporating an existing Relational Database Management
Systems (RDBMS) with Object-Oriented features such as classes
and inheritances, methods and encapsulations, polymorphism and
complex objects.

The “Third” approach aims to revolutionize the database


technology in the sense that an Object-Oriented Database
Management Systems (OODBMS) is designed from the ground up,
as represented by UniSQL / UniOracle and OpenOODB (Open
Object-Oriented Database) .

In my design, I have employed the second approach which extends


the Relational Databases by utilizing the Object-Oriented
Programming (OOP) concepts.

pg. 70
The proposed approach makes use of the Object-Oriented
Programming (OOP) concepts namely,” Inheritance and
Polymorphism “to design an Object-Oriented Database (OODB)
and perform classification in it respectively. Normally, database is
a collection of tables. Hence when I have consider a database, it is
bound to contain a number of tables with common fields. In my
approach, I have grouped together such common set of fields to
form a single generalized table. The newly created table resembles
the base class in the inheritance hierarchy. This ability to represent
classes in hierarchy is one of the eminent Object-Oriented
Programming (OOP) concepts. Next I have employed another
important object-oriented characteristic dynamic polymorphism,
where different classes have methods of the same name and
structure, performing different operations based on the “Calling
Object.” The polymorphism is specifically employed to achieve
classification in a simple and effective manner. The use of these
object-oriented concepts for the design of Object-Oriented
Database (OODB) Object-Oriented Database ensures that even
complex queries can be answered more efficiently. Particularly the
data mining task, classification can be achieved in an effective
manner.
Let T denote a set of all tables on a database D and t subset T,
where ‘t’ represents the set of tables in which some fields are in
common. Now I have create a generalized table composing of all
those common fields from the table set‘t’. To portray the efficiency
of my proposed approach, I consider a traditional table. A
traditional example of the database for large business organizations
will have a number of tables but to best illustrate the Object-
Oriented Programming (OOP) concepts employed in my approach,
I have concentrated on three tables namely, Employees, Suppliers
and Customers. The tables are represented as Table 1, Table 2,
Table 3 respectively

pg. 71
pg. 72
Table 1: Example of Employees Table

pg. 73
Table 2: Example of Customers Table

pg. 74
Table 3: Example of Suppliers Table

pg. 75
The above set of tables namely Employees, Suppliers and
Customers can be represented equivalently as classes. The class
structure may look like as in Figure 9

Figure 9: Class Structure of Employees, Suppliers and


Customers Table

pg. 76
From the above class structure, it is understood that every table has
a set of general or common fields (highlighted ones) and table-
specific fields. On considering the Employee table, it has general
fields like Name, Age, Gender etc. and table-specific fields like
Title, HireDate etc. These general fields occur repeatedly in most
tables. This causes redundancy and thereby increases space
complexity. Moreover, if a query is given to retrieve a set of
records for the whole organization satisfying a particular rule, there
may be a need to search all the tables separately. So, this
replication of general fields in the table leads to a poor design
which affects effective data classification. To perform better
classification, I have design an Object-Oriented Database (OODB)
by incorporating the inheritance concept of Object-Oriented
Programming (OOP).

pg. 77
 Design of the Object-Oriented Database

First in my proposed approach, I have design an Object-Oriented


Database (OODB) by utilizing the inheritance concept of Object-
Oriented Programming (OOP) by which will eliminate the problem
of redundancy. First, I have located all the general or common
fields from the table set‘t’. Then, all these general or common
fields are fetched and stored in a single table and all the related
tables can inherit it. Thus the Generalized table resembles the base
class of the Object-Oriented Programming (OOP) paradigm. In my
approach, I have created a new table called ‘Person’, which
contains all those common fields and the other tables like
Employees, Customers inherit the Person table without redefining
it.

Here, I have used two important Mechanisms namely


“Generalization” and “Composition”. Generalization depicts an
“is-a” relation and composition represents a “has-a” relation. Both
these relationships can be best illustrated as below: The generalized
table “Person” contains all the common fields and the tables
“Employees, Suppliers and Customers” inheriting the Table
“Person” is said to have an “is-a” relationship with the table Person
i.e., an Employee is a Person, A Supplier is a Person and A
Customer is a Person. Similarly to exemplify the composition
relation, the table Person contains an object reference of the
“Places” Table as its field. Then the table Person is said to have a
“has-a” relationship with the table Places i.e., a Person has a place
and similarly, A Place has a Postal Code. Figure 10 represents the
inheritance class hierarchy of the proposed (OODB) Object-
Oriented Database design. In the following pictured design, the
small triangle (→) represents “is-a” relationship and the arrow (→)
represents “has-a” relationship.

pg. 78
pg. 79
Figure 10: Inheritance Hierarchy of Classes in the Proposed OODB Design

pg. 80
pg. 81
The generalized table ‘Person’ is considered as the base class
‘Person’ and the fields are considered as the attributes of
the base class ‘Person’. Therefore, the base class ‘Person’,
which contains all the common attributes, is inherited by the other
classes namely Employees, Suppliers and Customers, which
contain only the specialized attributes.

“Moreover, inheritance allows me to define the generalized


methods in the base class and specialized methods in the sub
classes”.

For example, if there is a need to get the contact numbers of all the
people associated with the organization, can define a
method getContactNumebrs() in the base class ‘Person’ and it
can be shared by its subclasses. In addition, the generalized
class ‘Person’ exhibits composition relationship with another
two classes ‘Places’ and ‘PostalCodes’. The class ‘Person’ uses
instance variables, which are object references of the classes
‘Places’ and ‘PostalCodes’. The tables in the proposed (OODB)
design are shown in Tables.

pg. 82
pg. 83
Table 4: Example of Persons Table

pg. 84
Table 5: Example of Extended Employees Table

pg. 85
Table 6: Example of Extended Suppliers Table

pg. 86
Table 7: Example of Extended Customers Table

pg. 87
Table 8: Example of Extended Places Table

pg. 88
Table 9: Example of Extended PostalCodes Table

pg. 89
pg. 90
Owing to the incorporation of inheritance concept in the proposed
design, Database Designer can extend the database by effortlessly
adding new tables, merely by inheriting the common fields from
the generalized table

pg. 91
 Data Mining in the Designed Object-Oriented
Database

”Dynamic Polymorphism” or “Late Binding” allows the


programmer to define methods with the same name in different
classes and the method to be called is decided at runtime based on
the calling object. This Object-Oriented Programming (OOP)
concept and simple SQL\ ORACLE queries can be used to perform
classification in the designed Object-Oriented Database (OODB).
Here, a single method can do the classification process for all the
tables. The uniqueness of my concept is that the classification
process can be performed by using simple SQL/ ORACLE query
while the existing classification approaches for Object-Oriented
Database (OODB) employ complex techniques such as decision
trees, neural networks, nearest neighbor methods and more.
Database Administrator can also access the method, specifically for
individual entities namely Employees, Suppliers and Customers.
By integrating the polymorphism concept, the code is simpler to
write and easier to manage. As a result of the designed (OODB),
the task of classification can be carried out effectively by using
simple SQL/ORACLE queries. Thus in our approach by
incorporating the Object-Oriented Programming (OOP) concepts
for designing the Object-Oriented Database (OODB), I have
exploited the maximum advantages of Object-Oriented
Programming (OOP) and also the task of classification is
performed more effectively.

pg. 92
Implementation and Results

In this section, I have presented the experimental results of my


approach. The proposed approach for the design of Object-Oriented
Database (OODB) and classification has been designed with
ORACLE as database. I have considered only three tables for
experimentation. But in general, an organization may have a
number of tables to manage. Specifically, the number of records is
enormous in each table. The incorporation of the Object-Oriented
Programming (OOP) concepts to such databases greatly reduced
the implementation overhead incurred. Moreover, the memory
space occupied is reduced to a great extent as the size of the table
increases. These are some of the eminent benefits of the proposed
approach. I have performed a comparative analysis through
reviewing of Computer Reseller News (CRN) Magazines and
COMPUTER Monthly Newspaper then came to a conclusion of the
space utilized before and after generalization of tables and thus I
have computed the saved memory space. The comparison is
performed with varying number of records in the tables such as
1000, 2000, 3000, 4000 and 5000 and the results are stated below
in Table10, Table11, Table12, Table13, Table14 respectively.

pg. 93
pg. 94
Normalized Un Normalized

Tables Fields Records Total Memory Fields Total Records Memory


Records of size of the of the table size of the
Table table table
1 Customers 4 1000 4000 40000 15 15000 150000

2 Employees 5 1000 5000 50000 16 16000 160000

3 Suppliers 5 1000 5000 50000 16 16000 160000


4 Persons 8 3000 24000 240000

5 Places 3 500 1500 15000

6 Postalcodes 4 250 1000 10000

Total 40500 405000 47000 470000

pg. 95
Saved Memory (KB): 63.47656 Table 10: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}

pg. 96
Normalized
Un Normalized
Tables Fields Records Total Records Memory size Fields Total Records Memory
of Table of the table of the table size of the
table

1 Customers 4 2000 8000 80000 15 30000 300000

2 Employees 5 2000 10000 100000 16 32000 320000

3 Suppliers 5 2000 10000 100000 16 32000 320000

4 Persons 8 6000 48000 480000

5 Places 3 1000 3000 30000

6 Postal codes 4 500 2000 20000

Total 81000 810000 94000 940000

Saved Memory (KB): 126.9531 Table 11: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}

pg. 97
Tables Fields Records Total Memory size Fields Total Memory
Records of of the table Records of size of the
Table the table table

1 Customers 4 3000 12000 120000 15 45000 450000

2 Employees 5 3000 15000 150000 16 48000 480000

3 Suppliers 5 3000 15000 150000 16 48000 480000

4 Persons 8 9000 72000 720000

5 Places 3 1500 4500 45000

6 Postal codes 4 750 3000 30000

Total 121500 1215000 141000 1410000

Saved Memory (KB):190.4297 Table 12: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}

pg. 98
Tables Fields Records Total Memory Fields Total Memory

Records of size of the Records of size of the

Table table the table table

1 Customers 4 4000 16000 160000 15 60000 600000

2 Employees 5 4000 20000 200000 16 64000 640000

3 Suppliers 5 4000 20000 200000 16 64000 640000

4 Persons 8 12000 96000 960000

5 Places 3 2000 6000 60000

6 Postal codes 4 1000 4000 40000

Total 162000 1620000 188000 1880000

Saved Memory (KB):253.9063 Table 13: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}

pg. 99
Tables Fields Records Total Memory Fields Total Memory
Records of size of the Records of size of the

Table table the table table

1 Customers 4 5000 20000 200000 15 75000 750000

2 Employees 5 5000 25000 250000 16 80000 800000

3 Suppliers 5 5000 25000 250000 16 80000 800000

4 Persons 8 15000 120000 1200000

5 Places 3 2500 7500 75000

6 Postal codes 4 1250 5000 50000

Total 202500 2025000 235000 2350000

Saved Memory (KB):317.3828 Table 14: Saved Memory Table {Source: Computer Reseller News (CRN)
Magazines}

pg. 100
pg. 101
pg. 102
The results of comparative analysis that the saved memory space
increases, as the number of records in each table increases.

The Graphical Representation of the results is illustrated in Figure


11. From the graph, it is clear

Figure 11: Graph Demonstrating the above Evaluation Results

Moreover in the proposed approach, I have placed the common


methods in the generalized class and entity-specific methods in the
subclasses. Because of this design, we have saved a considerable
memory space.

pg. 103
For instance, in case of a traditional database if a method
getContactNumbers() is defined to get the contact numbers, the
method has to be defined in all the classes and all those results are
to be combined to obtain the final result. But in the proposed
approach, I have generalized all the classes, so the redefinition of
methods for all the related classes is not needed. If there are ‘n’
classes, placing the common methods in the base class can save a
memory space of

Equation 1: To Determine the Memory Size

pg. 104
Building Profitable Customer Relationships with
Data Mining

Organization have to built the customer information and marketing


data warehouse, how do organization can make good use of the
data it contains

“Customer Relationship Management” (CRM) helps companies


improve the profitability of their interactions with customers while
at the same time making the interactions appear friendlier through
individualization. To succeed with CRM, companies need to match
products and campaigns to prospects and customers in other words,
to intelligently manage the "Customer Life Cycle”. Until recently
most CRM software has focused on simplifying the organization
and management of customer information. Such software, called
“Operational CRM”, has focused on creating a customer database
that presents a consistent picture of the customer’s relationship
with the company, and providing that information in specific
applications such as sales force automation and customer service in
which the company “touches” the customer. However, the sheer
volume of customer information and increasingly complex
interactions with customers has propelled data mining to the
forefront of making the Organization customer relationships
profitable. Data mining is a process that uses a variety of data
analysis and modeling techniques to discover patterns and
relationships in data that may be used to make accurate predictions.
It can help Data Miner to select the right prospects on whom to
focus, offer the right additional products to Organization existing
customers, and identify good customers who may be about to leave
the product of the Organization. The result is improved revenue
because of a greatly improved ability to respond to each individual
contact in the best way, and reduced costs due to properly
allocating the business resources. CRM applications that use data
mining are called Analytic CRM.

pg. 105
This section of the project will describe the various aspects of
analytic CRM and show how it is used to manage the customer life
cycle more cost-effectively. The case histories of these fictional
companies are composites of real-life data mining applications.

pg. 106
Data Mining in Customer Relationship Management

The first and simplest analytical step in data mining is to "Describe


the Data" For example, summarize its statistical attributes (such as
means and standard deviations), visually review it using charts and
graphs, and look at the distribution of values of the fields in the
organization data.

But data description alone cannot provide an action plan. An


organization must "Build a Predictive Model" based on patterns
determined from known results, and then test that model on results
outside the original sample. A good model should never be
confused with reality (Business man know a road map isn’t a
perfect representation of the actual road), but it can be a useful
guide to understanding the business.

Data mining can be used for both classification and regression


problems.

In "Classification Problems" Business Analyst predicting what


category something will fall into

For example, whether a person will be a good credit risk or not, or


which of several offers someone is most likely to accept.

In “Regression Problems" Business Analyst are predicting a


number such as the probability that a person will respond to an
offer.

pg. 107
In CRM, data mining is frequently used to assign a score to a
particular customer or prospect indicating the likelihood that the
individual will behave in the way Business Man want. For
example, a score could measure the propensity to respond to a
particular offer or to switch to a competitor’s product. It is also
frequently used to identify a set of characteristics (called a profile)
that segments customers into groups with similar behaviors, such
as buying a particular product.

A special type of classification can recommend items based on


similar interests held by groups of customers. This is sometimes
called "Collaborative Filtering".

The data mining technology used for solving Classification,


Regression and Collaborative Filtering problems is briefly
described in the Appendix at the end of the project.

pg. 108
Defining CRM

"Customer Relationship Management" in its broadest sense simply


means managing all customer interactions. In practice, this requires
using information about the Business customers and prospects to
more effectively interact with Business customers in all stages of
Business relationship with them. I have refer to these stages as the
customer life cycle.

The customer life cycle has three stages:

• Acquiring customers
• Increasing the value of the customer
• Retaining good customers

Data mining can improve Business profitability in each of these


stages through integration with operational CRM systems or as
independent applications.

pg. 109
Applying Data Mining to CRM

In order to build good models for the Business CRM system, there
are a number of steps the Business Man must follow.

The Two Crows data mining process model described below is


similar to other process models such as the CRISP-DM model,
differing mostly in the emphasis it places on the different steps.
Keep in mind that while the steps appear in a list, the data mining
process is not linear the CRM implementor will inevitably need to
loop back to previous steps. For example, what implementer learns
in the “explore data” step may require implementor to add new data
to the data mining database. The initial models implementor build
may provide insights that lead implementor to create new variables.

The basic steps of data mining for effective CRM are:

• Define business problem


• Build marketing database
• Explore data
• Prepare data for modeling
• Build model
• Evaluate model
• Deploy model and results

• Define the business problem.

Each CRM application will have one or more business objectives


for which Business Analyst will need to build the appropriate
model. Depending on business specific goal, such as “increasing
the response rate” or “increasing the value of a response,” Business
Analyst will build a very different model. An effective statement of
the problem will include a way of measuring the results of Business
CRM project.

pg. 110
• Build a Marketing Database.

Steps two through four constitute the core of the data preparation.
Together, Big Sam’s Clothing Company take more time and effort
than all the other steps combined. There may be repeated iterations
of the data preparation and model building steps as business analyst
learn something from the model that suggests business analyst to
modify the data. These data preparation steps may take anywhere
from 50% to 90% of the time and effort of the entire data mining
process!

Business Analyst will need to build a marketing database because


business operational databases and corporate data warehouse will
often not contain the data Business Man need in the format.
Furthermore, business CRM applications may interfere with the
speedy and effective execution of these systems.

When business analyst build business marketing database Data


Miner will need to clean it up, if business want good models
business analyst need to have clean data. The data business analyst
need may reside in multiple databases such as the customer
database, product database, and transaction databases. This means
business analyst will need to integrate and consolidate the data into
a single marketing database and reconcile differences in data values
from the various sources. Improperly reconciled data is a major
source of quality problems. There are often large differences in the
way data is defined and used in different databases. Some
inconsistencies may be easy to uncover, such as different addresses
for the same customer. Making it more difficult to resolve these
problems is that Big Sam’s Clothing Company are often subtle.
For example, the same customer may have different names or
worse multiple customer identification numbers.

pg. 111
• Explore the data.

Before business analyst can build good predictive models, Business


Analyst must understand the Business data. Start by gathering a
variety of numerical summaries (including descriptive statistics
such as averages, standard deviations and so forth) and looking at
the distribution of the data.

Business Man may want to produce cross tabulations (pivot tables)


for multi-dimensional data. Graphing and visualization tools are a
vital aid in data preparation, and their importance to effective data
analysis cannot be overemphasized. Data visualization most often
provides the leading to new insights and success. Some of the
common and very useful graphical displays of data are histograms
or box plots that display distributions of values. Business analyst
may also want to look at scatter plots in two or three dimensions of
different pairs of variables. The ability to add a third, overlay
variable greatly increases the usefulness of some types of graphs

pg. 112
.
• Prepare data for modeling.

This is the final data preparation step before building models and
the step where the most “art” comes in. There are four main parts to
this step:

First business analyst wants to select the variables on which to


build the model. Ideally, business analyst would take all the
variables business analyst have, feed them to the data mining tool
and let it find those which are the best predictors. In practice, this
doesn’t work very well. One reason is that the time it takes to build
a model increases with the number of variables. Another reason is
that blindly including extraneous columns can lead to models with
less rather than more predictive power.

The next step is to construct new predictors derived from the raw
data.

For example, forecasting credit risk using a debt-to-income ratio


rather than just debt and income as predictor variables may yield
more accurate results that are also easier to understand.

Next business analyst may decide to select a subset or sample of


the data on which to build models. If business analysts have a lot of
data, however, using all Business data may take too long or require
buying a bigger computer than business analyst would like.
Working with a properly selected random sample usually results in
no loss of information for most CRM problems. Given a choice of
either investigating a few models built on all the data or
investigating more models built on a sample, the latter approach
will usually help business analyst to develop a more accurate and
robust model. of the problem.

Last, business analyst will need to transform variables in


accordance with the requirements of the algorithm business analyst
choose for building business model.

pg. 113
• Data mining model building.

The most important thing to remember about model building is that


it is an iterative process. Business analyst will need to explore
alternative models to find the one that is most useful in solving the
business problem. What business analyst learn in searching for a
good model may lead business analyst to go back and make some
changes to the data business is using or even modify the problem
statement.

Most CRM applications are based on a protocol called supervised


learning. Business analyst start with customer information for
which the desired outcome is already known.

For example, The Marketer may have historical data because


Customer previously mailed to a list very similar to the one
Marketer are using. Or Marketer may have to conduct a test
mailing to determine how people will respond to an offer. marketer
then split this data into two groups. On the first group marketer
train or estimate the model. Marketer then tests it on the remainder
of the data. A model is built when the cycle of training and testing
is completed.

• Evaluate the business results

Perhaps the most overrated metric for evaluating the business


results is accuracy. Suppose marketers have an offer to which only
1% of the people will respond. A model that predicts “nobody will
respond” is 99% accurate and 100% useless. Another measure that
is frequently used is lift. Lift measures the improvement achieved
by a predictive model. However, lift does not take into account cost
and revenue, so it is often preferable to look at profit or Return of
Investment (ROI). Depending on whether Marketer chooses to
maximize lift, profit, or Return of Investment (ROI), marketer will
choose a different percentage of the business mailing list to which
marketer will send solicitations.

pg. 114
• Incorporating Data Mining in the business CRM solution

In building a CRM application, data mining is often only a small,


albeit critical, part of the final product.

For example, predictive patterns through data mining may be


combined with the knowledge of domain experts and incorporated
in a large application used by many different kinds of people.

The way data mining is actually built into the application is


determined by the nature of the customer interaction. There are two
main ways Business interact with his customers: Business Man
contact Customer (inbound) or customer contact Business Man
(outbound). The deployment requirements are quite different.

Outbound interactions are characterized by the company


originating the contact such as in a direct mail campaign. Thus
marketers will be selecting the people to whom marketers mail by
applying the model to Business customer database. Another type of
outbound campaign is an advertising campaign. In this case the
marketers would match the profiles of good prospects shown by the
model to the profile of the people marketers advertisement would
reach.

pg. 115
For inbound transactions, such as a telephone order, an Internet
order, or a customer service call, the application must respond in
real time. Therefore the data mining model is embedded in the
application and actively recommends an action.

In either case, one of the key issues the business must deal with in
applying a model to new data is the transformations marketers used
in building the model. Thus if the input data (whether from a
transaction or a database) contains age, income, and gender fields,
but the model requires the age-to-income ratio and gender has been
changed into two binary variables, marketers must transform
business input data accordingly. The ease with which Business
Analyst can embed these transformations becomes one of the most
important productivity factors when marketers want to rapidly
deploy many models.

pg. 116
How to Data Mine for Future Trends

The exciting phenomenon of data mining has taken the business


world by storm. The technologies now available on the open
market enable all kinds of speculating and observations based in
quantifiable data that may be housed in databases or other
computerized resources. Data mining can be used to track what
customers are doing now and possibly even what customer will do
in the future.

Steps to be follow

• Collect a solid table of data going back several years from


the present. A future trend projection is only as good as the
data it's based on, and more accurate future projections
involve longer histories in a database. The more of the past
data the data miner have, the more the Business Analyst can
tell about the future.

• Set up algorithms to search through the business existing


data looking for behaviors, sales, or other trends that are
currently rising.

• Set up visual graphing that shows Business Analyst how


different behaviors have occurred over time. The graph will
include a line for each product/behavior/trend that Business
Analyst are considering. With visual graphing, Business
Analyst will be able to see the results of the business data
mining at a glance.

• Choose a long-term or short-term analysis. As stated, it's


good to have a long data history, but that's not necessarily
the only criteria for future trends. Business Analyst may
want to look only at the short term, and focus on trends that
are currently spiking above the norm.

pg. 117
• Add other mitigating data as it is found. Write reports
supplementary to the business data mining graphs that
"explain" a trend and evaluate the chances of its continuance.
Having background data and reasonable mitigating factors at
hand will help Business Analyst make better decisions about
the future.

pg. 118
The 10 Secrets to Using Data Mining to Succeed at CRM

• Planning is the key to a successful data mining project

As with any worthwhile endeavor, planning is half the battle. It is


critical that any organization considering a data mining project first
define project objectives from a business perspective, and then
convert this knowledge into a coherent data mining strategy and a
well-defined project plan.

Plan for data mining success by following these three steps:

 Start with the end in mind. Avoid the “ad hoc trap” of
mining data without defined business objectives. Prior to
modeling, define a project that supports the
organization’s strategic objectives. For example, the
business objective might be to attract additional
customers who are similar to business most valuable
current customers. Orit might be to keep business most
profitable customers longer.

 Get buy-in from business stakeholders. Be sure to involve


all those who have a stake in the project. Typically,
Finance, Sales, and Marketing are concerned with
devising cost-effective CRM strategies. But Database
and Information Technology managers are also
“interested parties” since their teams are often called
upon to support the execution of those strategies.

 Define an executable data mining strategy. Plan how to


achieve business objective by capitalizing on business
resources. Both technical and staff resources must be
taken into account.

pg. 119
• Set specific goals for business data mining project

Before organization begin a data mining project, clarify just how


data mining can help organization to achieve business goal. For
instance, if reducing customer defection or “churn” is a strategic
objective, what level of improvement do organizations want to see?
Next, commit to a standard data mining process, such as CRISP-
DM (Cross-Industry Standard Process for Data Mining). Then
create a project plan for achieving business goals, including a clear
definition of what will constitute “success.”Finally, complete a
cost-benefit analysis, taking care to include the cost of any
resources that will be required.

• Recruit a broad-based project team

One of the most common mistakes made by those new to data


mining is to simply pass responsibility for a data mining initiative
to a data miner. Because successful data mining requires a clear
understanding of the business problem being addressed, and
because in most organizations elements of that business
understanding are dispersed among different disciplines or
departments, it’s important to recruit a broad-based team for
business project.

For instance, to evaluate the factors involved in customer churn,


organization may need staff members from Customer Service,
Market Research, or even Billing, as well as those with specialized
knowledge of business data resources and data mining. Depending
upon business objective, organization may want to have
representatives from some or all of the following roles: executive
sponsor, project leader, business expert, data miner, data expert,
and IT sponsor. Some projects may require two or three people,
other projects may require more.

pg. 120
• Line up the right data

To help ensure success, it is critical to understand what kinds of


data are available and what condition that data is in. Begin with
data that is readily accessible. It doesn’t need to be a large amount
or organized in a data warehouse. Many useful data mining projects
are performed on small or medium-sized datasets some, containing
only a few hundreds or thousands of records. For example, Analyst
may be able to determine, from a sample of customer records,
which of business company’s products are typically purchased by
customers fitting a certain demographic profile. This enables
organization to predict what other customers might purchase or
what offers the Customers might find most appealing.

• Secure IT buy-in

IT is an important component of any successful data mining


initiative. Keep in mind that the data mining tool organization
select will play an important role in securing buy-in from business
IT department. The data mining tool should integrate with business
existing data infrastructure relevant databases, data warehouses,
and data marts and should provide open access to data and the
capability to enhance existing databases with scores and
predictions generated by data mining.

pg. 121
• Select the right data mining solution

Successful, efficient data mining requires data mining solutions


that are open and well integrated. Organizations save time and
improve the flow of analysis by selecting solutions that support
every step of the process. An integrated solution is particularly
important when incorporating additional types of data, such as text,
Web, or survey data. That’s because each type of data is likely to
originate in a different system and exist in a variety of formats.
Using an integrated solution enables business analysts to follow a
train of thought efficiently; regardless of the type of data involved
in the analysis Integration is also important during the “decision
optimization” phase of predictive analytics. Decision optimization
determines which actions will drive optimal outcomes, and then
delivers those recommended actions to the systems or people that
can effectively implement them. To support decision optimization,
the business will want a solution that links to operational systems,
such as the call center or marketing automation software. Such a
solution supports more widespread and rapid even real-time,
delivery of predictive insight.

pg. 122
• Consider mining other types of data to increase the
return on business data mining investment

When the business analyst combine text, Web, or survey data with
structured data used in building models, the business analyst enrich
the information available for prediction. Even if the business
analyst adds only one type of additional data, the business will see
an improvement in the results that the business analysts generate.
Incorporating multiple types of data will provide even greater
improvements. To determine if the company might benefit from
incorporating additional types of data, begin by asking the
following questions:

 What kinds of business problems are we trying to solve?

 What kinds of data do we have that might address these


problems?

The answers to these questions will help the business analyst to


determine what kinds of data to include, and why. If the business
analysts are trying to learn why long-time customers are leaving,
for example, the business analyst may want to analyze text from
call center notes combined with results of customer focus groups or
customer satisfaction surveys.

pg. 123
• Expand the scope of data mining to achieve even greater
results

One way that the business analyst can increase the Return on
Investment ROI generated by data mining is by expanding the
number of projects the business analyst undertake. With the right
data mining solution, one that helps automate routine tasks the
business analyst can do this without increasing staff.

Gain more from the investment in data mining either by addressing


additional related business challenges or by applying data mining
in different departments or geographic regions. If the company has
already made progress on the top-priority challenges increasing the
conversion rate for cross-selling campaigns.

For example consider whether there are secondary challenges that


the business analyst might now address such as trimming the cost
of customer acquisition programs.

pg. 124
• Consider all available deployment options

When mining data, organizations that efficiently deploy results


consistently achieve a higher ROI. In early implementations of
data mining, deployment consisted of providing analysts with
models and managers with reports. Models and reports had to be
interpreted by managers or staff before strategic or tactical plans
could be developed. Later, many companies used batch scoring,
often conducted at off-peak hours, to more efficiently incorporate
updated predictions in their databases. It even became possible to
automate the scheduling of updates and to embed scoring engines
within existing applications.

Today, using the latest data mining technologies, the business


analyst can update even massive datasets containing billions of
scores in just a few hours. Data miner can also update models in
real time and deploy results to customer-contact staff as
organization interacts with customers. In addition, the business
analyst can deploy models or scores in real time to systems that
generate sales offers automatically or make product suggestions to
Web site visitors, to name just two possibilities.

pg. 125
• Increase collaboration and efficiency through model
management

Look into data mining solutions that enable the business analyst to
centralize the management of data mining models and support the
automation of processes such as the updating of customer scores.
These solutions foster greater collaboration and enterprise
efficiency. Central model management also helps the organization
avoid wasted or duplicated effort while ensuring that the most
effective predictive models are applied to the business challenges.
Model management also provides a way to document model
creation, usage, and application.

pg. 126
The Suggestion for Analytics, Business
Intelligence, and Performance Management

In the wake of the long-running massive industry consolidation in


the Enterprise Software industry that reached its zenith with the
acquisitions of Business Intelligence market leaders Hyperion,
Cognos, and Business Objects in 2007, one could certainly have
been forgiven for being less than optimistic about the prospects of
innovation in the Analytics, Business Intelligence, and
Performance Management markets. This is especially true given
the dozens of innovative companies that each of these large best of
breed vendors themselves had acquired before being acquired in
turn. While the pace of innovation has slowed to a crawl as the
large vendors are midway through digesting the former best of
breed market leaders, thankfully for the health of the industry,
nothing could be further from the truth in the market overall. This
market has in fact shown itself to be very vibrant, with a resurgence
of innovative offerings springing up in the wake of the fall of the
largest best of breed vendors.
So what are the trends and where do we see the industry evolving
to? Few of these are mutually exclusive, but in order to provide
some categorization to the discussion, this has been broken down
as follows:

pg. 127
• The Business Analyst witness the emergence of packaged
strategy-driven execution applications. As we discussed
in Driven to Perform: Risk-Aware Performance
Management From Strategy Through Execution (Nenshad
Bardoliwalla, Stephanie Buscemi, and Denise Broady,
New York, NY, Evolved Technologist Press, 2009), the
end state for next-generation business applications is not
merely to align the transactional execution processes
contained in applications like ERP, CRM, and SCM with
the strategic analytics of performance and risk
management of the organization, but for those strategic
analytics to literally drive execution. We called this
“Strategy-Driven Execution”, the complete fusion of
goals, initiatives, plans, forecasts, risks, controls,
performance monitoring, and optimization with
transactional processes. Visionary applications such as
those provided by Workday and SalesForce.com with
embedded real-time contextual reporting available
directly in the application (not as a bolt-on), and Oracle’s
entire Fusion suite layering Essbase and OBIEE
capabilities tightly into the applications’ logic, clearly
portend the increasing fusion of analytic and transactional
capability in the context of business processes and this
will only increase.

pg. 128
• The holy grail of the predictive, real-time enterprise will
start to deliver on its promises. While classic analytic
tools and applications have always done a good job of
helping users understand what has happened and then
analyze the root causes behind this performance, the
value of this information is often stale before it reaches its
intended audience. The holy grail of analytic technologies
has always been the promise of being able to predict
future outcomes by sensing and responding, with minimal
latency between event and decision point. This has
become manifested in the resurgence of interest in event-
driven architectures that leverage a technology known as
Complex Event Processing and predictive analytics. The
predictive capabilities appear to be on their way to break
out market acceptance IBM’s significant investment in
setting up their Business Analytics and Optimization
practice with 4000 dedicated consultants, combined with
the massive product portfolio of the Cognos and recently
acquired SPSS assets. Similarly, Complex Event
Processing capabilities, a staple of extremely data-
intensive, algorithmically-sophisticated industries such as
financial services, have also become interesting to a
number of other industries that cannot deal with the
amount of real-time data being generated and need to be
able to capture value and decide instantaneously.
Combining these capabilities will lead to new classes of
applications for business management that were
unimaginable a decade ago.

pg. 129
• The industry will put reporting and slice-and-dice
capabilities in their appropriate places and return to its
decision-centric roots with a healthy dose of Web 2.0
style collaboration. It was clear to the pioneers of this
industry, beginning as early as H.P. Luhn’s brilliant
visionary piece A Business Intelligence System from
1958 that the goal of these technologies was to support
business decision-making activities, and we can trace the
roots of modern analytics, business intelligence, and
performance management to the decision-support notion
of decades earlier. But somewhere along the way,
business intelligence became synonymous with reporting
and slicing-and-dicing, which is a metaphor that suits
analysts, but not the average end-user. This has
contributed to the paltry BI adoption rates of
approximately 25% bandied about in the industry, despite
the fact that investment in BI and its priority for
companies has never been higher over the last five years.
Making report production cheaper to the point of nearly
being free, something BI is poised to do is still unlikely to
improve this situation much. Instead, we will see
resurgence in collaborative decision-centric business
intelligence offerings that make decisions the central
focus of the offerings. From an operational perspective,
this is certainly in evidence with the proliferation of rules-
based approaches that can automate thousands of
operational decisions with little human intervention.
However, for more tactical and strategic decisions, mash-
ups will allow users to assemble all of the relevant data
for making a decision, social capabilities will allow users
to discuss this relevant data to generate “crowd sourced”
wisdom, and explicit decisions, along with automated
inferences, will be captured and correlated against
outcomes. This will allow decision-centric business
intelligence to make recommendations within process
contexts for what the appropriate next action should be,
along with confidence intervals for the expected outcome,

pg. 130
as well as being able to tell the user what the risks of her
decisions are and how it will impact both the company’s
and her own personal performance.

pg. 131
• Performance, risk, and compliance management will
continue to become unified in a process-based framework
and make the leap out of the CFO’s office. The
disciplines of performance, risk, and compliance
management have been considered separate for a long
time, but the walls are breaking down, as we documented
thoroughly in Driven to Perform. Performance
management begins with the goals that the organization is
trying to achieve, and as risk management has evolved
from its siloed roots into Enterprise Risk Management, it
has become clear that risks must be identified and
assessed in light of this same goal context. Similarly, in
the wake of Sarbanes-Oxley, as compliance has become
an extremely thorny and expensive issue for companies of
all sizes, modern approaches suggest that compliance is
ineffective when cast as a process of signing off on
thousand of individual item checklists, but rather should
be based on an organization’s risks. All three of these
disciplines need to become unified in a process-based
framework that allows for effective organizational
governance. And while financial performance, risk, and
compliance management are clearly the areas of most
significant investment for most companies, it is clear that
these concerns are now finally becoming enterprise-level
plays that are escaping the confines of the Office of the
CFO. We will continue to witness significant investment
in sales and marketing performance management, as
vendors like Right90 continuing to gain traction in
improving the sales forecasting process and vendors like
Varicent receive hefty $35 million venture rounds this
year, no doubt thanks to experiencing over 100% year
over year growth in the burgeoning Sales Performance
Management category. My former Siebel colleague,
Bruce Cleveland, now a partner at Interwest, makes the
case for this market expansion of performance
management into the front-office rather convincingly and
has invested correspondingly.

pg. 132
• Cloud Business Intelligence, Tools will steal significant
revenue from on-premise vendors but also fight for
limited oxygen amongst themselves. From many
accounts, this was the year that BI based offerings hit the
mainstream due to their numerous advantages over on-
premise offerings, and this certainly was in evidence with
the significant uptick in investment and market visibility
of Business Intelligence, vendors. Although much was
made of the folding of LucidEra, one of the original
pioneers in the space, and while other vendors like
BlinkLogic folded as well, vendors like Birst, PivotLink,
Good Data, Indicee and others continue to announce wins
at a fair clip along with innovations at a fraction of the
cost of their on-premise brethren. From a functionality
perspective, these tools offer great usability, some
collaboration features, strong visualization capabilities,
and an ease-of-use not seen with their on-premise
equivalents whereby users are able to manage the system
in a self-sufficient fashion devoid of the need for
significant IT involvement. Business Intelligence, have
long argued that basic reporting and analysis is now a
commodity, so there is little reason for any customer to
invest in on-premise capabilities at the price/performance
ratio that the vendors are offering . Business
Intelligence, should thus expect to see continued
dimunition of the on-premise vendors BI revenue streams
as the BI value proposition goes mainstream, although it
wouldn’t be surprising to see acquisitions by the large
vendors to stem the tide. However, with so many small
players in the market offering largely similar capabilities,
the Business Intelligence, tools vendors may wind up
starving themselves for oxygen as company put price
pressure on each other to gain new customers. Only
vendors whose offerings were designed from the
beginning for cloud-scale architecture and thus whose
marginal cost per additional user approaches zero will

pg. 133
succeed in such a commodity pricing environment,
although alternatively these vendors can pursue going
upstream and try to compete in the enterprise, where the
risks and rewards of competition are much higher. On
the other hand, packaged Business Intelligence,
Applications such as those offered by Host Analytics,
Adaptive Planning, and new entrant Anaplan, while
showing promising growth, have yet to mature to
mainstream adoption, but are poised to do so in the
coming years. As with all applications, addressing key
integration and security concerns will remain crucial to
driving adoption.

pg. 134
• The undeniable arrival of the era of big data will lead to
further proliferation in data management alternatives.
While analytic-centric OLAP databases have been around
for decades such as Oracle Express, Hyperion Essbase,
and Microsoft Analysis Services, company have never
held the same dominant market share from an
applications consumption perspective that the RDBMS
vendors have enjoyed over the last few decades. No
matter what the application type, the RDBMS seemed to
be the answer. However, we have witnessed an explosion
of exciting data management offerings in the last few
years that have reinvigorated the information
management sector of the industry. The largest web
players such as Google (BigTable), Yahoo (Hadoop),
Amazon (Dynamo), Facebook (Cassandra) have built
their own solutions to handle their own incredible data
volumes, with the open source Hadoop ecosystem and
commercial offerings like CloudEra leading the charge in
broad awareness. Additionally, a whole new industry of
DBMSs dedicated to Analytic workloads have sprung up,
with flagship vendors like Netezza, Greenplum, Vertica,
Aster Data, and the like with significant innovations in in-
memory processing, exploiting parallelism, columnar
storage options, and more. We already starting to see
hybrid approaches between the Hadoop players and the
ADBMS players, and even the largest vendors like Oracle
with their Exadata offering are excited enough to make
significant investments in this space. Additionally,
significant opportunities to push application processing
into the databases themselves are manifesting themselves.
There has never been the plethora of choices available as
new entrants to the market seem to crop up weekly.
Visionary applications of this technology in areas like
metereological forecasting and genomic sequencing with
massive data volumes will become possible at hitherto
unimaginable price points.

pg. 135
• Advanced Visualization will continue to increase in depth
and relevance to broader audiences. Visionary vendors
like Tableau, QlikTech, and Spotfire (now Tibco) made
their mark by providing significantly differentiated
visualization capabilities compared with the trite bar and
pie charts of most Business Intelligence, players’
reporting tools. The latest advances in state-of-the-art
User interface technologies such as Microsoft’s
SilverLight, Adobe Flex, and AJAX via frameworks like
Google’s Web Toolkit augur the era of a revolution in
state-of-the art visualization capabilities. With consumers
broadly aware of the power of capabilities like Google
Maps or the tactile manipulations possible on the iPhone,
these capabilities will find their way into enterprise
offerings at a rapid speed lest the gap between the
consumer and enterprise realms become too large and
lead to large scale adoption revolts as a younger
generation begins to enter the workforce having never
known the green screens of yore.

pg. 136
• Open Source offerings will continue to make in-roads
against on-premise offerings. Much as Business
Intelligence, offerings are doing, Open Source offerings
in the larger Business Intelligence, market are disrupting
the incumbent, closed-source, on-premise vendors.
Vendors like Pentaho and JasperSoft are really starting to
hit their stride with growth percentages well above the
industry average, offering complete end-to-end Business
Intelligence, stacks at a fraction of the cost of their
competitors and thus seeing good bottom-up adoption
rates. This is no doubt a function of the brutal economic
times companies find themselves experiencing.
Individual parts of the stacks can also be assembled into
compelling offerings and receive valuable innovations
from both corporate entities as well as dedicated
committers: JFreeChart for charting, Actuate‘s BIRT for
reporting, Mondrian and Jedox‘s Palo for OLAP Servers,
DynamoBI‘s LucidDB for ADBMS, Revolution
Computing‘s R for statistical manipulation, Cloudera‘s
enterprise Hadoop for massive data, EsperTech for CEP,
Talend for Data Integration / Data Quality / MDM, and
the list goes on. These offerings have absolutely reached
a level of maturity where the companies are capable of
being deployed in the enterprise right alongside any other
commercial closed-source vendor offering.

pg. 137
• Data Quality, Data Integration, and Data Virtualization
will merge with Master Data Management to form a
unified Information Management Platform for structured
and unstructured data. Data quality has been the bain of
information systems for as long as the companies have
existed, causing many an IT analyst to obsess over it, and
data quality issues contribute to significant losses in
system adoption, productivity, and time spent addressing
them. Increasingly, data quality and data integration will
be interlocked hand-in-hand to ensure the right, cleansed
data is moved to downstream sources by attacking the
problem at its root. Vendors including SAP Business
Objects, SAS, Informatica, and Talend are all providing
these capabilities to some degree today. Of course, with
the amount of relevant data sources exploding in the
enterprise and no way to integrate all the data sources into
a single physical location while maintaining agility,
vendors like Composite Software are providing data
virtualization capabilities, whereby canonical information
models can be over layer on top information assets
regardless of where the data are located, capable of
addressing the federation of batch, real-time and event
data sources. These disparate data sources will need to be
harmonized by strong Master Data Management
capabilities, whereby the definitions of key entities in the
enterprise like customers, suppliers, products, etc. can be
used to provide semantic unification over these
distributed data sources. Finally, structured, semi-
structured, and unstructured information will all be able
to be extracted, transformed, loaded, and queried from
this ubiquitous information management platform by
leveraging the capabilities of text analytics capabilities
that continue to grow in importance and combining them
with data virtualization capabilities.

pg. 138
Excel will continue to provide the dominant paradigm for end-user
Business Intelligence, consumption. For Excel specifically, the
number one analytic tool by far with a home on hundreds of
millions of personal desktops, Microsoft has invested significantly
in ensuring its continued viability as we move past its second
decade of existence, and its adoption shows absolutely no sign of
abating any time soon. With Excel 2010's arrival, this includes
significantly enhanced charting capabilities, a server-based mode
first released in 2007 called Excel Services, being a first-class
citizen in SharePoint, and the biggest disruptor, the launch of
Power Pivot, an extremely fast, scalable, in-memory analytic
engine that can allow Excel analysis on millions of rows of data at
sub-second speeds. While many vendors have tried in vain to
displace Excel from the desktops of the business user for more than
two decades, none will be any closer to succeeding any time soon.
Microsoft will continue to make sure of that.

pg. 139
Successful Stories of Implementing Data Mining
in the Businesses

 Acquiring new customers via Data Mining

The first step in CRM is to


"Identify prospects and convert
them to Customers."

For example look at how data


mining can help manage the costs
and improve the effectiveness of a
customer acquisition campaign.

Big Bank and Credit Card Company (BB&CC) annually conducts


25 direct mail campaigns each of which offers one million people
the opportunity for a credit card. The conversion rate measures the
proportion of people who become credit card customers, which for
Big Bank and Credit Card Company (BB&CC) is about 1% per
campaign.

Getting people to fill out an application for the credit card is only
the first step. Then Big Bank and Credit Card Company (BB&CC)
must decide whether the applicant is a good risk and accept them as
a customer. Not surprisingly, poor credit risks are more likely to
accept the offer than are good credit risks. So while 6% of the
people on the mailing list respond with an application, only about
16% of those are suitable credit risks, for a net of about 1% of the
mailing list becoming customers.

pg. 140
Big Bank and Credit Card Company (BB&CC) experience of a 6%
response rate means that within the
million names are 60,000 people who will
respond to the solicitation. Unless Big
Bank and Credit Card Company
(BB&CC) changes the nature of the
solicitation using different mailing lists,
reaching customers in different ways,
altering the terms of the offer Big Bank
and Credit Card Company (BB&CC) are
not going to get more than 60,000 responses. And of those 60,000
responses, only 10,000 will be good enough risks becoming
customers. The challenge Big Bank and Credit Card Company
(BB&CC) faces is getting to those 10,000 people most efficiently.

The cost of mailing the solicitations is about $1.00 per piece for a
total cost of $1,000,000. Over the next couple of years, these
customers will generate about $1,250,000 in profit for the bank (or
about $125 each) for a net return from the mailing of $250,000.

Data mining can improve this return. Although it won’t precisely


identify the 1,00,000 eventual credit card customers, it will help
focus marketing efforts much more cost-effectively.

First Big Bank and Credit Card Company (BB&CC) did a test
mailing of 50,000 and carefully analyzed the results, building a
predictive model of who would respond (using a decision tree) and
a credit scoring model (using a neural net). It then combined these
two models to find the people who were both good credit risks and
most likely to respond to the offer.

pg. 141
The model was applied to the remaining 950,000 people in the
mailing list from which 700,000 people were selected for the
mailing. The result was that from the 750,000 pieces mailed overall
(including the test mailing), 9,000 acceptable applications for credit
cards were received. In other words, the response rate had risen
from 1% to 1.2%, a 20% increase. While the targeted mailing only
reaches 9,000 of the 10,000 prospects, no model is perfect;
reaching the remaining 1,000 prospects is not profitable. Had Big
Bank and Credit Card Company (BB&CC) mailed the other
250,000 people on the mailing list, the cost of $250,000 would
have resulted in another $125,000 of gross profit for a net loss of
$125,000.

The following Table summarizes the results.

Items Old New Differerence


Number of pieces $1,000,0 $750,00
($250,000)
Mailed 00 0
$1,000,0 $750,00
Cost of mailing ($250,000)
00 0
Number of
10000 9000 -1000
responses
Gross profit per
response
$125 $125 $0

Gross profit
$1,250,0 $1,125,0
($125,000)
00 00
$250,00 $375,00
Net Profit $125,000
0 0
Cost of model $0 $40,000 $40,000
$250,00 $335,00
Final Profit $85,000
0 0

Table 15: Cost Sheet of Mailing System Table {Source:


Computer Reseller News (CRN) Magazines}

pg. 142
Notice that the net profit from the mailing increased $125,000.
Even when Business man include the $40,000 cost of the data
mining software, computer, and people resources used for this
modeling effort the net profit increased $85,000. This translates to
a return on investment for modeling of over 200% which far
exceeded Big Bank and Credit Card Company (BB&CC) Return on
Investment (ROI) requirements for this project.

pg. 143
 Increasing the Value of Business Existing Customers:
Cross-Selling Via Data Mining

Guns and Roses (G&R) is


a company that
specializes in selling
antique mortars and
cannons as outdoor flower
pots. Guns and Roses
(G&R) also offer a line
of indoor flower pots
made from large caliber
antique pistols and a collection of muskets that have been
converted to unique holders of long stemmed flowers. Their
catalog is sent to about 12 million homes.

When a customer calls in to place an order, Guns and Roses (G&R)


identifies the caller using caller ID when possible; otherwise Guns
and Roses (G&R) ask for a phone number or customer number
from the catalog mailing label. Next, Guns and Roses (G&R) look
up the customer in the database and then proceed to take the order.

Guns and Roses (G&R) has an excellent chance of selling the caller
something additional cross-selling. But Guns and Roses (G&R) had
found that if the first suggestion fails and Guns and Roses (G&R)
try to suggest a second item, the customer may get irritated and
hang up without ordering anything. And there are some customers
who resent any attempt at all to cross-sell them on additional
products.

pg. 144
Before trying data mining, Guns and Roses (G&R) had been
reluctant to cross-sell at all. Without the model, the odds of making
the right recommendation were one in three. And because making
any recommendation is for some customers unacceptable, Guns
and Roses (G&R) wanted to be exceptionally sure that Guns and
Roses (G&R) never made a recommendation when Guns and Roses
(G&R) should not. In a trial campaign, Guns and Roses (G&R) had
less than a 1% sales rate and had a substantial number of
complaints. Guns and Roses (G&R) were reluctant to continue for
such a small gain.

The situation changed dramatically with the use of data mining.


Now the data mining model operates on the data. Using the
customer information in the database and the new order, it tells the
customer service representative what to recommend. Guns and
Roses (G&R) successfully sold 2% of the customers an additional
product with virtually no complaints.

Developing this capability involved a process similar to solving the


credit card customer acquisition problem. As with that situation,
two models were needed.

The first model predicted whether someone would be offended by


recommendations. Guns and Roses (G&R) found out how their
customers would react by conducting a very short telephone
survey. To be conservative, Guns and Roses (G&R) counted
anyone who declined to participate in the survey as someone who
would find recommendations intrusive. Later on, to verify this
assumption, Guns and Roses (G&R) made recommendations to a
small but statistically significant subset of those who had refused to
answer the survey questions. To their surprise, Guns and Roses
(G&R) found that the assumption was not warranted. This enabled
them to make more recommendations and further increase profits.

pg. 145
The second model predicted which offer would be most acceptable.

In summary, data mining helped Guns and Roses (G&R) better


understand their customers’ needs. When the data mining models
were incorporated in a typical cross-selling CRM campaign, the
models helped Guns and Roses (G&R) company to increase its
profitability 2%.

pg. 146
 Increasing the Value of the Business Existing
Customers: Personalization via Data Mining

Big Sam’s Clothing Company has


set up a website to supplement
their catalog. Whenever customers
go to their site Big Sam’s Clothing
Company greet customers with
“Howdy Pardner!” but once
customers have ordered or
registered with them Big Sam’s
Clothing company website greets
customers by name. If customers have a record of ordering from
them, Big Sam’s Clothing company website will also tell
customers about any new products that might be of particular
interest to customers. When customers look at a particular product,
such as a waterproof down parka, Big Sam’s Clothing Company
will suggest other things that might supplement such a purchase.

When Big Sam’s Clothing Company first put up the site, there was
none of this personalization. It was just an on-line version of their
catalog, nicely and efficiently done but not taking advantage of the
sales opportunities presented by the Web.

Data mining greatly increased the sales at their website. Catalogs


frequently group products by type to simplify the user’s task of
selecting products. In an on-line store, however, the product groups
may be quite different, often based on complementing the item
under consideration. In particular, the site can take into account not
only the items that the customers are looking at, but what is in
customers shopping cart as well, thus leading to even more
customized recommendations

pg. 147
.
First, Big Sam’s used clustering to discover which products
grouped together naturally. Some of the clusters were obvious,
such as shirts and pants. Others were surprising, such as books
about desert hiking and snakebite kits. Big Sam’s Clothing
company website used these groupings to make recommendations
whenever someone looked at a product.

Big Sam’s Clothing company website then built a customer profile


to help them identify those customers who would be interested in
the new products Big Sam’s Clothing
Company were always adding to their
catalog. Big Sam’s Clothing company
website found that steering people to these
selected products not only resulted in
significant incremental sales, but also
solidified their relationship with the
customer. Surveys established that consumers
were viewed as a trusted advisor for clothing and gear.

To extend their reach further, Big Sam’s started a program through


which customers could elect to receive e-mail about new products
that the data mining models predicted would interest customers.
While the customers viewed this as another example of proactive
customer service, Big Sam’s found it to be a program of profit
improvement.

The effort in personalization paid off for Big Sam’s with


significant, measurable increases in repeat sales, average number of
sales per customer, and average size of a sale.

pg. 148
 Retaining Good Customers Via Data Mining

For almost every company, the cost of acquiring a new customer


exceeds the cost of
keeping good
customers. This was
the challenge facing
Know Service, an
Internet Service
Provider (ISP), Know
Service whose
attrition rate was the
industry average, 8%
per month. Since
Internet Service
Provider (ISP) has one million customers, this means 80,000
customers left each month. The cost to replace these customers is
$200 each, or $16,000,000 plenty of incentive to start an attrition
management program.

The first thing Know Service needed to do was prepare the data for
predicting which customers would leave. Know Internet Service
Provider needed to select the variables from their customer
database and perhaps transform them. The bulk of their users were
dial-in clients (as opposed to clients who are always connected
through a T1 or DSL line), so Know Internet Service Provider
knew how long each user was connected to the Web. Know
Internet Service Provider also knew the volume of data transferred
to and from a user’s computer, the number of e-mail accounts a
user had, the number of e-mail messages sent and received, and a
customer’s service and billing history. In addition, Big Sam’s
Clothing Company had demographic data that customers provided
at sign-up.

pg. 149
Next Know Internet Service
Provider needed to identify who
were “good” customers. This is not
a data mining question but a
business definition (such as
profitability) followed by a
calculation. Know Service built a
model to profile their profitable
customers and their unprofitable
customers. Know Internet Service
Provider used this model not only
for customer retention but to
identify customers who were not
yet profitable but might become so
in the future.

Know Service then built a model to predict who among their


profitable customers would leave. As in most data mining
problems, determining what data to use and how to combine
existing data is where much of the challenge lies in model
development.

For example, Big Sam’s Clothing Company needed to look at


time-series data such as the monthly usage. Rather than use the raw
time-series data, Know Internet Service Provider smoothed it by
taking rolling three-month averages. Know Internet Service
Provider also calculated the change in the three month average and
tried that as a predictor. Some of the factors that were good
predictors, such as declining usage, were symptoms rather than
causes that could be directly addressed. Other predictors, such as
the average number of service calls and the change in the average
number of service calls, were indicative of customer satisfaction
problems worth investigating. Predicting who would churn,
however, wasn’t enough. Based on the results of their modeling,
Know Internet Service Provider identified some potential programs
and offers that Know Internet Service Provider believed would
entice people to stay.

pg. 150
For example, some churners were exceeding even the largest
amount of usage available for a fixed fee and were paying
substantial incremental usage fees. Know Internet Service Provider
tried offering these users a higher fee service that included more
bundled time. Some users were offered as more free disk space for
personal web pages. Know Internet Service Provider then built
models that would predict which would be the effective offer for a
particular user.

To summarize, the churn project made use of all three models. One
model identified likely churners, the next model picked out the
profitable ones worth keeping, and the third model matched the
potential churners with the most appropriate offer. The net result
was a reduction in their churn rate from 8% to 7.5%, for a savings
in customer acquisition costs of $1,000,000 per month.

Know Service found that their investment in data mining paid off
by improving their customer relationships and dramatically
increasing their profitability.

pg. 151
Conclusion

Data mining can be beneficial for businesses, governments, society


as well as the individual person. However, the major flaw with
data mining is that it increases the risk of privacy invasion.
Currently, business organizations do not have sufficient security
systems to protect the information that organization obtained
through data mining from unauthorized access, though the use of
data mining should be restricted. In the future, when companies
are willing to spend money to develop sufficient security system to
protect consumer data, then the use of data mining may be
supported.

Comprehensive data warehouses that integrate operational data


with customer, supplier, and market information have resulted in an
explosion of information. Competition requires timely and
sophisticated analysis on an integrated view of the data. However,
there is a growing gap between more powerful storage and retrieval
systems and the users’ ability to effectively analyze and act on the
information organization contain. Both relational and OLAP
technologies have tremendous capabilities for navigating massive
data warehouses, but brute force navigation of data is not enough.
A new technological leap is needed to structure and prioritize
information for specific end-user problems. The data mining tools
can make this leap. Quantifiable business benefits have been
proven through the integration of data mining with current
information systems, and new products are on the horizon that will
bring this integration to an even wider audience of users.

pg. 152
Data mining has been gaining tremendous interest and hence
research on data mining has mushroomed within the last few
decades. A promising approach for managing complex information
and user defined data types is by incorporating Object-Orientation
Concepts into Relational Database Management Systems. In this
Research, I have presented an approach for the design of an Object-
Oriented Database and performing classification effectively in it.
The Object Oriented Programming concepts such as “Inheritance
and Polymorphism” have been utilized in the presented approach.
Owing to this design of Object-Oriented Database (OODB), an
efficient classification task has been achieved by utilizing simple
SQL\ORACLE queries.

The Experimental results have demonstrated the effectiveness of


the presented approach. This approach will successfully reduce the
implementation overhead incurred in the design of an (OODB). My
approach will reduced the amount of memory space inquired for
storing databases that grow in size

Customer Relationship Management is essential to compete


effectively in today’s marketplace. The more effectively Business
Analyst can use the information about business customers to meet
their needs the more profitable the Business will be. But
operational CRM needs analytical CRM with predictive data
mining models at its core. The route to a successful business
requires that Business Man understand the customers and their
requirements, and data mining is the essential guide.

pg. 153
Inspire of the often weird accuracy of insight that data mining
provides, it is not magic. It’s a valuable business tool that
organizations around the globe are successfully using to make
critical business decisions about customer acquisition and
retention, customer value management, marketing optimization,
and other customer-related issues.

Similarly, the keys to effectively using data mining are not secret
or mysterious. With a solid understanding of the issues to be
addressed, appropriate resources and support, and the right
solution, the business analyst, too, can experience the business
benefits that other organizations are reaping from data mining.

pg. 154
APPENDIX – I

List of Figures

Figure 1: The Database System

Figure2: Data Mining is the core of Knowledge Discovery Process

Figure 3: Fragments of some relations from a relational database


for VideoStore

Figure 4: A multi-dimensional data cube structure commonly used


in data for data warehousing

Figure 5: Summarized data from VideoStore before and after drill-


down and roll-up operations

Figure 6: Fragment of a transaction database for the rentals at


VideoStore

Figure 7: Visualization of spatial OLAP (from GeoMiner system)

pg. 155
Figure 8: Examples of Time-Series Data (Source: Thompson
Investors Group)

Figure 9: Class Structure of Employees, Suppliers and Customers


Table

Figure 10: Inheritance Hierarchy of Classes in the Proposed


OODB Design

Figure 11: Graph Demonstrating the above Evaluation Results

Figure 12: Decision Tree

Figure 13: Generalization \ Specialization

pg. 156
APPENDIX – II

List of Tables

Table 1: Example of Employees Table

Table 2: Example of Customers Table

Table 3: Example of Suppliers Table

Table 4: Example of Persons Table

Table 5: Example of Extended Employees Table

Table 6: Example of Extended Suppliers Table

Table 7: Example of Extended Customers Table

Table 8: Example of Extended Places Table

pg. 157
Table 9: Example of Extended PostalCodes Table

Table 10: Saved Memory Table {Source: Computer Reseller News


(CRN) Magazines}

Table 11: Saved Memory Table {Source: Computer Reseller News


(CRN) Magazines}

Table 12: Saved Memory Table {Source: Computer Reseller News


(CRN) Magazines}

Table 13: Saved Memory Table {Source: Computer Reseller News


(CRN) Magazines}

Table 14: Saved Memory Table {Source: Computer Reseller News


(CRN) Magazines}

Table 15: Cost Sheet of Mailing System Table {Source: Computer


News}

pg. 158
APPENDIX – III

List of Equations

Equation 1: To Determine the Memory Size

pg. 159
APPENDIX – IV

SQL Queries

Create Table Statement

Create Table PostalCodes

Create type PostalCodes as Object(PostalCode number(6),City


Varchar2(15), Region Varchar2(15),Country Varchar2(15));

Create Table Places

Create type Places as Object(PlaceID number(6),Street


Varchar2(15), PostalCode PostalCodes);

Create Table Persons

Create type Persons as Object(PersonID number(6), Name


varchar2(15), Age number(3), Gender varchar2(6), MaritalStatus
varchar2(9), BirthDate date, Place Places,phone number(10));

pg. 160
Create Table Employees

Create type Employees as Object (EmployeeID Persons, title


varchar2(10),TitleofCourtesy Varchar2(4),Hiredate date, Extension
date);

Create Table Suppliers

Create type Suppliers as Object(SupplierID Persons,


CompanyName varchar2(20), ContactTitle varchar2(15),Fax
number(10),HomePage varchar2(20));

Create Table Customers

Create type Customers as Object(CustomerID Persons,


CompanyName varchar2(20),ContactTitle varchar2(15),Fax
number(10));

pg. 161
APPENDIX - V

Abbreviation and Synonyms

RDBMS: Relational Database Management Systems

OODB: Object-Oriented Database

OOP: Object-Oriented Programming

OODBMS: Object-Oriented Database Management Systems

OOPL: Object-Oriented Programming Language

VRML: Virtual Reality Markup Language / Virtual Reality


Modeling Language

NASA: National Aeronautics and Space Administration (USA)

CAD: Computer Aided Design

CAM: Computer Aided Manufacturing

pg. 162
CASE: Computer Aided Software Engineering

ROI: Return on Investment

CRISP-DM: Cross-Industry Standard Process for Data Mining

CRM: Customer Relationship Management

BB&CC: Big Bank and Credit Card Company

G&R: Guns and Roses Company

ISP: Internet Service Provider

BI: Business intelligence

pg. 163
Repository:

A repository is a collection of resources that can be accessed to


retrieve information. Repositories often consist of several databases
tied together by a common search engine.

Customer Relationship Management:

Customer Relationship Management is a broadly recognized,


widely-implemented strategy for managing and nurturing a
company’s interactions with clients and sales prospects.

Noise Data:

Noise data is meaningless data. The term has often been used as a
synonym for corrupt data. However, its meaning has expanded to
include any data that cannot be understood and interpreted
correctly by machines, such as unstructured text. Any data that has
been received, stored, or changed in such a manner that it cannot be
read or used by the program that originally created it can be
described as noisy.

Noise data unnecessarily increases the amount of storage space


required and can also adversely affect the results of any data
mining analysis. Statistical analysis can use information gleaned
from historical data to weed out noisy data and facilitate data
mining.

Noise data can be caused by hardware failures, programming errors


and gibberish input from speech or optical character recognition
(OCR) programs. Spelling errors, industry abbreviations and slang
can also impede machine reading.

pg. 164
Web Mining:

Web Mining is the application of data mining techniques to


discover patterns from the Web. According to analysis targets, Web
Mining can be divided into three different types, which are Web
usage mining, Web content mining and Web structure mining.

Mobile Computing:

Mobile Computing is a generic term describing one's ability to use


technology while moving, as opposed to portable computers, which
are only practical for use while deployed in a stationary
configuration.

Nearest Neighbor Method:

Nearest Neighbor search (NNS), also known as proximity search,


similarity search or closest point search, is an optimization problem
for finding closest points in metric spaces. The problem is: given a
set S of points in a metric space M and a query point q belongs to
M, find the closest point in S to q.

Niche:

In ecology, a niche ( or ) is a term describing the relational position


of a species or population in its ecosystem to each other; e.g. a
dolphin could potentially be in another ecological niche from one
that travels in a different pod if the members of these pods utilize
significantly different.

Semantics:

Semantics is the study of meaning, usually in language. The word


"semantics" itself denotes a range of ideas, from the popular to the
highly technical. It is often used in ordinary language to denote a
problem of understanding that comes down to word selection or
connotation.

pg. 165
Generalization:

An entity set may include subgroupings of entities that are distinct


in some way from other entities in the set. For instance, a subset of
entities within an entity set may have attributes that are not shared
by all the entities in the entity set. The E-R model provides a means
for representing these distinctive entity groupings. Consider an
entity set person, with attributes name, street, and city. A person
may be further classified as one of the following:
1. Customer

2. Employee

Each of these person types is described by a set of attributes that


includes all the attributes of entity set person plus possibly
additional attributes. For example, customer entities may be
described further by the attribute customer-id, whereas employee
entities may be described further by the attributes employee-id and
salary. The process of designating sub groupings within an entity
set is called specialization. The specialization of person allows us
to distinguish among persons according to whether the people are
employees or customers.

pg. 166
Specialization:

The design process may also proceed in a bottom-up manner, in


which multiple entity sets are synthesized into a higher-level entity
set on the basis of common features. The database designer may
have first identified a customer entity set with the attributes name,
street, city, and customer-id, and an employee entity set with the
attributes name, street, city, employee-id, and salary. There are
similarities between the customer entity set and the employee entity
set in the sense that entities have several attributes in common.
This commonality can be expressed by generalization, which is a
containment relationship that exists between a higher-level entity
set and one or more lower-level entity sets. In our example, person
is the higher-level entity set and customer and employee are lower-
level entity sets.

Higher- and lower-level entity sets also may be designated by the


terms super class and subclass, respectively. The person entity set
is the super class of the customer and employee subclasses. For all
practical purposes, generalization is a simple inversion of
specialization. We will apply both processes, in combination, in the
course of designing the E-R schema for an enterprise. In terms of
the E-R diagram itself, we do not distinguish between
specialization and generalization. New levels of entity
representation will be distinguished (specialization) or synthesized
(generalization) as the design schema comes to express fully the
database application and the user requirements of the database.

pg. 167
Differences in the two approaches may be characterized by their
starting point and overall goal. Generalization proceeds from the
recognition that a number of entity sets share some common
features (namely, entities are described by the same attributes and
participate in the same relationship sets).

Figure 13: Generalization \ Specialization

pg. 168
Method Override:

Method overriding, in object oriented programming, is a language


feature that allows a subclass to provide a specific implementation
of a method that is already provided by one of its superclasses. The
implementation in the subclass overrides (replaces) the
implementation in the superclass.

For Example:

class Base
{
public:
virtual void DoSomething() {x = x + 5;}
private:
int x;
};
class Derived : public Base
{
public:
virtual void DoSomething() { y = y + 5; Base::DoSomething(); }
private:
int y;
};

pg. 169
Method Overload:

Method overloading allows us to write different version of the


same method in a class or derived class. Compiler automatically
selects the most appropriate method based on the parameter
supplied.

For Example:

public class MultiplyNumbers


{
public int Multiply(int a, int b)
{return a * b; }
public int Multiply(int a, int b, int c)
{
return a*b*c;
}
MultiplyNumbers mn = new MultiplyNumbers();
int number = mn.Multiply(2, 3) // result = 6
int number1 = mn.Multiply(2, 3, 4) // result = 24
}

pg. 170
Polymorphic:

In computer science, polymorphism is a programming language


feature that allows values of different data types to be handled
using a uniform interface. The concept of parametric
polymorphism applies to both data types and functions, relating to
polymorphism (any sense), able to have several shapes or forms.

Interface:

An interface in the Java programming language is an abstract type


that is used to specify an interface (in the generic sense of the term)
that classes must implement. an Interface in Computer science
refers to a set of named operations that can be invoked by clients.
Interface generally refers to an abstraction that an entity provides of
itself to the outside.

Inheritance:

Inheritance is the practice of passing on property, titles, debts, and


obligations upon the death of an individual. It has long played an
important role in human societies. The rules of inheritance differ
between societies and have changed over time. In object-oriented
programming (OOP), inheritance is a way to form new classes
(instances of which are called objects) using classes that have
already been defined. Inheritance is employed to help reuse
existing code with little or no modification.

pg. 171
Composite Object Modeling:

In programming languages, composite objects are usually


expressed by means of references from one object to another;
depending on the language, such references may be known as
fields, members, properties or attributes, and the resulting
composition as a structure, storage record, tuple, user-defined type
(UDT), or composite type. Fields are given a unique name so that
each one can be distinguished from the others. However, having
such references doesn't necessarily mean that an object is a
composite. It is only called composite if the objects it refers to are
really its parts, i.e. have no independent existence.

pg. 172
Decision trees:

A decision tree (or tree diagram) is a decision support tool that uses
a tree-like graph or model of decisions and their possible
consequences, including chance event outcomes, resource costs,
and utility

Decision trees are a way of representing a series of rules that lead


to a class or value. For example, business man may wish to offer a
prospective customer a particular product. The figure shows a
simple decision tree that solves this problem while illustrating all
the basic components of a decision tree: the decision node,
branches and leaves.

A simple classification tree.

The first component is the top decision node, or root node, which
specifies a test to be carried out. Each branch will lead either to
another decision node or to the bottom of the tree, called a leaf
node. By navigating the decision tree business analysis can assign a
value or class to a case by deciding which branch to take, starting
at the root node and moving to each subsequent node until a leaf
node is reached. Each node uses the data from the case to choose
the appropriate branch. Decision trees models are commonly used
in data mining to examine the data and induce a tree and its rules
that will be used to make predictions. A number of different
algorithms may be used for building decision trees including
CHAID (Chi-squared Automatic Interaction Detection), CART
(Classification and Regression Trees), Quest, and C5.0.

pg. 173
Reluctant: not eager; unwilling; disinclined

Warranted: Authorization or certification; sanction, as given by a


superior.

Attrition Rate: The rate of shrinkage in size or number

Solicitation:

To seek to obtain by persuasion, entreaty, or formal application: a


candidate who solicited votes among the factory workers.

Clustering:

Clustering divides a database into different groups. The goal of


clustering is to find groups that are very different from each other,
and whose members are very similar to each other. Unlike
classification, Marketers don’t know what the clusters will be when
Marketers start, or by which attributes the data will be clustered.
Consequently, someone who is knowledgeable in the business must
interpret the clusters. After Marketers have found clusters that
reasonably segment the Business database, these clusters may then
be used to classify new data. Some of the common algorithms used
to perform clustering include Kohonen feature maps and K-means.
Don’t confuse clustering with segmentation. Segmentation refers to
the general problem of identifying groups that have common
characteristics. Clustering is a way to segment data into groups that
are not previously defined, whereas classification is a way to
segment data by assigning it to groups that are already defined.

pg. 174
Neural Networks:

A computer architecture in which processors are connected in a


manner suggestive of connections between neurons; can learn by
trial and error.

Neural networks are of particular interest because neural networks


offers a means of efficiently modeling large and complex problems
in which there may be hundreds of predictor variables that have
many interactions. (Actual biological neural networks are
incomparably more complex.) Neural nets are most commonly
used for regressions but may also be used in classification
problems.

pg. 175
Notes

---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------
---------------------------------------------------------------------------------

pg. 176

Das könnte Ihnen auch gefallen