Sie sind auf Seite 1von 70

CS2032 DATA WAREHOUSING AND DATA MINING

Department of Information Technoo!"


UNIT I
DATA WAREHOUSING
Data Warehouse Introduction
A data warehouse is a collection of data marts representing historical data from different
operations in the company. This data is stored in a structure optimized for querying and data analysis as a
data warehouse. Table design, dimensions and organization should be consistent throughout a data
warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also
be viewed as a database for historical data from different functions within a company.
The term Data Warehouse was coined by Bill nmon in !""#, which he defined in the following
way$ A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data
in support of managements decision making process!
%e defined the terms in the sentence as follows$
Subject Oriented: Data that gives information about a particular sub&ect instead of about a company's
ongoing operations.
Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a
coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. (ore data is added but data is never removed.
This enables management to gain a consistent picture of the business. t is a single, complete and
consistent store of data obtained from a variety of different sources made available to end users in what
they can understand and use in a business conte)t. t can be
*sed for decision +upport
*sed to manage and control business
*sed by managers and end,users to understand the business and ma-e &udgments
Data Warehousing is an architectural construct of information systems that provides users with current
and historical decision support information that is hard to access or present in traditional operational data
stores
"ther important terminolog#
Enterprise Data warehouse: t collects all information about sub&ects .customers, products, sales,
assets, personnel/ that span the entire organization
Data (art$ Departmental subsets that focus on selected sub&ects. A data mart is a segment of a data
warehouse that can provide data for reporting and analysis on a section, unit, department or operation in
1
CS2032 DATA WAREHOUSING AND DATA MINING
the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data
warehouses which are usually smaller than the corporate data warehouse.
Decision Support System (DSS): nformation technology to help the -nowledge wor-er .e)ecutive,
manager, and analyst/ ma-es faster 0 better decisions
Drill-down: Traversing the summarization levels from highly summarized data to the underlying
current or old detail
Metadata: Data about data. 1ontaining location and description of warehouse system components$
names, definition, structure2
$enefits of data warehousing
Data warehouses are designed to perform well with aggregate queries running on large
amounts of data.
The structure of data warehouses is easier for end users to navigate, understand and query
against unli-e the relational databases primarily designed to handle lots of transactions.
Data warehouses enable queries that cut across different segments of a company's operation.
3.g. production data could be compared against inventory data even if they were originally
stored in different databases with different structures.
4ueries that would be comple) in very normalized databases could be easier to build and
maintain in data warehouses, decreasing the wor-load on transaction systems.
Data warehousing is an efficient way to manage and report on data that is from a variety of
sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information from lots of
users.
5Data warehousing provides the capability to analyze large amounts of historical data for
nuggets of wisdom that can provide an organization with competitive advantage.
"perational and informational Data
6perational Data$
7ocusing on transactional function such as ban- card withdrawals and deposits
Detailed
*pdateable
8eflects current data
nformational Data$
7ocusing on providing answers to problems posed by decision ma-ers
+ummarized
9on updateable
Data Warehouse %haracteristics
A data warehouse can be viewed as an information system with the following attributes$
: t is a database designed for analytical tas-s
: t's content is periodically updated
: t contains current and historical data to provide a historical perspective of information
2
CS2032 DATA WAREHOUSING AND DATA MINING
6perational data store .6D+/
; 6D+ is an architecture concept to support day,to,day operational decision support and contains
current value data propagated from operational applications
; 6D+ is sub&ect,oriented, similar to a classic definition of a Data warehouse
; 6D+ is integrated
6D+ DATA WA83%6*+3
<olatile 9on volatile
<ery current data 1urrent and historical data
Detailed data =re calculated summaries
&!Data warehouse Architecture and its seven components
!. Data sourcing, cleanup, transformation, and migration tools
>. (etadata repository
?. Warehouse@database technology
A. Data marts
B. Data query, reporting, analysis, and mining tools
C. Data warehouse administration and management
3
CS2032 DATA WAREHOUSING AND DATA MINING
D. nformation delivery system

Data warehouse is an environment, not a product which is based on relational database
management system that functions as the central repository for informational data.
The central repository information is surrounded by number of -ey components designed to ma-e
the environment is functional, manageable and accessible.
The data source for data warehouse is coming from operational applications. The data entered into
the data warehouse transformed into an integrated structure and format. The transformation process
involves conversion, summarization, filtering and condensation. The data warehouse must be capable of
holding and managing large volumes of data as well as different structure of data structures over the time.
&! Data warehouse database
This is the central part of the data warehousing environment. This is the item number > in the
above arch. diagram. This is implemented based on 8DB(+ technology.
'! (ourcing, Ac)uisition, %lean up, and *ransformation *ools
This is item number ! in the above arch diagram. They perform conversions, summarization, -ey
changes, structural changes and condensation. The data transformation is required so that the information
can by used by decision support tools. The transformation produces programs, control statements, E1F
4
CS2032 DATA WAREHOUSING AND DATA MINING
code, 16B6F code, *9G scripts, and +4F DDF code etc., to move the data into data warehouse from
multiple operational systems.
The functionalities of these tools are listed below$
To remove unwanted data from operational db
1onverting to common data names and attributes
1alculating summaries and derived data
3stablishing defaults for missing data
5Accommodating source data definition changes
Issues to be considered while data sourcing, cleanup, extract and transformation:
Data heterogeneity$ t refers to DB(+ different nature such as it may be in different data modules,
it may have different access languages, it may have data navigation methods, operations, concurrency,
integrity and recovery processes etc.,
Data heterogeneity$ t refers to the different way the data is defined and used in different modules.
Some experts involved in the development of such tools:
=rism +olutions, 3volutionary Technology nc., <ality, =ra)is and 1arleton
+!,eta data
t is data about data. t is used for maintaining, managing and using the data warehouse. t is
classified into two$
Technical Meta data $ t contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tas-s. t includes,
5 nfo about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse db
Warehouse 6b&ect and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, bac-up history, archive history, info delivery history, data acquisition
history, data access etc.,
Business Meta data: t contains info that gives info stored in data warehouse to users. t includes,
+ub&ect areas, and info ob&ect type including queries, reports, images, video, audio clips etc.
nternet home pages
nfo related to info delivery system
5Data warehouse operational info such as ownerships, audit trails etc.,
5
CS2032 DATA WAREHOUSING AND DATA MINING
(eta data helps the users to understand content and find the data. (eta data are stored in a
separate data stores which is -nown as informational directory or (eta data repository which helps to
integrate, maintain and view the contents of the data warehouse. The following lists the characteristics of
info directory@ (eta data$
t is the gateway to the data warehouse environment
t supports easy distribution and replication of content for high performance and availability
t should be searchable by business oriented -ey words
5 t should act as a launch platform for end user to access data and analysis tools
t should support the sharing of info
5 t should support scheduling options for request
5 t should support and provide interface to other applications
t should support end user monitoring of the status of the data warehouse environment
- Access tools
ts purpose is to provide info to business users for decision ma-ing. There are five categories$
5Data query and reporting tools
Application development tools
3)ecutive info system tools .3+/
56FA= tools
Data mining tools
4uery and reporting tools are used to generate query and report. There are two types of reporting tools.
They are$
=roduction reporting tool used to generate regular operational reports
Des-top report writer are ine)pensive des-top tools designed for end users.
Managed Query tools: used to generate +4F query. t uses (eta layer software in between users
and databases which offers a point,and,clic- creation of +4F statement. This tool is a preferred choice of
users to perform segment identification, demographic analysis, territory management and preparation of
customer mailing lists etc.
pplication de!elopment tools: This is a graphical data access environment which integrates
6FA= tools with data warehouse and can be used to access all db systems
"#$ Tools: are used to analyze the data in multi dimensional and comple) views. To enable
multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base
and (8DB refers multi relational data bases.
Data mining tools: are used to discover -nowledge from the data warehouse data also can be used
for data visualization and data correction purposes.
6
CS2032 DATA WAREHOUSING AND DATA MINING
.!Data marts
Departmental subsets that focus on selected sub&ects. They are independent used by
dedicated user group. They are used for rapid delivery of enhanced decision support functionality
to end users. Data mart is used in the following situation$
3)tremely urgent user requirement
The absence of a budget for a full scale data warehouse strategy
The decentralization of business needs
The attraction of easy to use tools and mind sized pro&ect
Data mart presents two problems$
!. Scala%ility: A small data mart can grow quic-ly in multi dimensions. +o that while
designing it, the organization has to pay more attention on system scalability, consistency
and manageability issues
>. Data integration
/!Data warehouse admin and management
The management of data warehouse includes,
+ecurity and priority management
(onitoring updates from multiple sources
Data quality chec-s
(anaging and updating meta data
Auditing and reporting data warehouse usage and status
=urging data
8eplicating, sub setting and distributing data
Bac-up and recovery
Data warehouse storage management which includes capacity planning, hierarchical storage
management and purging of aged data etc.,
0!Information deliver# s#stem
; t is used to enable the process of subscribing for data warehouse info.
; Delivery to one or more destinations according to specified scheduling algorithm.
'!$uilding a Data warehouse
7
CS2032 DATA WAREHOUSING AND DATA MINING
There are two reasons why organizations consider data warehousing a critical need. n
other words, there are two factors that drive you to build and use data warehouse. They are$
Business &actors:
Business users want to ma-e decision quic-ly and correctly using all available data.
Technological &actors:
To address the incompatibility of operational data stores
T infrastructure is changing rapidly. ts capacity is increasing and cost is decreasing so that
building a data warehouse is easy
*here are several things to be considered while building a successful data warehouse
Business considerations$
6rganizations interested in development of a data warehouse can choose one of the following
*wo approaches1
1. Top : Down Approach .+uggested by Bill nmon/
2. Bottom : *p Approach .+uggested by 8alph Himball/
&!*op 2 Down Approach
n the top down approach suggested by Bill nmon, we build a centralized repository to house
corporate wide business data. This repository is called 3nterprise Data Warehouse .3DW/. The data in the
3DW is stored in a normalized form in order to avoid redundancy.
The central repository for corporate wide data helps us maintain one version of truth of the
data.The data in the 3DW is stored at the most detail level. The reason to build the 3DW on the most detail
level is to leverage
!. 7le)ibility to be used by multiple departments.
>. 7le)ibility to cater for future requirements.
*he disadvantages of storing data at the detail level are
!. The comple)ity of design increases with increasing level of detail.
>. t ta-es large amount of space to store data at detail level, hence increased cost.
8
CS2032 DATA WAREHOUSING AND DATA MINING
6nce the 3DW is implemented we start building sub&ect area specific data marts which contain
data in a de normalized form also called star schema. The data in the marts are usually summarized based
on the end users analytical requirements. The reason to de normalize the data in the mart is to provide
faster access to the data for the end users analytics. f we were to have queried a normalized schema for the
same analytics, we would end up in a comple) multiple level &oins that would be much slower as
compared to the one on the de normalized schema.
We should implement the top,down approach when
!. The business has complete clarity on all or multiple sub&ect areas data warehosue requirements.
>. The business is ready to invest considerable time and money.
The advantage of using the Top Down approach is that we build a centralized repository to cater
for one version of truth for business data. This is very important for the data to be reliable, consistent
across sub&ect areas and for reconciliation in case of data related contention between sub&ect areas.
The disadvantage of using the Top Down approach is that it requires more time and initial
investment. The business has to wait for the 3DW to be implemented followed by building the data marts
before which they can access their reports.
'! $ottom 3p Approach
The bottom up approach suggested by 8alph Himball is an incremental approach to build a data
warehouse. %ere we build the data marts separately at different points of time as and when the specific
sub&ect area requirements are clear. The data marts are integrated or combined together to form a data
warehouse. +eparate data marts are combined through the use of conformed dimensions and conformed
facts. A conformed dimension and a conformed fact is one that can be shared across data marts.
A 1onformed dimension has consistent dimension -eys, consistent attribute names and consistent
values across separate data marts. The conformed dimension means e)act same thing with every fact table
it is &oined. A 1onformed fact has the same definition of measures, same dimensions &oined to it and at the
same granularity across data marts.
The bottom up approach helps us incrementally build the warehouse by developing and integrating
data marts as and when the requirements are clear. We don't have to wait for -nowing the overall
requirements of the warehouse. We should implement the bottom up approach when
!. We have initial cost and time constraints.
>. The complete warehouse requirements are not clear. We have clarity to only one data mart.
9
CS2032 DATA WAREHOUSING AND DATA MINING
The advantage of using the Bottom *p approach is that they do not require high initial costs and
have a faster implementation timeI hence the business can start using the marts much earlier as compared
to the top,down approach.
The disadvantages of using the Bottom *p approach is that it stores data in the de normalized
format, hence there would be high space usage for detailed data. We have a tendency of not -eeping
detailed data in this approach hence loosing out on advantage of having detail data .i.e. fle)ibility to easily
cater to future requirements. Bottom up approach is more realistic but the comple)ity of the integration
may become a serious obstacle.
!SI"N #ONSI!$%TIONS
To be a successful data warehouse designer must adopt a holistic approach that is considering all
data warehouse components as parts of a single comple) system, and ta-e into account all possible data
sources and all -nown usage requirements.
(ost successful data warehouses that meet these requirements have these common characteristics$
Are based on a dimensional model
1ontain historical and current data
nclude both detailed and summarized data
1onsolidate disparate data from multiple sources while retaining consistency
Data warehouse is difficult to build due to the following reason$
%eterogeneity of data sources
*se of historical data
Jrowing nature of data base
Data warehouse design approach muse be business driven, continuous and iterative engineering
approach. n addition to the general considerations there are following specific points relevant to the data
warehouse design$
Data content
The content and structure of the data warehouse are reflected in its data model. The data model is
the template that describes how information will be organized within the integrated warehouse framewor-.
The data warehouse data must be a detailed data. t must be formatted, cleaned up and transformed to fit
the warehouse data model.
10
CS2032 DATA WAREHOUSING AND DATA MINING
,eta data
t defines the location and contents of data in the warehouse. (eta data is searchable by users to
find definitions or sub&ect areas. n other words, it must provide decision support oriented pointers to
warehouse data and thus provides a logical lin- between warehouse data and decision support applications.
Data distribution
6ne of the biggest challenges when designing a data warehouse is the data placement and
distribution strategy. Data volumes continue to grow in nature. Therefore, it becomes necessary to -now
how the data should be divided across multiple servers and which users should get access to which types of
data. The data can be distributed based on the sub&ect area, location .geographical region/, or time .current,
month, year/.
*ools
A number of tools are available that are specifically designed to help in the
implementation of the data warehouse. All selected tools must be compatible with the given data
warehouse environment and with each other. All tools must be able to use a common (eta data
repository.
Design steps
The following nine,step method is followed in the design of a data warehouse$
!. 1hoosing the sub&ect matter
>. Deciding what a fact table represents
?. dentifying and conforming the dimensions
A. 1hoosing the facts
B. +toring pre calculations in the fact table
C. 8ounding out the dimension table
D. 1hoosing the duration of the db
K. The need to trac- slowly changing dimensions
". Deciding the query priorities and query models
T!#&NI#%' #ONSI!$%TIONS
A number of technical issues are to be considered when designing a data warehouse
environment. These issues include$
11
CS2032 DATA WAREHOUSING AND DATA MINING
The hardware platform that would house the data warehouse
The dbms that supports the warehouse data
The communication infrastructure that connects data marts, operational systems and end
users
The hardware and software to support meta data repository
The systems management framewor- that enables admin of the entire environment
I()'!(!NT%TION #ONSI!$%TIONS
The following logical steps needed to implement a data warehouse$
1ollect and analyze business requirements
1reate a data model and a physical design
Define data sources
1hoose the db tech and platform
3)tract the data from operational db, transform it, clean it up and load it into the warehouse
1hoose db access and reporting tools
1hoose db connectivity software
1hoose data analysis and presentation s@w
*pdate the data warehouse
Access tools
Data warehouse implementation relies on selecting suitable data access tools. The best way to choose
this is based on the type of data can be selected using this tool and the -ind of access it permits for a
particular user. The following lists the various type of data that can be accessed$
+imple tabular form data
8an-ing data
(ultivariable data
Time series data
Jraphing, charting and pivoting data
1omple) te)tual search data
+tatistical analysis data
Data for testing of hypothesis, trends and patterns
=redefined repeatable queries
Ad hoc user specified queries
12
CS2032 DATA WAREHOUSING AND DATA MINING
8eporting and analysis data
1omple) queries with multiple &oins, multi level sub queries and sophisticated search criteria
Data e4traction, clean up, transformation and migration
A proper attention must be paid to data e)traction which represents a success factor for a data
warehouse architecture. When implementing data warehouse several the following selection criteria that
affect the ability to transform, consolidate, integrate and repair the data should be considered$
Timeliness of data delivery to the warehouse
The tool must have the ability to identify the particular data and that can be read by conversion tool
The tool must support flat files, inde)ed files since corporate data is still in this type
The tool must have the capability to merge data from multiple data stores
The tool should have specification interface to indicate the data to be e)tracted
The tool should have the ability to read data from data dictionary
The code generated by the tool should be completely maintainable
The tool should permit the user to e)tract the required data
The tool must have the facility to perform data type and character set translation
The tool must have the capability to create summarization, aggregation and derivation of records
The data warehouse database system must be able to perform loading data directly from these tools
Data placement strategies
: As a data warehouse grows, there are at least two options for data placement. 6ne is to put some of
the data in the data warehouse into another storage media.
: The second option is to distribute the data in the data warehouse across multiple servers.
3ser levels
The users of data warehouse data can be classified on the basis of their s-ill level in accessing the
warehouse. There are three classes of users$
'asual users: are most comfortable in retrieving info from warehouse in pre defined formats and
running pre e)isting queries and reports. These users do not need tools that allow for building standard and
ad hoc reports
$ower (sers: can use pre defined as well as user defined queries to create simple and ad hoc
reports. These users can engage in drill down operations. These users may have the e)perience of using
reporting and query tools.
13
CS2032 DATA WAREHOUSING AND DATA MINING
E)pert users: These users tend to create their own comple) queries and perform standard analysis
on the info they retrieve. These users have the -nowledge about the use of query and report tools
$enefits of data warehousing 1Data warehouse usage includes,
: Focating the right info
: =resentation of info
: Testing of hypothesis
: Discovery of info
: +haring the analysis
*he benefits can be classified into two1
Tangible benefits .quantified @ measureable/$t includes,
: mprovement in product inventory
: Decrement in production cost
: mprovement in selection of target mar-ets
: 3nhancement in asset and liability management
ntangible benefits .not easy to quantified/$ t includes,
: mprovement in productivity by -eeping all data in single location and eliminating re-eying of
data
: 8educed redundant processing
: 3nhanced customer relation
+! ,apping the data warehouse architecture to ,ultiprocessor architecture
The functions of data warehouse are based on the relational data base technology. The relational
data base technology is implemented in parallel manner. There are two advantages of having parallel
relational data base technology for data warehouse$
#inear Speed up: refers the ability to increase the number of processor to reduce response time.
#inear Scale up: refers the ability to provide same performance on the same requests as the
database size increases
*#pes of parallelism
There are two types of parallelism$
14
CS2032 DATA WAREHOUSING AND DATA MINING
*nter +uery $arallelism: n which different server threads or processes handle multiple requests at
the same time.
*ntra +uery $arallelism: This form of parallelism decomposes the serial +4F query into lower
level operations such as scan, &oin, sort etc. Then these lower level operations are e)ecuted concurrently in
parallel.
ntra query parallelism can be done in either of two ways$
,ori-ontal parallelism: which means that the data base is partitioned across multiple dis-s and
parallel processing occurs within a specific tas- that is performed concurrently on different processors
against different set of data
.ertical parallelism: This occurs among different tas-s. All query components such as scan, &oin,
sort etc are e)ecuted in parallel in a pipelined fashion. n other words, an output from one tas- becomes an
input into another tas-.
Data partitioning1
Data partitioning is the -ey component for effective parallel e)ecution of data base operations.
=artition can be done randomly or intelligently.
/andom portioning includes random data striping across multiple dis-s on a single server. Another
option for random portioning is round robin fashion partitioning in which each record is placed on the ne)t
dis- assigned to the data base.
*ntelligent partitioning assumes that DB(+ -nows where a specific record is located and does not
waste time searching for it across all dis-s. The various intelligent partitioning include$
,ash partitioning: A hash algorithm is used to calculate the partition number based on the value of
the partitioning -ey for each row
15
CS2032 DATA WAREHOUSING AND DATA MINING
0ey range partitioning: 8ows are placed and located in the partitions according to the value of the
partitioning -ey. That is all the rows with the -ey value from A to H are in partition !, F to T are in
partition > and so on.
Schema portioning: an entire table is placed on one dis-I another table is placed on different dis-
etc. This is useful for small reference tables.
(ser de&ined portioning: t allows a table to be partitioned on the basis of a user defined
e)pression.
Data base architectures of parallel processing
There are three DB(+ software architecture styles for parallel processing$
!. +hared memory or shared everything Architecture
>. +hared dis- architecture
?. +hred nothing architecture
Shared (emor* %rchitecture
Tightly coupled shared memory systems, illustrated in following figure have the following
characteristics$
(ultiple =*s share memory.
3ach =* has full access to all shared memory through a common bus.
1ommunication between nodes occurs via shared memory.
=erformance is limited by the bandwidth of the memory bus.
+ymmetric multiprocessor .+(=/ machines are often nodes in a cluster. (ultiple +(= nodes can be
used with 6racle =arallel +erver in a tightly coupled system, where memory is shared among the multiple
=*s, and is accessible by all the =*s through a memory bus. 3)amples of tightly coupled systems include
the =yramid, +equent, and +un +parc+erver.
16
CS2032 DATA WAREHOUSING AND DATA MINING
=erformance is potentially limited in a tightly coupled system by a number of factors. These include
various system components such as the memory bandwidth, =* to =* communication bandwidth, the
memory available on the system, the @6 bandwidth, and the bandwidth of the common bus.
=arallel processing advantages of shared memor# s#stems are these$
(emory access is cheaper than inter,node communication. This means that internal
synchronization is faster than using the Foc- (anager.
+hared memory systems are easier to administer than a cluster.
A disadvantage of shared memor# s#stems for parallel processing is as follows$
+calability is limited by bus bandwidth and latency, and by available memory.
Shared is+ %rchitecture
+hared dis- systems are typically loosely coupled. +uch systems, illustrated in following figure, have
the following characteristics$
3ach node consists of one or more =*s and associated memory.
(emory is not shared between nodes.
1ommunication occurs over a common high,speed bus.
17
CS2032 DATA WAREHOUSING AND DATA MINING
3ach node has access to the same dis-s and other resources.
A node can be an +(= if the hardware supports it.
Bandwidth of the high,speed bus limits the number of nodes .scalability/ of the system.
The cluster illustrated in figure is composed of multiple tightly coupled nodes. The Distributed Foc-
(anager .DF( / is required. 3)amples of loosely coupled systems are <AGclusters or +un clusters.
+ince the memory is not shared among the nodes, each node has its own data cache. 1ache consistency
must be maintained across the nodes and a loc- manager is needed to maintain the consistency.
Additionally, instance loc-s using the DF( on the 6racle level must be maintained to ensure that all nodes
in the cluster see identical data.
There is additional overhead in maintaining the loc-s and ensuring that the data caches are consistent.
The performance impact is dependent on the hardware and software components, such as the bandwidth of
the high,speed bus through which the nodes communicate, and DF( performance.
=arallel processing advantages of shared dis- systems are as follows$
+hared dis- systems permit high availability. All data is accessible even if one node dies.
These systems have the concept of one database, which is an advantage over shared nothing
systems.
+hared dis- systems provide for incremental growth.
18
CS2032 DATA WAREHOUSING AND DATA MINING
=arallel processing disadvantages of shared dis- systems are these$
nter,node synchronization is required, involving DF( overhead and greater dependency on high,
speed interconnect.
f the wor-load is not partitioned well, there may be high synchronization overhead.
There is operating system overhead of running shared dis- software.
Shared Nothing %rchitecture
+hared nothing systems are typically loosely coupled. n shared nothing systems only one 1=* is
connected to a given dis-. f a table or database is located on that dis-, access depends entirely on the =*
which owns it. +hared nothing systems can be represented as follows$
+hared nothing systems are concerned with access to dis-s, not access to memory. 9onetheless,
adding more =*s and dis-s can improve scale up. 6racle =arallel +erver can access the dis-s on a shared
nothing system as long as the operating system provides transparent dis- access, but this access is
e)pensive in terms of latency.
+hared nothing systems have advantages and disadvantages for parallel processing$
Advantages
19
CS2032 DATA WAREHOUSING AND DATA MINING
+hared nothing systems provide for incremental growth.
+ystem growth is practically unlimited.
(==s are good for read,only databases and decision support applications.
7ailure is local$ if one node fails, the others stay up.
Disadvantages
(ore coordination is required.
(ore overhead is required for a process wor-ing on a dis- belonging to another node.
f there is a heavy wor-load of updates or inserts, as in an online transaction processing system, it
may be worthwhile to consider data,dependent routing to alleviate contention.
5arallel D$,( features
+cope and techniques of parallel DB(+ operations
6ptimizer implementation
Application transparency
=arallel environment which allows the DB(+ server to ta-e full advantage of the e)isting facilities
on a very low level
DB(+ management tools help to configure, tune, admin and monitor a parallel 8DB(+ as
effectively as if it were a serial 8DB(+
=rice @ =erformance$ The parallel 8DB(+ can demonstrate a non linear speed up and scale up at
reasonable costs.
5arallel D$,( vendors
6racle$ =arallel 4uery 6ption .=46/
Architecture$ shared dis- arch
20
CS2032 DATA WAREHOUSING AND DATA MINING
Data partition$ Hey range, hash, round robin
=arallel operations$ hash &oins, scan and sort
nformi)$ eGtended =arallel +erver .G=+/
Architecture$ +hared memory, shared dis- and shared nothing models
Data partition$ round robin, hash, schema, -ey range and user defined
=arallel operations$ 9+38T, *=DAT3, D3F3FT3
B($ DB> =arallel 3dition .DB> =3/
Architecture$ +hared nothing models
Data partition$ hash
=arallel operations$ 9+38T, *=DAT3, D3F3FT3, load, recovery, inde) creation, bac-up, table
reorganization
+LBA+3$ +LBA+3 (==
Architecture$ +hared nothing models
Data partition$ hash, -ey range, +chema
=arallel operations$ %orizontal and vertical parallelism
-! D$,( schemas for decision support
The basic concepts of dimensional modeling are$ facts, dimensions and measures. A fact is a
collection of related data items, consisting of measures and conte)t data. t typically represents business
items or business transactions. A dimension is a collection of data that describe one business dimension.
Dimensions determine the conte)tual bac-ground for the factsI they are the parameters over which we
want to perform 6FA=. A measure is a numeric attribute of a fact, representing the performance or
behavior of the business relative to the dimensions.
1onsidering 8elational conte)t, there are three basic schemas that are used in dimensional
modeling$
!. +tar schema
>. +nowfla-e schema
?. 7act constellation schema
(tar schema
21
CS2032 DATA WAREHOUSING AND DATA MINING
The multidimensional view of data that is e)pressed using relational data base semantics is
provided by the data base schema design called star schema. The basic of stat schema is that information
can be classified into two groups$
7acts
Dimension
+tar schema has one large central table .fact table/ and a set of smaller tables .dimensions/
arranged in a radial pattern around the central table.
7acts are core data element being analyzed while dimensions are attributes about the facts.
The determination of which schema model should be used for a data warehouse should be based
upon the analysis of pro&ect requirements, accessible tools and pro&ect team preferences.
What is star schemaM The star schema architecture is the simplest data warehouse schema. t is
called a star schema because the diagram resembles a star, with points radiating from a center. The center
of the star consists of fact table and the points of the star are the dimension tables. *sually the fact tables in
a star schema are in third normal form.?97/ whereas dimensional tables are de,normalized. Despite the
22
CS2032 DATA WAREHOUSING AND DATA MINING
fact that the star schema is the simplest architecture, it is most commonly used nowadays and is
recommended by 6racle.
6act *ables
A fact table is a table that contains summarized numerical and historical data .facts/ and a
multipart inde) composed of foreign -eys from the primary -eys of related dimension tables. A fact table
typically has two types of columns$ foreign -eys to dimension tables and measures those that contain
numeric facts. A fact table can contain fact's data on detail or aggregated level.
Dimension *ables
Dimensions are categories by which summarized data can be viewed. 3.g. a profit
summary in a fact table can be viewed by a Time dimension .profit by month, quarter, year/,
8egion dimension .profit by country, state, city/, =roduct dimension .profit for product!,
product>/.
A dimension is a structure usually composed of one or more hierarchies that categorizes data. f a
dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary -eys of each of
the dimension tables are part of the composite primary -ey of the fact table. Dimensional attributes help to
describe the dimensional value. They are normally descriptive, te)tual values. Dimension tables are
generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic region
.mar-ets, cities/, clients, products, times, channels.
,easures
(easures are numeric data based on columns in a fact table. They are the primary data which
end users are interested in. 3.g. a sales fact table may contain a profit measure which represents profit on
each sale.
Aggregations are pre calculated numeric data. By calculating and storing the answers to a query before
users as- for it, the query processing time can be reduced. This is -ey in providing fast query performance
in 6FA=.
1ubes are data processing units composed of fact tables and dimensions from the data
warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients.
The main characteristics of star schema$
+imple structure ,N easy to understand schema
23
CS2032 DATA WAREHOUSING AND DATA MINING
Jreat query effectives ,N small number of tables to &oin
8elatively long time of loading data into dimension tables ,N de,normalization, redundancy
data caused that size of the table could be large.
The most commonly used in the data warehouse implementations ,N widely supported by a
large number of business intelligence tools
(nowflake schema1
The snowfla-e schema is an e)tension of the star schema, where each point of the star e)plodes
into more points. n a star schema, each dimension is represented by a single dimensional table, whereas in
a snowfla-e schema, that dimensional table is normalized into multiple loo-up tables, each representing a
level in the dimensional hierarchy.
7or e)ample, the Time Dimension that consists of > different hierarchies$
!.LearO(onthODay
>. Wee- O Day
We will have A loo-up tables in a snowfla-e schema$ A loo-up table for year, a loo-up table for
month, a loo-up table for wee-, and a loo-up table for day. Lear is connected to (onth, which is then
connected to Day. Wee- is only connected to Day.
The main advantage of the snowfla+e schema is the improvement in query performance due to
minimized dis- storage requirements and &oining smaller loo-up tables.
The main disadvantage of the snowfla+e schema is the additional maintenance efforts needed due
to the increase number of loo-up tables.
24
CS2032 DATA WAREHOUSING AND DATA MINING
t is the result of decomposing one or more of the dimensions. The many,to,one relationships
among sets of attributes of a dimension can separate new dimension tables, forming a hierarchy. The
decomposed snowfla-e structure visualizes the hierarchical structure of dimensions very well.
7act constellation schema$ 7or each star schema it is possible to construct fact constellation
schema .for e)ample by splitting the original star schema into more star schemes each of them describes
facts on another level of dimension hierarchies/. The fact constellation architecture contains multiple fact
tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because
many variants for particular -inds of aggregation must be considered and selected. (oreover, dimension
tables are still large.
-! Data 74traction, %leanup, and *ransformation *ools
3TF stands for 3)tract, Transform, Foad is Data Warehouse acquisition processes that involves
3)tract the data from outside sources.
Transform the data to fit business needs and ultimately
Foad the the transform data to the data warehouse.
7or e)ample$
!. nformatics.
>. Data +tage.
?. 6racle warehouse builder.
A. Ab initio.
3TF can also be used for the integration with legacy systems. 3TF is the Data Warehouse
acquisition processes of 3)tracting, Transforming and Foading data from source systems into the data
warehouse.
74traction
25
CS2032 DATA WAREHOUSING AND DATA MINING
3)traction is the operation of e)tracting data from a source system for further use in a data
warehouse environment. This is the first step of the 3TF process. After the e)traction, this data can be
transformed and loaded into the data warehouse.
*ntroduction to E)traction Methods in Data 1arehouses
The e)traction method you should choose is highly dependent on the source system and also from
the business needs in the target data warehouse environment. <ery often, therePs no possibility to add
additional logic to the source systems to enhance an incremental e)traction of data due to the performance
or the increased wor-load of these systems. +ometimes even the customer is not allowed to add anything
to an out,of,the,bo) application system.
Lou have to decide how to e)tract data logically and physically
'ogical !xtraction: There are two -inds of logical e)traction
!. 7ull e)traction, >. ncremental e)traction
)h*sical !xtraction: There are two -inds of physical e)traction
!. 6nline e)traction >. 6ffline e)traction
*ransformation tools
ts purpose is to provide info to business users for decision ma-ing. There are five categories$
Data query and reporting tools
Application development tools
3)ecutive info system tools .3+/
6FA= tools
Data mining tools
4uery and reporting tools are used to generate query and report. There are two types of reporting tools.
They are$
=roduction reporting tool used to generate regular operational reports
Des-top report writer are ine)pensive des-top tools designed for end users.
Managed Query tools: used to generate +4F query. t uses (eta layer software in between users
and databases which offers a point,and,clic- creation of +4F statement. This tool is a preferred choice of
26
CS2032 DATA WAREHOUSING AND DATA MINING
users to perform segment identification, demographic analysis, territory management and preparation of
customer mailing lists etc.
pplication de!elopment tools: This is a graphical data access environment which integrates
6FA= tools with data warehouse and can be used to access all db systems
"#$ Tools: are used to analyze the data in multi dimensional and comple) views. To enable
multidimensional properties it uses (DDB and (8DB where (DDB refers multi dimensional data base
and (8DB refers multi relational data bases.
Data mining tools: are used to discover -nowledge from the data warehouse data also can be used
for data visualization and data correction purposes.
.! ,etadata
(eta data$ data about data
,eta Data in Data Warehouse
,eta Data is one of the most important aspect of data warehousing. t is the data about data stored
in data warehouse and its users.
,eta Data provides decision,support,oriented pointer to warehouse data and thus provide logical
lin- between warehouse data and decision support application.
,eta Data is the -ey to providing user and application with a road map to the information stored
in the warehouse.
,eta Data can define all attributes, data sources and timing, and rules that govern data use and
data transformation of all data elements.
(etadata .metacontent/ is defined as data providing information about one or more aspects of the
data, such as$
(eans of creation of the data
=urpose of the data
Time and date of creation
1reator or author of data
27
CS2032 DATA WAREHOUSING AND DATA MINING
Focation on a computer networ- where the data was created
+tandards used
*#pes1
,- Technical (eta data1
t contains information about data warehouse data used by warehouse designer, administrator to carry out
development and management tas-s. t includes,
nfo about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse db
Warehouse 6b&ect and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, bac-up history, archive history, info delivery history, data acquisition history,
data access etc.,
.- /usiness (eta data:
t contains info that gives info stored in data warehouse to users. t includes,
+ub&ect areas, and info ob&ect type including queries, reports, images, video, audio clips etc.
nternet home pages
nfo related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
"ther *#pes1
28
CS2032 DATA WAREHOUSING AND DATA MINING
(tructural metadata is used to describe the structure of computer systems such as tables,
columns and inde)es. 8uide metadata is used to help humans find specific items and is usually e)pressed
as a set of -eywords in a natural language.
According to 8alph Himballmetadata can be divided into > similar categoriesQTechnical
metadata and Business metadata. Technical metadata correspond to internal metadata, business
metadatato e)ternal metadata.
Himball adds a third category named 5rocess metadata. 6n the other hand, 9+6 distinguishes
between three types of metadata$ descriptive, structural and administrative.
Descriptive metadata is the information used to search and locate an ob&ect such as title, author,
sub&ects, -eywords, publisherI structural metadata gives a description of how the components of the
ob&ect are organizedI and administrative metadata refers to the technical information including file type.
Two sub,types of administrative metadata are rights management metadata and preservation metadata.
*#pes of Data Warehouse
There are mainly three type of Data Warehouse.
!/. 3nterprise Data Warehouse.
>/. 6perational data store.
?/. Data (art.
7nterprise Data Warehouse provide a control Data Base for decision support through out the
enterprise.
"perational data store has a broad enterprise under scope but unli-e a real enterprise DW. Data
is refreshed in rare real time and used for routine business activity.
Data ,art is a sub part of Data Warehouse. t support a particular reason or it is design for
particular lines of business such as sells, mar-eting or finance, or in any organization documents of a
particular department will be data mart
29
CS2032 DATA WAREHOUSING AND DATA MINING

UNIT II
#USINESS ANA$%SIS
&! 9eporting and :uer# tools and Applications 2 *ool %ategories 2 the ;eed for
Applications
Data )uer# and reporting tools
4uery and reporting tools are divided in to two parts.
8eporting tools
(anaged query tools
9eporting tools further dividing in to two parts.
5roduction reporting tools will let companies generate regular operational reports or support
high level batch &ob, such as calculating and printing paychec-s.
9eport writer, on the other hand, are e)pensive des-top tools designed for end users.
30
CS2032 DATA WAREHOUSING AND DATA MINING
,anaged )uer# tools protect end users from comple)ities of +4F and database structure by inserting
a (eta layer between user and the database.
(eta layer is software that provides sub&ect oriented view of database and support point,and,clic-
creation of +4F.
These tools are designed for easy,to,use point:and,clic- and visual navigation operations that either
accept +4F or generate +4F statements to query relational data stored warehouse.
+ome of these tools are used to format the received data in to easy,to,read report.
Data Warehouse Access *ools
The principal purpose of data warehousing is to providing information to business users for
strategic decision ma-ing.
These users interact with data warehouse using front,end tools. Although regular reports and
custom reports are the primary delivery vehicles for analysis done in most data warehouse, many
development efforts in data warehouse arena are focusing on e)ceptional reporting also -nown as alerts2
74ample1 f the data warehouse designed for accessing the ris- for currency treading, an alert can
be activated when a certain currency rate drops below a predefined threshold.
Access tools can be divided in to five main groups.
Data query and reporting tools.
Application development tools.
3)ecutive information system .3+/ tools.
Data mining tools.
'! %ognous Impromptu
What is impromptu<
mpromptu is an interactive database reporting tool. t allows =ower *sers to query data without
programming -nowledge. When using the mpromptu tool, no data is written or changed in the database. t
is only capable of reading the data.
mpromptu's main features includes,
31
CS2032 DATA WAREHOUSING AND DATA MINING
5 nteractive reporting capability
53nterprise,wide scalability
5+uperior user interface
57astest time to result
5Fowest cost of ownership
%atalogs
mpromptu stores metadata in sub&ect related folders. This metadata is what will be used to
develop a query for a report. The metadata set is stored in a file called a Rcatalog'. The catalog does not
contain any data. t &ust contains information about connecting to the database and the fields that will be
accessible for reports.
A catalog contains1
; 7oldersQmeaningful groups of information representing columns from one or more tables
; 1olumnsQindividual data elements that can appear in one or more folders
; 1alculationsQe)pressions used to compute required values from e)isting data
; 1onditionsQused to filter information so that only a certain type of information is displayed
; =romptsQpre,defined selection criteria prompts that users can include in reports they create
; 6ther components, such as metadata, a logical database name, &oin information, and user classes
=ou can use catalogs to
; view, run, and print reports
; e)port reports to other applications
; disconnect from and connect to the database
; create reports
; change the contents of the catalog
; add user classes
5rompts
Lou can use prompts to
; filter reports
; calculate data items
; format data
5icklist 5rompts
A pic-list prompt presents you with a list of data items from which you select one or more values,
so you need not be familiar with the database. The values listed in pic-list prompts can be retrieved from
32
CS2032 DATA WAREHOUSING AND DATA MINING
a database via a catalog when you want to select information that often changes.
a column in another saved mpromptu report, a snapshot, or a %ot7ile
A report can include a prompt that as-s you to select a product type from a list of those available in
the database. 6nly the products belonging to the product type you select are retrieved and displayed in
your report.
9eports
8eports are created by choosing fields from the catalog folders. This process will build a +4F
.+tructured 4uery Fanguage/ statement behind the scene. 9o +4F -nowledge is required to use
mpromptu. The data in the report may be formatted, sorted and@or grouped as needed. Titles, dates,
headers and footers and other standard te)t formatting features .italics, bolding, and font size/ are also
available.6nce the desired layout is obtained, the report can be saved to a report file.
This report can then be run at a different time, and the query will be sent to the database. t is also
possible to save a report as a snapshot. This will provide a local copy of the data. This data will not be
updated when the report is opened.
1ross tab reports, similar to 3)cel =ivot tables, are also easily created in mpromptu.
6rame-$ased 9eporting
7rames are the building bloc-s of all mpromptu reports and templates. They may contain report
ob&ects, such as data, te)t, pictures, and charts.
There are no limits to the number of frames that you can place within an individual report or
template. Lou can nest frames within other frames to group report ob&ects within a report.
Different types of frames and its purpose for creating frame based reporting
7orm frame$ An empty form frame appears.
Fist frame$ An empty list frame appears.
Te)t frame$ The flashing ,beam appears where you can begin inserting te)t.
=icture frame$ The +ource tab .=icture =roperties dialog bo)/ appears. Lou can use this tab to select
the image to include in the frame.
1hart frame$ The Data tab .1hart =roperties dialog bo)/ appears. Lou can use this tab to select the
data item to include in the chart.
33
CS2032 DATA WAREHOUSING AND DATA MINING
6F3 6b&ect$ The nsert 6b&ect dialog bo) appears where you can locate and select the file you want
to insert, or you can create a new ob&ect using the software listed in the 6b&ect Type bo).
Impromptu features1
(ni&ied +uery and reporting inter&ace: t unifies both query and reporting interface in a single user
interface
"%3ect oriented architecture: t enables an inheritance based administration so that more than !###
users can be accommodated as easily as single user
'omplete integration with $ower$lay: t provides an integrated solution for e)ploring trends and
patterns
Scala%ility: ts scalability ranges from single user to !### user
Security and 'ontrol: +ecurity is based on user profiles and their classes.
Data presented in a %usiness conte)t: t presents information using the terminology of the business.
"!er 45 pre de&ined report templates: t allows users can simply supply the data to create an
interactive report
6rame %ased reporting: t offers number of ob&ects to create a user designed report
Business rele!ant reporting: t can be used to generate a business relevant report through filters, pre
conditions and calculations
Data%ase independent catalogs: +ince catalogs are in independent nature they require minimum
maintenance
+! ">A5
6FA= stands for 6nline Analytical =rocessing. t uses database tables .fact and dimension tables/
to enable multidimensional viewing, analysis and querying of large amounts of data. 3.g. 6FA=
technology could provide management with fast answers to comple) queries on their operational data or
enable them to analyze their company's historical data for trends and patterns.
6nline Analytical =rocessing .6FA=/ applications and tools are those that are designed to as-
Scomple) queries of large multidimensional collections of data.T Due to that 6FA= is accompanied with
data warehousing.
-! ;eeds of ">A5
The -ey driver of 6FA= is the multidimensional nature of the business problem. These problems
are characterized by retrieving a very large number of records that can reach gigabytes and terabytes and
summarizing this data into a form information that can by used by business analysts.
34
CS2032 DATA WAREHOUSING AND DATA MINING
6ne of the limitations that +4F has, it cannot represent these comple) problems. A query will be
translated in to several +4F statements. These +4F statements will involve multiple &oins, intermediate
tables, sorting, aggregations and a huge temporary memory to store these tables. These procedures
required a lot of computation which will require a long time in computing. The second limitation of +4F is
its inability to use mathematical models in these +4F statements. f an analyst, could create these comple)
statements using +4F statements, still there will be a large number of computation and huge memory
needed. Therefore the use of 6FA= is preferable to solve this -ind of problem.
.! %ategories of ">A5 *ools
(O'%)
This is the more traditional way of 6FA= analysis. n (6FA=, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary formats. That is,
data stored in array,based structures.
Advantages$
3)cellent performance$ (6FA= cubes are built for fast data retrieval, and are optimal for slicing
and dicing operations.
1an perform comple) calculations$ All calculations have been pre,generated when the cube is
created. %ence, comple) calculations are not only doable, but they return quic-ly.
Disadvantages$
Fimited in the amount of data it can handle$ Because all calculations are performed when the cube
is built, it is not possible to include a large amount of data in the cube itself. This is not to say that
the data in the cube cannot be derived from a large amount of data. ndeed, this is possible. But in
this case, only summary,level information will be included in the cube itself.
8equires additional investment$ 1ube technology are often proprietary and do not already e)ist in
the organization. Therefore, to adopt (6FA= technology, chances are additional investments in
human and capital resources are needed.
35
CS2032 DATA WAREHOUSING AND DATA MINING
3)amples$ %yperion 3ssbase, 7usion .nformation Builders/
$O'%)
This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional 6FA='s slicing and dicing functionality. n essence, each action of slicing and
dicing is equivalent to adding a SW%383T clause in the +4F statement. Data stored in relational tables
Advantages$
1an handle large amounts of data$ The data size limitation of 86FA= technology is the limitation
on data size of the underlying relational database. n other words, 86FA= itself places no
limitation on data amount.
1an leverage functionalities inherent in the relational database$ 6ften, relational database already
comes with a host of functionalities. 86FA= technologies, since they sit on top of the relational
database, can therefore leverage these functionalities.
Disadvantages$
=erformance can be slow$ Because each 86FA= report is essentially a +4F query .or multiple
+4F queries/ in the relational database, the query time can be long if the underlying data size is
large.
Fimited by +4F functionalities$ Because 86FA= technology mainly relies on generating +4F
statements to query the relational database, and +4F statements do not fit all needs .for e)ample, it
is difficult to perform comple) calculations using +4F/, 86FA= technologies are therefore
traditionally limited by what +4F can do. 86FA= vendors have mitigated this ris- by building
36
CS2032 DATA WAREHOUSING AND DATA MINING
into the tool out,of,the,bo) comple) functions as well as the ability to allow users to define their
own functions.
3)amples$ (icrostrategy ntelligence +erver, (eta1ube .nformi)@B(/
&O'%) 0(1!: (anaged 1uer* !nvironment2
%6FA= technologies attempt to combine the advantages of (6FA= and 86FA=. 7or summary,
type information, %6FA= leverages cube technology for faster performance. t stores only the inde)es and
aggregations in the multidimensional form while the rest of the data is stored in the relational database.
3)amples$ =ower=lay .1ognos/, Brio, (icrosoft Analysis +ervices, 6racle Advanced Analytic
+ervices
/! ,ultidimensional ?ersus ,ultirelational ">A5
These relational implementations of multidimensional database systems are sometimes referred to
as multirelationaldatabase systems. To achieve the required speed, these products use the star or snowfla-e
schemas,specially optimized and denormalized data models that involve data restructuring and
37
CS2032 DATA WAREHOUSING AND DATA MINING
aggregation. .The snowfla-e schema is an e)tension of the star schema that supports multiple fact tables
and &oins between them./
6ne benefit of the star schema approach is reduced comple)ity in the data model, which increases
data Slegibility,T ma-ing it easier for users to pose business questions of 6FA= nature.Data warehouse
queries can be answered up to !# times faster because of improved navigations.
Two types of database activity$
! 6FT=$ 6n,Fine Transaction =rocessing
+hort transactions, both queries and updates
.e.g., update account balance, enroll in course/
4ueries are simple
.e.g., find account balance, find grade in course/
*pdates are frequent
.e.g., concert tic-ets, seat reservations, shopping carts/
>. 6FA=$ 6n,Fine Analytical =rocessing
U Fong transactions, usually comple) queries
U .e.g., all statistics about all sales, grouped by dept and
U month/
U SData miningT operations
U nfrequent updates
O'T) vs O'%)
6FT= stands for 6n Fine Transaction =rocessing and is a data modeling approach typically used to
facilitate and manage usual business applications. (ost of applications you see and use are 6FT= based.
6FT= technology used to perform updates on operational or transactional systems .e.g., point of
sale systems/
6FA= stands for 6n Fine Analytic =rocessing and is an approach to answer multi,
dimensional queries. 6FA= was conceived for (anagement nformation +ystems and Decision
+upport +ystems. 6FA= technology used to perform comple) analysis of the data in a data
warehouse.
The following table summarizes the major dieren!es between "#T$ and "#%$
s&stem design'
"#T$ (&stem "#%$ (&stem
38
CS2032 DATA WAREHOUSING AND DATA MINING
"nline Transa!tion $ro!essing
)"*erational (&stem+
"nline %nal&ti!al $ro!essing
),ata -arehouse+
+ource of data
6perational dataI 6FT=s are the
original source of the data.
1onsolidation dataI 6FA= data comes from
the various 6FT= Databases
=urpose of data
To control and run fundamental
business tas-s
To help with planning, problem solving, and
decision support
What the data
8eveals a snapshot of ongoing
business processes
(ulti,dimensional views of various -inds of
business activities
nserts and
*pdates
+hort and fast inserts and updates
initiated by end users
=eriodic long,running batch &obs refresh the
data
4ueries
8elatively standardized and simple
queries 8eturning relatively few
records
6ften comple) queries involving
aggregations
=rocessing
+peed
Typically very fast
Depends on the amount of data involvedI
batch data refreshes and comple) queries
may ta-e many hoursI query speed can be
improved by creating inde)es
+pace
8equirements
1an be relatively small if historical
data is archived
Farger due to the e)istence of aggregation
structures and history dataI requires more
inde)es than 6FT=
Database
Design
%ighly normalized with many tables
Typically de,normalized with fewer tablesI
use of star and@or snowfla-e schemas
Bac-up and
8ecovery
Bac-up religiouslyI operational data is
critical to run the business, data loss is
li-ely to entail significant monetary
loss and legal liability
nstead of regular bac-ups, some
environments may consider simply
reloading the 6FT= data as a recovery
method
0! *he ,ultidimensional data ,odel
The multidimensional data model is an integral part of 6n,Fine Analytical =rocessing, or 6FA=.
Because 6FA= is on,line, it must provide answers quic-lyI analysts pose iterative queries during
interactive sessions, not in batch &obs that run overnight. And because 6FA= is also analytic, the queries
are comple). The multidimensional data model is designed to solve comple) queries in real time.
(ultidimensional data model is to view it as a cube. The cable at the left contains detailed sales
data by product, mar-et and time. The cube on the right associates sales number .unit sold/ with
dimensions,product type, mar-et and time with the unit variables organized as cell in an array.
39
CS2032 DATA WAREHOUSING AND DATA MINING
This cube can be e)pended to include another array,price,which can be associates with all or only
some dimensions. As number of dimensions increases number of cubes cell increase e)ponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,
quarters, months, wea- and day. J36J8A=%L may contain country, state, city etc.
n this cube we can observe, that each side of the cube represents one of the elements of the
question. The ),a)is represents the time, the y,a)is represents the products and the z,a)is represents
different centers. The cells of in the cube represents the number of product sold or can represent the price
of the items.
This 7igure also gives a different understanding to the drilling down operations. The relations
defined must not be directly related, they related directly.
The size of the dimension increase, the size of the cube will also increase e)ponentially. The time
response of the cube depends on the size of the cube.
"perations in ,ultidimensional Data ,odel1
; Aggregation .roll-up/
: dimension reduction$ e.g., total sales by city
: summarization over aggregate hierarchy$ e.g., total sales by city and year ,N total sales by
region and by year
40
CS2032 DATA WAREHOUSING AND DATA MINING
; +election .slice/ defines a subcube
: e.g., sales where city V =alo Alto and date V !@!B@"C
; 9avigation to detailed data .drill-down/
: e.g., .sales : e)pense/ by city, top ?W of cities by average income
; <isualization 6perations .e.g., =ivot or dice/
@! ">A5 8uidelines
Dr. 3.7. 1odd the SfatherT of the relational model, created a list of rules to deal with the 6FA=
systems. *sers should priorities these rules according to their needs to match their business requirements
.reference ?/. These rules are$
!/ (ultidimensional conceptual view$ The 6FA= should provide an appropriate multidimensional
Business model that suits the Business problems and 8equirements.
>/ Transparency$ The 6FA= tool should provide transparency to the input data for the users.
?/ Accessibility$ The 6FA= tool should only access the data required only to the analysis needed.
A/ 1onsistent reporting performance$ The +ize of the database should not affect in any way the
performance.
B/ 1lient@server architecture$ The 6FA= tool should use the client server architecture to ensure better
performance and fle)ibility.
C/ Jeneric dimensionality$ Data entered should be equivalent to the structure and operation
requirements.
D/ Dynamic sparse matri) handling$ The 6FA= too should be able to manage the sparse matri) and so
maintain the level of performance.
K/ (ulti,user support$ The 6FA= should allow several users wor-ing concurrently to wor- together.
"/ *nrestricted cross,dimensional operations$ The 6FA= tool should be able to perform operations
across the dimensions of the cube.
!#/ ntuitive data manipulation. S1onsolidation path re,orientation, drilling down across columns or
rows, zooming out, and other manipulation inherent in the consolidation path outlines should be
accomplished via direct action upon the cells of the analytical model, and should neither require
the use of a menu nor multiple trips across the user interface.T.8eference A/
!!/ 7le)ible reporting$ t is the ability of the tool to present the rows and column in a manner suitable
to be analyzed.
!>/ *nlimited dimensions and aggregation levels$ This depends on the -ind of Business, where
multiple dimensions and defining hierarchies can be made.
n addition to these guidelines an 6FA= system should also support$
1omprehensive database management tools$ This gives the database management to control
distributed Businesses
41
CS2032 DATA WAREHOUSING AND DATA MINING
The ability to drill down to detail source record level$ Which requires that The 6FA= tool should
allow smooth transitions in the multidimensional database.
ncremental database refresh$ The 6FA= tool should provide partial refresh.
+tructured 4uery Fanguage .+4F interface/$ the 6FA= system should be able to integrate effectively
in the surrounding enterprise environment.
UNIT III
DATA MINING
&! Data mining Aknowledge discover# in databasesB
3)traction of interesting .non,trivial, implicit, previously un-nown and potentially useful/
information or patterns from data in large databases.
42
CS2032 DATA WAREHOUSING AND DATA MINING
Data mining is the practice of automatically searching large stores of data to discover patterns and
trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment
the data and evaluate the probability of future events. Data mining is also -nown as Hnowledge Discovery
in Data .HDD/.
The -ey properties of data mining are$
Automatic discovery of patterns
=rediction of li-ely outcomes
1reation of actionable information
7ocus on large data sets and databases
Data mining can answer questions that cannot be addressed through simple query and reporting
techniques.
'! Data ,ining 6unctions
A basic understanding of data mining functions and algorithms is required for using 6racle Data
(ining. This section introduces the concept of data mining functions. Algorithms are introduced in XData
(ining AlgorithmsX.
3ach data mining function specifies a class of problems that can be modeled and solved. Data
mining functions fall generally into two categories$ supervised and unsupervised. 9otions of supervised
and unsupervised learning are derived from the science of machine learning, which has been called a sub,
area of artificial intelligence.
Artificial intelligence refers to the implementation and study of systems that e)hibit autonomous
intelligence or behavior of their own. (achine learning deals with techniques that enable devices to learn
from their own performance and modify their own functioning. Data mining applies machine learning
concepts to data.
Supervised ata (ining:
+upervised learning is also -nown as directed learning. The learning process is directed by a
previously -nown dependent attribute or target. Directed data mining attempts to e)plain the behavior of
the target as a function of a set of independent attributes or predictors.
43
CS2032 DATA WAREHOUSING AND DATA MINING
+upervised learning generally results in predictive models. This is in contrast to unsupervised
learning where the goal is pattern detection.
The building of a supervised model involves training, a process whereby the software analyzes
many cases where the target value is already -nown. n the training process, the model XlearnsX the logic
for ma-ing the prediction. 7or e)ample, a model that see-s to identify the customers who are li-ely to
respond to a promotion must be trained by analyzing the characteristics of many customers who are -nown
to have responded or not responded to a promotion in the past.
3nsupervised ata (ining
*nsupervised learning is non,directed. There is no distinction between dependent and independent
attributes. There is no previously,-nown result to guide the algorithm in building the model.
*nsupervised learning can be used for descriptive purposes. t can also be used to ma-e
predictions.
ata pre-processing
Data pre-processing is an often neglected but important step in the data mining process. The
phrase Xgarbage in, garbage outX is particularly applicable to data mining and machine learning pro&ects.
Data,gathering methods are often loosely controlled, resulting in out,of,range values .e.g., ncome$ Y!##/,
impossible data combinations .e.g., Jender$ (ale, =regnant$ Les/, missing values, etc. Analyzing data that
has not been carefully screened for such problems can produce misleading results. Thus, the representation
and quality of data is first and foremost before running an analysis.
f there is much irrelevant and redundant information present or noisy and unreliable data, then
-nowledge discovery during the training phase is more difficult. Data preparation and filtering steps can
ta-e considerable amount of processing time. Data pre,processing includes cleaning, normalization,
transformation, feature e)traction and selection, etc. The product of data pre,processing is the final training
set. Hotsiantis et al. .>##C/ present a well,-nown algorithm for each step of data pre,processing.
+! %lassification of Data ,ining (#stems
ata mining classification scheme:
!. Decisions in data mining
: Hinds of databases to be mined
44
CS2032 DATA WAREHOUSING AND DATA MINING
: Hinds of -nowledge to be discovered
: Hinds of techniques utilized
: Hinds of applications adapted
>. Data mining tas-s
: Descriptive data mining
: =redictive data mining
,- Decisions in data mining
Databases to be mined
o 8elational, transactional, ob&ect,oriented, ob&ect,relational, active, spatial, time,
series, te)t, multi,media, heterogeneous, legacy, WWW, etc.
- Hnowledge to be mined
o 1haracterization, discrimination, association, classification, clustering, trend,
deviation and outlier analysis, etc.
o (ultiple@integrated functions and mining at multiple levels
- Techniques utilized
o Database,oriented, data warehouse .6FA=/, machine learning, statistics,
visualization, neural networ-, etc.
- Applications adapted
o 8etail, telecommunication, ban-ing, fraud analysis, D9A mining, stoc- mar-et
analysis, Web mining, Weblog analysis, etc.
.- Data mining tasks
: =rediction Tas-s
o *se some variables to predict un-nown or future values of other variables
: Description Tas-s
o 7ind human,interpretable patterns that describe the data.
45
CS2032 DATA WAREHOUSING AND DATA MINING
#ommon data mining tas+s
: 1lassification Z=redictive[
: 1lustering ZDescriptive[
: Association 8ule Discovery ZDescriptive[
: +equential =attern Discovery ZDescriptive[
: 8egression Z=redictive[
: Deviation Detection Z=redictive[
%lassifications of data mining s#stems1
+upervised learning .classification/
+upervision$ The training data .observations, measurements, etc./ are
accompanied by labels indicating the class of the observations
9ew data is classified based on the training set
*nsupervised learning .clustering/
The class labels of training data is un-nown
Jiven a set of measurements, observations, etc. with the aim of establishing the e)istence of
classes or clusters in the data.
%lassification
predicts categorical class labels .discrete or nominal/
classifies data .constructs a model/ based on the training set and the values .class
labels/ in a classifying attribute and uses it in classifying new data
;umeric 5rediction
models continuous,valued functions, i.e., predicts un-nown or missing values
*#pical applications
1redit@loan approval
(edical diagnosis$ if a tumor is cancerous or benign
7raud detection$ if a transaction is fraudulent
Web page categorization$ which category it is
46
CS2032 DATA WAREHOUSING AND DATA MINING
-! Data ,ining *ask 5rimitives
The set of tas7-rele!ant data to be mined$ This specifies the portions of the database or the set of
data in which the user is interested. This includes the database attributes or data warehouse dimensions of
interest .referred to as the rele!ant attri%utes or dimensions/.
The 7ind o& 7nowledge to be mined$ This specifies the data mining &unctions to be per,
formed, such as characterization, discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis.
The %ac7ground 7nowledge to be used in the discovery process$ This -nowledge about the domain
to be mined is useful for guiding the -nowledge discovery process and for evaluating the patterns found.
'oncept hierarchies are a popular form of bac-,ground -nowledge, which allow data to be mined
at multiple levels of abstraction. An e)ample of a concept hierarchy for the attribute .or dimension/ age is
shown in 7igure. *ser beliefs regarding relationships in the data are another formof bac-, ground
-nowledge.

The interestingness measures and thresholds for pattern evaluation$ They may be used to guide the
mining process or, after discovery, to evaluate the discovered patterns. Different -inds of -nowledge may
have different interestingness measures. 7or e)am, ple, interestingness measures for association rules
include support and con&idence.
8ules whose support and confidence values are below user,specified thresholds are considered
uninteresting. The e)pected representation &or !isuali-ing the discovered patterns$ This refers to the
forminwhich discovered patterns are to be displayed,which may include rules, tables, charts, graphs,
decision trees, and cubes.
47
CS2032 DATA WAREHOUSING AND DATA MINING

.! Data5reprocessing!
The real,world data that is to be analyzed by data mining techniques are$
1' Incomplete1 lac-ing attribute values or certain attributes of interest, or containing only aggregate
data. (issing data, particularly for tuples with missing values for some attributes, may need to be
inferred.
2' ;ois# $ containing errors, or outlier values that deviate from the e)pected. ncorrect data may also
result from inconsistencies in naming conventions or data codes used, or inconsistent formats for
input fields, such as date. t is hence necessary to use some techniques to replace the noisy data.
3' Inconsistent 1 containing discrepancies between different data items. some attributes representing
a given concept may have different names in different databases, causing inconsistencies and
redundancies. 9aming inconsistencies may also occur for attribute values. The inconsistency in
data needs to be removed.
4' Aggregate Information1 t would be useful to obtain aggregate information such as to the sales
per customer regionQsomething that is not part of any pre,computed data cube in the data
warehouse.
48
CS2032 DATA WAREHOUSING AND DATA MINING
5' 7nhancing mining process1 Farge number of data sets may ma-e the data mining process slow.
%ence, reducing the number of data sets to enhance the performance of the mining process is
important.
6' Improve Data :ualit#1 Data preprocessing techniques can improve the quality of the data,
thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data
preprocessing is an important step in the -nowledge discovery process, because quality decisions
must be based on quality data. Detecting data anomalies, rectifying them early, and reducing the
data to be analyzed can lead to huge payoffs for decision ma-ing.
Different forms of Data 5rocessing
Data %leaning1
Data cleaning routines wor- to ScleanT the data by filling in missing values, smoothing noisy
data, identifying or removing outliers, and resolving inconsistencies.
f users believe the data are dirty, they are unli-ely to trust the results of any data mining that
has been applied to it. Also, dirty data can cause confusion for the mining procedure,
resulting in unreliable output. But, they are not always robust.
Therefore, a useful preprocessing step is used some data,cleaning routines.
Data Integration1
Data integration involves integrating data from multiple databases, data cubes, or files.
+ome attributes representing a given concept may have different names in different databases,
causing inconsistencies and redundancies. 7or e)ample, the attribute for customer
identification may be referred to as customer\id in one data store and cust\id in another.
9aming inconsistencies may also occur for attribute values.
Also, some attributes may be inferred from others .e.g., annual revenue/.
%aving a large amount of redundant data may slow down or confuse the -nowledge
discovery process. Additional data cleaning can be performed to detect and remove
redundancies that may have resulted from data integration.
Data *ransformation1
Data transformation operations, such as normalization and aggregation, are additional data
preprocessing procedures that would contribute toward the success of the mining process.
9ormalization$ 9ormalization is scaling the data to be analyzed to a specific range such as
Z#.#, !.#[ for providing better results.
49
CS2032 DATA WAREHOUSING AND DATA MINING
Aggregation$ Also, it would be useful for data analysis to obtain aggregate information such
as the sales per customer region. As, it is not a part of any pre,computed data cube, it would
need to be computed. This process is called Aggregation.
Data 9eduction1
Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet
produces the same .or almost the same/ analytical results. There are a number of strategies for
data reduction.
data aggregation .e.g., building a data cube/,
attribute subset selection .e.g., removing irrelevant attributes through correlation analysis/,
dimensionality reduction .e.g., using encoding schemes such as minimum length encoding or
wavelets/,
and numerosity reduction .e.g., SreplacingT the data by alternative, smaller representations
such as clusters or parametric models/.
generalization with the use of concept hierarchies,by organizing the concepts into varying
levels of abstraction.
Data discretization is very useful for the automatic generation of concept hierarchies from
numerical data.

50
CS2032 DATA WAREHOUSING AND DATA MINING
UNIT & I'
ASSOCIATION RU$E MINING AND C$ASSI(ICATION
&! 6re)uent 5attern Anal#sis
7requent pattern$ a pattern .a set of items, subsequences, substructures, etc./ that occurs frequently
in a data set
7irst proposed by Agrawal, mielins-i, and +wami ZA+"?[ in the conte)t of frequent itemsets and
association rule mining
(otivation$ 7inding inherent regularities in data
What products were often purchased togetherMQ Beer and diapersM]
What are the subsequent purchases after buying a =1M
What -inds of D9A are sensitive to this new drugM
1an we automatically classify web documentsM
Applications$
Bas-et data analysis, cross,mar-eting, catalog design, sale campaign analysis, Web log .clic-
stream/ analysis, and D9A sequence analysis
Wh# Is 6re)! 5attern ,ining Important<
Discloses an intrinsic and important property of data sets
7orms the foundation for many essential data mining tas-s
Association, correlation, and causality analysis
+equential, structural .e.g., sub,graph/ patterns
=attern analysis in spatiotemporal, multimedia, time,series, and stream data
1lassification$ associative classification
1luster analysis$ frequent pattern,based clustering
Data warehousing$ iceberg cube and cube,gradient
+emantic data compression$ fascicles
Broad applications
51
CS2032 DATA WAREHOUSING AND DATA MINING
$asic %oncepts1 6re)uent 5atterns and Association 9ules
'!%onstraint-based A:uer#-DirectedB ,ining
7inding all the patterns in a database autonomouslyM Q unrealistic]
o The patterns could be too many but not focused]
Data mining should be an interactive process
o *ser directs what to be mined using a data mining query language .or a graphical user
interface/
1onstraint,based mining
o *ser fle)ibility$ provides constraints on what to be mined
o +ystem optimization$ e)plores such constraints for efficient miningQconstraint,based
mining
%onstraints in Data ,ining
52
CS2032 DATA WAREHOUSING AND DATA MINING
Hnowledge type constraint$
o classification, association, etc.
Data constraint Q using +4F,li-e queries
o find product pairs sold together in stores in 1hicago in Dec.'#>
Dimension@level constraint
o in relevance to region, price, brand, customer category
8ule .or pattern/ constraint
o small sales .price ^ _!#/ triggers big sales .sum N _>##/
nterestingness constraint
o strong rules$ min\support ?W, min\confidence C#W
%onstrained ,ining vs! %onstraint-$ased (earch
1onstrained mining vs. constraint,based search@reasoning
o Both are aimed at reducing search space
o 7inding all patterns satisfying constraints vs. finding some .or one/ answer in
constraint,based search in A
o 1onstraint,pushing vs. heuristic search
o t is an interesting research problem on how to integrate them
1onstrained mining vs. query processing in DB(+
o Database query processing requires to find all
o 1onstrained pattern mining shares a similar philosophy as pushing selections
deeply in query processing
*he Apriori Algorithm C 74ample
53
CS2032 DATA WAREHOUSING AND DATA MINING
+!Decision *ree Induction
54
CS2032 DATA WAREHOUSING AND DATA MINING
nformation produced by data mining techniques can be represented in many different
ways. Decision tree structures are a common way to organize classification schemes. n
classifying tas-s, decision trees visualize what steps are ta-en to arrive at a classification. 3very
decision tree begins with what is termed a root node, considered to be the XparentX of every other
node. 3ach node in the tree evaluates an attribute in the data and determines which path it should
follow. Typically, the decision test is based on comparing a value against some constant.
1lassification using a decision tree is performed by routing from the root node until arriving at a
leaf node.
The illustration provided here is a cannonical e)ample in data mining, involving the
decision to play or not play based on climate conditions. n this case, outloo- is in the position of
the root node. The degrees of the node are attribute values. n this e)ample, the child nodes are
tests of humidity and windy, leading to the leaf nodes which are the actual classifications. This
e)ample also includes the corresponding data, also referred to as instances. n our e)ample, there
are " XplayX days and B Xno playX days.
55
CS2032 DATA WAREHOUSING AND DATA MINING
Decision trees can represent diverse types of data. The simplest and most familiar is
numerical data. t is often desirable to organize nominal data as well. 9ominal quantities are
formally described by a discrete set of symbols. 7or e)ample, weather can be described in either
numeric or nominal fashion. We can quantify the temperature by saying that it is !! degrees
1elsius or B> degrees 7ahrenheit. We could also say that it is cold, cool, mild, warm or hot. The
former is an e)ample of numeric data, and the latter is a type of nominal data. (ore accurately,
the e)ample of cold, cool, mild, warm and hot is a special type of nominal data, described as
ordinal data. 6rdinal data has an implicit assumption of ordered relationships between the values.
1ontinuing with the weather e)ample, we could also have a purely nominal description li-e
sunny, overcast and rainy. These values have no relationships or distance measures.
The type of data organized by a tree is important for understanding how the tree wor-s at
the node level. 8ecalling that each node is effectively a test, numeric data is often evaluated in
terms of simple mathematical inequality. 7or e)ample, numeric weather data could be tested by
finding if it is greater than !# degrees 7ahrenheit. 9ominal data is tested in Boolean fashionI in
other words, whether or not it has a particular value. The illustration shows both types of tests. n
the weather e)ample, outloo- is a nominal data type. The test simply as-s which attribute value is
represented and routes accordingly. The humidity node reflects numeric tests, with an inequality
of less than or equal to D#, or greater than D#.
Decision tree induction algorithms function recursively. 7irst, an attribute must be selected
as the root node. n order to create the most efficient .i.e, smallest/ tree, the root node must
effectively split the data. 3ach split attempts to pare down a set of instances .the actual data/ until
they all have the same classification. The best split is the one that provides what is termed the
most information gain.
nformation in this conte)t comes from the concept of entropy from information theory, as
developed by 1laude +hannon. Although XinformationX has many conte)ts, it has a very specific
mathematical meaning relating to certainty in decision ma-ing. deally, each split in the decision
tree should bring us closer to a classification. 6ne way to conceptualize this is to see each step
along the tree as removing randomness or entropy. nformation, e)pressed as a mathematical
quantity, reflects this. 7or e)ample, consider a very simple classification problem that requires
creating a decision tree to decide yes or no based on some data. This is e)actly the scenario
visualized in the decision tree. 3ach attributes values will have a certain number of yes or no
classifications. f there are equal numbers of yeses and noPs, then there is a great deal of entropy in
56
CS2032 DATA WAREHOUSING AND DATA MINING
that value. n this situation, information reaches a ma)imum. 1onversely, if there are only yeses
or only noPs the information is also zero. The entropy is low, and the attribute value is very useful
for ma-ing a decision.
The formula for calculating intermediate values is as follows$
)*Machine $earnin!
The general problem of machine learning is to search a, usually very large, space of potential
hypotheses to determine the one that will best fit the data and any prior -nowledge. The data may be
labelled or unlabelled. f labels are given then the problem is one of supervised learning in that the true
answer is -nown for a given set of data. f the labels are categorical then the problem is one of
classification, e.g. predicting the species of a flower given petal and sepal measurements. f the labels are
real,valued the problem is one of regression, e.g. predicting property values from crime, pollution, etc.
statistic. f labels are not given then the problem is one of unsupervised learning and the aim is
characterize the structure of the data, e.g. by identifying groups of e)amples in the data that are
collectively similar to each other and distinct from the other data.
S+per,i-e. $earnin!
Jiven some e)amples we wish to predict certain properties, in the case where there are available a
set of e)amples whose properties have already been characterized the tas- is to learn the relationship
between the two. 6ne common early approach

was to present the e)amples in turn to a learner. The learner
ma-es a prediction of the property of interest, the correct answer is presented, and the learner ad&usts its
hypothesis accordingly. This is -nown as learning with a teacher, or supervised learning.
n supervised learning there is necessarily the assumption that the descriptors available are in some
related to a quantity of interest. 7or instance, suppose that a ban- wishes to detect fraudulent credit card
transactions. n order to do this some domain -nowledge is required to identify factors that are li-ely to be
indicative of fraudulent use. These may include frequency of usage, amount of transaction, spending
patterns, type of business engaging in the transaction and so forth. These variables are the predictive, or
independent, variables 4. t would be hoped that these were in some way related to the target, or
dependent, variable . Deciding which variables to use in a model is a very difficult problem in generalI this
is -nown as the problem of feature selection and is 9=,complete. (any methods e)ist for choosing the
predictive variables, if domain -nowledge is available then this can be very useful in this conte)t. %ere we
assume that at least some of the predictive variables at least are in fact predictive. L Assume, then, that the
relationship between and is given by the &oint probability density .
57
CS2032 DATA WAREHOUSING AND DATA MINING
UNIT & '
C$USTERING AND A//$ICATIONS AND TRENDS IN DATA MINING
&!%luster Anal#sis
Data clustering is a method in which we make cluster of objects that are somehow similar in
characteristics. The criterion for checking the similarity is implementation dependent.
Clustering is often confused with classification, but there is some difference between the two. In
classification the objects are assigned to pre defined classes, whereas in clustering the classes are also to be
defined.
Precisely, Data Clustering is a technique in which, the information that is logically similar is
physically stored together. In order to increase the efficiency in the database systems the number of disk
accesses are to be minimized. In clustering the objects of similar properties are placed in one class of
objects and a single access to the disk makes the entire class available.
!.! 74ample to 7laborate the Idea of %lustering
58
CS2032 DATA WAREHOUSING AND DATA MINING
n order to elaborate the concept a little bit, let us ta-e the e)ample of the library system. n a
library boo-s concerning to a large variety of topics are available. They are always -ept in form of clusters.
The boo-s that have some -ind of similarities among them are placed in one cluster. 7or e)ample, boo-s
on the database are -ept in one shelf and boo-s on operating systems are -ept in another cupboard, and so
on. To further reduce the comple)ity, the boo-s that cover same -ind of topics are placed in same shelf.
And then the shelf and the cupboards are labeled with the relative name. 9ow when a user wants a boo- of
specific -ind on specific topic, he or she would only have to go to that particular shelf and chec- for the
boo- rather than chec-ing in the entire library.
'! D76I;I*I";(
In this section some frequently used terms are defined.
2.1Cluster
A cluster is an ordered list of objects, which have some common characteristics. The objects belong to an
interval [a , b], in our case [0 , 1]
2.2 Distance Between Two Clusters
The distance between two clusters involves some or all elements of the two clusters. The clustering
method determines how the distance should be computed.
2.3 Similarity
A similarity measure SIMILAR ( D
i
, D
j
) can be used to represent the similarity between the documents.
Typical similarity generates values of 0 for documents exhibiting no agreement among the assigned
indexed terms, and 1 when perfect agreement is detected. Intermediate values are obtained for cases of
partial agreement.
2.4 Average Similarity
If the similarity measure is computed for all pairs of documents ( D
i
, D
j
) except when i=j, an average
value AVERAGE SIMILARITY is obtainable. Specifically, AVERAGE SIMILARITY = CONSTANT
SIMILAR ( D
i
, D
j
), where i=1,2,.n and j=1,2,.n and i < > j
59
CS2032 DATA WAREHOUSING AND DATA MINING
2.5 Threshold
The lowest possible input value of similarity required to join two objects in one cluster.
2.6 Similarity Matrix
Similarity between objects calculated by the function SIMILAR (D
i,
,D
j
), represented in the form of a
matrix is called a similarity matrix.
2.7 Dissimilarity Coefficient
The dissimilarity coefficient of two clusters is defined to be the distance between them. The smaller the
value of dissimilarity coefficient , the more similar two clusters are.
2.8 Cluster Seed
First document or object of a cluster is defined as the initiator of that cluster i.e. every incoming objects
similarity is compared with the initiator. The initiator is called the cluster seed.
2. TYPES OF CLUSTERING METHODS
There are many clustering methods available, and each of them may give a different grouping of a
dataset. The choice of a particular method will depend on the type of output desired, The known
performance of method with particular types of data, the hardware and software facilities available and the
size of the dataset. In general , clustering methods may be divided into two categories based on the cluster
structure which they produce. The non-hierarchical methods divide a dataset of N objects into M clusters,
with or without overlap.
These methods are sometimes divided into partitioning methods, in which the classes are mutually
exclusive, and the less common clumping method, in which overlap is allowed. Each object is a member of
the cluster with which it is most similar, however the threshold of similarity has to be defined. The
hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is
progressively nested in a larger cluster until only one cluster remains. The hierarchical methods can be
further divided into agglomerative or divisive methods. In agglomerative methods , the hierarchy is build
up in a series of N-1 agglomerations, or Fusion, of pairs of objects, beginning with the un-clustered
60
CS2032 DATA WAREHOUSING AND DATA MINING
dataset. The less common divisive methods begin with all objects in a single cluster and at each of N-1
steps divides some clusters into two smaller clusters, until each object resides in its own cluster.
Some of the important Data Clustering Methods are described below.
Partitioning Methods
The partitioning methods generally result in a set of M clusters, each object belonging to one
cluster. Each cluster may be represented by a centroid or a cluster representative; this is some sort of
summary description of all the objects contained in a cluster. The precise form of this description will
depend on the type of the object which is being clustered. In case where real-valued data is available, the
arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate
representative; alternative types of centroid may be required in other cases, e.g., a cluster of documents can
be represented by a list of those keywords that occur in some minimum number of documents within a
cluster. If the number of the clusters is large, the centroids can be further clustered to produces hierarchy
within a dataset.
Single Pass: A very simple partition method, the single pass method creates a partitioned dataset as
follows:
1. Make the first object the centroid for the first cluster.
2. For the next object, calculate the similarity, S, with each existing cluster centroid, using some
similarity coefficient.
3. If the highest calculated S is greater than some specified threshold value, add the object to the
corresponding cluster and re determine the centroid; otherwise, use the object to initiate a new
cluster. If any objects remain to be clustered, return to step 2.
As its name implies, this method requires only one pass through the dataset; the time requirements are
typically of order O(NlogN) for order O(logN) clusters. This makes it a very efficient clustering method for
a serial processor. A disadvantage is that the resulting clusters are not independent of the order in which
the documents are processed, with the first clusters formed usually being larger than those created later in
the clustering run
Hierarchical Agglomerative methods
The hierarchical agglomerative clustering methods are most commonly used. The construction of an
hierarchical agglomerative classification can be achieved by the following general algorithm.
61
CS2032 DATA WAREHOUSING AND DATA MINING
1. Find the 2 closest objects and merge them into a cluster
2. Find and merge the next two closest points, where a point is either an individual object or a cluster
of objects.
3. If more than one cluster remains , return to step 2
Individual methods are characterized by the definition used for identification of the closest pair of points,
and by the means used to describe the new cluster when two clusters are merged.
There are some general approaches to implementation of this algorithm, these being stored matrix and
stored data, are discussed below
In the second matrix approach , an N*N matrix containing all pairwise distance values is first
created, and updated as new clusters are formed. This approach has at least an O(n*n) time
requirement, rising to O(n
3
) if a simple serial scan of dissimilarity matrix is used to identify the
points which need to be fused in each agglomeration, a serious limitation for large N.
The stored data approach required the recalculation of pairwise dissimilarity values for each of the
N-1 agglomerations, and the O(N) space requirement is therefore achieved at the expense of an
O(N
3
) time requirement.
The Single Link Method (SLINK)
The single link method is probably the best known of the hierarchical methods and operates by
joining, at each step, the two most similar objects, which are not yet in the same cluster. The name single
link thus refers to the joining of pairs of clusters by the single shortest link between them.
The Complete Link Method (CLINK)
The complete link method is similar to the single link method except that it uses the least similar
pair between two clusters to determine the inter-cluster similarity (so that every cluster member is more
like the furthest member of its own cluster than the furthest item in any other cluster ). This method is
characterized by small, tightly bound clusters.
The Group Average Method
62
CS2032 DATA WAREHOUSING AND DATA MINING
The group average method relies on the average value of the pair wise within a cluster, rather than
the maximum or minimum similarity as with the single link or the complete link methods. Since all objects
in a cluster contribute to the inter cluster similarity, each object is , on average more like every other
member of its own cluster then the objects in any other cluster.
Text Based Documents
In the text based documents, the clusters may be made by considering the similarity as some of the
key words that are found for a minimum number of times in a document. Now when a query comes
regarding a typical word then instead of checking the entire database, only that cluster is scanned which
has that word in the list of its key words and the result is given. The order of the documents received in the
result is dependent on the number of times that key word appears in the document.
+!A55>I%A*I";(
Data clustering has immense number of applications in every field of life. One has to cluster a lot
of thing on the basis of similarity either consciously or unconsciously. So the history of data clustering is
old as the history of mankind.
In computer field also, use of data clustering has its own value. Specially in the field of
information retrieval data clustering plays an important role. Some of the applications are listed below.
Similarity searching in Medical Image Database
This is a major application of the clustering technique. In order to detect many diseases like Tumor
etc, the scanned pictures or the x-rays are compared with the existing ones and the dissimilarities are
recognized.
We have clusters of images of different parts of the body. For example, the images of the CT Scan
of brain are kept in one cluster. To further arrange things, the images in which the right side of the brain is
damaged are kept in one cluster. The hierarchical clustering is used. The stored images have already been
analyzed and a record is associated with each image. In this form a large database of images is maintained
using the hierarchical clustering.
63
CS2032 DATA WAREHOUSING AND DATA MINING
Now when a new query image comes, it is firstly recognized that what particular cluster this image
belongs, and then by similarity matching with a healthy image of that specific cluster the main damaged
portion or the diseased portion is recognized. Then the image is sent to that specific cluster and matched
with all the images in that particular cluster. Now the image with which the query image has the most
similarities, is retrieved and the record associated to that image is also associated to the query image. This
means that now the disease of the query image has been detected.
Using this technique and some really precise methods for the pattern matching, diseases like really
fine tumor can also be detected.
So by using clustering an enormous amount of time in finding the exact match from the database is
reduced.
Data Mining
Another important application of clustering is in the field of data mining. Data mining is defined as
follows.
Definition1: "Data mining is the process of discovering meaningful new correlation, patterns and trends
by sifting through large amounts of data, using pattern recognition technologies as well as statistical and
mathematical techniques."
Definition2: Data mining is a "knowledge discovery process of extracting previously unknown, actionable
information from very large databases."
Use of Clustering in Data Mining: Clustering is often one of the first steps in data mining analysis. It
identifies groups of related records that can be used as a starting point for exploring further relationships.
This technique supports the development of population segmentation models, such as demographic-based
customer segmentation. Additional analyses using standard analytical and other data mining techniques
can determine the characteristics of these segments with respect to some desired outcome. For example,
the buying habits of multiple population segments might be compared to determine which segments to
target for a new sales campaign.
For example, a company that sales a variety of products may need to know about the sale of all of their
products in order to check that what product is giving extensive sale and which is lacking. This is done by
data mining techniques. But if the system clusters the products that are giving less sale then only the
64
CS2032 DATA WAREHOUSING AND DATA MINING
cluster of such products would have to be checked rather than comparing the sales value of all the
products. This is actually to facilitate the mining process.
Windows NT
Another major application of clustering is in the new version of windows NT. Windows NT uses
clustering, it determine the nodes that are using same kind of resources and accumulate them into one
cluster. Now this new cluster can be controlled as one node.
5artitioning methods1
Jiven a database of n ob&ects or data tuples,a partition in method constructs - partitions of the
data, where each partition represents cluster and H^Vn. That is ,it classifies the data into - groups, which
together satisfy the following requirements$
.!/ each group must contain at least on e ob&ect,and
.>/ each ob&ect must belong to e)actly one group,9otice that the second requirement can be rela)ed in
some fuzzy partitioning technique.
Jiven H, the number of partitions to construct , a partitining method creates an initial partitioning.
t then uses an iterative relocation technique that attempts to improve the partitioning by moving
ob&ects from one group to another .The general criterion of a good partitioning is that ob&ects in the
same clusters are XcloseX or related to each other,whereas ob&ects of different clusters are Xfar apartXor
very different. there are various -inds of other criteria for &udging the quality of partitions.
To achieve global optimality in partitioning,based clustering would require the e)haustive
enumeration of all of the possible partitions. nstead, most applications adopt one of two popular
-! Deuristic methods
!. the -,means algorithm,where each cluster is represented by the mean value of the ob&ects in the
cluster,and
>. the -,medoids algorithm,where each cluster is represented by one of the ob&ects located near the
center of the cluster.These heuristic clustering methods wor- well for finding spherical,shaped clusters
in small to medium ,sized databases.To find clusters with comple) shapes and for clustering very large
data sets, partitioning,based methods need to be e)tended.=artitioning,based clustering methods are
studied in depth later.
65
CS2032 DATA WAREHOUSING AND DATA MINING
Dierarchical methods1
A hierarchical method creates a hierarchical decomposition of the given set of data ob&ects,A
hierarchical method can be classified as being either agglomerative or divisive ,based on how the
hierarchical decomposition is formed. The agglomerative approach,also called the bottom ,up aproach
,starts with each ob&ect forming a separate group, t successively merges the ob&ects or groups close to one
another, until all of the groups are merged into one. the topmost level of the hierarchy/, or until a
trmination condition holds.
The divisive approach, also called the top,down approach, starts with all the ob&ects in the same
cluster,until eventually each ob&ect is in one cluster, or until a termination condition holds, %ierarchical
methods suffer form the fact that once a step.merge o"r split/ is done,it can never be undone. This rigidity
is useful in that it leads to smaller computation costs by not worrying about a combinatorial number of
different choices.%owever, a ma&or problem of such techniques is that they cannot correct erroneous
decisions.There are two approaches to improving the quality of hierarchical partitioning, such as in 1*83
and 1hameleon, or .>/ integate hierarchical agglomeration and iterative relocation by first using a
hierarchical agglomerative algorithm and then refining the result using iterative relocation by first using a
hierarchical aggomerative algorithm and then refining the result using iterative relocation , as in B81%.
.! Densit#- based methods
most partitioning methods cluster ob&ects based on the distance between ob&ects.+uch methods can
find only spherical,shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes.
6ther clustering methods have been developed based on the notion of density.Their general idea is to
continue growing the given cluster as long as the density. Their general idea is to continue growing the
given cluster as long as the density.number of ob&ects or data points/in the XneighborhoodX e)ceeds some
thresholdI that is , for each data point within a given cluster,the neighborhood of a given radius has to
contain at least a minimum number of points .+uch a method can be used to filter out noise.outliers/ and
discover clusters of arbitrary shape.
DB+1A9 is a typical density,based method that grows clusters according to a density threshold,6=T1+
is a density,based method that computes an augmented clustering ordering for automatic and interactive
cluster analysis.
/!8rid -based method
Jrid ,based methods quantize the ob&ect space into a finite number of cells that form a grid
structure .All of the clustering operations are performed on the grid structure.i.e., on the quantized
space/.The main avantage of this approach is its fast processing time, which is typically independent of the
66
CS2032 DATA WAREHOUSING AND DATA MINING
number of data ob&ects and dependent only on the number of cells in each dimension in the quantized
space. +T9J is a typical e)ample of a grid,based method.1F4*3 and wave,cluster are two clustering
algorithms that are both grid,based and density,based. model,based methods$ (odel,based methods
hypothesize a model for each of the clusters and find the best fit of t he data to the given model. A model,
based methods hypothesize a model for each of the clusters and find the best fit of the data to the given
model .A model,baed algorithm may locatre clusters by constructing a density function that reflects the
spatial distribution of the data points.t also leads to a way of automatical determining the number of
clusters based on standard statistics, ta-ing XnoiseXor outliers into account and thus yielding robust
clustering methods .(odel,based clustering methods are studied below.
+ome clustering algorithms integrate the ideas of several clustering methods,so that it is sometimes
difficult to classify a given algorithm as uniquely belonging to only one clusteing method category.
7urthermore ,some applications may have clustering creteria that require the integration of seeral
clustering techniques.
n the following sections,we e)amine each of the above five clustering methods in detail. We also
introduce algorithms that integrate the ideas of several clustering methods.outlier analysis , which
typically involves clustering, is described at the end of this section.
0! 5artitioning ,ethods
Jiven a database of ob&ects and - , the number of clusters to form , a partitioning algorithm
organizes the ob&ects into - partitions.-^Vn/, where each partition represents a cluster.The clusters are
formed to optimize an ob&ective,partitioning criterion,often called a similarity function ,such as distance
,so that the ob&ects within a cluste are Xsimilar,X whereas the ob&ects of different clusters are XdissimilarXin
terms of the database attributes.
%lassical 5artitioning ,ethods1 k-means and k-medoids
The most well ,-nown and commonly used partitioning methods are -,means,-,nedoids, and their
variations.
%entroid-$ased *echni)ue1 *he E-,eans method
The fc,means algorithm ta-es the input paramete,-, and partitions a set of n ob&ects into -
clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.cluster
similarity is measured in regard to the mean value of the ob&ects in a cluster, which can be viewed as
the clusterPs center of gravity.
X%ow does the -,means algorithm wor- MX The -,means algorithm proceeds as follows.7irst, it
67
CS2032 DATA WAREHOUSING AND DATA MINING
randomly selects - of the ob&ects, each of which initially represents a cluster mean or center.7or each of
the remaiining ob&ects, an ob&ect is assigned to the cluster to which it is the most similar, based on the
distance between the ob&ect and the cluster mean.t then computes the new mean for each cluster.This
process iterates until the criterion function converges.Typically, the squared,error criterion is
used,defined as
!4sigma sigmap4ci5p-mi5 s6uare
where 3 is the sum of square,error for all obects in the database ,p is the point in space
representing a given ob&ect, and mi, is the mean of cluster ci . both p and mi, are multidimensional/.This
criterion tries to ma-e the resulting - clusters as compact and as separate as possible. The algorithm
attempts to determine H partitions that minimize the squared,error function. t wor-s when the clusters are
compact clouds that are rather well separated from one another.The method is relatively scalable and
efficient in processing large data sets because the computational comple)ity of the algorithm is
6.n-t/, where n is the total number of ob&ects,- is the number of clusters , and t is the number of
iterations . n9ormally,-^^n and t^^n.The method often terminates at a local optimum.
The -,means method,however,can be applied only when the mean of a cluster is denned,This may
not be the case in some applications , such as when data with categorical attributes are involved,the
necessity for users to specify -, the number of clusters, in advance can be seen as a desadvantage.the -,
means method is not suitable for discovering clusters with nonconve) shapes or clusters of very different
size.(oreover,it is sensitive to noise and outlier data points since a small number of such data can
substantially influence the mean value.
@! Dierarchical ,ethods
A hierarchical clustering method wor-s by grouping data ob&ects into a tree clusters.%ierarchical
clustering methods can be further classified into agglomerative and divisive hierarchical
clustering,depending on whether the hierarchy decomposition is formed in a bottom,up or top,down
fashion.The quality of a pure hierarchical clustering method suffers from its inalbility to perform
ad&ustment once a merge or split decision has been e)ecuted.Tecent studies have emphasized the
integration of hierarchical agglomeration with iterative relocation methods.
Agglomerative and Divisive Dierarchical %lustering
n general, there are two types of herearchical clustering methods$
68
CS2032 DATA WAREHOUSING AND DATA MINING
Agglomerative hierarchical clustering1 This bottom,up strategy starts by placing each ob&ect in
its own cluster and then merges these atomic dusters into larger and larger clusters,,until all of the ob&ects
are in a single cluster or until certain termination conditions are satisfied .(ost herearchical clustering
methods belong to this category.They differ only in their definition of inter cluster similarity.
Divisive heerarchical clustering1 This top,down strategy does the reverse of agglomerative
hierarchical clustering by starting with all ob&ects in one cluster.t subdivides the cluster into smaller and
smallerpieces,until each ob&ect forms a cluster on its own or until it satisfies certain termination
conditions,such as a desired number of clusters is obtained or the distance between the two closest
clusters is above a certain threshold distance. 7our widely used measures for distance between clusters are
as follows,where `p,pP` is the distance between two ob&ects or points p and pP,m,is the mean for cluster 1,
and n, is the number of ob&ects of in 1i. minimum distance $ ma)imium distance$ mean distance$
Average distance$
F! DA*A ,I;I;8 A55>I%A*I";(
(cience1 %hemistr#, 5h#sics, ,edicine
o Biochemical analysis
o 8emote sensors on a satellite
o Telescopes : star gala)y classification
o (edical mage analysis
$ioscience
o +equence,based analysis
o =rotein structure and function prediction
o =rotein family classification
o (icroarray gene e)pression
5harmaceutical companies, Insurance and Dealth care, ,edicine
o Drug development
o dentify successful medical therapies
o 1laims analysis, fraudulent behavior
o (edical diagnostic tools
o =redict office visits
69
CS2032 DATA WAREHOUSING AND DATA MINING
6inancial Industr#, $anks, $usinesses, 7-commerce
o +toc- and investment analysis
o dentify loyal customers vs. ris-y customer
o =redict customer spending
o 8is- management
o +ales forecasting
70

Das könnte Ihnen auch gefallen