10746328

Downloaded from www.pencilji.
com
Data Warehousing and Data Mining
Dr. G. Raghavendra Rao,

Professor and Head,
Dept. of CS&E, NIE, Mysore.
Sri. K. Raghuveer,
Asst. Professor
Dept. of CS&E, NIE, Mysore.
1
Downloaded from www.pencilji.com
Data Warehousing and Data Mining

Course Introduction
In this course, you will learn about the concepts of data warehousing and data
mining. This is one of the recent additions to the IT area and a lot of work is still going
on. You will be introduced to the present status of the topic and will also be introduced
to the future trends.
The concept of data warehousing is a logical extension of the DBMS concept. In
case of the data bases, static data is stored in databases and user generated queries are
answered by the system. A data warehousing is a more complete process, wherein apart
from getting answers, you can also put this answers into standard models, like business
models, economic models etc., study their performance and modify the parameters
suitably. The query answers, in a database are an end in itself, but in a data warehouse,
they serve as valuable inputs to the Decision Support System (DSS).
The first unit speaks about the characteristics of a typical data warehouse, data
marts, the types of data models, and other details. Most importantly, you will also be
introduced to some of the existing software support in the area of data warehousing. At
the end of this unit, you should be able to design you own data warehouse, given the
requirements.
The second unit is about data mining. The concept is simple. Given a lar ge
repository of data, as a data warehouse, you should be able to search for the desired data
with atmost efficiency. This concept is called mining of data. You will be taught some
of the algorithms in the context.
The third unit is a case study of the concepts involved. The case study will help
you to understand the fundamentals in a better way.
2

UNIT I
UNIT INTRODUCTION
In this unit, you are introduced to the basic concepts of a data warehouse.
The first block deals with the fundamental definitions of a data warehouse, how
they are an extension of the data base concept, but are still different from the DBMS and
also the various terminologies used in the context of a data warehouse are depicted.
The second block describes, in detail, a typical data warehouse system and the
environment in which it operates. The various data models for the data warehouse, the
various software tools used and also the two broad classifications of the data
warehouses- namely the relational and multidimensional warehouses are discussed. You
will also be introduced to some of the software available for a data warehouse designer.
The third block gives a step by step method of developing a typical data
warehouse. It begins with the choice of the data to be put in the data warehouse, the
concept of metadata, the various hardware and software options and the role of various
access tools. It is also gives a step by step algorithm for the actual development process
of a data warehouse.
3

BLOCK I
BLOCK INTRODUCTION
In this block, you are briefly introduced to the concept of a data warehouse. It is
basically a large store house of data, from which users can access and view data in
various formats. Further, they allow the users to do computations, so that the y can derive
maximum benefits out of the data. For example, having data about the previous 5 years
sales is good, but THE ability to computations on them so that you can predict the sales
for the next year will be a much more welcome situation. Similarly by comparing the
performances of your and your competitor's business performances, if you can develop
business patterns, it will be excellent.
A data warehouse provides exactly similar
opportunities.
You will also be introduced to the concept of a datamart, which can be thought of
as a subsection of a warehouse, which will allow you to view only those data, which are
of interest to you, irrespective of how big the actual data base is. You will also be
introduced to the basic requirements that the data warehouse or the datamart should
satisfy and also some of the implementation issues involved.
Contents
1. Data warehousing
2. Datamart
3. Types of data warehouses
4. Loading of data into a mart
5. Data model for data warehouse
6. Maintenance of data warehouse
7. Metadata
8. Software components
9. Security of data warehouse
10. Monitoring of a data warehouse
4

11. Block summary
BLOCK I
1. DATA WAREHOUSING
A Data Warehouse can be thought of on lines similar to other ware house i.e. a
place where selected and (sometimes) modified operationa l data are stored. This data can
answer any query which may be complex, statistical or analytical. Data warehouse is
normally the heart of a decision support system of an organization (DSS). It is essential
for effective business operation, manipulation, strategy planning and historical data
recording. With the business scenario becoming more and more competitive and the
amount of data to be processed, according to a rough and estimation, getting doubled
every 2 years, the need for a very fast, accurate and reliable data warehouse need not be
over emphasized. A proper organization and fast retrieval of data are the keys for the
effective operation of a data warehouse.
Unfortunately, such rhetorics can explain the situation only upto a certain degree.
When one sits down to brass tacks the actual organization of such a warehouse, the
question arises are there any simple rules that govern data warehousing operation ? Do
the rules remain same for all types of data and all types of analysis ? what are the
tradeoffs involved ? How reliable and effective are such warehouses etc. This course
answers some of the questions. However, the concept of a data warehouse is a relatively
new one and is still undergoing lots of transformation. Hence, in this work you find only
pointers guidelines for the effective operation of such a warehouse. But the actual
operation needs a lot more skill than simple knowledge of the ground rules.
Before we venture into the warehouse details, we see what types of data will
essentially be handled there. Normally, they are classified into 3 groups.
i.
Transaction and reference data: They are the original data, arising out of the
various systems and are comparable to the data in a database (at regular
intervals) and are also removed (purged) when their useful lifespan is over.
However, the purged data is normally archived into taper any such devices.
5
ii.

Derived data (or secondary data) as the name suggests are derived from the
reference data (normally as a result of certain computations).
iii.
Denormalised data is prepared periodically for on line processing, but unlike

derived data, they are directly based on the transaction data (not on
computations).
2. DATA MART
A data warehouse holds the entire data, of which various departments o f the
decision support system would need only portions. These portions are drawn into Data
Marts.
They can be viewed as subsets of a data warehouse.
advantages over a central data warehouse.
They have certain
The latter keeps growing and becomes
complex, unvieldy and difficult to understand and manage after a time. They become
difficult to customize, maintain and keep track of. Further, as volume of data increases,
the software needed to access data becomes more complex and also the time complexities
increase. (It can be compared to a very large library, where searching for a book you need
becomes that much more difficult).
On the other hand, a data mart can be thought of as a small departmental library,
which is small, elegant and easy to handle and customize. It is easy to sort, search or
structure the data elements without any global considerations. Obviously the hardware
and software demands are manageable. Once in a while you may not get some data that
may not be available in the datamart, which can be easily tracked to the data warehouse.
It is to be noted that the type of data that flows from the warehouse to the data
mart is of the current level type. The derived data and denormalised data are to prepared
at the data mart level itself.
However, many of the issues that affect the warehouse affect the data marts also.
You can view, for simplicity that the data warehouse is a collection of several data marts.
Hence, whatever we speak about a data mart, can be extended to a warehouse and viceversa unless specified otherwise.
3. TYPES OF DATA WAREHOUSES

6

Data warehouses are basically of two types multidimensional and relational. In
multidimensional data warehouse the data can be stored so that the multidimensionality is
not lost. Contrast this with the RDBMS method of storing, wherein they are essentially
stored as table and the multidimensionality of the data is lost.
Thus, in a
multidimensional warehouse, queries can be asked on the multidimensionality of data.

At this stage, we can not describe the operation of such warehouses, but it is suffice to
say that specialized search engines are needs to support such models.
On the other hand, the relational warehouse contain both text and numeric data
and are supported by RDBMS. They are used for general purpose analysis.
4. LOADING OF DATA INTO A DATA MART
The data mart is loaded from the data warehouse using a load program. The
loading program takes care of the following factors, before loading the data from the data
warehouse to the data mart.
i)
Frequency and schedule i.e. when & how often the data is to be loaded
ii)
Total or partial refreshment i.e. whether all data in the data mart is
modified (replaced) or is done only partially.
iii)
Selection, resequencing and merging of data when required
iv)
Efficiency (or speed) of loading
v)
Integrity of data i.e. the data should not get unintentionally modified
during the transfer and also the data in the data mart should, at all times,
match that in the warehouse.
5. DATA MODEL FOR DATA WAREHOUSE

A model is to be built into the data warehouse, when large amounts of data are to
be stored. Though the data model for the data warehouse need not be corresponding to
any of the standard RDBMS models, it is desirable to choose one that is similar to the
standard one.
6. MAINTENANCE OF A DATA WAREHOUSE:
7

All data warehouses, (as also DBMS for that matter) need periodic maintenance
i.e. loading, refreshing and purging of data loading can be from the data warehouse (in
case of a data mart) or from the system producing the data (in the case of a data
warehouse). Refreshing means updating the data, may be on a daily, weekly or monthly
basis. Purging of data means to read data periodically and weed out old data. The data to
be weeded may be totally removed, archived or condensed depending on the nature of
purging, but most often, it is archived.
7. METADATA
Most data warehouses, and data marts come with a metadata (data about data).
The metadata is a description of the contents and source of data of the warehouse, type of
customization on the data, description of the data marts of the warehouse, its tables,
relationships etc. and any other relevant details about the data warehouse.
The metadata is created and updated (from time to time) and are very useful for
the data analyst or the systems manager to manipulate the warehouse.
8. SOFTWARE COMPONENTS
The software that goes into the data warehouse varies depending on the context
and purpose of the warehouse but normally includes a DBMS, access and creatio n and
management software.
9. SECURITY OF A DATA WAREHOUSE:

The data of the warehouse need to be protected against physical as well as
software interventions. If unauthorized access, modification and deletion need to be
prevented, the security need not suffice, if it is only at the warehouse level, but has to be
implemented at the datamart level also i.e. a person authorized for one data mart may be
prevented from approaching some other data mart or for that matter the authorities may
be valid only for portions of the data mart.
The warehouse administrator will be
responsible for implementing these security measures. the normal methods used are i)
Fire Walls: Which are softwares to prevent unauthorized access into the data warehouse/
data mart. ii) logon/ logoff passwords which prevent unauthorised login or logout iii)
8
application - based security procedures

iv) encryption and decription: where the
appearance of data is modified to prevent unauthorized users accessing the data.
10. MONITORING THE REQUIREMENTS OF A DATA WAREHOUSE
The performance and contents of warehouse needs to be monitored closelyusualy by the system manager or data administrator. Monitoring is normally done by a
"data content tracking". it keeps track of the actual contents of the data mart, types of
accessing invalid/ obsolete data in the warehouse, the rate and kind of warehouse growth,
consistency issues (between the previously present data and the newly acquired data) etc.
While the monitoring of data is often a transparent operation, it's success is very
important for ensuring the continued usefulness and reliability of the warehouse to the
common user.
10. SUMMARY
In this block, you were introduced to the basic concepts of a data mart, data
warehouse, the types of data warehouses and how a data warehouse differs from a data
base - essentially the warehouse is a multidimensional concept whereas a data base is a
relational one.
You were also introduced to the concept of metadata which contains data about
the data of the warehouse and helps the future users as well as developers to deal with the
warehouse.
You were also introduced to the issues of data warehouse security,
consistency of data, and data integrity.

Many of these aspects will be elaborated in the next blocks.
Review Questions
1)
A _________________ holds only relevant portions of data held by a data

warehouse.
2)
A data warehouse normally holds _________________ data, whereas

RDBMS handle relational data.
9
3)
Replacement
of
old
data

values by latest values is called
_________________
4)
Removal of obsolete data is called _________________
5)
Data about the data available in a datamart is called _________________
6)
Monitoring of data content is called _________________
7)
Obsolete data in a RDBMS is deleted, whereas in a data mart it is usally

_________________
8)
Data derived after computations on the primary data is called

_________________
9)
The compatibility between the previously available data and new data is
called _________________
10)
The main difference between transaction data and derived data lies in
_________________
Ans wers
1. Datamart
2. Multidimensional
3. Refreshing
4. Purging
5. Metadata
6. Data content tracking
7. Archived
8. Derived data
9. Consistency
10. Computations
10
BLOCK - II
A TYPICAL DATA WAREHOUSE SYSTEM
In this block, we introduce you to the fundamentals of an actual data warehouse.

It is presumed that you have atleast a preliminary knowledge of the concept of a database
and the concepts of the data warehouse are developed in relation to it.
The concept of the star model is introduced which is central to the concept of data
warehouse development.
Most importantly, you will be introduced to some of the existing tools available in
the market, like IBI Focus fusion, Cognos power play and Pilot Software. While it is
unlikely that you will be using any of them in this course, their study would help you to
understand the complexities involved, the various tradeoffs available and most
importantly, you will understand the concept of step by step development of the
warehouse environment.
Contents
1. Atypical data warehouse
2. A typical data warehousing environment
3. Data Modeling star schema for multidimensional view
4. Various Data models
5. Various OLAP tools
6. Relational OLAP
7. Managed query environment
8. Data warehousing products state of art
IBI Focus Fusion
Cognos Power Play
Pilot Software
9. Summary
11
1. A TYPICAL DATA WAREHOUSE SYSTEM
A data warehouse system, should be able to accept queries from the user, analyze
them and give the results. sometimes, they are called Online Analytical Processing
Systems (OLAP).
Contrast this with the conventional online systems - which for
distinguishing purpose, we call Online Transaction Processing System (OLTP). A OLTP

most often will be capable of only a few transactions like entering data, accessing records
etc. Whereas a OLAP - or a ware housing system - should be capable of analyzing, on
line, a large number of records with varying interrelationships and summarize the relative
results. The type of data is usually multidimensional in nature. i.e. each data item is
linked to several other data in several other directions.
To make this concept clear, consider the following example:
A Student record has his name, address, marks scored in different subjects in
different years form a 2 dimensional record. Now consider the following case, using his
address, say his cityname, you can find more about his city, it's tourist spots etc. Using
the field of, say mathematics, you can find how many students had registered for
mathematics, what their addresses are etc. In this way, each link takes you around the
scenario as a huge, multidimensional space in which you can able to not only facilitate
such traversals, but also be able to consolidate the results of such traversals and present in
a suitable format - all in real time. One thing you can be sure of, is that such systems
need to have capacities of storing and processing enormous amounts of data at very high
speeds.
Now about the software. It is obvious that any conventional type DBMS would
be , theoretically, able to do the desired operations, but with the increase in dimensional
complexities, there will be a literal explosion of the SQL statements required to build the
query - with multiple joins, scans, aggregations and what not - and the operation would
need large amounts of space and time to do the operations.
12

Thus a typical data warehousing system, at first approximation, can be said to be a
"resource hungry" system, which cannot be handled by any conventional database
system.
Comparison between a Database System and Data Warehouse System.

Data Warehouse
1. Can handle both current and historic
Databsase
1. Can handle only current data.
data, since the current data is appended
Whenever an updating is done, the new
to the historic data during updatings.
data replaces the existing data, so that

no historic data is available.
2. Transactions can be very long and

complex
2. Transactions are short or atmost

combinations of such short
transactions.
3. Volume of transactions (no. of
3. Volume of transactions is very high.
transactions over a period of time) is

less. Also data are periodically
refreshed.
4. No concurrent transactions allowed.
4. Concurrent transactions allowed. Since
i.e. only one query can access data at a
multiple users can simultaneously
time. hence, no recovery failures are
access/update the data, the transactions
notified.
may lead to erroneous results. Hence

transactions recovery procedures need
to be followed.
5. Queries are often predetermined,

needing high level of indexing.
5. Transactions need low level of indexing

and hence can be large and online.
2. A TYPICAL DATA WAREHOUSING ENVIRONMENT

13

We study most of the data base / data warehousing operations in terms of views at
different levels. Though the hardware/software views are exhaustive, most use rs would
like to be shielded from such details and would be happy to deal with the system at the
"users level" . i.e. They would hand over queries at the "analytical level" and the system
operates on them at the "operational level" and hands over the required results again at
the "analytical level".
An analytical level can be considered to be the logical
relationships while the operational level is the level corresponding to computer

operations.
Look at the following schema

Operational Level
Source A
Source B
Source C
Source D
Extraction
A
Transformation
A
Users View
Different Views of a data warehouse
The different sources may be available in different storage devices. They are
extracted, put in a common place (only the relevant portio ns) and are transformed, so that
14

they contain the required results. These are then presented to the user in response to his
queries.
3. DATA MODELING- STAR SCHEMA FOR MULTIDIMENSIONAL VIEW

In a data warehousing environment, since the data warehouse comprises of a
central reposition of data, collected from different sources, may be even at different
periods of time, the problem of presenting an integrated view to the user assumes prime
importance. Obviously such divergent information cannot be integrated and used, unless
it is modelled them as independent entities. Further, such a modelling should keep the
end user's perspective in mind. The better the understanding of the user's perspective,
more effective and efficient will be the data warehouse operations. The warehouse
designer should have a thorough knowledge of the various requirements and generalities
in order to effectively capture the data model for the warehouse.
One obvious way of doing this is to view the divergent sources of data entities
that go into the warehouse as individual tables. Such data sets can be thought of as a
"star schema" - i.e. the individual tables beam into the warehouse, they will be
denormalised and integrated, so as to be fit to be presented to the enduser.
Sub 1
Sub 2
Sub 3
Sub 1
Student result
(School 1)
Sub 2
Sub 3
Student result
(School 2)
Regional
Result
One typical warehouse example
15

While the data warehouse in the above example contain the results of the region,
the viewer may like to view it school wise, subjectwise or student wise. Each of these
views adds a 'dimension' to the view. Often, in large warehouses, the number of possible
dimensions will be too large to be compressed at one go.
But how does the star schema actually work ? From the Database Administrator's
angle, it is a relational schema (A Table- in Simple terms). A simple start schema has a
central "fact table" which contains raw, numeric facts. These facts are additive and are
accessed through the various dimensions. Since the fact tables contain the entire data of
the warehouse, they will be normally huge.
Dimension
Subject
Dimension
Fact Table
School
Marks of the
region
Subject
Star schema for the above cited example
4. THE VARIOUS DATA MODELS
Now, any body with a little imagination could visualise that the simple star
structure depicted above suffices only for reasonably simple & straight forward
databases. A practical data warehouse will be much more complex, it terms of the
diversity of the subjects covered, their inter relationships and also the various
perspectives. In such a scenario, adding more dimensions , thus increasing the scope of
the attributes of the star schema could solve the problem only upto an extent. But
sooner, rather than later, this structure collapses. To avoid such breakdowns, a better
technique called the multifact star schema or the snow flake schema is used. The main
problem of the simple star schema, it's inability to grow beyond reasonable dimensions, is
over come by providing aggregations at difficult levels of hierarchies in a given
16

dimension i.e now a given dimension is not just a co llection of entities, but it is a
collection of hierarchies, each hierarchy being a collection of entities. This goal is
achieved by normalising the respective hierarchical dimensions into more detailed data
sets to facilitate the aggregation of fact data. The data ware house itself may be a
collection of different groups, each group addresses specific performance, thus catering
to the needs of specific users or user groups. Each group of fact data can be modelled
using a separate star schema.
In essence, we are not abandoning the star schema concept, but we are building
on it. Simply put, we are dividing the complex schema into smaller schema( these into
much smaller schema, if necessary) etc.. and combining these back to get the completed
data model. Hence, the name " multifact star schema" or " snow flake schema".
5. THE VARIOUS ONLINE ANALYTICAL PROCESSING (OLAP) TOOLS

The online analytical processing tools, which form the backbone of any data
warehouse system, can broadly classified into 2 categories: The multidimensional
(MOLAP) and
the Relational(ROLAP) tools.
As the names suggest,
the
multidimensional tools look at the database as a multidimensional entity. The addition of

each new set of entities increases the dimensions of the schema. It also implies that such
tools can work only in the presence of a multidimensional database(MDDB). A simple
example is the case of a student database, wherein each facet of the student, like his
academic performance, his extracurricular activities, his fina ncial commitment vis-a-vis
the college etc. add a new dimension to his database. Now, if some body is interested in
his academics only, he will search only his academic dimension, while the college office
is concerned with his financial dimension only and so on.
A Relational OLAP, in contrast, looks at the data base essentially as a relation
i.e a table. However complex the database may be, it has to be converted to a tabular
form. Then the standard relational operations can be made use of. Those fa miliar with
the operations of relational databases would recollect that this method, though simple,
17

involves lot of redundancy and also would make the processing of the data cumbersome.
However, a relational OLAP has the advantages of simplicity & uniformity.
There are several hybrid approaches also i.e they integrate the two methods at
various levels like having a table of hierarchies, for example, and these relations are
usually multirelational systems. However, all these tools basically implement the star
schema. As a thumb rule, one can say that the Relational OLAP are used where the
complexity of the applications are at the lower ends and also the performance
expectations are limited.
I.e in situations, wherein the effects available for system
development are limited. However, as the complexity increases, the relational models
generally become both unwieldy and less efficient, performance wise and the choice is
normally the multidimensional OLAP.
With this introduction, we now look into the typical multidimensional and
Relational Architecture.
6. RELATIONAL OLAP
The main strength of this architecture is
Front end Tool
ROLAP Server
Info Request
Data Base Server
SQL
Result set
M etadata
request
processing
Result set
Relational OLAP
Its simplicity and universality. It can support any number of layers of data and
the main advantage is that new data can be added as additional layers, without affecting
18

the existing data. I.e the database can be thought of as a collection of two dimensional
relational tables that can be used to produce multidimensional views.
The other
advantage is the availability of a several strong SQL engines to support the complexity of
multidimensional analysis. The Relational databases have grown over several years and
hence the entire expertise available can be made use of to provide powerful search
engines to support the complexity of multidimensional analysis. These include creating
multiple SQL statements to handle complex user requests, optimizing these statements
using standard RDBMS techniques and searching the database from multiple points.
But, what makes the relational OLAP most attractive, possibly is its flexibility and also
the availably of products that can work on un-normalised database designs efficiently.
However, the Relational OLAP comes with its standard operations( inspite of the
optimizations at the DBMS level, described above). Thus, recently the Relational OLAP
are shifting towards the concepts of middleware technology to simplify its visualization,
design and applications. Also, instead of pure relational OLAP, hybrid systems, whic h
make use of the relational operations only to the extent to which it remains convenient
and beyond that level make use of other methods are coming into existence.
The multidimensional OLAP:
Front end Tool
MOLAP Server
Load
Info Request
Data Base Server
SQL
Result set
M etadata
request
processing
Result set
This is the other style of design available to a data warehouse designer. Here, the
data is basically organized in an aggregate form. I.e instead of the simple 2-dimensional
19

relational model, each data object interacts with others in a variety of ways and obviously
capturing these multidimensional interactions would need a multidimensional database
operation with tight coupling between the applications. Efficient implementations would
store the data in a form in which it is utilized most of the time, during the search process.
I.e at the design stage itself, the programmer should not only visualize the various
interactions between the data elements, but also should have an idea about what the end
user will be expecting the search engines to search for atleast most of the time. He has
to work under the dual constraints of visualizing the multiple-several times invisible at
first sight relationships on one hand and capturing such relationships to ensure most
optimal pattern of storage on the other. Most commercial products based on this category
come with the concept of time in its operation. I.e it is not sufficient if the data is made
available, but it should be made available within specified times.
This discussions brings out one major limitation of such systems- namely the
maintainability. Any database, worth its name, will not be static. New data are added,
old ones are deleted and what is more, certain relationships may get modified.
Incorporating such frequent changes into a multidimensional OLAP can be tricky.
Several suppliers provide standard tools, which to some extent, take care of such
modifications but they can be useful only if the changes are not too drastic.
Thus, the multi-OLAP are best utilized when used for applications that require
iterative operations and comprehensive time series analysis of trends.
7. MANAGED QUERY ENVIRONMENT (MQE)

The latest OLAP provide the users with the capability of performing analysis directly
against the database, albeit in a limited sense.
They can bring in a limited
multidimensional OLAP server to simplify the query processing environment hence the
name managed query environment (MQE). Though the actual implementations differ,
the concept can be highlighted somewhat in the following manner: -
20

Instead of the user going to the database every time, he bring a adhoc data cube
to his local system. (The idea can be imagined as follows There is a huge cube of
data available on the data base, out of which you will frequently be making use of some
portions make a copy of a section of data in your machine). This can be done by first
developing a query to select the data from the DBMS, had it delivered to your desktop
and then afterwards it can be manipulated (accessed, updated, modified) so as to reduce
the overhead required to create the structure each time the query is executed. In another
approach, these tools can work with multidimensional OLAP servers, so that the data
from RDBMS can first go to multidimensional OLAP and then on to the desktop.
This approach provides for ease of operation and administration especially when
the end user is reasonably familiar with RDBMS operations. It is cost effective and
efficient at the same time.
However, certain shortcomings persist with the data cubes being built and
maintained on separate desktops. The factors of data redundancy and data integrity need
to be addressed more effectively. Also, in multiuser systems, if each user chooses to
maintain his own data cubes, the system will come under a lot of strain and data
consistency may take a beating. Thus, this method can be effective only when the data
volumes are small.
8. DATA WAREHOUSING PRODUCTS STATE OF ART
Data mining and Data ware housing being a relatively a new and a fast growing
field, any commitment on the state of art of the technology is hazardous. Further,
because of the intuitive approach to the problem by the corporates, newer and better tools
keep flooding the market. What we are discussing in the next few pages can taken as a
guideline to the type of products available need not be taken as an exhaustive list of
options available.
21

Basically all the data warehousing tools available in the market allow the user to
aggregate the data along common dimensions ie. The user can choose the group the
data in various forms so that, at a later time, he can navigate along these dimensions with
the click of a button. However, while the tools provide the facility to integrate data the
actual choice of integration methods are still with the user and as such an insight into the
type and suitability of data are essential for effective use of these tools.
Roughly, these tools work in two different ways. One set of tools like Oracles
Express preaggregate the data into multi-dimensional databases. Other tools directly
work against relational data. While the merits of each of these approaches are datatable,
leading database vendors like Oracle & Microsoft have taken steps to incorporate some
of the features of OLAP into their normal database softwares. So, a time may come when
data warehousing tools will not be separate tools, but will form a part of the database
software.
Now, to the features of some of the tools available. (you may not be able to work
on them, but the discussions tell you about the fascinating number of options provided by
them)
I BI Focus Fusion
Focus Fusion from information builders Inc. uses multidimensional database

technology, with business application analysis packages.
It combines a high
performance, parallel capable engine, supplemented with administrative, copy

management & access tools. Some of the features are:
1. Fast query & reporting: With advanced indexing, parallel query & rollup
facilities provide high performance.
Indexing ensures appropriate
arrangement of data to ensure direct accessing capabilities. Parallel query

facilities ensure that several sections of the query are searched independent of
one another, so that the overall search time is minimized. The software also
22

ensures scalability, so that it is possible to use the solutions for data
warehouses of varying sizes, or those built in stages. Scalability would be a
critical factor when dealing with warehousing of data of large corporates.
2. It has a comprehensive GUI(Graphic User Interface) to facilitate ease of
administration.
3. A complete portfolio of business intelligence applications, that span a wide
range of reporting, query, decision support and EIS needs, with fully
integrated models.
4. Integrated copy management facilities, which schedule automatic data refresh
from any source into Fusion. This would ensure data scalability and integrity.
5. Fusion works with a wide variety of desktop tools, including the world wide
web browsers. This is ensured with a open access via industry- standard
protocols like SQL, ODBC & HTTP.
6. Three tiered reporting architecture will ensure high performance.
7. Provides for several classes of precalculated summaries. Also provides for
data manipulation capabilities that allow for every conceivable type of
manipulations.
8. For any large scale data base operation, it is desirable that the data is
partitioned into several classes.
Fusion provides for a fully transparent
partitioning without disrupting the users.

9. Support for a parallel computing environment.
10. Seamless integration with more than 60 different databases on various
platforms.
Cognos Powe rplay

This is a OLAP that can interface and operate with a wide variety of software
tools, data bases and applications. The highlight of the solution is that it stores data into
multidimensional data sets called power cubes. These power cubes are stored on
either a Cognos universal client or on a server. They can also be on the LAN or inside
23
popular relational databases.

Power play features fast installation and deployment
capabilities, scalability and economic costs.
1. Supports data cubes of more than 20 million records

2. Supports scatter charts that let the users show data across two measures, so
that comparisons can be made ( ex: the budgeted values and actual figures can
be shown side by side for comparison).
3. Supports linked displays: i.e. multiple views of the same data in a report
4. Has a large number of formatting features for financial reports like single and
double underlining, brackets for negative numbers etc.
5. Unlimited levels of undo operations & customizable tool bars.
6. Features of word processing, spread sheet or presentation software features.
7. 32 bit word length, so can work better with the latest OS.
8. Can create power cubes with the existing data bases supported by packages
like Oracle, SYBASE etc..
9. Schedules power cube creations for offpeak processing times.
10. Advanced security features to lock by dimension, category either on client or
server or both.
11. Users can pull subsets of information from the server to process on the client.
12. Data base management and data base security features are integrated.
13. Data cubes can be created by drawing data from different data sources.
14. Multidimensional data cubes can be created & processed.
In principle power play manages the query analysis as a process that runs on a
population of data cubes.
Pilot Software
It is a package of several PILOT decision support tools that form a high speed
multidimensional database. Some of the software that form the core of the offerings are:
24

1. PILOT Analysis server: A multidimensional database with GUI. It includes
the latest version of expert level interface, a multidimensional relational data
stores.
2. PILOT link: A data base connectivity tool that provides ODBC connectivity
via specialized drivers to most relational databases. It also comes with a GUI.
3. PILOT Designer: To develop applications speedily.
4. PILOT
Desktop: Used
for
navigation
& search between several
multidimensional databases.
5. PILOT sales and marketing analysis library: Provides applications that allow
sophisticated sales & marketing models to be visualized. It also allows the
user to modify the tools to satisfy specific deviations.
6. PILOT internet publisher: allows users to access PILOT databases via
browsers on the internet.
The main advantage of having such differentiated tools is that it is easy to modify &
customize the applications.
The other features that are common to the PILOT software are:
1. Many of them provide time as one of the dimensions, so that periodic reports,
updations and
shifting
from one time base
to
another becomes
straightforward.
2. Provide integrated, predictive data mining in a multidimensional environment.
3. Provide for compression of spare cells ( those cells which have no value, but
still form the part of the matrix). Compression of ole( time-based) cells,
implicit declaration of some dimensions( they need not explicitly specified in
the query, but are automatically calculated, as long as they are defined as the
attribute of certain other dimensions), creation of dynamic variables etc. All
these features decrease the total size of the database and hence reduce the time
for navigation, without actually losing data.
4. Allow for seamless integration to existing OLTP. The users can also specify
the view of the database that they frequently refer and the system self
optimizes the relevant queries.
25

The above is not an exhaustive list of tools nor are the features completely listed. They
only indicate the type of supports one can expect from such tools and would be further
useful in deciding one tool over the other for actual implementation.
Summary
In this block, you were introduced to the differences between the OLTP
(database) and OLAP (Warehouse) concepts. Some of the concepts underlying a typical
data warehouse were discussed in brief. You also learnt about the star schema modelling
and about three commonly used tools - IBI Focus Fusion, Cognos Power play and Pilot
Software.
Review Questions
1. OLAP stands for _________________
2. OLTP stands for _________________
3. In a OLAP system, the volume of transaction is _________________
4. A _________________ manages both current and historic transactions .
5. A star schema is organised around a central table called _________________ table.
6. _________________ are locally situated multidimensional data sets, which form
subsets of the data warehouse.
7. _________________ is the ability of the application to grow over a period of time
8. _________________ software comes with special, business oriented applications.
9. Power cube creations are normally scheduled for _________________ periods to
reduce the load on the system
10. DSS stands for _________________
Ans wers:
1. On Line Analysis Processing
2. On Line Transaction Processing
3. Low
4. OLAP
5. Fact Table
26

6. Power cubes
7. Scalability
8. PILOT
9. Offpeak
10. Decision Support Systems
27

BLOCK - III
THE PROCESS OF A DATA WAREHOUSE DEVELOPMENT
In this block, you will be introduced to the step by step methodology of
developing a data warehouse. Beginning from the choice of the subject matter, a brief
introduction to the various stages of development, tradeoffs involved and pitfalls in each
are discussed in brief. You are advised to through in material in detail and ensure that
you understand various terminology involved. It is needless to say, how ever, that the
development of a data warehouse is both an art and a science. While the science portion
can be taught, the art portion is to be developed by practice.
Contents:
1. When do we go for a data warehouse?
2. The basic strategy for a data warehouse
3. Design of a warehouse
4. Data content
5. Metadata
6. The actual development process
7. The process of a data warehouse design
8. Considerations of Technology
i.
Hardware platforms
ii.
The DBMS
iii.
Networking capabilities
9. Role of access tools

10. A data warehouse implementation algorithm
11. Summary
28

THE PROCESS OF A DATA WAREHOUSE DEVELOPMENT
1. WHEN DO WE GO FOR A DATA WAREHOUSE?
As we have seen earlier, a data warehouse is built usually to get answers to

strategic questions regarding policities & strategies based on past (historical) data. From
the business perspective, it is a tool for the quest for survival in the competitive
environment. What decisions previously used to take weeks & months to arrive at, are to
be taken with the hours, if not minutes. Added to the demands on speed, is the increase
in volume of data available to be processed. Since the available data in most business
areas are predicted to double every two years, the need for efficient & reliable data
warehousing cannot be over emphasized.
Add to this the changes that keep taking place. Entire business models keep
getting modified, if not totally being discarded and we get a reasonable perspective for
efficient data warehousing.
Hence, the need to organize, maintain large amounts of data, so that they can be
analyzed within minutes in the manner and depth desired becomes important. Thus, one
cannot fail to identify the need for efficient data warehousing strategies.
Before we start looking into the actual design aspects of a data warehouse, we
would also see why the conventional information systems could not meet the
requirements? The conventional DBMS systems originated basically for homogeneous
and platform dependent applications.
Also, they were designed for data that often
changes slowly and also to situations where the search times were reasonably high. But
with the advent of very fast CPUs and larger and cheaper disk spaces, the ability and the
need to work on very large databases which are dynamic was felt. (The concept of
networking with ever increasing bandwidths made the available data as well as the
29

results, highly dynamic). Thus, an alternative, online analytical processing, as opposed to
the online transaction processing was felt. And hence the OLAP systems.
Having once again assured ourselves about the basic features involved in data
warehouses, in the following sections we survey the issues involved in building a
warehouse beginning from the design approaches, architectures, design trade offs,
concept of metadata, data rearrangement, tools and finally the various p erformance
considerations.
2. THE BASIC STRATEGY FOR A DATA WAREHOUSE

Just like any other software, a data warehouse can be built on either a top-down or
bottom up approach. i.e. one can begin with the overall structure required, break it into
modules, submodules etc.. i.e. we can begin at the level of a global data warehouse for
the entire organisation, split it into individual warehouses( data marts) for the
departments, break it further based on products/locations etc.. until we arrive at modules
which become small enough to be handled independently. Each of these can built by
one/more project groups, (often in parallel) and can be integrated to suit the original
needs.
Alternatively, begin at the lower end, combine the sub-data marts, into data marts
into the data warehouse, to get all possible analysis that you can get from the warehouse.
However, the discussion is not just about systems & programming. One will also
have to look into the location of the various departments, the levels of interact ions
between them, the parts of data flow, the sources of data and the demand centres of
analysed information etc. and arrive at a suitable model. Often, a suitable combination of
top down and bottom up designs ( or further combinations there of) are used.
3. THE DESIGN OF A WAREHOUSE

As you know, the very first stage in any software project, is the design. In the
case of a data warehouse, the problem is a little complex one, because of the volume of
30

data and the dynamic nature thereof. However, the very first stage can be, definitely, to
take a holistic approach of the proposed data warehouse -Identify all possible sources of
data ( present and future), their possible utility for the various departments, the possible
path of data travel etc.. and arrive at a comprehensive single, complex system that
effectively encaptures all possible user requirements. Any failure at this stage would
result in a skewed data warehouse (not balanced), that cater to only a few requirements,
shutting out others. This may, in the long run, undermine the utility of the warehouse
itself. The main difficulty arises in identifying future trends & make room for them.
Further, to enhance the data accessibility, especially in organisations that are
geographically spreadout, web enablement would be highly desirable.
However, there are three major issues in the development of a data warehouse,
that need very careful consideration.
1. The available data, will more often than not, be heterogeneous in nature. Ie
since the data came from various, unconnected sources, they need to be converted
to some standard format, with reference to a uniformly recognised base. This
requires a fair amount of efforts and ingenuity.
Also, the data needs to be
maintained. Ie with the passage of time, the data becomes obsolete and requires
updation. Again, because various pieces of data are from different sources, a
substantial amount of effert is required to upgrade them uniformly to maintain
data integrity. Since important decisions are taken based on the data values, their
reliability and authenticity should be beyond doubt at all times.
2. Unlike databases, in data warehouses, historic data connot be scrapped, but

have to be arranged in a format that is both concise and precise on one t he hand
and cost effective on the other. This is a very fundamental challenge in any data
warehouse operation, but needs to be addressed at the design level itself.
3. Mainly because of the above considerations and also because of the constant
inflow of latest data, the warehouse tends to grow out of proportions very shortly.
31

Specific instructions are to be left to identify and weed out old data subject to the
constraints imposed by condition (2) above.
Thus, one can safely presence that the design of a warehouse is definitely more
complex and tricky compared to a database design. Also, since it is business driven and
business requirements keep changing, one can safely say, it is not a one time job, but is
a continuous process.
4. DATA CONTENT
Compared to a Database, a warehouse contains data which need to be constantly
monitored and modified, if found obsolete. Also, the level of abstraction in a data
warehouse is more detailed, partly to facilitate ease of analysis and partly to ensure ease
of maintenance.
Thus, the data models used in a data warehouse are to be chosen based on the
nature, content and the processing pattern of the data warehouse. Before the data is
actually stored, one will have to clearly identify the major components of the model,
their relationships, including the entities, attributes, their values and the possible keys.
But the more difficult process is the ability of the designer to identify the query
process and the path traveled by a query. Because of the varying nature of q ueries, it is
more easily said than done.
Visualising all possible query combinations, their
frequency etc.. before arriving at the most optimal storage patterns is the key to a
successful design. In addition to optimising the data storage for a high level query
performance, one should also keep in mind the data storage requirements and data
loading performance of the system.
Thus, no specific rules for the design can be prescribed and a lot of finetuning
based on experience needs to be done. Further, since the data handled will be normally
voluminous, a decision on its actual distribution, whether on a single server, on several
servers on the network etc. is to be taken. It can also be divided based on region, time
32

or subject. Of course, it is needless to say that each of these need to be optimised
individually, as also in combinations.
5. METADATA
Since the type of data in a warehouse in voluminous contentwise and varying in
terms of the models, the relationships between the databases, amongst the mselves and
with the warehouse in total, needs to be made known to the endusers and the endusers
tools. The metadata defines the contents and the location of the data in the warehouse.
This would facilitate further updating and maintenance of the data warehouse. It is
used by the users to find the subject areas and the definitions of data. It also helps the
users to modify and update the data and datamodels. It essentially acts as a logical link
between the decision support system application and the data warehouse.
Thus, a data warehouse designer would also create a metadata repository which
has access paths to all important parts of the data warehouse at all points of time. The
metadata works like a access buffer between the tools and the data and no user or tool
can directly meddle with the data warehousing environment. The actual choice of the
format for the metadata, ofcourse, is left to the designer.
6. THE ACTUAL DEVELOPMENT PROCESS

As we have seen earlier, a number of tools are available fo r each phase of
development. They provide facilities for defining the transformation and cleanup, data
movement, query processing, reporting and analysis. They differ in capabilities and
compatibilitys and it is left to the designer to choose appropriate tools and also modify
his design modules to fit to the capabilities of these tools.
No doubt, the metadata should be able to effectively address to database and the
tools that are used. Further, an injudicious choice of the tools or diluting the design
specifications to accommodate the tools may result in inefficient data warehouses
which will soon become unmanagable.
33

Having seen the various stages of a data warehousing design, we will look at an
actual step by step procedure to design workable data warehouses.
7. THE PROCESS OF A DATA WAREHOUSE DESIGN:

The process of a data warehouse design is complex because of the vague nature of
the goals available. Quite often, all the guideline that is available to a data warehouse
designer is take all the enterprise data and build a data warehouse, so that the
management can get answers to their questions.
In such a situation, all that the designer has to do is to start somewhere and get
going. The most common technique is to develop a datamart and gradually blow it to a
full fledged data warehouse.
Ralph Kimball identifies a nine step strategy to build a data mart.

They are:
1. Choose the subject matter (one subject at a time)
2. Decide what the fact table represents
3. Identify and conform the dimensions
4. Choose the facts
5. Store pre-calculations in the fact table
6. Define the dimensions and tables.
7. Decide the duration of the database and the periodicity of updation
8. Track slowly the changing dimensions
9. Decide the query priorities and query models
10. Build a few simple data marts and
11. Integrate them in stages
Let us briefly look into the details of the above steps

1. Often, even people who have worked with the organisation for several years
will find it difficult to the clearly identify the areas of activity and partit ion
them. Hence, the warhouse designer, normally an outsider, would find it quite
34

difficult to decide on the various subject matters to deal with of course, he will
interact with the users of the proposed warehouse at various levels and elicit
their answers based on interviews and questionnaires, going through the
various documents or by simply watching the procedures. If there is already a
level of computerisation, of course, the DBAs would give invaluable
information regarding sources of data, their quality and validity.
Armed with these informations, the designer will have to decide on his
own as to how to partition the activities into his subject matters and which of
them should be implemented to begin with. Normally, the hot subjects will
be given priority. Ie those which are likely to interest most people or those
which are likely to immediately benefit the organisation.
2. A fact table is a large control table in a dimensional design that has a

multipart key. The parts of the key can be combined to form query keys for
the datamart. For example, for a student data base, the fact table may contain
his various particulars and each of them can be a part of the key. Converting
the facts into the fact table is a very crucial step and involves several
brainstorming sessions.
3. The dimension table design is the next important step, which converts the fact
table to a multidimensional table. Each dimension normally refers to one a set
of related activities and would let to a multidimensional or relational database
as the ease may be. The dimensions are the source of new headers in the
users final reports. Since the choice of the dimensions freeze the data
warehouse specifications to some extent, sufficient thought for future growth
of the warehouse or of the organisation itself.
Duplicate or super fluous dimensions should be avoided while
compromising with the long range perspectives of the warehouse. However,
if two datamarts endup having the same dimensions, they should conform to
each other. This would ensure ease of standardising the queries.
35

The remaining steps are the logical followup of the first three stages.
4. The choice of the facts, though appears simple, can some times be tricky,
especially if step 1 above is not carried out properly. All facts that pertain to
the dimensions should be correctly identified and also their links to other data
items are to be ascertained.
5. The relations between the various entities are expressed in terms of precalculations and are stored in the fact tables.
6. This stage involves the choice of the number, content and dimensions of the
various tables used in the operation. While the selection may appear simple,
one has to note that choice of two few tables would make each of them too
voluminous and hence the query processing becomes inefficient. On the other
hand, too many small tables would create problems of storage, consistency
and data integration.
7. The duration of the databases and periodicity of updation is decided mainly by

the type of operations of the organisation, the frequency of data sampling and
to some extent the time & space constraints of the software programmer. As
already indicated, any updation would mean the previous data is stored as
historic data, in a suitable format, depending on its importance.
8. and 9 require several iterations, spread over a period of time, and possibly
would involve accommodating conflicting priorities.
10 and 11 are self explanatory.
8. CONSIDERATIONS OF TECHNOLOGY:
36

While the above discussion talks of the implementationed issues, several
technological issues also need to be addressed. Some of them are:
i)
The hardware platforms
ii)
The DBMS
iii)
Networking infrastructure
iv)
Operating systems and system management platforms
v)
Software tools
i) Hardware platforms: While implementing a data warehouse, the existing hardware

can be utilized, provided the disk storage space is sufficient : usually of the GB order.
Apart from the record size, sufficient space for processing, indexing, swapping etc need
to be made available, apart from, of course, the space required for the system software.
Further, because of the sensitivity of data, sufficient scope for backup of the data is to be
built in to safeguard against crashes etc.. Through any reasonably fast processor can be
used, the trend is to go for a dedicated data warehouse server. Such servers, apart from
being able to support large data volumes and fast operations, are scalable a very
important characteristic for a warehouse, as a practical data warehouse keeps growing
throughout its life cycle. In fact, as the data volume increases, the capabilities, need to
increase more than proportionately, to take into account the more complex indexing &
computational aspects. Further, if the querying is to go on a public data network (like
internet), a multiprocessor configuration with sufficient I/O bandwidth is essential and a
balance between the I/O and computational capabilities of the server is to be achieved. If
this is not done, the I/o processing could endup as a bottleneck. This is done easily by
choosing different types of processors (not just a multiprocessor system, but a multi-type
processor system) and also having disk controllers, which (sometimes more than one) to
control the required number of disks.
But it is needless to say that for maximum efficiency, each major component of
the system should be selected such that optimum performance and scalability is achieved.
37

Otherwise, one or the other component will endup blocking further innovations to the
warehouse.
ii) Choice of the DBMS: This is as important, if not more, as the hardware selection, as
it determines the performance of the warehouse to a no lesser extent.
Again, the
parameters remain the same scalability, ability to efficiently handle large volumes of
data and speed of processing.
Almost all the well known DBMS Oracle, Sybase, DB- support parallel
database processing. Some of them also provide special features for operating datacubes
(described in the previous chapter)
iii) Networking capabilities:
Most data warehousing applications work on a intranet
(within the organisation) and a few may also work in the internet environment (web
enabled). The choice to put it in a network itself is decided by various factors like
security, privacy on one hand, counter balanced by accessibility & spread on the other.
While not many extra hardware for networking may be used (apart from those normally
used) for warehousing, software considerations & planning process tend to become
definitely more complex.
9. ROLE OF ACCESS TOOLS

Though readymade data warehouses to suit every needs are hard to get, several
tools are available to ease the implementation of the warehouse. However, care is to be
excercised to choose the best suitable tools(note the word best suitable tools, not the
best tool, for no such best tool exists) to compare and understand their capabilities, a few
of the following reports are generated on a trail basis.
1. Statistical analysis
2. Data visualisation, production of graphical reports
3. General statistical analysis
4. Complex textual search (text mining)
38

5. Generation of user specific reports
6. Complex queries which travel across multiple tables, multilevel subqueries &
involve sophisticated computations.
10. A DATA WAREHOUSE IMPLEMENTATION ALGORITHM
Step1 : Define the data sources

Step2 : Create a datamodel, decide on the appropriate hardware and software
platforms
Step3 : Choose the DBMS & other tools
Step4 : Extract the data from the sources, and load into the model
Step5 : Create the database connectivity software, using the various tools chosen
in steps 2 & 3
step6 : Define / choose suitable GUI (presentation)software
step7 : Devise ways of updating the data, by channelising the data from the data
sources, periodically.
SUMMARY
You have been briefly introduced to the various stages of a data warehouse
development, with an algorithm emerging out of the discussions. The stages, namely
collection of requirements, creating a data model, indicating data sources and data users,
choice of hardware and software platforms, choice of reporting tools, connectivity tools
and GUI and refreshment of data periodically form the core of any data warehouse
development.
The next unit, which is a case study, is to be studied, bearing in mind these
fundamentals.
39

Review Questions
1. The process of removing the deficiencies and loopholes in the data is called
____________ of data.
2. The design of the method of information storage in the data warehouse is defined by
the ___________.
3. ___________ provides panters to data of the data warehouse.
4. A reasonable prediction of the type of queries, that are likely to arise, help in
improving the ___________ of search
5. A balance between the ___________ processors and ___________ processors is
necessary for better performance of the data warehouse.
6. Name any two methods of identifying the business requirements, ____________ and
______________.
7. GUI stands for _________________
8. The two basic design strategies of OLTP are ___________ and _____________.
Ans wers:
1. Cleaning up
2. Data model
3. Metadata
4. Efficiency
5. Input/output, computational
6. Interviews and questionnaire's.
7. Graphical User Interface.
8. Top Down and Bottom up
Reference Books:
1. CSR Prabhu, ' Data Warehousing: Concepts, Techniques, Products and Applications',
PHI, New Delhi - 2001.
40

UNIT II
DATA MINING
COURSE INTRODUCTION
We know lots of data is being collected and warehoused. Data collected and
stored at enormous speeds. Data mining is a technique for semi-automatic discovery of
patterns, associations, changes, anomalies, rules in data.
Data mining is a
interdisciplinary in nature. In this course you study the importance of data mining,
techniques used for data mining, web data mining and knowledge discovery in databases.
41
BLOCK - 1
DATA MINING
Data Mining - An Introduction
1.0 Introduction
1.1 What is data mining?
1.2 Few applications
1.3 Extraction Methods
1.4 Trends that Effect data Mining
1.5 Summary
1.0 Introduction
The field of data mining is emerging as a new, fundamental area with important
applications to science, engineering, medicine, business and education. Data mining
attempts to formulate, analyze and implement basic induction processes that facilitate the
extraction of meaningful information and knowledge from unstructured data.
Data
mining extracts patterns, changes, association and anomalies from large data sets. Work
in data mining ranges from theoretical work on the principles of learning and
mathematical representation of data to building advanced engineering s ystems that
perform information filtering on the web. Data mining is also a promising computational
paradigm that enhances traditional approaches to discovery and increases the
opportunities for break through in the understanding of complex physical and b iological
systems.
42

1.1
What is data mining
Data mining is the semi-automatic discovery of patterns, associations, changes,

anomalies, rules and statistically significant structures and events in data. i.e., data
mining attempts to extract knowledge from data.
Data mining is an interactive, semi automated process begins with raw data.
Results of the data mining process may be insights, rules or predictive models.
The focus on large data sets is not an just an engineering challenge, it is an

essential feature of induction of expressive representation from raw data. It is only by
analyzing large data sets that we can produce accurate logical descriptions that can be
translated automatically into powerful predictive mechanisms.
1.2
Few applications
The opportunities today in data mining reset on a variety of applications. Many

are interdisciplinary in nature.
a) Neural Networks - Neural networks are systems inspired by the human brain. A
basic example is provided by a back propagation network which consists of input
nodes, output nodes and intermediate nodes called hidden nodes. Initially, the nodes
are connected with random weights. During the training, a gradient descent algorithm
is used to adjust the weights so that the output nodes correctly classify data presented
to the input nodes.
b) Tree - based classifiers - A tree is a convenient way to break a large data sets into
smaller ones. By presenting a learning set to the root and asking questions at each
interior node, the data at the leaves can often be analyzed very simply. Tree based
43

classifiers were independently invented in information theory, statistics, pattern
recognition and machine learning.
c) Graphical Models and Hierarchical Probabilistic representation - A directed

graph is a good means of organizing information about qualitative knowledge about
conditional independence and causality gleamed from domain experts. Graphical
models were independently invented by computational probabilistic and artificial
intelligence researchers studying uncertainty.
d) Ensemble learning - Rather than use data mining to build a single predictive model,
it is often better to build a collection or ensemble of models and to combine them, say
with a simple, efficient voting strategy. This simple idea has now been applied in a
wide variety of contexts and applications.
e) Linear algebra - Scaling data mining algorithms often depends critically upon
scaling underlying computations in linear algebra. Recent work in parallel algorithms
for solving linear system and algorithms for solving sparse linear systems in high
dimensions are important for a variety of data mining applications, ranging from text
mining to detecting network intrusions.
f) Large scale optimization - some data mining algorithms can be expressed as large
scale, often non-convex, optimization problems.
g) Databases, Data Warehouses and Digital Libraries - The most time consuming
part of the data mining process is preparing data for data mining. This step can be
stream - lined in part if the data is already in a database, data warehouse or digital
library, although mining data across different databases.
h) Visualization of Massive data sets : Massive data sets, often generated by complex
simulation
programs,
required
graphical
visualization
methods
for
best
comprehension.
44

i) Multi-media docume nts : Few people are satisfied with today's technology for
retrieving documents on the web, yet the numbers of documents and the number of
people accessing these documents is growing explosively. In addition, it is becoming
easier and easier to archive multimedia data, including audio, images and video data,
but harder and harder to extract meaningful information from the archives as the
volume grows.
j) Electronic comme rce - Not only does electronic commerce produce large data sets in
which the analysis of marketing patterns and risk patterns is critical, but unlike some
of the applications above, it is also important to do this in real or near - real time, in
order to meet the demands of online transactions.
1.3
Extraction Methods
Information extraction is an important part of any knowledge management
system.
Working in conjunction with information retrieval and organization tools,
machine driven extraction is a powerful means of finding contents on the web.
The precision and efficiency of information access improves when digital content
is organized into tables within a relational database.
The two main methods of
information extraction technology are

-
Natural language processing
Wrapper induction
Information extraction identifies and removes relevant information from texts,

pulling
information from a variety of sources and aggregates it to create a single view.
45

1.4
Trends that effect data mining

The following few trends which promise to have a fundamental impact on data
mining
Data trends : Perhaps the most fundamental external trend is the explosion of digital
data during the past two decades. During this period, the amount of data probably has
grown between six to ten orders of magnitude. Much of this data is accessible via
networks.
Hardware trends : Data mining requires numerically and statistically intensive

computations on large data sets.
The increasing memory and processing speed of
workstations enables the mining of data sets using current algorithms and techniques that
were too large to be mined just a few years ago. In addition, the commoditization of high
performance computing through workstations and high performance workstation clusters
enables attacking data mining problems that were accessible using only the largest
supercomputers a few years ago.
Scientific computing trends : Data mining and knowledge discovery serves an

important role linking the three modes of science, theory, experiment and simulation,
especially or those cases in which the experiment or simulation results in large data sets.
Business trends - Today businesses must be more profitable, react quicker and offer
higher quality services than ever before and do it all using fewer people and at lower cost.
With these types of expectations and constraints, data mining becomes a fundamental
technology, enabling businesses to more accurately predict opportunities and risks
generated by their customers and their customer's transactions.
1.5 Summary
Data mining is the semi - automatic discovery of patterns, associations, changes,
anomalies, rules and statistically significant structures and events in data. Data mining
46

can be applied with Neural networks, tree based classifies , ensemble learning, linear
algebra, optimization, Databases and more.
1.6 Question /Ans wer Key
1.Data mining attempts to extract ______________ from data.

2. _____________ are systems inspired by the human brain is used for data
mining
3. The two main methods of information extraction are ____________ and
________________
Ans wers
1. Knowledge
2. Neural networks
3. Natural language processing, wrapper induction
BLOCK - 2
DATA MINING FUNCTIONS

2.0 Introduction
2.1 Classification
47

2.2 Associations
2.3 Sequential patterns
2.4 Clustering/Segmentation
2.5 Summary
2.0 Introduction
In this unit you are going to study various data mining functions. Data mining
methods may be classified by the function they perform or according to the class of
application that can be used in. Data mining functions is helpful in solving real
world problems.
2.1 Classification
Data mine tools have to infer a model from the database, and in the case of
supervised learning this requires the user to define one or more classes. The database
contains one or more attributes that denote the class of a tuple and these are known as
predicted attributes whereas the remaining attributes are called predicting attributes. A
combination of values for the predicted attributes defines a class.
When learning classification rules the system has to find the rules that predict the
class from the predicting attributes so firstly the user has to define conditions for each
class, the data mine system then constructs descriptions for the classes. Basically the
system should given a case or tuple with certain known attribute values be able to predict
what class this case belongs to.
Once classes are defined the system should infer rules that govern the
classification therefore the system should be able to find the description of each class.
The descriptions should only refer to the predicting attributes of the training set so that
the positive examples should satisfy the description and none of the negative. A rule said
to be correct if its description covers all the positive examples and none o f the negative
examples of a class.
48

A rule is generally presented as, if the left hand side (LHS) then the right hand
side (RHS), so that in all instances where LHS is true then RHS is also true, are very
probable. The categories of rules are:
exact rule - permits no exceptions so each object of LHS must be an element of RHS
strong rule - allows some exceptions, but the exceptions have a given limit
probablistic rule - relates the conditional probability P(RHS|LHS) to the probability
P(RHS)
Other types of rules are classification rules where LHS is a sufficient condition to
classify objects as belonging to the concept referred to in the RHS.
2.2 Associations
Given a collection of items and a set of records, each of which contain some
number of items from the given collection, an association function is an operation against
this set of records which return affinities or patterns that exist among the collection of
items. These patterns can be expressed by rules such as "72% of all the records that
contain items A, B and C also contain items D and E." The specific percentage of
occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule,
A,B and C are said to be on an opposite side of the rule to D and E. Associations can
involve any number of items on either side of the rule.
Another example of the use of associations is the analysis of the claim forms
submitted by patients to a medical insurance company. Every claim form contains a set of
medical procedures that were performed on a given patient during one visit. By defining
the set of items to be the collection of all medical procedures that can be performed on a
patient and the records to correspond to each claim form, the application can find, using
the association function, relationships among medical procedures that are often
performed together.
49

2.3 Sequential/Te mporal patterns
Sequential/temporal pattern functions analyze a collection of records over a

period of time for example to identify trends. Where the identity of a customer who made
a purchase is known an analysis can be made of the collection of related records of the
same structure (i.e. consisting of a number of items drawn from a given collection of
items). The records are related by the identity of the customer who did the repeated
purchases. Such a situation is typical of a direct mail application where for example a
catalogue merchant has the information, for each customer, of the sets of products that
the customer buys in every purchase order. A sequential pattern function will analyze
such collections of related records and will detect frequently occurring patterns of
products bought over time. A sequential pattern operator could also be used to discover
for example the set of purchases that freq uently precedes the purchase of a microwave
oven. Sequential pattern mining functions are quite powerful and can be used to detect
the set of customers associated with some frequent buying patterns. Use of these
functions on for example a set of insurance claims can lead to the identification of
frequently occurring sequences of medical procedures applied to patients which can help
identify good medical practices as well as to potentially detect some medical insurance
fraud.
2.4 Clustering/Segmentation
Clustering and segmentation are the processes of creating a partition so that all the
members of each set of the partition are similar according to some metric. A cluster is a
set of objects grouped together because of their similarity or proximity. Objects are often
decomposed into an exhaustive and/or mutually exclusive set of clusters.
Clustering according to similarity is a very powerful technique, the key to it being

to translate some intuitive measure of similarity into a quantitative measure. When
learning is unsupervised then the system has to discover its own classes i.e. the system
50

clusters the data in the database. The system has to discover subsets of related objects in
the training set and then it has to find descriptions that describe each of these subsets.
There are a number of approaches for forming clusters. One approach is to form
rules which dictate membership in the same group based on the level of similarity
between members. Another approach is to build set functions that measure some property
of partitions as functions of some parameter of the partition.
2.5 Summary :
In this unit you studied classifications, associations, sequential temporal patterns

and
clustering / segmentation data mining functions. Supervise and unsupervised learning
techniques play a vital role in data mining.
2.6 Question / Ans wer Key
1. ________________ rule permits no exceptions so each object of LHS must be

an element of RHS
2. ________________ pattern functions analyze a collection of records over a
period of time
3. A ________________ is a set of objects grouped together because of their
similarity or proximity.
Ans wers
1. Exact
2. Sequential / Temporal
3. Cluster.
51
BLOCK - 3
DATA MINING TECHNIQUES

3.0 Introduction
3.1 Cluster Analysis
3.2 Induction
3.3 Neural Networks
3.4 On-line Analytical processing
3.5 Data Visualization
3.6 Summary
Introduction
Learning procedure can be classified into two categories. Supervised learning and
unsupervised learning. In the case of supervised learning we know the exact value, with
respect to exact value we compare the output, this procedure is repeated until the desired
value is obtained. In the case of unsupervised learning without knowing the target value
we extract new facts. In this unit you will learn different data mining techniques.
3.1 Cluster Analysis
52

In an unsupervised learning environment the system has to discover its own classes
and one way in which it does this is to cluster the data in the database as shown in the
following diagram. The first step is to discover subsets of related objects and then find
descriptions e.eg D1, D2, D3 etc. which describe each of these subsets.
Clustering and segmentation basically partition the database so that each partition or
group is similar according to some criteria or metric. Clustering according to similarity is
a concept which appears in many disciplines. If a measure of similarity is available there
are a number of techniques for forming clusters. Membership of groups can be based on
the level of similarity between members and from this the rules of membership can be
defined. Another approach is to build set functions that measure some property of
partitions ie groups or subsets as functions of some parameter of the partition. This latter
approach achieves what is known as optimal partitioning.
Many data mining applications make use of clustering according to similarity for
example to segment a client/customer base. Clustering according to optimization of set
functions is used in data analysis e.g. when setting insurance tariffs the customers can be
segmented according to a number of parameters and the optimal tariff segmentation
achieved.
Clustering/segmentation in databases are the processes of separating a data set into

components that reflect a consistent pattern of behavior. Once the patterns have been
established they can then be used to "deconstruct" data into more understandable subsets
and also they provide sub-groups of a population for further analysis or action which is
important when dealing with very large databases. For example a database could be used
for profile generation for target marketing where previous response to mailing campaigns
can be used to
generate a profile of people who responded and this can be used to predict response and
filter mailing lists to achieve the best response.
3.2 Induction
53

A database is a store of information but more important is the information which
can be inferred from it. There are two main inference techniques available ie deduction
and induction.
Deduction is a technique to infer information that is a logical consequence
of the
information in the database e.g. the join operator applied to two relational tables
where the first concerns employees and departments and the second departments and
managers infers a relation between employee and managers.
Induction has been described earlier as the technique to infer information that is
generalised from the database as in the example mentioned above to infer that each
employee has a manager. This is higher level information or knowledge in that it is a
general statement about objects in the database.
The database is searched for
patterns or regularities.
Induction has been used in the following ways within data mining.
3.2.1 Decision trees
Decision trees are simple knowledge representation and they classify examples to
a finite number of classes, the nodes are labeled with attribute names, the edges are
labeled with possible values for this attribute and the leaves labeled with different classes.
Objects are classified by following a path down the tree, by taking the edges,
corresponding to the values of the attributes in an object.
The following is an example of objects that describe the weather at a given time.
The objects contain information on the outlook, humidity etc. Some objects are positive
examples denote by P and others are negative i.e. N. Classification is in this case the
construction of a tree structure, illustrated in the following diagram, which can be used to
classify all the objects correctly.
3.2.2 Rule induction
54

A data mine system has to infer a model from the database that is it may define
classes such that the database contains one or more attributes that denote the class of a
tuple ie the predicted attributes while the remaining attributes are the predicting
attributes. Class can then be defined by condition on the attributes. When the classes are
defined the system should be able to infer the rules that govern classification, in other
words the system should find the description of each class.
Production rules have been widely used to represent knowledge in expert systems
and they have the advantage of being easily interpreted by human experts because of their
modularity i.e. a single rule can be understood in isolation and doesn't need reference to
other rules. The prepositional like structure of such rules has been described earlier but
can summed up as if-then rules.
3.3 Neural networks
Neural networks are an approach to computing that involves developing

mathematical structures with the ability to learn. The methods are the result of academic
investigations to model nervous system learning. Neural networks have the remarkable
ability to derive meaning from complicated or imprecise data and can be used to extract
patterns and detect trends that are too complex to be noticed by either humans or other
computer techniques. A trained neural network can be thought of as an "expert" in the
category of information it has been given to analyze. This expert can then be used to
provide projections given new situations of interest and answer "what if" questions.
Neural networks have broad applicability to real world business problems and
have already been successfully applied in many industries. Since neural networks are best
at identifying patterns or trends in data, they are well suited for prediction or forecasting
needs including:
sales forecasting
industrial process control
customer research
55

data validation
risk management
target marketing etc.
Neural networks use a set of processing elements (or nodes) analogous to neurons in
the brain. These processing elements are interconnected in a network that can then
identify patterns in data once it is exposed to the data, i.e the network learns from
experience just as people do. This distinguishes neural networks from traditional
computing programs, that simply follow instructions in a fixed sequential order.
The structure of a neural network looks something like the following:
3.4 On-line Analytical processing
A major issue in information processing is how to process larger and larger

databases, containing increasingly complex data, without sacrificing response time. The
client/server architecture gives organizations the opportunity to deploy specialized
servers which are optimized for handling specific data management problems. Until
recently, organizations have tried to target relational database management systems
(RDBMSs) for the complete spectrum of database applications. It is however apparent
that there are major categories of database applications which are not suitably serviced by
relational database systems. Oracle, for example, has built a totally new Media Server for
handling multimedia applications. Sybase uses an object-oriented DBMS (OODBMS) in
its Gain Momentum product which is designed to handle complex data such as images
and audio. Another category of applications is that of on-line analytical processing
(OLAP). OLAP was a term coined by E F Codd (1993) and was defined by him as;
the
dynamic
synthesis,
analysis
and
consolidation
of
large
volumes
of
multidimensional data
Codd has developed rules or requirements for an OLAP system;
56

multidimensional conceptual view
transparency
accessibility
consistent reporting performance
client/server architecture
generic dimensionality
dynamic sparse matrix handling
multiuser support
unrestricted cross dimensional operations
intuitative data manipulation
flexible reporting
unlimited dimensions and aggregation levels
An alternative definition of OLAP has been supplied by Nigel Pe ndse who unlike
Codd does not mix technology prescriptions with application requirements. Pendse
defines OLAP as, Fast Analysis of Shared Multidimensional Information which means;
Fast in that users should get a response in seconds and so doesn't lose their chain of
thought;
Analysis in that the system can provide analysis functions in an intuitive manner and
that the functions should supply business logic and statistical analysis relevant to the
users application;
Shared from the point of view of supporting multiple users concurrently;
Multidimensional as a main requirement so that the system supplies a

multidimensional conceptual view of the data including support for multiple hierarchies;
Information is the data and the derived information required by the user application.
57

One question is what is multidimensional data and when does it become OLAP? It is
essentially a way to build associations between dissimilar pieces of information using
predefined business rules about the information you are using. Kirk Cruikshank of Arbor
Software has identified three components to OLAP, in an issue of UNIX News on data
warehousing;
A multidimensional database must be able to express complex business calculations

very easily. The data must be referenced and mathematics defined. In a relational
system there is no relation between line items which makes it very difficult to express
business mathematics.
Intuitative navigation in order to `roam around' data which requires mining
hierarchies.
Instant response i.e. the need to give the user the information as quick as possible.
Dimensional databases are not without problem as they are not suited to storing all
types of data such as lists for example customer addresses and purchase orders etc.
Relational systems are also superior in security, backup and replication services as these
tend not to be available at the same level in dimensional systems. The advantages of a
dimensional system are the freedom they offer in that the user is free to explore the data
and receive the type of
report they want without being restricted to a set format.
3.5 Data Visualization
Data visualisation makes it possible for the analyst to gain a deeper, more
intuitive understanding of the data and as such can work well along side data mining.
Data mining allows the analyst to focus on certain patterns and trends and explore indepth using visualisation. On its own data visualisation can be overwhelmed by the
58

volume of data in a database but in conjunction with data mining can he lp with
exploration.
3.6 Summary
In this unit, you studies various data mining techniques. Each method is having
its own advantage and drawback. Depending on the application one should choose the
method.
BLOCK - 4
59

KNOWLEDGE DISCOVERY FROM DATABASE (KDD)
4.0 Introduction
4.1 view points
4.2 Classification Method
4.3 steps of a KDD process
4.4 KDD Application
4.5 Related Fields
4.6 Summary
4.7 Question/Answer key
4.0 Introduction
We know lots of data is being collected and warehoused. Data collected and
stored at enamels speeds. Traditional techniques are infeasible for raw data. Hence data
mining is used for data reduction.
4.1 Vie w points.
From commercial point of view, data mining provides better, customized services
for the user. Information is becoming product in its own right we know traditional
techniques is not suitable because of enormity of data, high dimensionality of data,
heterogeneous distributed nature of data. Hence we can use some prediction methods i.e.
we find human - interpretable patterns that describe the data.
Knowledge discovery in Data bases (KDD) is an emerging field that combines

techniques from machine learning, pattern recognition, statistics, Databases and
visualization to automatically extract concepts, concept interrelations and patterns of
interest from large databases. The basic task is to extract knowledge (or information)
from lower level data (databases). The basic tools used to extract patterns from data are
60

called data mining methods. While the process of surrounding the usage of these tools
(including pre-processing, selection and transformation of the data) and the interpretation
of patterns into knowledge is the KDD Process.
This extracted knowledge is subsequently used to support human decision

making. The use of KDD systems alleviates the problem of manually analyzing the large
amounts of collected data which decision makers face currently KDD systems have been
implemented and currently in use in finance, fraud detectio n, market data analysis,
astronomy, etc., Problems in KDD include representation of the extracted knowledge,
search complexity, the use of prior knowledge to improve the discovery process,
controlling the discovery operation, statistical internee and selecting the most appropriate
data mining method(s) to apply on a particular set.
4.2 Classification Method :
In this approach, a collection of records (training set) is given, each record

contains a set of attributes, one of the attributes is the class. After that we should find a
model for class attributed as a function of the values of other attributes.
Building accurate and efficient classifiers for large data bases is one of the
essential tasks of data mining and machine learning research. Give a set of cases with
class labels as a training set, classification is build a model (called classifier) to predict
future data objects for which the class label is unknown.
Recent studies propose the extraction of a set of high quality association rules
form the training data set which satisfy certain user specified frequency and confidence
thresholds.
Suppose a data object
61

Obj = { a1, a2, .. an } follows the schema (A1, A2 . An) were A1
An are called attributes. Attributes can be categorical or continuous. For a
categorical attribute, we assume that all the possible values are mapped to a set of
concessive positive integers. For a continuous attribute, we assume that its value range is
discredited into intervals and the internals are also mapped to consecution positive
integers.
Let C={c1 , cm} be a finite set of class labels. A training data set is a set
of data objects, such that, for each object obj, there exists a class label c obj E C
associated with it. A classifier C is a function from (A1 . . . . . . . . An) to C. Given a data
object obj, c(obj) C return a class label.
In general, given training data set, the task of classification is to build a classifier
from the training data set such that it can be used to predict class labels of unknown
objects with high accuracy.
4.3 steps of a KDD process

a) learning the application domain - relevant prior knowledge and goals of application.
b) Creating a target data set : data selection
c) Data clearing and transformation Find useful features, dimensionality / variable
reduction, invariant representation.
d) Data reduction and transformation - Find useful features, dimensionality / variable
reduction, invariant represenation.
e) Chossing function of data mining - summarization, classification, regression.
f) Choosing the mining algorithms
g) Data mining - search for patterns of interest
h) Pattern evaluation and knowledge presentation - visualization, transformation,
removing redundant patterns
i) Use of discovered knowledge
Following figure shows the KDD Process

62

Selection
Preprocessing
Transformation
data
Datamining
Target data
preprocessed
Data
transformed
data
Interpretational Evaluation
Pattern
4.4 KDD Application
Knowledge
The rapidly emerging field of knowledge discovery in databases (KDD) has

grown significantly in the past few years. This growth is driven by a mix of daunting
practical needs and strong research interest. The technology for computing and storage
has enabled people to collect and store information from a wide range of sources at rates
that were, only a few years ago, considered unimaginable. Although modern database
technology enables economical storage of these large streams of data, we do not yet have
the technology to help us analyze, understand, or even visualize this stored data.
Examples of this phenomenon abound in a wide spectrum of fields: finance,

banking, retail sales, manufacturing, monitoring and diagnosis (be it of humans or
machines), health care, marketing, and science data acquisition, among others.
Why are today's database and automated match and retrieval technologies not
adequate for addressing the analysis needs? The answer lies in the fact that the patterns to
be searched for, and the models to be extracted, are typically subtle and require
significant specific domain knowledge. For example, consider a credit card company
wishing to analyze its recent transactions to detect fraudulent use or to use the individual
history of customers to decide on-line whether an incoming new charge is likely to be
from an unauthorized user. This is clearly not an easy classification problem to solve.
63

One can imagine constructing a set of selection filters that trigger a set of queries
to check if a particular customer has made similar purchases in the past, or if the amount
or the purchase location is unusual, for example. However, such a mechanism must
account for changing tastes, shifting trends, and perhaps travel or change of residence.
Such a problem is inherently probabilistic and would require a reasoning-withuncertainty scheme to properly handle the trade-offs between disallowing a charge and
risking a false alarm, which might result in the loss of a sale (or even a customer).
In the past, we could rely on human analysts to perform the necessary analysis.
Essentially, this meant transforming the problem into one of simply retrieving data,
displaying it to an analyst, and relying on expert knowledge to reach a decision.
However, with large databases, a simple query can easily return hundreds or thousands
(or even more) matches. Presenting the data, letting the analyst digest it, and enabling a
quick (and correct) decision becomes infeasible. Data visualization techniques can
significantly assist this process, but ultimately the reliance on the human in the loop
becomes a major bottleneck. (Visualization works only for small sets and a small number
of variables. Hence, the problem becomes one of finding the appropriate transformations
and reductions--typically just as difficult as the original problem.)
Finally, there are situations where one would like to search for patterns that
humans are not well-suited to find. Typically, this involves statistical modeling, followed
by "outlier" detection, pattern recognition over large data sets, classification, or
clustering. (Outliers are data points that do not fit within a hypothesiss probabilistic
mode and hence are likely the result of interference from another process.) Most database
management systems (DBMSs) do not allow the type of access and data manipulation
that these tasks require; there are also serious computational and theoretical problems
attached to performing data modeling in high-dimensional spaces and with large amounts
of data.
4.5 Related fields

64

By definition, KDD is an interdisciplinary field that brings together researchers
and practitioners from a wide variety of fields. The major related fields include statistics,
machine learning, artificial intelligence and reasoning with uncertainty, databases,
knowledge acquisition, pattern recognition,
information retrieval,
visualization,
intelligent agents for distributed and multimedia environments, digital libraries, and
management information systems.
The remainder of this article briefly outlines how some of these relate to the
various parts of the KDD process. I focus on the main fields and hope to clarify to the
reader the role of each of the fields and how they fit together naturally when unified
under the goals and applications of the overall KDD process. A detailed or
comprehensive coverage of how they relate to the KDD process would be too lengthy and
not very useful because ultimately one can find relations to every step from each of the
fields. The article aims to give a general review and paint with a broad brush. By no
means is this intended to be a guide to the literature, neither do I aim at being
comprehensive in any sense of the word.
Statistics. Statistics plays an important role primarily in data selection and sampling, data
mining, and evaluation of extracted knowledge steps. Historically, most statistics work
has focused on evaluation of model fit to data and on hypothesis testing. These are clearly
relevant to evaluating the results of data mining to filter the good from the bad, as well as
within the data- mining step itself in searching for, parametrizing, and fitting models to
data. On the front end, sampling schemes play an important role in selecting which data
to feed to the data- mining step. For the data-cleaning step, statistics offers techniques for
detecting "outliers," smoothing data when necessary, and estimating noise parameters. To
a lesser degree, estimation techniques for dealing with missing data are also available.
Finally, for exploratory data analysis, some techniques in clustering and design of
experiments come into play. However, the focus of research has dealt primarily with
small data sets and addressing small sample problems.
65

On the limitations front, work in statistics has focused mostly on theoretical
aspects of techniques and models. Thus, most work focuses on linear models, additive
Gaussian noise models, parameter estimation, and parametric methods for a fairly
restricted class of models. Search has received little emphasis, with emphasis on closedform analytical solutions whenever possib le. While the latter is very desirable both
computationally and theoretically, in many practical situations a user might not have the
necessary background statistics knowledge (which can often be substantial) to
appropriately use and apply the methods. Furthermore, the typical approaches require an
a priori model and significant domain knowledge of the data as well as of the underlying
mathematics for proper use and interpretation. In addition, issues having to do with
interfaces to databases, dealing with massive data sets, and techniques for efficient data
management have only recently begun to receive attention in statistics
Patte rn recognition, machine learning, and artificial intelligence. In pattern

recognition, work has historically focused on practical techniques with an appropriate
mix of rigor and formalism. The major applicable techniques fall under the category of
classification learning and clustering. Hence, most pattern-recognition work contributes
to the data-mining step in the process. Significant work in dimensionality reduction,
transformations, and projections has relevance to the corresponding step in the KDD
process.
Within the data-mining step, pattern-recognition contributions are distinguished

from statistics by their emphasis on computational algorithms, more sophisticated data
structures, and more search, both parametric and nonparametric. Given its strong ties to
image analysis and problems in 2D signal processing, work in pattern recognition did not
emphasize algorithms for dealing with symbolic and categorical data. Classification
techniques applied to categorical data typically take the approach of mapping the data to
a metric space norms .Such a mapping is often not easy to formulate meaningfully: Is the
distance between the values "square" and "circle" for the variable shape greater than the
distance between "male" and"female" for the variable sex?
66

Databases and data warehouses. The relevance of the field of databases to KDD is
obvious from the name. Databases provide the necessary infrastructure to store, access,
and manipulate the raw data. With parallel and distributed database management systems,
they provide the essential layers to insulate the analysis for the extensive details of how
the data is stored and retrieved. I focus here only on the aspects of database research
relevant to the data-mining step. A strongly related term is on-line analytical processing,
which mainly concerns providing new ways of manipulating and analyzing data using
multidimensional methods. This has been primarily driven by the need to overcome
limitations posed by SQL and relational DBMS schemes for storing and accessing data.
The efficiencies achieved via relational structure and normalization's can pose significant
challenges to algorithms that require special access to the data: in data mining, one would
need to collect statistics and counts based on various partitioning of the data, which
would require excessive joins and new tables to be generated. Supporting operations from
the data- mining perspective is an emerging research area in the database community. In
the data-mining step itself, new approaches for functional dependency analysis and
efficient methods for finding association rules directly from databases have emerged and
are starting to appear as products. In addition, classical database techniques for query
optimization and new object-oriented databases make the task of searching for patterns in
databases much more tenable.
An emerging area in databases is data warehousing, which is concerned with

schemes and methods of integrating legacy databases, online transaction databases, and
various nonhomogeneous RDBMSs so that they can be accessed in a uniform and easily
managed framework. Data warehousing primarily involves storage, data se lection, data
cleaning, and
infrastructure for updating databases once new knowledge or
representations are developed.

4.6 Summary
Knowledge discovery in data bases (KDD) is an emerging field that combines
techniques from machine learning, pattern recognition, statistic, databases. In this unit
you studied the steps that are involved in KDD processes.
Few applications are
challenging and today lot of research work is on.

67

4.7 Question / Ans wer Keys
1. The basic tools used to extract patterns from data are called _______methods.
2. In classification method a collection of records (training set) is given, each
record contains a set of__________________ one of the attributes is the class.
3. ______________ plays an important role primarily in data selection and
sampling, data mining, and evaluation of extracted knowledge steps.
Ans wers
1. data mining
2. attributes
3. Statistics
BLOCK - 5
WEB DATA MINING

68

5.0 Introduction
5.1 Methods
5.2 Web content Mining
5.3 Web structure Mining
5.4 Web usage Mining
5.5 The usage mining on the web
5.6 Privacy on the web
5.7 Summary
5.8 Question and Answers key
5.0INTRODUCTION
Web data mining is the use of data mining techniques to automatically discover
and extract information from world wide web documents and services. Today, with the
tremendous growth of the data sources available on the web and the dramatic popularity
of the data sources available on the web and the dramatic popularity of e-commerce in
the business community.
5.1 Methods
Web mining is a technique to discover and analyze the useful information from
the web data web mining is decomposed into the following tasks
a) Resource discovery : the task of retrieving the intended information from
Web.
b) Information Extraction : automatically selecting and preprocessing specific
information from the retrieved web resources.
c) Generalization : automatically discovers general patterns at the both
individual web sites and across multiple sites.
d) Analysis : analyzing the mined pattern.
69

5.2 Web Content Mining
Web content mining describes the automatic search of information resources

available online and involves mining web data contents. In the web mining domain, web
content mining essentially is an analog of data mining techniques for relational data
bases, since it is possible to find similar types of knowledge from the unstructured data
residing in web documents. The web document usually contains several types of data,
such as text, image, audio, video, meta data and hyperlinks. Some of them are semi
structured such as HTML documents or a more structured data like the data in the tables
or database generated HTML pages, but most of the data is unstructured text data. The
unstructured characteristic of web data force the web content mining towards a more
complicated approach.
Web content mining is based on the statistics about single words in isolation, to
represent unstructured text and take single word found in the training corpus as features.
Multimedia data mining is part of the content mining, which is e ngaged to mine
the high- level information and knowledge from large online multimedia sources.
Multimedia data mining on the web has gained many researchers attention recently.
Working towards a unifying framework for representation, problem solving and learning
from multimedia is really a challenge, this research area is still in its infancy indeed,
many works are waiting to be done.
5.3 Web structure Mining
Most of the web information retrieval tools only use the textural information,
while ignores the link information that could be very valuable. The goal of web structure
mining is to generate structural summary about the web site and web page.
70

Technically, web content mining mainly focuses on the structures of inner
documents, while web structure mining tries to discover the link structure of the
hyperlinks at the inner document level. Based on the topology of the hyperlinks, web
structure mining will categorize the web pages and generate the information, such as
similarity and relationship between different web sites.
If a web page is linked to another web page directly or the web pages are
neighbors, we would like to discover the relationships among those web pages. The
relations may be fall in one of the types, such as they related by s ynonyms or antilogy,
they may have similar contents, both of them may sit in the same web server therefore
created by the same person. Another task of web structure mining is to discover the
nature of hierarchy or network of hyperlinks in the web sites o f a particular domain. This
may help to generalize the flow information in web sites that may represent some
particular domain, therefore the query processing will be easier and more efficient.
5.4 Web usage Mining
Web usage mining tries to discovery the useful information from the secondary
data derived from the interaction o the users while surfing on the web. It focuses on the
techniques that could predict user behavior while the user interacts with web. In the
process of data preparation of web usage mining, the web content and web site topology
will be used as the information sources which interacts web usage mining with the web
content mining and web structure mining.
The clustering in the process of pattern
discovery is a bridge to web content and structure mining from usage mining.
5.5. The usage Mining on the web
Web usage mining is the application of data mining techniques to discover usage
patterns from web data, in order to understand and better serve the needs of web based
application.
71

Web usage mining is parsed into three distinctive phases : preprocessing, pattern
discovery, and pattern analysis.
Preprocessing : Web usage mining is the application of date mining techniques to usage
logs (secondary web data) of large web data repositories. The purpose of it is to produce
results that can used in the design tasks such as web site design, web server design and
navigating through a web site. Before applying the data mining algorithm, we must
perform a data preparation to convert the raw data into the data abstraction necessary for
the further process.
Pattern discovery : Pattern discovery converges the algorithms and techniques from
several research areas, such as data mining, machine learning, statistics and pattern
recognition.
Pattern Analysis : Pattern analysis is a final stage of the whole web usage mining. The
goal of this process is to eliminate the irrelative rules or patterns and to extract the
interesting rules or patterns from the output of the pattern discovery process. There are
two most common approaches for the pattern analysis. One is to use the knowledge
query mechanism such as SQL, while another is to construct multi dimensional data cube
before perform OLAP operations.
5.6 Privacy on the Web
Due to massive growth of the e-commerce, privacy becomes a sensitive topic and
attracts more and more attention recently. The basic goal of web mining is to extract
information from data set for business needs, which determines its application is highly
customer related. The lack of regulations in the use and deployment of web mining
systems and the widely spread privacy abuses reports related to data mining has made
privacy a hot iron like never before. Privacy touches a central nerve with people and
there are no easy solutions.
72

5.7 Summary
In this unit, you studied the area of web data mining with the focus on the web
usage mining. Web mining requires three stages reprocessing, pattern discovery and
pattern analysis.
5.8 Question/Answer Keys

1. _________________ the task of retrieving the intended information from
Web.
2. Web content mining describes the ____________________of information

resources available online and involves mining web data contents.
3. Web content mining mainly focuses on the structures of inner documents,

while web structure mining tries to discover the _________of the hyperlinks
at the inner document level
Answer
1. Resource discovery
2. automatic search
3. link structure
73

UNIT III
Unit Introduction
Having learnt the fundamentals of warehousing, in this unit, we list out the
various areas in which
warehousing becomes useful, especially in the central and state
government sectors.
You will also be introducted to a case study: That of the Andhra Pradesh
information warehouse. This case study is expected to underline the various concepts
discussed in the previous unit and provide a practical bias to the entire concept of
dataware housing.
A term project is suggested to further drive home the complexities and intricacies
involved in the process.
Since this unit is to be studied in totality, no summary or review questions are
included.
74

BLOCK I
Block Introduction
In this block you will be briefly introduced to the various applications, possible,
of the data warehouse concept.
Because of familiarity, the present and suggested
applications of warehousing technique at the government level have been briefly

described. This block is expected to give you some insight into the practical applications
of the data warehouse.
Contents:
1.
Areas of applications of Data warehousing
2.
Data warehousing technologies in the government
3.
Government of India warehouses

i.
Data warehousing of census data
ii.
Monitoring of essential commodities
iii.
Ministry of commerce
iv.
Ministry of Education
75

1. AREAS OF APPLICATIONS OF DATAWARE HOUSING & DATA MINING
Having seen so much about datamining and data warehousing, the question arises
as to its areas of application. Of course, the business community are the first users of the
technique they feed in their and their competitors results, trends etc. and come out with
tangible strategies for the future. Obviously, there can be as many deviations and
modifications to the concept of warehousing as the types of business. However, to learn
about them, one should first know the types of business practices, their various strategies
etc.. before one can appreciate the warehousing techniques. Instead, in this block, we
choose the safer option of going through the various applications at the government
levels. For two reasons, this promises to be a good procedure one, all of us have an
idea to some extent, how the government machinery works. Secondly, a lot of literature
is available about the implementations. However, we should underline the fact, that we
will be more bothered more about the techniques and technologies, rather than the actual
outputs & results.
2. DATAWARE HOUSING TECHNOLOGIES AT THE GOVERNMENT

It is obvious that in a large country like India, datamining and data warehousing
technologies have extensive potential for being used in a variety of activities several
central government sectors like Agriculture, Commerce, rural development, health,
tourism and soon. In fact, even before the advent of the data warehousing technologies,
there was an attempt to computerize the available data and use it for facilitating the
decision making process.
However, there have been several attempts to work
meticulously with the available data, over the last decade.

Similarly, several state governments, especially those who are forerunners in IT
industry, have tried to exploit the technology in various areas. Needless to say, a lot
more needs to be done than what has been achieved. In the next sections, we briefly list
the various areas that have been identified for the data warehousing applications. In the
next block, we see a detailed case study of the Andhra Pradesh Information warehouse,
which should give the learners a grasp over the concepts we have studied earlier.
76

3. GOVERNMENT OF INDIA WAREHOUSES:
i)
A Data warehouse of census data

The government of India conducts a census of the population of the country, that
should be a store house of all types of information for the planning process of the
country. Though, they are presently being processed manually or even with database
technologies, the data available in them in so varied and complex that it is ideally suited
for the data warehousing techniques. Information about wide ranging areas at various
levels (village, district, state etc..) can be extracted and compiled using OLAP techniques.
In fact, a village level census analysis software has been developed by the
National Informatics Centre (NIC). This software gives details in two parts : primary
census abstract and the details about the various amenities. This software has been used
on a trial basis to get the various views of the development scenario of selected villages
in the country, using the 1991 census data. Efforts are on to use technology on a much
larger scale in the subsequent census data.
It is easy to see why the census data is ideally suited for data warehousing
applications. Firstly, it is a reasonably static data, which is updated only once in ten
years. Secondly, since unbelievably large volumes of complex data is a vailable, the
benefit of technology over the other methods of extracting information is obvious even at
first sight. Thirdly, almost all the concepts of data warehousing become applicable in the
application.
ii) Monitoring of essential commadities
The government of India compiles data on prices of essential commodities like
rice, wheat, pulses, edible oils etc. The prices are collected at every week end, of the
prices of these commodities on every day of the previous week, in selected centres across
the country. These are consolidated to give the government an insight about the trends
and also allow the government to device strategies on various agricultural policies.
Again, because of the geographical spread, the periodicity of updating etc., this becomes
77

an ideal ease for OLAP technology application - especially because of the network
facility available on a countrywide basis.
iii) Ministry of comme rce

The ministry of commerce has to constantly monitor the prices, quantum of
exports, imports and stock levels of several commodities to take appropriate steps to
boost exports and also to device a EXIM policy(Export - Import policy) that suits the
country's industries both in short and long terms. It should constantly take into account
the various trends ( both at the national and international levels) so that our exports
continue to be competitive in the global markets, at the same time ensuring that our
industries are not swamped by foreign goods. The situation became more complex after
the opening up of our economy to global influences in 1991.
To ensure this, the ministry of commerce has setup several export processing
zones across the country, which compile data about the various export- import
transactions in their selected regions. These are then compiled to produce data for
decision making to the commerce ministry on a regular basis.
This being again a fit ease for data warehousing operation, the government has
drawn up a plan to make use of OLAP decision support tools for decision making. In fact
the data collection centres are already computerised, and in the second phase of
computerisation, the decision making process is expected to be based on the principles of
data mining and warehousing concepts.
iv) Ministry of Education

The latest all India education survey, which has given rise to a treasure house of
valuable data about the status of education across the country, has been converted into a
data warehouse. This is supporting various decision making queries. In addition, several
other departments are ideally suited to make use of data warehousing and data mining
technologies. Some of them have already initiated action in this direction as well. To list
a few of them 78
i)

The ministry of rural development: Detailed surveys on the availability of
drinking water, number of people below the poverty line, available surplus
land for distribution etc.. have been computerised at various stages in the
last decade.
A consolidation of these into a warehouse is being
contemplated.
ii)
The ministry of tourism has already collected valuable data regarding the
pattern of tourist arrivals, their choices and spending patterns etc. Details
about primary tourist spots are also available. They can be combined to
produce a data warehouse to support decision making.
iii)
The ministry of Agriculture conducts an agriculture census, based on

remote sensing, random sampling etc to compile data about the cropping
patterns, expected yields, input of seeds and fertilizers, livestock data etc.
Also areas under irrigation, rainfall patterns, forecasts etc.. are routinely
compiled. These can be combined into a data warehouse to aid decision
making.
In addition, several areas like planning, health, economic affairs .... etc are ideally
suited to make use of OLAP tools. Conventionally, many of these departments are
computerised and routinely producing MIS reports and hence are maintaining
medium to large size databases. The next logical sequence, is to convert these
databases and MIS know how into a full fledged data warehouses. This would
result in a paradigm shift, as far as data utilisation is concerned. Since the utility
of most types of data is time bound, enormous delays in extracting information
out of them would make the information time barred and hence of little use.
Further, such warehouses, when they come into existence, would release the
expert manpower now spent on processing the data for data analysis and decision
making.
The next stage, obviously, is to link these departmental warehouses. It is

obvious, even to an outsider, that most of these departments, can not work in
isolation.
Hence, unless the departments can avail selected data from the
79

warehouses of other departments, their decision making remains incomplete.
Ofcourse, a lot of checks and balances need to be put in place before such huge,
multidimensional, warehouses are made functional. But the goal should be to
have a consolidated central government warehouse and corresponding state
government warehouses.
80

Block II
Data Warehouse A case study
You will be introduced to a practical data warehouse design case- that of

the Andhra Pradesh Information Warehouse.
The various stages of the
development have been described, in the context of what has already been
discussed in the previous sections
the various tradeoffs involved have been
discussed to the lowest possible detail. At the end of the block, you are expected
to have become more comfortable with the practical aspects of the warehouse
techniques.
Contents:
1. Introduction
2. Concepts used in Developing the warehouse.
3. Data Sources
a) MPHS
b) Land Suite applications
c) Maps and dictionaries
4. Possible users of information
i.
Policy planners
ii. Custodian
iii.
Warehouse developers
iv.
Citizens
5. Conversion of data to information

i.
Data conversion
ii. Data scrubbing

iii.
Data transformation
iv.
Web publishing
6. Identifying hardware and software

7. Choice of data structures and dimensions
8. Term Exercises
81

1. INTRODUCTION
In this block, we look in detail, the process of development of the Andhra
Pradesh Information Warehouse. As specified earlier, we will be more interested
in the technological & technical aspects rather than the administrative details.
The Andhra Pradesh government has undertaken a project to develop a

state data warehouse, with the 'person' identified as the smallest entity of the data
repository. Put the otherway, the state government extracted information from it's
Multipurpose House hold Survey(MPHS) and it's computerised land records. The
idea was to link the 'land' and 'people' entities to produce a conceptua lly clean
data warehouse.
This data warehouse is expected to provide planners with sufficient inputs

to assess the impact of their various welfare schemes, on various sections of
society. It is possible for them to choose different target groups like urban slum
dwellers, industrial works, agricultural laborers etc.. and review their status with
reference to various parameters like economy, education, housing, health etc.. The
data so generated can be used provide for planning schemes, specifically targeted
towards any one or more of these groups. By a logical extension of the concept, it
should be possible for the policy makers to assess the impact of their welfare
programs over these target groups, during the progress of their programs. Such a
scenario is expected to help the policy planners and executives to keep their
decisions purposeful and focussed.
The actual warehouse was developed by C-DAC (Centre for Development

of Advanced Computing). In the next few sections, we see the base concepts of
schema used by them.
82

2. THE CONCEPTS USED IN DEVELOPING THE WAREHOUSE:
These concepts have been discussed in detail in the earlier unit, but have
been included here to make the case study self contained and also to serve as a
ready reckoner.
The type of processing is a typical data warehouse Processing. The data is

stored in the form of tables(relations). Data can be accessed based on the keys.
Some queries are taken up by the OnLine Analytical Processing System

(OLAP) which is designed as a multidimensional database (or a collection of
them) and the user can query for complex analytical process. The databases are
normally optimised based on the previously known patterns of data entry and data
retrieval.
Drill down and Rollup analysis:
Data available in the database is normally
arranged in several layers. The upper layers contain single data entities and their
details are hidden in the lower levels, each successive layer having detailed data
about the entities above it and the details of the present layer hidden in the next
lower layers. It is for the user to decide at what level he wants to see the data.
The process of beginning at a higher level and viewing data at the progressively
lower levels is called " drilling down" on the data. Conversely, one can view
data beginning at a detailed (lower) level and move up to concise (higher) levels.
This is called "rolling up" of data.
With these terminology in place, we go about " designing" the data

warehouse. Though, several alternative methods are possible and how the CDAC have gone around doing the same cannot be duplicated here, this exercise is
supposed to give an idea of the actual process of development in a nutshell.
83

Now we see the various stages of development as follows:
1) Identify the data sources and the type of data available from them
2) Identify the users of the warehouse and the type of queries you can expect
from them
3) Identify the methods of converting the data sources (1) to data users
(2)
above
4) Identify the hardware and software components
5) Finalise the type of queries that arise and ways of combining / standardising
them.
6) Look at the ways of storing data in such a format so that it can be efficiently
searched by most of the queries.
7) Finalise the data structures, analysis variables and methods of a calculation.
3. THE DATA SOURCES:

The Andhra Pradesh government basically decided to link the data entities "Land"
and "Person" and build the warehouse. Hence, the primary sources of data were the land
records ( which had already been computerised) and the person-related data were
collected from the Multi-Purpose Household Survey(MPHS) suite of applications.
a) MPHS: The government of Andhra Pradesh collected data from each house
hold regarding the socio-economic status of each family. This data, collected
originally for a different purpose, was available as MPHS suite of applications
in an electronic format. Relevant portions of this suite were made use of by
the government for building the warehouse.
b) Land suite of applications: This data, again, was already available, in which
land was the core entity of information. Again, relevant portions of these
records, were used for constructing the warehouse.
84

c) Maps and dictionaries: Since the number of entities entering any reasonably
useful database is very large, normally codes, instead of names are used.
Dictionaries are maintained that relate the names to the unique codes. Of
course, depending on the entities and their applications, the codes are allotted.
These dictionaries are maintained by different custodians. Depending on
whether the data is land related, person related, social or educational, different
custodians allot and maintain these codes. Needless to say, the use of codes
greatly simplify the use of the entities.
However, depending on the areas of applications, each entity is allotted a different

code. For example, the school building may be given a different code, depending on
whether the dictionary pertains to educational, land use or social aspect. Thus, there
should be a mapping between the various dictionaries based on these different
classification schemes. Thus maps are maintained to interrelate one set of codes to
another, their validity being checked and updated at regular intervals.
Again only
authorised custodians are allowed to maintain and modify such maps.
These dictionaries and maps are essential to store and manipulate the data objects.
4. POSSIBLE USERS OF INFORMATION

The next phase is to identify the users of the information that the ware house
generates. We briefly discuss the proposed users of the information in the present case.
i) Policy Planners : These are the primary users of the information generated from the
warehouse. Since they are expected to use it to a maximum extent, the warehouse
queries need to be optimised to suite the type and pattern of queries generated by them.
Though, it may not be possible to anticipate the queries generated by them fully, a
reasonable guess about what typ0e of conclusion and decisions they like to draw from the
same can be ascertained, possibly through interviews and questionaires. Also, since they
are likely to be distributed all over the state and may be even outside, the warehouse
85

needs to be web-enabled. They should also be able to copy sections of information (data
cubes) into their own machines and operate on them, since most of them are not likely to
be computer professionals, the entire operation should be seamless and transparent. All
these factors should be taken into account while finalising the optimisation parameters.
ii) Custodians : As seen earlier, the dictionaries and maps are maintained by custodians.
In addition, the object entities themselves need to be maintained by custodians. All these
dictionary custodians, map custodians and entity custodians will be responsible for
maintaining the entities and also for incorporating changes from time to time. For
example, the way the government treats a particular caste (SC/ST/backward), or a village
(backward/ forward/sensitive) or even persons may change from time to time. All that
government does is to issue a notification to the same effect. The concerned custodians
will be responsible for maintaining the validity of the entities of the warehouse.
However, again, they are not likely to be computer professionals (at least the map and
dictionary custodians) and hence they should be able to vies the entities in the way they
are accustomed to and be able to manage them.
iii) Warehouse developers, administrators and database administrators:
They are the
persons, who actually are responsible for the day to day working of the warehouse. They
will be able to look at the repository from the practical point of view and decide about it's
capabilities and limitations. Their views are most sought after to decide about the
viability / otherwise of the warehouse.
iv) Citizens: The Government plans to make certain categories of data available to
ordinary citizens on the web. Since their background, type of information they are
looking for and their abilities to interact are not homogeneous, generalised assumptions
are to be made about their needs and suitable queries made available.
86

5. CONVERSION OF DATA TO INFORMATION:
Once the sources and users are identified, methods of converting raw data into
useful information are to be explored. Needless to say, this is the key to the success of
the warehouse.
Normal methods employed are
i.
i.
Data Conversion
ii.
Data Scrubbing
iii.
Data Transformation
iv.
Web Publishing
Data Conversion: The different inputs to the warehouse come from various data
capture systems - online, disks, tapes etc. such information coming from different
OLTP systems need to be accepted, converted into suitable formats before loading
on to the warehouse (called the core object Repositery). Standard software like
Oracle SQL loaders can do the job.
Once the data becomes available in tape, floppy or any other input form,
the warehouse manager checks for their authenticity, then executes the routines to
store them in the warehouse memory. He may even take printouts of the same.
Barring the warehouse manager, other users/Custodians are not allowed to modify
the data in the warehouse. They can only send the data to be updated to the
manager, who will do the necessary updatings. Typically, the data from the
warehouse becomes unavailable to all or a set of users during such updatings.
ii.
Data Scrubbing: Data Scrubbing is the process of checking the validity of the
data, arriving at the data warehouse from different sources to ensure it's qualit y
and accuracy as well as it's completeness. Since the data originates from different
sources, it is possible that some of the key data may be ambiguous, incomplete or
missing altogether. Further, the data keep arriving at periodical intervals, it's
consistency with respect to the previously stored data is not always guaranteed.
Such originally invalid data, needless to say, loads to false comparisons.
87

Further, over a period of time, simple inconsistencies like misspelt names,
missing fields, inconsistent data (like placename and PIN) may accrue. No single
method is available for dealing with all such short comings. Several algorithms,
ranging from simple to fairly complex ones, are used to filter out such
inconsistencies. In extreme cases, the sources of data may have to be requested
for resubmission.
iii.
Data Transformation:
This process involves the extraction of data from
information repository, scrubbing it and loading it into the main database. The
process includes identifying the dimensionalities, store the data in appropriate
formats and may also involve indicating the users that the data is ready for use.
iv) Web Publishing:
This becomes important if the warehouse is to be web
enabled. The web agent on the server, which interacts with the HTML templates, reads
the data from the server and sends it on the web page. The agent, of course, has to
resolve the access rights of the user before populating the information on the web
page. This becomes extremely important, when, for example, citizens are allowed to
access certain section of information, while many others are to be made inaccessible.
The system administrator is expected to tackle the various issues regarding such
selective access rights by suitable configuring the server.
6. IDENTIFYING OF THE HARDWARE AND SOFTWARE COMPONENTS:

While no specific guidelines regarding the hardware components can be made, it
is desirable to store the detain from external data sources separately, at least during the
data conversion stage.
In the present case, the two data sources, the multipurpose
household survey (MPHS) and the land data extracted from land records can be stored on
two separate sets of storage devices.
The data, after the scrubbing operation is normally stored on a RDBMS, like
Oracle or Sybase.
Usually these are relational databases, from which the
multidimensional database server (MDDB) receiver data. In the present case Oracle8
Enterprise Edition was deployed, because it supports both relational and object relational
models.
88

The next important component is the multidimensional database server (MDDB).
This is a specialised data storage facility to store summarised data for fast access. Since,
unlike the two dimensional relational DBMS, this operates on a multidimensional logical
perspective, traversal along one/more of these dimensions either successively or in
parallel becomes easier and also faster.
Of course the choice of number of dimensions and their actual relationships forms
an important design strategy. Too few dimensions can make the operational efficiency
similar to that of simple relational models, where as choice of too many dimensions
would make the operation complex (Since physically, the server and RDBMS still work
as two diemnsional operators, the multidimensional operation being only a logical
extention).
The concept can be increased to several levels. A given dimension can be a
simple one or itself can be made up of several dimensions.
For example the concept of person can be made up of dimensions of sex,
Agegroup, Occupation, Caste and Income.
The dimension age group may have
dimensions along the actual age, income of dimensions like assured income and nonassured income etc. The consolidation of data into several dimensions is a tricky job.
Often data is called at the lowest level and is aggregated into higher level totals for the
sake of analysis.
Since the data is to be accessed on the web, a web server of suitable capacity is of
prime importance. The web server receives query requests from the web, converts it into
suitable queries, hands it to the MDDB server and the replies from the MDDB server are
sent back to the web, to be displayed to the person who has raised the query request.
The query itself can be raised either by i) The clients which are computers
connected physically to the servers or ii) web clients, where the user would require the
replies over the Internet. The government may also provide "Kisosks", special terminals,
where users can get the required information by the 'touch screen' tec hnology.
89

7. TYPE OF EXPECTED QUERIES
The usually encountered type of queries are to be listed out next. This can easily be
done by interacting with the potential users and noting out their expectation out of the
warehouse.
For example, in the present case, the person object may be used to answer questions
1. Relationship amongst members
2. Educational levels of persons
3. No. of persons above poverty line
4. Average income per household
5. Percentage of persons owning houses etc. etc.
Similarly on the land object, questions normally asked are

1. Land under cultivation
2. Percentage of irrigated land
3. Percentage of crops in the land
4. Average land holdings of person
5. Yield seasonwise etc. etc.
While the users like the planners are likely to comeout with special and newer
queries, the average citizens often end up asking similar questions. This, apart
from the fact that many of them may not be computer savvy, makes a case for
producting several "Canned Query modules". I.e. the user has no option of
formulating his own queries, but can choose to get answers for one/more of the
readymade questions. Such questions can be on the "Kiosks", and the user gets
the answer by choosing them by using a suitable pointer device.
At the next level, the user may be provided with a "Custom Query
Model", which helps him to formulate queries and get the answers. It may also
help the user to change certain parameters and get suitable results that help in
formulating policies. Further, such custom queries may be either summary ones
or detailed. The latter help the users in microlevel analysis of information.
90

8. CHOICE OF DATA STRUCTURES AND DIMENSIONS
The next stage is the choice of data structures and dimensions. Dimensions are a
series of values that provide information at various levels in the hierarchy. For
example in this particular case, the item person has been given 20 dimensions and
their dimension numbers are listed below
1. Occupation (D1)
2. Age (D2)
3. Sex (D3)
4. Caste (D4)
5. Religion (D5)
6. Shelter (D6)
7. SSID (D7)
8. House (D8)
9. Khata number (D9)
10. Crop (D10)
11. Season (D11)
12. Nature (D12)
13. Irrigation (D13)
14. Classification (D14)
15. Serial Number (D15)
16. Land (D16)
17. Area (D17)
18. Time (D18)
19. Occupant (D19)
20. Marital Status (D20)
Of course there was no reason for this particular ordering the dimensions, and any
other order of dimensions would have been equally liable. Note that each of these
dimensions can be considered to be at level 1, but they can have lower level values at
level2, level3 etc.
91

For example
Level 1
All Occupations
Occupation
1
Level 2
Occupation
2
Now if someone is searching for a person with a specific occupation (say

occupation2), then one will search along the occupation dimension (D1) at level 2 and so
on.
Now consider another case, the case of castes
Level 1
All Casts
Level2
Forward
Backward
Scheduled
Level 3
Caste
1
Caste
1
Caste
1
Caste
1
Caste
1
Caste
1
Now all castes is along dimension D4. If one were to need some details about all
castes, then he will search along D4, level1. If some details about backward castes is
needed, one goes along D4 level 2. In case of some particular say caste 3, the search will
be along level-3, of D4.
92

Take the case of area D17.
Look at the hierarchy.
Level 1
State
District 1
Taluk 1
Village 1
Taluk 2
Level 2
District 2
Taluk 3
Taluk 4
Village 2
Level 3
Level 4
Now any search along D17, for specific taluks will proceed along level 3, for a specific
district at level 2 etc.
Once the above structures are frozen, the analysis becomes simple. Most data
base packages provide I) specific queries to search along specific levels of a dimensions
in a truly multidimensional database and ii) in a simple relational database, the
multidimensions need to be searched as a relation at appropriate level.
Since this analysis part is software specific, they do not come under the purview
of this specific case study, but it suffices to say that any general query can be broken into
a sequence of search commands at the appropriate levels.
Canned query modules simply are a list of such sequencial query combinations,
each combination answering a particular ' canned query' and identified, possibly by a
number. Once the number is selected, the sequence of searches is made and the results
displayed.
In the other case of "custom query modules", the GUI helps the user to convert his
queries into a sequence of system queries, so that they can be implemented.
While the above discussion provides a basic structure for the implementation,
several details like handling of historic data, providing time-dimensioned reports etc.
93

have been hidden. But, it suffices to say, those details will be addons to the basic
analysis package.
9. TERM EXERCISE.
Suggest a suitable data warehouse design to maintain the various details of your
college. While the actual query formations are not very important, the study of the
various system requirements need to be worked out in detail, and presented in a step by
step manner.
94

10746328

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

10746328

Hochgeladen von

Copyright:

Verfügbare Formate

Downloaded from www.pencilji.

Data Warehousing and Data Mining

Dr. G. Raghavendra Rao,

Downloaded from www.pencilji.com

Data Warehousing and Data Mining

Downloaded from www.pencilji.com

Downloaded from www.pencilji.com

A data warehouse provides exactly similar

Downloaded from www.pencilji.com

Downloaded from www.pencilji.com

Denormalised data is prepared periodically for on line processing, but unlike

They can be viewed as subsets of a data warehouse.

advantages over a central data warehouse.

They have certain

The latter keeps growing and becomes

3. TYPES OF DATA WAREHOUSES

Downloaded from www.pencilji.com

multidimensional warehouse, queries can be asked on the multidimensionality of data.

4. LOADING OF DATA INTO A DATA MART

Selection, resequencing and merging of data when required

Efficiency (or speed) of loading

5. DATA MODEL FOR DATA WAREHOUSE

Downloaded from www.pencilji.com

9. SECURITY OF A DATA WAREHOUSE:

The warehouse administrator will be

application - based security procedures

Downloaded from www.pencilji.com

appearance of data is modified to prevent unauthorized users accessing the data.

10. MONITORING THE REQUIREMENTS OF A DATA WAREHOUSE

You were also introduced to the issues of data warehouse security,

consistency of data, and data integrity.

A _________________ holds only relevant portions of data held by a data

A data warehouse normally holds _________________ data, whereas

Downloaded from www.pencilji.com

Removal of obsolete data is called _________________

Data about the data available in a datamart is called _________________

Monitoring of data content is called _________________

Obsolete data in a RDBMS is deleted, whereas in a data mart it is usally

Data derived after computations on the primary data is called

Downloaded from www.pencilji.com

In this block, we introduce you to the fundamentals of an actual data warehouse.

Downloaded from www.pencilji.com

1. A TYPICAL DATA WAREHOUSE SYSTEM

Contrast this with the conventional online systems - which for

distinguishing purpose, we call Online Transaction Processing System (OLTP). A OLTP

Downloaded from www.pencilji.com

Comparison between a Database System and Data Warehouse System.

data, since the current data is appended

Whenever an updating is done, the new

to the historic data during updatings.

data replaces the existing data, so that

2. Transactions can be very long and

2. Transactions are short or atmost

3. Volume of transactions (no. of

3. Volume of transactions is very high.

transactions over a period of time) is

4. Concurrent transactions allowed. Since

i.e. only one query can access data at a

multiple users can simultaneously

time. hence, no recovery failures are

access/update the data, the transactions

may lead to erroneous results. Hence

5. Queries are often predetermined,

5. Transactions need low level of indexing