Beruflich Dokumente
Kultur Dokumente
com
Sri. K. Raghuveer,
Asst. Professor
Dept. of CS&E, NIE, Mysore.
1
Downloaded from www.pencilji.com
2
Downloaded from www.pencilji.com
In this unit, you are introduced to the basic concepts of a data warehouse.
The first block deals with the fundamental definitions of a data warehouse, how
they are an extension of the data base concept, but are still different from the DBMS and
also the various terminologies used in the context of a data warehouse are depicted.
The second block describes, in detail, a typical data warehouse system and the
environment in which it operates. The various data models for the data warehouse, the
various software tools used and also the two broad classifications of the data
warehouses- namely the relational and multidimensional warehouses are discussed. You
will also be introduced to some of the software available for a data warehouse designer.
The third block gives a step by step method of developing a typical data
warehouse. It begins with the choice of the data to be put in the data warehouse, the
concept of metadata, the various hardware and software options and the role of various
access tools. It is also gives a step by step algorithm for the actual development process
of a data warehouse.
3
Downloaded from www.pencilji.com
BLOCK INTRODUCTION
In this block, you are briefly introduced to the concept of a data warehouse. It is
basically a large store house of data, from which users can access and view data in
various formats. Further, they allow the users to do computations, so that the y can derive
maximum benefits out of the data. For example, having data about the previous 5 years
sales is good, but THE ability to computations on them so that you can predict the sales
for the next year will be a much more welcome situation. Similarly by comparing the
performances of your and your competitor's business performances, if you can develop
business patterns, it will be excellent.
opportunities.
You will also be introduced to the concept of a datamart, which can be thought of
as a subsection of a warehouse, which will allow you to view only those data, which are
of interest to you, irrespective of how big the actual data base is. You will also be
introduced to the basic requirements that the data warehouse or the datamart should
satisfy and also some of the implementation issues involved.
Contents
1. Data warehousing
2. Datamart
3. Types of data warehouses
4. Loading of data into a mart
5. Data model for data warehouse
6. Maintenance of data warehouse
7. Metadata
8. Software components
9. Security of data warehouse
10. Monitoring of a data warehouse
4
Downloaded from www.pencilji.com
1. DATA WAREHOUSING
A Data Warehouse can be thought of on lines similar to other ware house i.e. a
place where selected and (sometimes) modified operationa l data are stored. This data can
answer any query which may be complex, statistical or analytical. Data warehouse is
normally the heart of a decision support system of an organization (DSS). It is essential
for effective business operation, manipulation, strategy planning and historical data
recording. With the business scenario becoming more and more competitive and the
amount of data to be processed, according to a rough and estimation, getting doubled
every 2 years, the need for a very fast, accurate and reliable data warehouse need not be
over emphasized. A proper organization and fast retrieval of data are the keys for the
effective operation of a data warehouse.
Unfortunately, such rhetorics can explain the situation only upto a certain degree.
When one sits down to brass tacks the actual organization of such a warehouse, the
question arises are there any simple rules that govern data warehousing operation ? Do
the rules remain same for all types of data and all types of analysis ? what are the
tradeoffs involved ? How reliable and effective are such warehouses etc. This course
answers some of the questions. However, the concept of a data warehouse is a relatively
new one and is still undergoing lots of transformation. Hence, in this work you find only
pointers guidelines for the effective operation of such a warehouse. But the actual
operation needs a lot more skill than simple knowledge of the ground rules.
Before we venture into the warehouse details, we see what types of data will
essentially be handled there. Normally, they are classified into 3 groups.
i.
Transaction and reference data: They are the original data, arising out of the
various systems and are comparable to the data in a database (at regular
intervals) and are also removed (purged) when their useful lifespan is over.
However, the purged data is normally archived into taper any such devices.
5
Downloaded from www.pencilji.com
ii.
iii.
2. DATA MART
A data warehouse holds the entire data, of which various departments o f the
decision support system would need only portions. These portions are drawn into Data
Marts.
complex, unvieldy and difficult to understand and manage after a time. They become
difficult to customize, maintain and keep track of. Further, as volume of data increases,
the software needed to access data becomes more complex and also the time complexities
increase. (It can be compared to a very large library, where searching for a book you need
becomes that much more difficult).
On the other hand, a data mart can be thought of as a small departmental library,
which is small, elegant and easy to handle and customize. It is easy to sort, search or
structure the data elements without any global considerations. Obviously the hardware
and software demands are manageable. Once in a while you may not get some data that
may not be available in the datamart, which can be easily tracked to the data warehouse.
It is to be noted that the type of data that flows from the warehouse to the data
mart is of the current level type. The derived data and denormalised data are to prepared
at the data mart level itself.
However, many of the issues that affect the warehouse affect the data marts also.
You can view, for simplicity that the data warehouse is a collection of several data marts.
Hence, whatever we speak about a data mart, can be extended to a warehouse and viceversa unless specified otherwise.
Thus, in a
The data mart is loaded from the data warehouse using a load program. The
loading program takes care of the following factors, before loading the data from the data
warehouse to the data mart.
i)
Frequency and schedule i.e. when & how often the data is to be loaded
ii)
Total or partial refreshment i.e. whether all data in the data mart is
modified (replaced) or is done only partially.
iii)
iv)
v)
Integrity of data i.e. the data should not get unintentionally modified
during the transfer and also the data in the data mart should, at all times,
match that in the warehouse.
7
Downloaded from www.pencilji.com
7. METADATA
Most data warehouses, and data marts come with a metadata (data about data).
The metadata is a description of the contents and source of data of the warehouse, type of
customization on the data, description of the data marts of the warehouse, its tables,
relationships etc. and any other relevant details about the data warehouse.
The metadata is created and updated (from time to time) and are very useful for
the data analyst or the systems manager to manipulate the warehouse.
8. SOFTWARE COMPONENTS
The software that goes into the data warehouse varies depending on the context
and purpose of the warehouse but normally includes a DBMS, access and creatio n and
management software.
responsible for implementing these security measures. the normal methods used are i)
Fire Walls: Which are softwares to prevent unauthorized access into the data warehouse/
data mart. ii) logon/ logoff passwords which prevent unauthorised login or logout iii)
8
Downloaded from www.pencilji.com
The performance and contents of warehouse needs to be monitored closelyusualy by the system manager or data administrator. Monitoring is normally done by a
"data content tracking". it keeps track of the actual contents of the data mart, types of
accessing invalid/ obsolete data in the warehouse, the rate and kind of warehouse growth,
consistency issues (between the previously present data and the newly acquired data) etc.
While the monitoring of data is often a transparent operation, it's success is very
important for ensuring the continued usefulness and reliability of the warehouse to the
common user.
10. SUMMARY
In this block, you were introduced to the basic concepts of a data mart, data
warehouse, the types of data warehouses and how a data warehouse differs from a data
base - essentially the warehouse is a multidimensional concept whereas a data base is a
relational one.
You were also introduced to the concept of metadata which contains data about
the data of the warehouse and helps the future users as well as developers to deal with the
warehouse.
Review Questions
1)
2)
3)
Replacement
of
old
data
_________________
4)
5)
6)
7)
8)
9)
The compatibility between the previously available data and new data is
called _________________
10)
The main difference between transaction data and derived data lies in
_________________
Ans wers
1. Datamart
2. Multidimensional
3. Refreshing
4. Purging
5. Metadata
6. Data content tracking
7. Archived
8. Derived data
9. Consistency
10. Computations
10
Downloaded from www.pencilji.com
BLOCK - II
A TYPICAL DATA WAREHOUSE SYSTEM
Contents
1. Atypical data warehouse
2. A typical data warehousing environment
3. Data Modeling star schema for multidimensional view
4. Various Data models
5. Various OLAP tools
6. Relational OLAP
7. Managed query environment
8. Data warehousing products state of art
IBI Focus Fusion
Cognos Power Play
Pilot Software
9. Summary
11
Downloaded from www.pencilji.com
A data warehouse system, should be able to accept queries from the user, analyze
them and give the results. sometimes, they are called Online Analytical Processing
Systems (OLAP).
Databsase
1. Can handle only current data.
notified.
Source B
Source C
Source D
Extraction
A
Transformation
A
Users View
Different Views of a data warehouse
The different sources may be available in different storage devices. They are
extracted, put in a common place (only the relevant portio ns) and are transformed, so that
14
Downloaded from www.pencilji.com
Sub 1
Sub 2
Sub 3
Sub 1
Student result
(School 1)
Sub 2
Sub 3
Student result
(School 2)
Regional
Result
One typical warehouse example
15
Downloaded from www.pencilji.com
Dimension
Subject
Dimension
Fact Table
School
Marks of the
region
Subject
Now, any body with a little imagination could visualise that the simple star
structure depicted above suffices only for reasonably simple & straight forward
databases. A practical data warehouse will be much more complex, it terms of the
diversity of the subjects covered, their inter relationships and also the various
perspectives. In such a scenario, adding more dimensions , thus increasing the scope of
the attributes of the star schema could solve the problem only upto an extent. But
sooner, rather than later, this structure collapses. To avoid such breakdowns, a better
technique called the multifact star schema or the snow flake schema is used. The main
problem of the simple star schema, it's inability to grow beyond reasonable dimensions, is
over come by providing aggregations at difficult levels of hierarchies in a given
16
Downloaded from www.pencilji.com
In essence, we are not abandoning the star schema concept, but we are building
on it. Simply put, we are dividing the complex schema into smaller schema( these into
much smaller schema, if necessary) etc.. and combining these back to get the completed
data model. Hence, the name " multifact star schema" or " snow flake schema".
the
17
Downloaded from www.pencilji.com
There are several hybrid approaches also i.e they integrate the two methods at
various levels like having a table of hierarchies, for example, and these relations are
usually multirelational systems. However, all these tools basically implement the star
schema. As a thumb rule, one can say that the Relational OLAP are used where the
complexity of the applications are at the lower ends and also the performance
expectations are limited.
development are limited. However, as the complexity increases, the relational models
generally become both unwieldy and less efficient, performance wise and the choice is
normally the multidimensional OLAP.
With this introduction, we now look into the typical multidimensional and
Relational Architecture.
6. RELATIONAL OLAP
The main strength of this architecture is
ROLAP Server
Info Request
SQL
Result set
M etadata
request
processing
Result set
Relational OLAP
Its simplicity and universality. It can support any number of layers of data and
the main advantage is that new data can be added as additional layers, without affecting
18
Downloaded from www.pencilji.com
The other
advantage is the availability of a several strong SQL engines to support the complexity of
multidimensional analysis. The Relational databases have grown over several years and
hence the entire expertise available can be made use of to provide powerful search
engines to support the complexity of multidimensional analysis. These include creating
multiple SQL statements to handle complex user requests, optimizing these statements
using standard RDBMS techniques and searching the database from multiple points.
But, what makes the relational OLAP most attractive, possibly is its flexibility and also
the availably of products that can work on un-normalised database designs efficiently.
However, the Relational OLAP comes with its standard operations( inspite of the
optimizations at the DBMS level, described above). Thus, recently the Relational OLAP
are shifting towards the concepts of middleware technology to simplify its visualization,
design and applications. Also, instead of pure relational OLAP, hybrid systems, whic h
make use of the relational operations only to the extent to which it remains convenient
and beyond that level make use of other methods are coming into existence.
MOLAP Server
Load
Info Request
SQL
Result set
M etadata
request
processing
Result set
This is the other style of design available to a data warehouse designer. Here, the
data is basically organized in an aggregate form. I.e instead of the simple 2-dimensional
19
Downloaded from www.pencilji.com
This discussions brings out one major limitation of such systems- namely the
maintainability. Any database, worth its name, will not be static. New data are added,
old ones are deleted and what is more, certain relationships may get modified.
Incorporating such frequent changes into a multidimensional OLAP can be tricky.
Several suppliers provide standard tools, which to some extent, take care of such
modifications but they can be useful only if the changes are not too drastic.
Thus, the multi-OLAP are best utilized when used for applications that require
iterative operations and comprehensive time series analysis of trends.
multidimensional OLAP server to simplify the query processing environment hence the
name managed query environment (MQE). Though the actual implementations differ,
the concept can be highlighted somewhat in the following manner: -
20
Downloaded from www.pencilji.com
This approach provides for ease of operation and administration especially when
the end user is reasonably familiar with RDBMS operations. It is cost effective and
efficient at the same time.
However, certain shortcomings persist with the data cubes being built and
maintained on separate desktops. The factors of data redundancy and data integrity need
to be addressed more effectively. Also, in multiuser systems, if each user chooses to
maintain his own data cubes, the system will come under a lot of strain and data
consistency may take a beating. Thus, this method can be effective only when the data
volumes are small.
8. DATA WAREHOUSING PRODUCTS STATE OF ART
Data mining and Data ware housing being a relatively a new and a fast growing
field, any commitment on the state of art of the technology is hazardous. Further,
because of the intuitive approach to the problem by the corporates, newer and better tools
keep flooding the market. What we are discussing in the next few pages can taken as a
guideline to the type of products available need not be taken as an exhaustive list of
options available.
21
Downloaded from www.pencilji.com
Now, to the features of some of the tools available. (you may not be able to work
on them, but the discussions tell you about the fascinating number of options provided by
them)
I BI Focus Fusion
It combines a high
1. Fast query & reporting: With advanced indexing, parallel query & rollup
facilities provide high performance.
23
Downloaded from www.pencilji.com
In principle power play manages the query analysis as a process that runs on a
population of data cubes.
Pilot Software
It is a package of several PILOT decision support tools that form a high speed
multidimensional database. Some of the software that form the core of the offerings are:
24
Downloaded from www.pencilji.com
Desktop: Used
for
navigation
multidimensional databases.
5. PILOT sales and marketing analysis library: Provides applications that allow
sophisticated sales & marketing models to be visualized. It also allows the
user to modify the tools to satisfy specific deviations.
6. PILOT internet publisher: allows users to access PILOT databases via
browsers on the internet.
The main advantage of having such differentiated tools is that it is easy to modify &
customize the applications.
The other features that are common to the PILOT software are:
1. Many of them provide time as one of the dimensions, so that periodic reports,
updations and
shifting
to
another becomes
straightforward.
2. Provide integrated, predictive data mining in a multidimensional environment.
3. Provide for compression of spare cells ( those cells which have no value, but
still form the part of the matrix). Compression of ole( time-based) cells,
implicit declaration of some dimensions( they need not explicitly specified in
the query, but are automatically calculated, as long as they are defined as the
attribute of certain other dimensions), creation of dynamic variables etc. All
these features decrease the total size of the database and hence reduce the time
for navigation, without actually losing data.
4. Allow for seamless integration to existing OLTP. The users can also specify
the view of the database that they frequently refer and the system self
optimizes the relevant queries.
25
Downloaded from www.pencilji.com
Summary
In this block, you were introduced to the differences between the OLTP
(database) and OLAP (Warehouse) concepts. Some of the concepts underlying a typical
data warehouse were discussed in brief. You also learnt about the star schema modelling
and about three commonly used tools - IBI Focus Fusion, Cognos Power play and Pilot
Software.
Review Questions
1. OLAP stands for _________________
2. OLTP stands for _________________
3. In a OLAP system, the volume of transaction is _________________
4. A _________________ manages both current and historic transactions .
5. A star schema is organised around a central table called _________________ table.
6. _________________ are locally situated multidimensional data sets, which form
subsets of the data warehouse.
7. _________________ is the ability of the application to grow over a period of time
8. _________________ software comes with special, business oriented applications.
9. Power cube creations are normally scheduled for _________________ periods to
reduce the load on the system
10. DSS stands for _________________
Ans wers:
1. On Line Analysis Processing
2. On Line Transaction Processing
3. Low
4. OLAP
5. Fact Table
26
Downloaded from www.pencilji.com
27
Downloaded from www.pencilji.com
Hardware platforms
ii.
The DBMS
iii.
Networking capabilities
28
Downloaded from www.pencilji.com
Add to this the changes that keep taking place. Entire business models keep
getting modified, if not totally being discarded and we get a reasonable perspective for
efficient data warehousing.
Hence, the need to organize, maintain large amounts of data, so that they can be
analyzed within minutes in the manner and depth desired becomes important. Thus, one
cannot fail to identify the need for efficient data warehousing strategies.
Before we start looking into the actual design aspects of a data warehouse, we
would also see why the conventional information systems could not meet the
requirements? The conventional DBMS systems originated basically for homogeneous
and platform dependent applications.
changes slowly and also to situations where the search times were reasonably high. But
with the advent of very fast CPUs and larger and cheaper disk spaces, the ability and the
need to work on very large databases which are dynamic was felt. (The concept of
networking with ever increasing bandwidths made the available data as well as the
29
Downloaded from www.pencilji.com
Having once again assured ourselves about the basic features involved in data
warehouses, in the following sections we survey the issues involved in building a
warehouse beginning from the design approaches, architectures, design trade offs,
concept of metadata, data rearrangement, tools and finally the various p erformance
considerations.
Alternatively, begin at the lower end, combine the sub-data marts, into data marts
into the data warehouse, to get all possible analysis that you can get from the warehouse.
However, the discussion is not just about systems & programming. One will also
have to look into the location of the various departments, the levels of interact ions
between them, the parts of data flow, the sources of data and the demand centres of
analysed information etc. and arrive at a suitable model. Often, a suitable combination of
top down and bottom up designs ( or further combinations there of) are used.
However, there are three major issues in the development of a data warehouse,
that need very careful consideration.
1. The available data, will more often than not, be heterogeneous in nature. Ie
since the data came from various, unconnected sources, they need to be converted
to some standard format, with reference to a uniformly recognised base. This
requires a fair amount of efforts and ingenuity.
maintained. Ie with the passage of time, the data becomes obsolete and requires
updation. Again, because various pieces of data are from different sources, a
substantial amount of effert is required to upgrade them uniformly to maintain
data integrity. Since important decisions are taken based on the data values, their
reliability and authenticity should be beyond doubt at all times.
3. Mainly because of the above considerations and also because of the constant
inflow of latest data, the warehouse tends to grow out of proportions very shortly.
31
Downloaded from www.pencilji.com
Thus, one can safely presence that the design of a warehouse is definitely more
complex and tricky compared to a database design. Also, since it is business driven and
business requirements keep changing, one can safely say, it is not a one time job, but is
a continuous process.
4. DATA CONTENT
Compared to a Database, a warehouse contains data which need to be constantly
monitored and modified, if found obsolete. Also, the level of abstraction in a data
warehouse is more detailed, partly to facilitate ease of analysis and partly to ensure ease
of maintenance.
Thus, the data models used in a data warehouse are to be chosen based on the
nature, content and the processing pattern of the data warehouse. Before the data is
actually stored, one will have to clearly identify the major components of the model,
their relationships, including the entities, attributes, their values and the possible keys.
But the more difficult process is the ability of the designer to identify the query
process and the path traveled by a query. Because of the varying nature of q ueries, it is
more easily said than done.
frequency etc.. before arriving at the most optimal storage patterns is the key to a
successful design. In addition to optimising the data storage for a high level query
performance, one should also keep in mind the data storage requirements and data
loading performance of the system.
Thus, no specific rules for the design can be prescribed and a lot of finetuning
based on experience needs to be done. Further, since the data handled will be normally
voluminous, a decision on its actual distribution, whether on a single server, on several
servers on the network etc. is to be taken. It can also be divided based on region, time
32
Downloaded from www.pencilji.com
5. METADATA
Since the type of data in a warehouse in voluminous contentwise and varying in
terms of the models, the relationships between the databases, amongst the mselves and
with the warehouse in total, needs to be made known to the endusers and the endusers
tools. The metadata defines the contents and the location of the data in the warehouse.
This would facilitate further updating and maintenance of the data warehouse. It is
used by the users to find the subject areas and the definitions of data. It also helps the
users to modify and update the data and datamodels. It essentially acts as a logical link
between the decision support system application and the data warehouse.
Thus, a data warehouse designer would also create a metadata repository which
has access paths to all important parts of the data warehouse at all points of time. The
metadata works like a access buffer between the tools and the data and no user or tool
can directly meddle with the data warehousing environment. The actual choice of the
format for the metadata, ofcourse, is left to the designer.
No doubt, the metadata should be able to effectively address to database and the
tools that are used. Further, an injudicious choice of the tools or diluting the design
specifications to accommodate the tools may result in inefficient data warehouses
which will soon become unmanagable.
33
Downloaded from www.pencilji.com
In such a situation, all that the designer has to do is to start somewhere and get
going. The most common technique is to develop a datamart and gradually blow it to a
full fledged data warehouse.
3. The dimension table design is the next important step, which converts the fact
table to a multidimensional table. Each dimension normally refers to one a set
of related activities and would let to a multidimensional or relational database
as the ease may be. The dimensions are the source of new headers in the
users final reports. Since the choice of the dimensions freeze the data
warehouse specifications to some extent, sufficient thought for future growth
of the warehouse or of the organisation itself.
Duplicate or super fluous dimensions should be avoided while
compromising with the long range perspectives of the warehouse. However,
if two datamarts endup having the same dimensions, they should conform to
each other. This would ensure ease of standardising the queries.
35
Downloaded from www.pencilji.com
4. The choice of the facts, though appears simple, can some times be tricky,
especially if step 1 above is not carried out properly. All facts that pertain to
the dimensions should be correctly identified and also their links to other data
items are to be ascertained.
5. The relations between the various entities are expressed in terms of precalculations and are stored in the fact tables.
6. This stage involves the choice of the number, content and dimensions of the
various tables used in the operation. While the selection may appear simple,
one has to note that choice of two few tables would make each of them too
voluminous and hence the query processing becomes inefficient. On the other
hand, too many small tables would create problems of storage, consistency
and data integration.
8. and 9 require several iterations, spread over a period of time, and possibly
would involve accommodating conflicting priorities.
8. CONSIDERATIONS OF TECHNOLOGY:
36
Downloaded from www.pencilji.com
i)
ii)
The DBMS
iii)
Networking infrastructure
iv)
v)
Software tools
But it is needless to say that for maximum efficiency, each major component of
the system should be selected such that optimum performance and scalability is achieved.
37
Downloaded from www.pencilji.com
ii) Choice of the DBMS: This is as important, if not more, as the hardware selection, as
it determines the performance of the warehouse to a no lesser extent.
Again, the
parameters remain the same scalability, ability to efficiently handle large volumes of
data and speed of processing.
Almost all the well known DBMS Oracle, Sybase, DB- support parallel
database processing. Some of them also provide special features for operating datacubes
(described in the previous chapter)
(within the organisation) and a few may also work in the internet environment (web
enabled). The choice to put it in a network itself is decided by various factors like
security, privacy on one hand, counter balanced by accessibility & spread on the other.
While not many extra hardware for networking may be used (apart from those normally
used) for warehousing, software considerations & planning process tend to become
definitely more complex.
1. Statistical analysis
2. Data visualisation, production of graphical reports
3. General statistical analysis
4. Complex textual search (text mining)
38
Downloaded from www.pencilji.com
SUMMARY
You have been briefly introduced to the various stages of a data warehouse
development, with an algorithm emerging out of the discussions. The stages, namely
collection of requirements, creating a data model, indicating data sources and data users,
choice of hardware and software platforms, choice of reporting tools, connectivity tools
and GUI and refreshment of data periodically form the core of any data warehouse
development.
The next unit, which is a case study, is to be studied, bearing in mind these
fundamentals.
39
Downloaded from www.pencilji.com
1. The process of removing the deficiencies and loopholes in the data is called
____________ of data.
2. The design of the method of information storage in the data warehouse is defined by
the ___________.
3. ___________ provides panters to data of the data warehouse.
4. A reasonable prediction of the type of queries, that are likely to arise, help in
improving the ___________ of search
5. A balance between the ___________ processors and ___________ processors is
necessary for better performance of the data warehouse.
6. Name any two methods of identifying the business requirements, ____________ and
______________.
7. GUI stands for _________________
8. The two basic design strategies of OLTP are ___________ and _____________.
Ans wers:
1. Cleaning up
2. Data model
3. Metadata
4. Efficiency
5. Input/output, computational
6. Interviews and questionnaire's.
7. Graphical User Interface.
8. Top Down and Bottom up
Reference Books:
1. CSR Prabhu, ' Data Warehousing: Concepts, Techniques, Products and Applications',
PHI, New Delhi - 2001.
40
Downloaded from www.pencilji.com
COURSE INTRODUCTION
We know lots of data is being collected and warehoused. Data collected and
stored at enormous speeds. Data mining is a technique for semi-automatic discovery of
patterns, associations, changes, anomalies, rules in data.
Data mining is a
interdisciplinary in nature. In this course you study the importance of data mining,
techniques used for data mining, web data mining and knowledge discovery in databases.
41
Downloaded from www.pencilji.com
BLOCK - 1
DATA MINING
Data Mining - An Introduction
1.0 Introduction
1.1 What is data mining?
1.2 Few applications
1.3 Extraction Methods
1.4 Trends that Effect data Mining
1.5 Summary
1.0 Introduction
The field of data mining is emerging as a new, fundamental area with important
applications to science, engineering, medicine, business and education. Data mining
attempts to formulate, analyze and implement basic induction processes that facilitate the
extraction of meaningful information and knowledge from unstructured data.
Data
mining extracts patterns, changes, association and anomalies from large data sets. Work
in data mining ranges from theoretical work on the principles of learning and
mathematical representation of data to building advanced engineering s ystems that
perform information filtering on the web. Data mining is also a promising computational
paradigm that enhances traditional approaches to discovery and increases the
opportunities for break through in the understanding of complex physical and b iological
systems.
42
Downloaded from www.pencilji.com
Data mining is an interactive, semi automated process begins with raw data.
Results of the data mining process may be insights, rules or predictive models.
1.2
Few applications
a) Neural Networks - Neural networks are systems inspired by the human brain. A
basic example is provided by a back propagation network which consists of input
nodes, output nodes and intermediate nodes called hidden nodes. Initially, the nodes
are connected with random weights. During the training, a gradient descent algorithm
is used to adjust the weights so that the output nodes correctly classify data presented
to the input nodes.
b) Tree - based classifiers - A tree is a convenient way to break a large data sets into
smaller ones. By presenting a learning set to the root and asking questions at each
interior node, the data at the leaves can often be analyzed very simply. Tree based
43
Downloaded from www.pencilji.com
d) Ensemble learning - Rather than use data mining to build a single predictive model,
it is often better to build a collection or ensemble of models and to combine them, say
with a simple, efficient voting strategy. This simple idea has now been applied in a
wide variety of contexts and applications.
e) Linear algebra - Scaling data mining algorithms often depends critically upon
scaling underlying computations in linear algebra. Recent work in parallel algorithms
for solving linear system and algorithms for solving sparse linear systems in high
dimensions are important for a variety of data mining applications, ranging from text
mining to detecting network intrusions.
f) Large scale optimization - some data mining algorithms can be expressed as large
scale, often non-convex, optimization problems.
g) Databases, Data Warehouses and Digital Libraries - The most time consuming
part of the data mining process is preparing data for data mining. This step can be
stream - lined in part if the data is already in a database, data warehouse or digital
library, although mining data across different databases.
h) Visualization of Massive data sets : Massive data sets, often generated by complex
simulation
programs,
required
graphical
visualization
methods
for
best
comprehension.
44
Downloaded from www.pencilji.com
j) Electronic comme rce - Not only does electronic commerce produce large data sets in
which the analysis of marketing patterns and risk patterns is critical, but unlike some
of the applications above, it is also important to do this in real or near - real time, in
order to meet the demands of on- line transactions.
1.3
Extraction Methods
Information extraction is an important part of any knowledge management
system.
The precision and efficiency of information access improves when digital content
is organized into tables within a relational database.
Wrapper induction
45
Downloaded from www.pencilji.com
mining
Data trends : Perhaps the most fundamental external trend is the explosion of digital
data during the past two decades. During this period, the amount of data probably has
grown between six to ten orders of magnitude. Much of this data is accessible via
networks.
workstations enables the mining of data sets using current algorithms and techniques that
were too large to be mined just a few years ago. In addition, the commoditization of high
performance computing through workstations and high performance workstation clusters
enables attacking data mining problems that were accessible using only the largest
supercomputers a few years ago.
Business trends - Today businesses must be more profitable, react quicker and offer
higher quality services than ever before and do it all using fewer people and at lower cost.
With these types of expectations and constraints, data mining becomes a fundamental
technology, enabling businesses to more accurately predict opportunities and risks
generated by their customers and their customer's transactions.
1.5 Summary
Data mining is the semi - automatic discovery of patterns, associations, changes,
anomalies, rules and statistically significant structures and events in data. Data mining
46
Downloaded from www.pencilji.com
Ans wers
1. Knowledge
2. Neural networks
3. Natural language processing, wrapper induction
BLOCK - 2
2.0 Introduction
In this unit you are going to study various data mining functions. Data mining
methods may be classified by the function they perform or according to the class of
application that can be used in. Data mining functions is helpful in solving real
world problems.
2.1 Classification
Data mine tools have to infer a model from the database, and in the case of
supervised learning this requires the user to define one or more classes. The database
contains one or more attributes that denote the class of a tuple and these are known as
predicted attributes whereas the remaining attributes are called predicting attributes. A
combination of values for the predicted attributes defines a class.
When learning classification rules the system has to find the rules that predict the
class from the predicting attributes so firstly the user has to define conditions for each
class, the data mine system then constructs descriptions for the classes. Basically the
system should given a case or tuple with certain known attribute values be able to predict
what class this case belongs to.
Once classes are defined the system should infer rules that govern the
classification therefore the system should be able to find the description of each class.
The descriptions should only refer to the predicting attributes of the training set so that
the positive examples should satisfy the description and none of the negative. A rule said
to be correct if its description covers all the positive examples and none o f the negative
examples of a class.
48
Downloaded from www.pencilji.com
exact rule - permits no exceptions so each object of LHS must be an element of RHS
strong rule - allows some exceptions, but the exceptions have a given limit
probablistic rule - relates the conditional probability P(RHS|LHS) to the probability
P(RHS)
Other types of rules are classification rules where LHS is a sufficient condition to
classify objects as belonging to the concept referred to in the RHS.
2.2 Associations
Given a collection of items and a set of records, each of which contain some
number of items from the given collection, an association function is an operation against
this set of records which return affinities or patterns that exist among the collection of
items. These patterns can be expressed by rules such as "72% of all the records that
contain items A, B and C also contain items D and E." The specific percentage of
occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule,
A,B and C are said to be on an opposite side of the rule to D and E. Associations can
involve any number of items on either side of the rule.
Another example of the use of associations is the analysis of the claim forms
submitted by patients to a medical insurance company. Every claim form contains a set of
medical procedures that were performed on a given patient during one visit. By defining
the set of items to be the collection of all medical procedures that can be performed on a
patient and the records to correspond to each claim form, the application can find, using
the association function, relationships among medical procedures that are often
performed together.
49
Downloaded from www.pencilji.com
2.4 Clustering/Segmentation
Clustering and segmentation are the processes of creating a partition so that all the
members of each set of the partition are similar according to some metric. A cluster is a
set of objects grouped together because of their similarity or proximity. Objects are often
decomposed into an exhaustive and/or mutually exclusive set of clusters.
50
Downloaded from www.pencilji.com
There are a number of approaches for forming clusters. One approach is to form
rules which dictate membership in the same group based on the level of similarity
between members. Another approach is to build set functions that measure some property
of partitions as functions of some parameter of the partition.
2.5 Summary :
Ans wers
1. Exact
2. Sequential / Temporal
3. Cluster.
51
Downloaded from www.pencilji.com
BLOCK - 3
Introduction
Learning procedure can be classified into two categories. Supervised learning and
unsupervised learning. In the case of supervised learning we know the exact value, with
respect to exact value we compare the output, this procedure is repeated until the desired
value is obtained. In the case of unsupervised learning without knowing the target value
we extract new facts. In this unit you will learn different data mining techniques.
52
Downloaded from www.pencilji.com
Clustering and segmentation basically partition the database so that each partition or
group is similar according to some criteria or metric. Clustering according to similarity is
a concept which appears in many disciplines. If a measure of similarity is available there
are a number of techniques for forming clusters. Membership of groups can be based on
the level of similarity between members and from this the rules of membership can be
defined. Another approach is to build set functions that measure some property of
partitions ie groups or subsets as functions of some parameter of the partition. This latter
approach achieves what is known as optimal partitioning.
Many data mining applications make use of clustering according to similarity for
example to segment a client/customer base. Clustering according to optimization of set
functions is used in data analysis e.g. when setting insurance tariffs the customers can be
segmented according to a number of parameters and the optimal tariff segmentation
achieved.
3.2 Induction
53
Downloaded from www.pencilji.com
of the
information in the database e.g. the join operator applied to two relational tables
where the first concerns employees and departments and the second departments and
managers infers a relation between employee and managers.
Induction has been described earlier as the technique to infer information that is
generalised from the database as in the example mentioned above to infer that each
employee has a manager. This is higher level information or knowledge in that it is a
general statement about objects in the database.
patterns or regularities.
Induction has been used in the following ways within data mining.
3.2.1 Decision trees
Decision trees are simple knowledge representation and they classify examples to
a finite number of classes, the nodes are labeled with attribute names, the edges are
labeled with possible values for this attribute and the leaves labeled with different classes.
Objects are classified by following a path down the tree, by taking the edges,
corresponding to the values of the attributes in an object.
The following is an example of objects that describe the weather at a given time.
The objects contain information on the outlook, humidity etc. Some objects are positive
examples denote by P and others are negative i.e. N. Classification is in this case the
construction of a tree structure, illustrated in the following diagram, which can be used to
classify all the objects correctly.
54
Downloaded from www.pencilji.com
Production rules have been widely used to represent knowledge in expert systems
and they have the advantage of being easily interpreted by human experts because of their
modularity i.e. a single rule can be understood in isolation and doesn't need reference to
other rules. The prepositional like structure of such rules has been described earlier but
can summed up as if-then rules.
Neural networks have broad applicability to real world business problems and
have already been successfully applied in many industries. Since neural networks are best
at identifying patterns or trends in data, they are well suited for prediction or forecasting
needs including:
sales forecasting
industrial process control
customer research
55
Downloaded from www.pencilji.com
Neural networks use a set of processing elements (or nodes) analogous to neurons in
the brain. These processing elements are interconnected in a network that can then
identify patterns in data once it is exposed to the data, i.e the network learns from
experience just as people do. This distinguishes neural networks from traditional
computing programs, that simply follow instructions in a fixed sequential order.
The structure of a neural network looks something like the following:
the
dynamic
synthesis,
analysis
and
consolidation
of
large
volumes
of
multidimensional data
56
Downloaded from www.pencilji.com
An alternative definition of OLAP has been supplied by Nigel Pe ndse who unlike
Codd does not mix technology prescriptions with application requirements. Pendse
defines OLAP as, Fast Analysis of Shared Multidimensional Information which means;
Fast in that users should get a response in seconds and so doesn't lose their chain of
thought;
Analysis in that the system can provide analysis functions in an intuitive manner and
that the functions should supply business logic and statistical analysis relevant to the
users application;
Information is the data and the derived information required by the user application.
57
Downloaded from www.pencilji.com
Dimensional databases are not without problem as they are not suited to storing all
types of data such as lists for example customer addresses and purchase orders etc.
Relational systems are also superior in security, backup and replication services as these
tend not to be available at the same level in dimensional systems. The advantages of a
dimensional system are the freedom they offer in that the user is free to explore the data
and receive the type of
report they want without being restricted to a set format.
Data visualisation makes it possible for the analyst to gain a deeper, more
intuitive understanding of the data and as such can work well along side data mining.
Data mining allows the analyst to focus on certain patterns and trends and explore indepth using visualisation. On its own data visualisation can be overwhelmed by the
58
Downloaded from www.pencilji.com
3.6 Summary
In this unit, you studies various data mining techniques. Each method is having
its own advantage and drawback. Depending on the application one should choose the
method.
BLOCK - 4
59
Downloaded from www.pencilji.com
4.0 Introduction
4.1 view points
4.2 Classification Method
4.3 steps of a KDD process
4.4 KDD Application
4.5 Related Fields
4.6 Summary
4.7 Question/Answer key
4.0 Introduction
We know lots of data is being collected and warehoused. Data collected and
stored at enamels speeds. Traditional techniques are infeasible for raw data. Hence data
mining is used for data reduction.
From commercial point of view, data mining provides better, customized services
for the user. Information is becoming product in its own right we know traditional
techniques is not suitable because of enormity of data, high dimensionality of data,
heterogeneous distributed nature of data. Hence we can use some prediction methods i.e.
we find human - interpretable patterns that describe the data.
Building accurate and efficient classifiers for large data bases is one of the
essential tasks of data mining and machine learning research. Give a set of cases with
class labels as a training set, classification is build a model (called classifier) to predict
future data objects for which the class label is unknown.
Recent studies propose the extraction of a set of high quality association rules
form the training data set which satisfy certain user specified frequency and confidence
thresholds.
61
Downloaded from www.pencilji.com
In general, given training data set, the task of classification is to build a classifier
from the training data set such that it can be used to predict class labels of unknown
objects with high accuracy.
Preprocessing
Transformation
data
Datamining
Target data
preprocessed
Data
transformed
data
Interpretational Evaluation
Pattern
4.4 KDD Application
Knowledge
Why are today's database and automated match and retrieval technologies not
adequate for addressing the analysis needs? The answer lies in the fact that the patterns to
be searched for, and the models to be extracted, are typically subtle and require
significant specific domain knowledge. For example, consider a credit card company
wishing to analyze its recent transactions to detect fraudulent use or to use the individual
history of customers to decide on-line whether an incoming new charge is likely to be
from an unauthorized user. This is clearly not an easy classification problem to solve.
63
Downloaded from www.pencilji.com
In the past, we could rely on human analysts to perform the necessary analysis.
Essentially, this meant transforming the problem into one of simply retrieving data,
displaying it to an analyst, and relying on expert knowledge to reach a decision.
However, with large databases, a simple query can easily return hundreds or thousands
(or even more) matches. Presenting the data, letting the analyst digest it, and enabling a
quick (and correct) decision becomes infeasible. Data visualization techniques can
significantly assist this process, but ultimately the reliance on the human in the loop
becomes a major bottleneck. (Visualization works only for small sets and a small number
of variables. Hence, the problem becomes one of finding the appropriate transformations
and reductions--typically just as difficult as the original problem.)
Finally, there are situations where one would like to search for patterns that
humans are not well-suited to find. Typically, this involves statistical modeling, followed
by "outlier" detection, pattern recognition over large data sets, classification, or
clustering. (Outliers are data points that do not fit within a hypothesiss probabilistic
mode and hence are likely the result of interference from another process.) Most database
management systems (DBMSs) do not allow the type of access and data manipulation
that these tasks require; there are also serious computational and theoretical problems
attached to performing data modeling in high-dimensional spaces and with large amounts
of data.
information retrieval,
visualization,
intelligent agents for distributed and multimedia environments, digital libraries, and
management information systems.
The remainder of this article briefly outlines how some of these relate to the
various parts of the KDD process. I focus on the main fields and hope to clarify to the
reader the role of each of the fields and how they fit together naturally when unified
under the goals and applications of the overall KDD process. A detailed or
comprehensive coverage of how they relate to the KDD process would be too lengthy and
not very useful because ultimately one can find relations to every step from each of the
fields. The article aims to give a general review and paint with a broad brush. By no
means is this intended to be a guide to the literature, neither do I aim at being
comprehensive in any sense of the word.
Statistics. Statistics plays an important role primarily in data selection and sampling, data
mining, and evaluation of extracted knowledge steps. Historically, most statistics work
has focused on evaluation of model fit to data and on hypothesis testing. These are clearly
relevant to evaluating the results of data mining to filter the good from the bad, as well as
within the data- mining step itself in searching for, parametrizing, and fitting models to
data. On the front end, sampling schemes play an important role in selecting which data
to feed to the data- mining step. For the data-cleaning step, statistics offers techniques for
detecting "outliers," smoothing data when necessary, and estimating noise parameters. To
a lesser degree, estimation techniques for dealing with missing data are also available.
Finally, for exploratory data analysis, some techniques in clustering and design of
experiments come into play. However, the focus of research has dealt primarily with
small data sets and addressing small sample problems.
65
Downloaded from www.pencilji.com
66
Downloaded from www.pencilji.com
1. The basic tools used to extract patterns from data are called _______methods.
2. In classification method a collection of records (training set) is given, each
record contains a set of__________________ one of the attributes is the class.
3. ______________ plays an important role primarily in data selection and
sampling, data mining, and evaluation of extracted knowledge steps.
Ans wers
1. data mining
2. attributes
3. Statistics
BLOCK - 5
5.0INTRODUCTION
Web data mining is the use of data mining techniques to automatically discover
and extract information from world wide web documents and services. Today, with the
tremendous growth of the data sources available on the web and the dramatic popularity
of the data sources available on the web and the dramatic popularity of e-commerce in
the business community.
5.1 Methods
Web mining is a technique to discover and analyze the useful information from
the web data web mining is decomposed into the following tasks
a) Resource discovery : the task of retrieving the intended information from
Web.
b) Information Extraction : automatically selecting and preprocessing specific
information from the retrieved web resources.
c) Generalization : automatically discovers general patterns at the both
individual web sites and across multiple sites.
d) Analysis : analyzing the mined pattern.
69
Downloaded from www.pencilji.com
Web content mining is based on the statistics about single words in isolation, to
represent unstructured text and take single word found in the training corpus as features.
Multimedia data mining is part of the content mining, which is e ngaged to mine
the high- level information and knowledge from large online multimedia sources.
Multimedia data mining on the web has gained many researchers attention recently.
Working towards a unifying framework for representation, problem solving and learning
from multimedia is really a challenge, this research area is still in its infancy indeed,
many works are waiting to be done.
Most of the web information retrieval tools only use the textural information,
while ignores the link information that could be very valuable. The goal of web structure
mining is to generate structural summary about the web site and web page.
70
Downloaded from www.pencilji.com
If a web page is linked to another web page directly or the web pages are
neighbors, we would like to discover the relationships among those web pages. The
relations may be fall in one of the types, such as they related by s ynonyms or antilogy,
they may have similar contents, both of them may sit in the same web server therefore
created by the same person. Another task of web structure mining is to discover the
nature of hierarchy or network of hyperlinks in the web sites o f a particular domain. This
may help to generalize the flow information in web sites that may represent some
particular domain, therefore the query processing will be easier and more efficient.
Web usage mining tries to discovery the useful information from the secondary
data derived from the interaction o the users while surfing on the web. It focuses on the
techniques that could predict user behavior while the user interacts with web. In the
process of data preparation of web usage mining, the web content and web site topology
will be used as the information sources which interacts web usage mining with the web
content mining and web structure mining.
discovery is a bridge to web content and structure mining from usage mining.
Web usage mining is the application of data mining techniques to discover usage
patterns from web data, in order to understand and better serve the needs of web based
application.
71
Downloaded from www.pencilji.com
Preprocessing : Web usage mining is the application of date mining techniques to usage
logs (secondary web data) of large web data repositories. The purpose of it is to produce
results that can used in the design tasks such as web site design, web server design and
navigating through a web site. Before applying the data mining algorithm, we must
perform a data preparation to convert the raw data into the data abstraction necessary for
the further process.
Pattern discovery : Pattern discovery converges the algorithms and techniques from
several research areas, such as data mining, machine learning, statistics and pattern
recognition.
Pattern Analysis : Pattern analysis is a final stage of the whole web usage mining. The
goal of this process is to eliminate the irrelative rules or patterns and to extract the
interesting rules or patterns from the output of the pattern discovery process. There are
two most common approaches for the pattern analysis. One is to use the knowledge
query mechanism such as SQL, while another is to construct multi dimensional data cube
before perform OLAP operations.
Due to massive growth of the e-commerce, privacy becomes a sensitive topic and
attracts more and more attention recently. The basic goal of web mining is to extract
information from data set for business needs, which determines its application is highly
customer related. The lack of regulations in the use and deployment of web mining
systems and the widely spread privacy abuses reports related to data mining has made
privacy a hot iron like never before. Privacy touches a central nerve with people and
there are no easy solutions.
72
Downloaded from www.pencilji.com
In this unit, you studied the area of web data mining with the focus on the web
usage mining. Web mining requires three stages reprocessing, pattern discovery and
pattern analysis.
Answer
1. Resource discovery
2. automatic search
3. link structure
73
Downloaded from www.pencilji.com
Unit Introduction
Having learnt the fundamentals of warehousing, in this unit, we list out the
various areas in which
government sectors.
You will also be introducted to a case study: That of the Andhra Pradesh
information warehouse. This case study is expected to underline the various concepts
discussed in the previous unit and provide a practical bias to the entire concept of
dataware housing.
A term project is suggested to further drive home the complexities and intricacies
involved in the process.
Since this unit is to be studied in totality, no summary or review questions are
included.
74
Downloaded from www.pencilji.com
2.
3.
ii.
iii.
Ministry of commerce
iv.
Ministry of Education
75
Downloaded from www.pencilji.com
Having seen so much about datamining and data warehousing, the question arises
as to its areas of application. Of course, the business community are the first users of the
technique they feed in their and their competitors results, trends etc. and come out with
tangible strategies for the future. Obviously, there can be as many deviations and
modifications to the concept of warehousing as the types of business. However, to learn
about them, one should first know the types of business practices, their various strategies
etc.. before one can appreciate the warehousing techniques. Instead, in this block, we
choose the safer option of going through the various applications at the government
levels. For two reasons, this promises to be a good procedure one, all of us have an
idea to some extent, how the government machinery works. Secondly, a lot of literature
is available about the implementations. However, we should underline the fact, that we
will be more bothered more about the techniques and technologies, rather than the actual
outputs & results.
should be a store house of all types of information for the planning process of the
country. Though, they are presently being processed manually or even with database
technologies, the data available in them in so varied and complex that it is ideally suited
for the data warehousing techniques. Information about wide ranging areas at various
levels (village, district, state etc..) can be extracted and compiled using OLAP techniques.
In fact, a village level census analysis software has been developed by the
National Informatics Centre (NIC). This software gives details in two parts : primary
census abstract and the details about the various amenities. This software has been used
on a trial basis to get the various views of the development scenario of selected villages
in the country, using the 1991 census data. Efforts are on to use technology on a much
larger scale in the subsequent census data.
It is easy to see why the census data is ideally suited for data warehousing
applications. Firstly, it is a reasonably static data, which is updated only once in ten
years. Secondly, since unbelievably large volumes of complex data is a vailable, the
benefit of technology over the other methods of extracting information is obvious even at
first sight. Thirdly, almost all the concepts of data warehousing become applicable in the
application.
ii) Monitoring of essential commadities
The government of India compiles data on prices of essential commodities like
rice, wheat, pulses, edible oils etc. The prices are collected at every week end, of the
prices of these commodities on every day of the previous week, in selected centres across
the country. These are consolidated to give the government an insight about the trends
and also allow the government to device strategies on various agricultural policies.
Again, because of the geographical spread, the periodicity of updating etc., this becomes
77
Downloaded from www.pencilji.com
To ensure this, the ministry of commerce has setup several export processing
zones across the country, which compile data about the various export- import
transactions in their selected regions. These are then compiled to produce data for
decision making to the commerce ministry on a regular basis.
This being again a fit ease for data warehousing operation, the government has
drawn up a plan to make use of OLAP decision support tools for decision making. In fact
the data collection centres are already computerised, and in the second phase of
computerisation, the decision making process is expected to be based on the principles of
data mining and warehousing concepts.
i)
contemplated.
ii)
The ministry of tourism has already collected valuable data regarding the
pattern of tourist arrivals, their choices and spending patterns etc. Details
about primary tourist spots are also available. They can be combined to
produce a data warehouse to support decision making.
iii)
In addition, several areas like planning, health, economic affairs .... etc are ideally
suited to make use of OLAP tools. Conventionally, many of these departments are
computerised and routinely producing MIS reports and hence are maintaining
medium to large size databases. The next logical sequence, is to convert these
databases and MIS know how into a full fledged data warehouses. This would
result in a paradigm shift, as far as data utilisation is concerned. Since the utility
of most types of data is time bound, enormous delays in extracting information
out of them would make the information time barred and hence of little use.
Further, such warehouses, when they come into existence, would release the
expert manpower now spent on processing the data for data analysis and decision
making.
Hence, unless the departments can avail selected data from the
79
Downloaded from www.pencilji.com
80
Downloaded from www.pencilji.com
development have been described, in the context of what has already been
discussed in the previous sections
discussed to the lowest possible detail. At the end of the block, you are expected
to have become more comfortable with the practical aspects of the warehouse
techniques.
Contents:
1. Introduction
2. Concepts used in Developing the warehouse.
3. Data Sources
a) MPHS
b) Land Suite applications
c) Maps and dictionaries
4. Possible users of information
i.
Policy planners
ii. Custodian
iii.
Warehouse developers
iv.
Citizens
Data conversion
Data transformation
iv.
Web publishing
81
Downloaded from www.pencilji.com
82
Downloaded from www.pencilji.com
arranged in several layers. The upper layers contain single data entities and their
details are hidden in the lower levels, each successive layer having detailed data
about the entities above it and the details of the present layer hidden in the next
lower layers. It is for the user to decide at what level he wants to see the data.
The process of beginning at a higher level and viewing data at the progressively
lower levels is called " drilling down" on the data. Conversely, one can view
data beginning at a detailed (lower) level and move up to concise (higher) levels.
This is called "rolling up" of data.
83
Downloaded from www.pencilji.com
1) Identify the data sources and the type of data available from them
2) Identify the users of the warehouse and the type of queries you can expect
from them
3) Identify the methods of converting the data sources (1) to data users
(2)
above
4) Identify the hardware and software components
5) Finalise the type of queries that arise and ways of combining / standardising
them.
6) Look at the ways of storing data in such a format so that it can be efficiently
searched by most of the queries.
7) Finalise the data structures, analysis variables and methods of a calculation.
a) MPHS: The government of Andhra Pradesh collected data from each house
hold regarding the socio-economic status of each family. This data, collected
originally for a different purpose, was available as MPHS suite of applications
in an electronic format. Relevant portions of this suite were made use of by
the government for building the warehouse.
b) Land suite of applications: This data, again, was already available, in which
land was the core entity of information. Again, relevant portions of these
records, were used for constructing the warehouse.
84
Downloaded from www.pencilji.com
Again only
These dictionaries and maps are essential to store and manipulate the data objects.
i) Policy Planners : These are the primary users of the information generated from the
warehouse. Since they are expected to use it to a maximum extent, the warehouse
queries need to be optimised to suite the type and pattern of queries generated by them.
Though, it may not be possible to anticipate the queries generated by them fully, a
reasonable guess about what typ0e of conclusion and decisions they like to draw from the
same can be ascertained, possibly through interviews and questionaires. Also, since they
are likely to be distributed all over the state and may be even outside, the warehouse
85
Downloaded from www.pencilji.com
ii) Custodians : As seen earlier, the dictionaries and maps are maintained by custodians.
In addition, the object entities themselves need to be maintained by custodians. All these
dictionary custodians, map custodians and entity custodians will be responsible for
maintaining the entities and also for incorporating changes from time to time. For
example, the way the government treats a particular caste (SC/ST/backward), or a village
(backward/ forward/sensitive) or even persons may change from time to time. All that
government does is to issue a notification to the same effect. The concerned custodians
will be responsible for maintaining the validity of the entities of the warehouse.
However, again, they are not likely to be computer professionals (at least the map and
dictionary custodians) and hence they should be able to vies the entities in the way they
are accustomed to and be able to manage them.
persons, who actually are responsible for the day to day working of the warehouse. They
will be able to look at the repository from the practical point of view and decide about it's
capabilities and limitations. Their views are most sought after to decide about the
viability / otherwise of the warehouse.
iv) Citizens: The Government plans to make certain categories of data available to
ordinary citizens on the web. Since their background, type of information they are
looking for and their abilities to interact are not homogeneous, generalised assumptions
are to be made about their needs and suitable queries made available.
86
Downloaded from www.pencilji.com
i.
i.
Data Conversion
ii.
Data Scrubbing
iii.
Data Transformation
iv.
Web Publishing
Data Conversion: The different inputs to the warehouse come from various data
capture systems - online, disks, tapes etc. such information coming from different
OLTP systems need to be accepted, converted into suitable formats before loading
on to the warehouse (called the core object Repositery). Standard software like
Oracle SQL loaders can do the job.
Once the data becomes available in tape, floppy or any other input form,
the warehouse manager checks for their authenticity, then executes the routines to
store them in the warehouse memory. He may even take printouts of the same.
Barring the warehouse manager, other users/Custodians are not allowed to modify
the data in the warehouse. They can only send the data to be updated to the
manager, who will do the necessary updatings. Typically, the data from the
warehouse becomes unavailable to all or a set of users during such updatings.
ii.
Data Scrubbing: Data Scrubbing is the process of checking the validity of the
data, arriving at the data warehouse from different sources to ensure it's qualit y
and accuracy as well as it's completeness. Since the data originates from different
sources, it is possible that some of the key data may be ambiguous, incomplete or
missing altogether. Further, the data keep arriving at periodical intervals, it's
consistency with respect to the previously stored data is not always guaranteed.
Such originally invalid data, needless to say, loads to false comparisons.
87
Downloaded from www.pencilji.com
Data Transformation:
information repository, scrubbing it and loading it into the main database. The
process includes identifying the dimensionalities, store the data in appropriate
formats and may also involve indicating the users that the data is ready for use.
enabled. The web agent on the server, which interacts with the HTML templates, reads
the data from the server and sends it on the web page. The agent, of course, has to
resolve the access rights of the user before populating the information on the web
page. This becomes extremely important, when, for example, citizens are allowed to
access certain section of information, while many others are to be made inaccessible.
The system administrator is expected to tackle the various issues regarding such
selective access rights by suitable configuring the server.
household survey (MPHS) and the land data extracted from land records can be stored on
two separate sets of storage devices.
The data, after the scrubbing operation is normally stored on a RDBMS, like
Oracle or Sybase.
multidimensional database server (MDDB) receiver data. In the present case Oracle8
Enterprise Edition was deployed, because it supports both relational and object relational
models.
88
Downloaded from www.pencilji.com
dimensions along the actual age, income of dimensions like assured income and nonassured income etc. The consolidation of data into several dimensions is a tricky job.
Often data is called at the lowest level and is aggregated into higher level totals for the
sake of analysis.
Since the data is to be accessed on the web, a web server of suitable capacity is of
prime importance. The web server receives query requests from the web, converts it into
suitable queries, hands it to the MDDB server and the replies from the MDDB server are
sent back to the web, to be displayed to the person who has raised the query request.
The query itself can be raised either by i) The clients which are computers
connected physically to the servers or ii) web clients, where the user would require the
replies over the Internet. The government may also provide "Kisosks", special terminals,
where users can get the required information by the 'touch screen' tec hnology.
89
Downloaded from www.pencilji.com
While the users like the planners are likely to comeout with special and newer
queries, the average citizens often end up asking similar questions. This, apart
from the fact that many of them may not be computer savvy, makes a case for
producting several "Canned Query modules". I.e. the user has no option of
formulating his own queries, but can choose to get answers for one/more of the
readymade questions. Such questions can be on the "Kiosks", and the user gets
the answer by choosing them by using a suitable pointer device.
At the next level, the user may be provided with a "Custom Query
Model", which helps him to formulate queries and get the answers. It may also
help the user to change certain parameters and get suitable results that help in
formulating policies. Further, such custom queries may be either summary ones
or detailed. The latter help the users in microlevel analysis of information.
90
Downloaded from www.pencilji.com
1. Occupation (D1)
2. Age (D2)
3. Sex (D3)
4. Caste (D4)
5. Religion (D5)
6. Shelter (D6)
7. SSID (D7)
8. House (D8)
9. Khata number (D9)
10. Crop (D10)
11. Season (D11)
12. Nature (D12)
13. Irrigation (D13)
14. Classification (D14)
15. Serial Number (D15)
16. Land (D16)
17. Area (D17)
18. Time (D18)
19. Occupant (D19)
20. Marital Status (D20)
Of course there was no reason for this particular ordering the dimensions, and any
other order of dimensions would have been equally liable. Note that each of these
dimensions can be considered to be at level 1, but they can have lower level values at
level2, level3 etc.
91
Downloaded from www.pencilji.com
Occupation
1
Level 2
Occupation
2
All Casts
Level2
Forward
Backward
Scheduled
Level 3
Caste
1
Caste
1
Caste
1
Caste
1
Caste
1
Caste
1
Now all castes is along dimension D4. If one were to need some details about all
castes, then he will search along D4, level1. If some details about backward castes is
needed, one goes along D4 level 2. In case of some particular say caste 3, the search will
be along level-3, of D4.
92
Downloaded from www.pencilji.com
Level 1
State
District 1
Taluk 1
Village 1
Taluk 2
Level 2
District 2
Taluk 3
Taluk 4
Village 2
Level 3
Level 4
Now any search along D17, for specific taluks will proceed along level 3, for a specific
district at level 2 etc.
Once the above structures are frozen, the analysis becomes simple. Most data
base packages provide I) specific queries to search along specific levels of a dimensions
in a truly multidimensional database and ii) in a simple relational database, the
multidimensions need to be searched as a relation at appropriate level.
Since this analysis part is software specific, they do not come under the purview
of this specific case study, but it suffices to say that any general query can be broken into
a sequence of search commands at the appropriate levels.
Canned query modules simply are a list of such sequencial query combinations,
each combination answering a particular ' canned query' and identified, possibly by a
number. Once the number is selected, the sequence of searches is made and the results
displayed.
In the other case of "custom query modules", the GUI helps the user to convert his
queries into a sequence of system queries, so that they can be implemented.
While the above discussion provides a basic structure for the implementation,
several details like handling of historic data, providing time-dimensioned reports etc.
93
Downloaded from www.pencilji.com
9. TERM EXERCISE.
Suggest a suitable data warehouse design to maintain the various details of your
college. While the actual query formations are not very important, the study of the
various system requirements need to be worked out in detail, and presented in a step by
step manner.
94
Downloaded from www.pencilji.com