Sie sind auf Seite 1von 30

III C.

S E DWBI
UNIT-I
Syllabus:
Introduction to Data Mining: Motivation for Data Mining, Data Mining-Definition &
Functionalities, Classification of DM systems, DM task primitives, Integration of a Data Mining
system with a Database or a Data Warehouse, Major issues in Data Mining.

Data Warehousing (Overview Only): Overview of concepts like star schema, fact and
dimension tables, OLAP operations, From OLAP to Data Mining.

1.1 DATA MINING:

1.1.1 What motivated data mining? Why is it important?


The major reason that data mining has attracted a great deal of attention in information
industry in recent years is due to the wide availability of huge amounts of data and the imminent
need for turning such data into useful information and knowledge.
 The information and knowledge gained can be used for applications ranging from business
management, production control, and market analysis, to engineering design and science
exploration.

The evolution of database system technology

Mallikarjun, Assoc Prof MLWEC Page 1


III C.S E DWBI

1.1.2 What is data mining?

 Data mining refers to extracting or mining" knowledge from large amounts of data.
 There are many other terms related to data mining, such as knowledge mining, knowledge
extraction, data/pattern analysis, data archaeology, and data dredging.
 Many people treat data mining as a synonym for another popularly used term, Knowledge
Discovery in Databases", or KDD

Essential step in the process of knowledge discovery in databases


Knowledge discovery as a process is depicted in following figure and consists of an iterative
sequence of the following steps:
 Data cleaning: It is used to remove noise or irrelevant data
 Data integration: In this method multiple data sources may be combIned
 Data selection: In this method data relevant to the analysis task are retrieved from the
database
 Data transformation: In this method data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations
 Data mining : It is an essential process where intelligent methods are applied in order to
extract data patterns
 Pattern evaluation: It is used to identify the truly interesting patterns representing
knowledge based on some interestingness measures
 Knowledge presentation: visualization and knowledge representation techniques are
used to present the mined knowledge to the user.

Data mining as a step in the process of knowledge discovery.

Mallikarjun, Assoc Prof MLWEC Page 2


III C.S E DWBI

How is a data warehouse different from a database? How are they similar?
 Differences between a data warehouse and a database:

Data Warehouse Data Base


1. A data warehouse is a repository of 1. A database is a collection of
information collected from multiple interrelated data that represents the
sources, over a history of time, stored current status of the stored data.
under a unified schema. Multiple heterogeneous databases
where the schema of one database
2. It is used for data analysis and may not agree with the schema of
decision support. another.

2. A database system supports ad-hoc


query and on-line transaction
processing.

Similarities between a data warehouse and a database: Both are repositories of information,
storing huge amounts of persistent data.

1.1.3 Data mining functionalities/Data mining tasks: what kinds of patterns can be
mined?

Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. In general, data mining tasks can be classified into two categories:
• Descriptive
• predictive
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.

Q) Describe data mining functionalities, and the kinds of patterns they can discover
(or)
Q) Define each of the following data mining functionalities: characterization,
discrimination, association and correlation analysis, classification, prediction, clustering,
and evolution analysis. Give examples of each data mining functionality, using a real-life
database that you are familiar with.

1 .Concept/class description: characterization and discrimination


Data can be associated with classes or concepts. It describes a given set of data in a
concise and summative manner, presenting interesting general properties of the data.
These descriptions can be derived via
1. Data characterization, by summarizing the data of the class under study (often
called the target class) in general terms.

Mallikarjun, Assoc Prof MLWEC Page 3


III C.S E DWBI
2. data discrimination, by comparison of the target class with one or a set of
comparative classes
3. both data characterization and discrimination
Data characterization
It is a summarization of the general characteristics or features of a target class of data.
Example:
A data mining system should be able to produce a description summarizing the
characteristics of a student who has obtained more than 75% in every semester; the result could
be a general profile of the student.

The output of data characterization can be presented in various forms like


 Pie charts
 Bar charts
 Curves
 Multi dimensional cubes
 Multi dimensional tables etc.

The resulting descriptions can be presented as generalized relations or in rule forms


called characteristic rules.
Data Discrimination is a comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes.

Example
The general features of students with high GPA’s may be compared with the general
features of students with low GPA’s. The resulting description could be a general comparative
profile of the students such as 75% of the students with high GPA’s are fourth-year computing
science students while 65% of the students with low GPA’s are not.
Discrimination descriptions expressed in rule form are referred to as discriminant rules.

2. Mining Frequent patterns, Associations and correlations.


Frequent patterns, as the name suggests, are patterns that occur frequently in data.
 There are many kinds of frequent patterns, including frequent item sets, frequent
subsequences (also known as sequential patterns), and frequent substructures.
 A frequent item set typically refers to a set of items that often appear together in a
transactional data set—for example, milk and bread, which are frequently bought together in
grocery stores by many customers.
 A frequently occurring subsequence, such as the pattern that customers, tend to purchase
first a laptop, followed by a digital camera, and then a memory card, is a (frequent)
sequential pattern.

Mallikarjun, Assoc Prof MLWEC Page 4


III C.S E DWBI
 A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may
be combined with item sets or subsequences. If a substructure occurs frequently, it is called
a (frequent) structured pattern.
 Mining frequent patterns leads to the discovery of interesting associations and correlations
within data.
It is the discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data. For example, a data mining system may find
association rules like
major(X, “computing science””) ⇒owns(X, “personal computer”)
[support = 12%, confidence = 98%]
Where X is a variable representing a student. The rule indicates that of the students
under study, 12% (support) major in computing science and own a personal computer. There is
a 98% probability (confidence, or certainty) that a student in this group owns a personal
computer.

Correlation analysis
Correlation analysis is a technique use to measure the association between two variables.
 Correlation is degree or type of relationship b/w two or more quantities ( variables).
 A correlation coefficient (r) is a statistic used for measuring the strength of a supposed
linear association between two variables. Correlations range from -1.1.0 to +1.1.0 in value.
 A correlation coefficient of 1.1.0 indicates a perfect positive relationship in which two or
more variables fluctuate together ( one increases/decreases and other one also
increases/decreases).
 A correlation coefficient of 0.0 indicates no relationship between the two variables. That is,
one cannot use the scores on one variable to tell anything about the scores on the second
variable.
 A correlation coefficient of -1.1.0 indicates a perfect negative relationship in which high
values of one variable increases and other decreases.

3. Classification and prediction


Classification:
Classification:
 It predicts categorical class labels
 It classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
 Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis

Mallikarjun, Assoc Prof MLWEC Page 5


III C.S E DWBI
 Classification can be defined as the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown.
 The derived model is based on the analysis of a set of training data (i.e., data objects whose
class label is known).
Example:
An airport security screening station is used to determine if passengers are potential
terrorist or criminals. To do this, the face of each passenger is scanned and its basic pattern
(distance between eyes, size, and shape of mouth, head etc) is identified. This pattern is
compared to entries in a database to see if it matches any patterns that are associated with
known offenders
 A classification model can be represented in various forms, such as
1) IF-THEN rules,
student ( class , "undergraduate") AND concentration ( level, "high") ==> class A
student (class ,"undergraduate") AND concentrtion (level,"low") ==> class B
student (class , "post graduate") ==> class C
2) Decision tree

3) Neural network (mathematical formulae)

Prediction:
Find some missing or unavailable numerical data values rather than class labels referred
to as prediction.
 Although prediction may refer to both numerical prediction and class label prediction, it is
usually confined to numerical data value prediction and thus is distinct from classification.
 Prediction also encompasses the identification of distribution trends based on the available
data.
 Regression analysis is a statistical methodology that is most often used for numerical
prediction.

Mallikarjun, Assoc Prof MLWEC Page 6


III C.S E DWBI
 Classification and prediction may need to be preceded by a relevance analysis, which
attempts to identify attributes that do not contribute to the classification or prediction
process. These attributes can then be excluded.

Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various
points in the river. These monitors collect data relevant to flood prediction: water level, rain
amount, time, humidity etc. These water levels at a potential flooding point in the river can be
predicted based on the data collected by the sensors upriver from this point. The prediction must
be made with respect to the time the data were collected.

Classification vs. Prediction


Classification differs from prediction in that the former is to construct a set of models
(or functions) that describe and distinguish data class or concepts, whereas the latter is to predict
some missing or unavailable, and often numerical, data values.
Their similarity is that they are both tools for prediction: Classification is used for
predicting the class label of data objects and prediction is typically used for predicting missing
numerical data values.

4. Clustering analysis

Clustering analyzes data objects without consulting a known class label.

 The objects are clustered or grouped based on the principle of maximizing the intra-class
similarity and minimizing the interclass similarity.

 Cluster of objects are formed so that objects within a cluster have high similarity to one
another, but are very dissimilar to objects in other clusters.

 Each cluster that is formed can be viewed as a class of objects.

Clustering can also facilitate taxonomy formation, that is, the organization of observations into
a hierarchy of classes that group similar events together as shown below:

A 2-D plot of customer data with respect to customer locations in a city, showing three data
Clusters.
Mallikarjun, Assoc Prof MLWEC Page 7
III C.S E DWBI
Classification vs. Clustering
 In general, in classification you have a set of predefined classes and want to know which
class a new object belongs to.
 Clustering tries to group a set of objects and find whether there is some relationship between
the objects.
In the context of machine learning, classification is supervised learning and clustering is
unsupervised learning

5. Outlier analysis:
A database may contain data objects that do not comply with general behavior or model
of data.
 These data objects are outliers. In other words, the data objects which do not fall within the
cluster will be called as outlier data objects.
 Noisy data or exceptional data are also called as outlier data. The analysis of outlier data is
referred to as outlier mining.
Example
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account. Outlier values may also be detected with respect to the location and type
of purchase, or the purchase frequency.

6. Data evolution analysis


It describes and models regularities or trends for objects whose behavior changes over
time.
 Although this may include characterization, discrimination, association, classification, or
clustering of time-related data, distinct features of such an analysis include time-series data
analysis, sequence or periodicity pattern matching, and similarity-based data analysis.
Example:
The data of result the last several years of a college would give an idea if quality of graduated
produced by it.

1.1.4 Which Technologies Are Used? (Or) Classification of Data Mining Systems

As a highly application-driven domain, data mining has incorporated many techniques


from other domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, algorithms, high performance
computing, and many application domains.

Statistics studies the collection, analysis, interpretation or explanation, and presentation of


data. Data mining has an inherent connection with statistics.

Mallikarjun, Assoc Prof MLWEC Page 8


III C.S E DWBI

Data mining adopts techniques from many domains.

A statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
distributions. Statistical models are widely used to model data and data classes.
For example, in data mining tasks like data characterization and classification, statistical
models of target classes can be built. In other words, such statistical models can be the outcome
of a data mining task.
Alternatively, data mining tasks can be built on top of statistical models. For example,
we can use statistics to model noise and missing data values. Then, when mining patterns in a
large data set, the data mining process can use the model to help identify and handle noisy or
missing values in the data.
Statistics research develops tools for prediction and forecasting using data and statistical
models. Statistical methods can be used to summarize or describe a collection of data.
Statistics is useful for mining various patterns from data as well as for understanding
the underlying useful for mining various patterns from data as well as for understanding the
underlying mechanisms generating and affecting the patterns.
Inferential statistics (or predictive statistics) models data in a way that accounts for
randomness and uncertainty in the observations and is used to draw inferences about the process
or population under investigation.
Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical
hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data analysis)
makes statistical decisions using experimental data. A result is called statistically significant if it
is unlikely to have occurred by chance. If the classification or prediction model holds true, then
the descriptive statistics of the model increases the soundness of the model.

Mallikarjun, Assoc Prof MLWEC Page 9


III C.S E DWBI

Machine Learning
Machine learning investigates how computers can learn (or improve their performance)
based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data. For example, a typical
machine learning problem is to program a computer so that it can automatically recognize
handwritten postal codes on mail after learning from a set of examples.
Machine learning is a fast-growing discipline. Here, we illustrate classic problems in
machine learning that are highly related to data mining.
Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set. For example, in the postal code
recognition problem, a set of handwritten postal code images and their corresponding machine-
readable translations are used as the training examples, which supervise the learning of the
classification model.

Unsupervised learning is essentially a synonym for clustering. The learning process is


unsupervised since the input examples are not class labeled. Typically, we may use clustering to
discover classes within the data.
For example, an unsupervised learning method can take, as input, a set of images of
handwritten digits. Suppose that it finds 10 clusters of data. These clusters may correspond to
the 10 distinct digits of 0 to 9, respectively. However, since the training data are not labeled, the
learned model cannot tell us the semantic meaning of the clusters found.
Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. In one approach, labeled examples are
used to learn class models and unlabeled examples are used to refine the boundaries between
classes. For a two-class problem, we can think of the set of examples belonging to one class as
the positive examples and those belonging to the other class as the negative examples. In the
following Figure, if we do not consider the unlabeled examples, the dashed line is the decision
boundary that best partitions the positive examples from the negative examples.

Semi-supervised learning

Mallikarjun, Assoc Prof MLWEC Page 10


III C.S E DWBI

Using the unlabeled examples, we can refine the decision boundary to the solid line. Moreover,
we can detect that the two positive examples at the top right corner, though labeled, are likely
noise or outliers.

Active learning is a machine learning approach that lets users play an active role in the learning
process. An active learning approach can ask a user (e.g., a domain expert) to label an example,
which may be from a set of unlabeled examples or synthesized by the learning program. The
goal is to optimize the model quality by actively acquiring knowledge from human users, given
a constraint on how many examples they can be asked to label.

Information Retrieval

Information retrieval (IR) is the science of searching for documents or information in


documents. Documents can be text or multimedia, and may reside on the Web. The differences
between traditional information retrieval and database systems are twofold: Information
retrieval assumes that (1) the data under search are unstructured; and (2) the queries are formed
mainly by keywords, which do not have complex structures (unlike SQL queries in database
systems).

The typical approaches in information retrieval adopt probabilistic models. For example,
a text document can be regarded as a bag of words, that is, a multi-set of words appearing in the
document. The document’s language model is the probability density function that generates the
bag of words in the document. The similarity between two documents can be measured by the
similarity between their corresponding language models.

Furthermore, a topic in a set of text documents can be modeled as a probability


distribution over the vocabulary, which is called a topic model. A text document, which may
involve one or multiple topics, can be regarded as a mixture of multiple topic models. By
integrating information retrieval models and data mining techniques, we can find the major
topics in a collection the major topics involved.

1.1.5 Data Mining Task Primitives:

A data mining query is defined in terms of data mining task primitives. These primitives
allow the user to interactively communicate with the data mining system during discovery in
order to direct the mining process, or examine the findings from different angles or depths.

The set of task-relevant data to be mined: This specifies the portions of the database or the set of
data in which the user is interested. This includes the database attributes or data warehouse
dimensions of interest (referred to as the relevant attributes or dimensions).

Mallikarjun, Assoc Prof MLWEC Page 11


III C.S E DWBI

The kind of knowledge to be mined: This specifies the data mining functions to be performed,
such as characterization, discrimination, association or correlation analysis, classification,
prediction, clustering, outlier analysis, or evolution analysis.

The background knowledge to be used in the discovery process: This knowledge about the
domain to be mined is useful for guiding the knowledge discovery process and for evaluating
the patterns found. Concept hierarchies are a popular form of background knowledge, which
allow data to be mined at multiple levels of abstraction.

The interestingness measures and thresholds for pattern evaluation: They may be used to guide
the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of
knowledge may have different interestingness measures. For example, interestingness measures
for association rules include support and confidence.
Rules whose support and confidence values are below user-specified thresholds are considered
un-interesting.

The expected representation for visualizing the discovered patterns: This refers to the form in
which discovered patterns are to be displayed, which may include rules, tables, charts, graphs,
decision trees, and cubes.
Mallikarjun, Assoc Prof MLWEC Page 12
III C.S E DWBI

1.1.6 Integration of Data Mining System with a Database or Data warehouse System
Good system architecture will facilitate the data mining system to make best use of the
software environment, accomplish data mining tasks in an efficient and timely manner with
other information systems.

 Design a Data Mining System should be integrate or couple with a Database (DB ) system
and/ or Data warehouse (DW) system .

 There are different integration schemes. Those are


1. No – coupling
2. Loose coupling
3. Semi tight –coupling
4. Tight – coupling

1.1. No- coupling: If a DM system works as a stand –alone system or is embedded in an


application program, There are no DB or DW systems with which it has to communicate.
This scheme is called No- coupling.
No coupling means that a DM system will not utilize any function of a DB or DW
system.
 It may fetch data from a particular source (such as a file system), process data using some
data mining algorithms and then store the mining results in another file.
 Drawbacks: Without using a DB/DW system , A DM system may spend more amount of
time for finding , collecting ,cleaning and transforming data .
 No coupling represents a poor design.

2. Loose coupling:
Loose – coupling means that a DM system will use some facilities of a DB or DW
system, such as
 Fetching data from a data repository managed by these systems.
 Performing data mining ,
 And then storing the mining results either in a file or in a designed place in a
database or data warehouse.

 Loose coupling is better than no coupling because it can fetch any portion of data stored in
database or data warehouses by using query processing , indexing and other system
facilities.

 Drawbacks: It is difficult for loose coupling to achieve high scalabilityand good


performance with large data sets .

Mallikarjun, Assoc Prof MLWEC Page 13


III C.S E DWBI

3. Semi tight coupling


Semi tight coupling means that besides linking a DM system to a DB/DW system ,
Efficient implementation of a few essential data mining primitives can be provided in the
DB/DW system .
 These primitives can include
 Sorting
 Indexing
 Aggregation
 Histogram analysis
 Multiway join
 Precomputation of some essential measures
 Sum
 Count
 Max
 Min
 Standard deviation
 Some frequently used intermediate mining results can be precomputed and stored in the
DB/DW system .

4. Tight coupling
Tight coupling means that a DM system is smoothly integrated into the DB/DW system .
 The data mining subsystem is treated as one functional component of an information
system.
 Data mining queries and functions are optimized based on mining query analysis, data
structures, indexing schemes and query processing methods as DB or DW system .

1.1.7 Major issues in data mining


Major issues in data mining is regarding mining methodology, user interaction, performance,
and diverse data types
1.1. Mining methodology:
 Mining different kinds of knowledge in databases: Since different users can be interested in
different kinds of knowledge, data mining should cover a wide spectrum of data analysis
and knowledge discovery tasks, including data characterization, discrimination, association,
classification, clustering, trend and deviation analysis, and similarity analysis.
These tasks may use the same database in different ways and require the development
of numerous data mining techniques.

 Handling outlier or incomplete data:


The data stored in a database may reflect outliers: noise, exceptional cases, or
incomplete data objects. These objects may confuse the analysis process, causing over

Mallikarjun, Assoc Prof MLWEC Page 14


III C.S E DWBI
fitting of the data to the knowledge model constructed. As a result, the accuracy of the
discovered patterns can be poor. Data cleaning methods and data analysis methods which
can handle outliers are required.

 Pattern evaluation: refers to interestingness of pattern:


A data mining system can uncover thousands of patterns. Many of the patterns
discovered may be uninteresting to the given user, representing common knowledge or
lacking novelty. Several challenges remain regarding the development of techniques to
assess the interestingness of discovered patterns.

2. User Interaction:

 Interactive mining of knowledge at multiple levels of abstraction:


Since it is difficult to know exactly what can be discovered within a database, the data
mining process should be interactive.
 Incorporation of background knowledge:
Background knowledge, or information regarding the domain under study, may be used
to guide the discovery patterns. Domain knowledge related to databases, such as integrity
constraints and deduction rules, can help focus and speed up a data mining process, or judge the
interestingness of discovered patterns.
 Data mining query languages and ad-hoc data mining:
Knowledge in Relational query languages (such as SQL) required since it allow users to
pose ad-hoc queries for data retrieval.

 Presentation and visualization of data mining results:


Discovered knowledge should be expressed in high-level languages, visual
representations, so that the knowledge can be easily understood and directly usable by
humans

3. Performance issues (Efficiency and Scalability).


 Efficiency and scalability of data mining algorithms:
To effectively extract information from a huge amount of data in databases, data mining
algorithms must be efficient and scalable.
 Parallel, distributed, and incremental updating algorithms:
Such algorithms divide the data into partitions, which are processed in parallel. The
results from the partitions are then merged.
 Cloud computing and cluster computing, which use computers in a distributed and
collaborative way to tackle very large-scale computational tasks, are also active research
themes in parallel data mining.

Mallikarjun, Assoc Prof MLWEC Page 15


III C.S E DWBI

4. Diversity of database types


 Handling of relational and complex types of data:
Since relational databases and data warehouses are widely used, the development of
efficient and effective data mining systems for such data is important.
 Mining information from heterogeneous databases and global information systems:
Local and wide-area computer networks (such as the Internet) connect many sources of
data, forming huge, distributed, and heterogeneous databases. The discovery of knowledge
from different sources of structured, semi-structured, or unstructured data with diverse data
semantics poses great challenges to data mining.

5. Data Mining and Society


 Social impacts of data mining: With data mining penetrating our everyday lives, it is
important to study the impact of data mining on society. How can we use data mining
technology to benefit society? How can we guard against its misuse? The improper
disclosure or use of data and the potential violation of individual privacy and data protection
rights are areas of concern that need to be addressed.
 Privacy-preserving data mining: Data mining will help scientific discovery, business
management, economy recovery, and security protection (e.g., the real-time discovery of
intruders and cyber attacks). However, it poses the risk of disclosing an individual’s
personal information. Studies on privacy-preserving data publishing and data mining are
ongoing. The philosophy is to observe data sensitivity and preserve people’s privacy while
performing successful data mining.
 Invisible data mining: We cannot expect everyone in society to learn and master data mining
techniques. More and more systems should have data mining functions built within so that
people can perform data mining or use data mining results simply by mouse clicking,
without any knowledge of data mining algorithms. Intelligent search engines and Internet-
based stores perform such invisible data mining by incorporating data mining into their
components to improve their functionality and performance. This is done often un known to
the user. For example, when purchasing items online, users may be unaware that the store is
likely collecting data on the buying patterns of its customers, which may be used to
recommend other items for purchase in the future.

Mallikarjun, Assoc Prof MLWEC Page 16


III C.S E DWBI

1.2 DATA WAREHOUSE:

1.2.1. What is a data warehouse?

A data warehouse is a subject oriented, integrate, time-variant, and nonvolatile


collection of data in support of management’s decision making process.

“Data warehousing: The process of constructing and using data warehouses”.

Subject-oriented: A data warehouse is organized around major subjects, such as customer,


supplier, product, and sales. Rather than concentrating on the day-to-day operations and
transaction processing of an organization, a data warehouse focuses on the modeling and analysis
of data for decision makers.

Integrated: A data warehouse is usually constructed by integrating multiple heteroge- neous


sources, such as relational databases, flat files, and on-line transaction records. Data cleaning and
data integration techniques are applied to ensure consistency in naming conventions, encoding
structures, attribute measures, and so on.

Mallikarjun, Assoc Prof MLWEC Page 17


III C.S E DWBI

Time-variant: Data are stored to provide information from a historical perspective (e.g., the
past 5–10 years). Every key structure in the data warehouse contains, either implicitly or
explicitly, an element of time.

Nonvolatile: A data warehouse is always a physically separate store of data trans- formed
from the application data found in the operational environment. Due to this separation, a
data warehouse does not require transaction processing, recovery, and concurrency control
mechanisms. It usually requires only two operations in data accessing: initial loading of data and
access of data.

1.2.1. Data warehouse Architecture

Tier-1:
The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources (such as customer profile information provided by external
consultants). These tools and utilities perform data extraction, cleaning, and transformation
(e.g., to merge similar data from different sources into a unified format), as well as load and
refresh functions to update the data warehouse. The data are extracted using application
program interfaces known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server.

Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about the data warehouse
and its contents.

Mallikarjun, Assoc Prof MLWEC Page 18


III C.S E DWBI
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a relational OLAP
(ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS that maps operations on multidimensional data to
standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.

Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).

1.2.2 Differences between Operational Database Systems and Data Warehouses:

Data bases generally involve the transactions. OLTP (online transaction processing) is a class of
software programs capable of supporting transaction-oriented applications on the Internet.
Typically, OLTP systems are used for order entry, financial transactions, customer relationship
management (CRM) and retail sales.

Data warehouses generally follow OLAP. OLAP (Online Analytical Processing) is the
technology behind many Business Intelligence (BI) applications. OLAP is a powerful
technology for data discovery, including capabilities for limitless report viewing, complex
analytical calculations, and predictive “what if” scenario (budget, forecast) planning.

Mallikarjun, Assoc Prof MLWEC Page 19


III C.S E DWBI

1.2.3 MULTIDIMENSIONAL DATA MODEL


A data warehouse is based on multidimensional data model which views data in the
form of data cube.
 A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
 Dimensions are perspectives or entities with respect to which an organization wants to
keep records such as time, item, branch, location etc.
 Dimension table, such as item (item name, brand, type), or time (day, week, month,
quarter, year) gives further descriptions about dimensions
 Fact table contains measures (such as dollars _sold) and keys to each of the related
dimension tables.
 In data warehousing, an n-Dimensional base cube is called base cuboids. The top most o-
Dimensional cuboids, which hold the highest-level of summarization, called the apex
cuboids. The lattice of cuboids forms a data cube.
 A 3-D view of sales data warehouse, according to the dimensions time, item, and location.
The measure displayed is dollars sold (in thousands).

A 3-D data cube representation of the data in above Table, according to time, item, and
location.

A 3-D data cube representation of the data in above Table

Mallikarjun, Assoc Prof MLWEC Page 20


III C.S E DWBI

1.2.4 Star, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models

 The entity-relationship data model is commonly used in the design of relational


databases, where a database schema consists of a set of entities and the relationships
between them. Such a data model is appropriate for on-line transaction processing.
 A data warehouse requires a concise, subject-oriented schema that facilitates on-line data
analysis.
 The most popular data model for a data warehouse is a multidimensional model, which
can exist in the form of a star schema, a snowflake schema, or a fact constellation
schema.

Star schema:
 The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and (2) a set of smaller attendant tables (dimension tables), one for each
dimension.
 The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.

Mallikarjun, Assoc Prof MLWEC Page 21


III C.S E DWBI

Snowflake schema:
 The snowflake schema is a variant of the star schema model, where some dimension tables
are normalized, thereby further splitting the data into additional tables.
 The resulting schema graph forms a shape similar to a snowflake.

 The major difference between the snowflake and star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce redundancies.

Fact constellation:
 Sophisticated applications may require multiple fact tables to share dimension tables.
 This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.

Fact constellation schema of a sales and shipping data warehouse.

Mallikarjun, Assoc Prof MLWEC Page 22


III C.S E DWBI

1.2.5 A Concept Hierarchy

A concept hierarchy defines a sequence of mapping from a set of low-level concepts to


higher-level, more general concepts.

Ex 1:

A concept hierarchy for location.


Ex 2:

Hierarchical and lattice structures of attributes in warehouse dimensions: (a) a hierarchy for
Location and (b) a lattice for time.

 Concept hierarchies may also be defined by discretizing or grouping values for a given
dimension or attribute, resulting in a set-grouping hierarchy.

Mallikarjun, Assoc Prof MLWEC Page 23


III C.S E DWBI

1.2.6 OLAP Operations in the Multidimensional Data Model

 In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies.
 This organization provides users with the flexibility to view data from different
perspectives.
 A number of OLAP data cube operations exist to materialize these different views,
allowing interactive querying and analysis of the data at hand.
 Hence, OLAP provides a user-friendly environment for interactive data analysis.

Roll-up:
 The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by
dimension reduction.
 Example: The result of a roll-up operation performed on the central cube by climbing up the
concept hierarchy for location given in Figure.
 This hierarchy was defined as the total order “street <city < province or state < country.”
 The roll-up operation shown aggregates the data by ascending the location hierarchy from
the level of city to the level of country.

Mallikarjun, Assoc Prof MLWEC Page 24


III C.S E DWBI

Drill-down:
 Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data.
 Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions.
 Following Figure shows the result of a drill-down operation performed on the central cube
by stepping down a concept hierarchy for time defined as “day < month < quarter < year.”
Drill-down occurs by descending the time hierarchy from the level of quarter to the more
detailed level of month.

Slice:
 The slice operation performs a selection on one dimension of the given cube, resulting in
a subcube.
 Following Figure shows a slice operation where the sales data are selected from the
central cube for the dimension time using the criterion time = “Q1.”

Mallikarjun, Assoc Prof MLWEC Page 25


III C.S E DWBI

Dice:

The dice operation defines a subcube by performing a selection on two or more


dimensions.
Following Figure shows a dice operation on the central cube based on the following
selection criteria that involve three dimensions: (location = “Toronto” or “Vancouver”)
and (time D=“Q1” or “Q2”) and (item D “home entertainment” or “computer”).

Mallikarjun, Assoc Prof MLWEC Page 26


III C.S E DWBI

Pivot (rotate):
 Pivot (also called rotate) is a visualization operation that rotates the data axes in view to
provide an alternative data presentation.
 Following Figure shows a pivot operation where the item and location axes in a 2-D slice
are rotated. Other examples include rotating the axes in a 3-D cube, or transforming a 3-D
cube into a series of 2-D planes.

Totally all these operations shown in following figure:

Mallikarjun, Assoc Prof MLWEC Page 27


III C.S E DWBI

OLAP Operations

Mallikarjun, Assoc Prof MLWEC Page 28


III C.S E DWBI

1.2.7 OLAP Server Architectures

1. Relational OLAP (ROLAP)


 Use relational or extended-relational DBMS to store and manage warehouse data
 And OLAP middle ware to support missing pieces
 Include optimization of DBMS backend, implementation of aggregation navigation
logic, and additional tools and services
 Greater scalability
2. Multidimensional OLAP (MOLAP)
 Array-based multidimensional storage engine(sparse matrix techniques)
 Fast indexing to pre-computed summarized data
3. Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array
4. Specialized SQL servers
 Specialized support for SQL queries over star/snowflake schemas

1.2.8 From Online Analytical Processing to Multidimensional Data Mining:

Multidimensional data mining (also known as exploratory multidimensional data mining, online
analytical mining, or OLAM) integrates OLAP with data mining to uncover knowledge in
multidimensional databases.

Multidimensional data mining is particularly important for the following reasons:

a) High quality of data in data warehouses

b) Available information processing infrastructure surrounding data warehouses

c) OLAP-based exploration of multidimensional data

d) Online selection of data mining functions.

Architecture of online analytical Mining:

An OLAM server performs analytical mining data cubes in similar manner as on OLAP server
performs OLAP. Integrated OLAM and OLAP architecture is shown in following figure where
the OLAM AND OLAP servers both accept user online queries via graphical user interface API
and work with the cube in the data Analysis via cube API. A Meta data directory used to guide
access the data cube. The Data cube can be constructed by accessing and/or integrating multiple
data base API that may support OLE DB or ODBC connections.

Mallikarjun, Assoc Prof MLWEC Page 29


III C.S E DWBI

Since an OLAM server may perform multiple data mining tasks such as concept description,
association, classification, prediction, clustering and time series analysis and so on it usually
consists of multiple integrated data mining modules and is more sophisticated than an OLAP
Server.

Mallikarjun, Assoc Prof MLWEC Page 30