Sie sind auf Seite 1von 36

DATA, DATABASE, DATA

WAREHOUSE - OLTP & OLAP


Introduction

HISTORY

Database

1960 - the first database


management system
1970 - the first relational
model
1980 - distributed database
systems
and
database
machines
1990
object-oriented
databases
2000 - XML database

Data Warehouse
It became a distinct type of
computer database during the
late 1980s and early 1990s

INTRODUCTION
Database
collection of related data
database management system (DBMS) is a collection of
programs that enables users to create and maintain a
database
used in many applications

Data warehouse

a record of an enterprise's past transactional and operational


information

designed to favor efficient data analysis and reporting

data warehousing is not meant for current "live" data

CONTD...
Database
a structured collection of records or data
Data Warehouse
a logical collection of information, gathered from many different
operational databases, that supports business analysis activities and
decision-making tasks

Database models
is the structure or format of a database, described in a formal language
supported by the database management system

THE RELATIONAL DATABASE MODEL

There are many types of databases


Databases are
Collections of information
Created with logical structures
With logical ties within the information
With built-in integrity constraints
Databases have many tables
Consider Solomon Enterprises that provides concrete to home and
commercial builders. Tables or files include:
Order
Customer
Concrete Type
Employee
Truck
The relational database model is the most popular
Relational database uses a series of logically related twodimensional tables or files to store information in the form of a
database

DATABASE COLLECTION OF INFORMATION

3-6

DATABASES CONTD

In databases, the row number is irrelevant


In databases, column names are very important. Column names
are created in the data dictionary
Data dictionary contains the logical structure of the information
in a database
Logical ties must exist between the tables or files in a database
Logical ties are created with primary and foreign keys
Primary key field (or group of fields in some cases) that uniquely
describes each record
Foreign key primary key of one file that appears in another file
Foreign keys help you create logical ties within the information
in a database
Integrity constraints rules that help ensure the quality of the
information
Examples
Primary keys must be unique
Foreign keys must be present
Sales price cannot be negative
Phone number must have area code

EXAMPLE: DATABASES WITH LOGICAL TIES WITHIN


THE INFORMATION

DATABASE MANAGEMENT SYSTEM TOOLS

Database management system (DBMS) helps to


specify

the logical organization of a database


access and use the information within a database

5 software components:
1.
2.
3.
4.
5.

DBMS engine
Data definition subsystem
Data manipulation subsystem
Application generation subsystem
Data administration subsystem

CONTD

DBMS

DBMS engine accepts logical requests from the various other


DBMS subsystems, converts them into their physical equivalent,
and actually accesses the database and data dictionary as they
exist on a storage device

With a database, you only concern yourself with your logical view

Data definition subsystem helps you create and maintain the


data dictionary and define the structure of the files in a database

DBMS engine separates the logical from the physical

Physical view how information is physically arranged, stored,


and accessed on some type of storage device
Logical view how you as a knowledge worker need to arrange
and access information

You must create a data dictionary before entering information into a


database

Data manipulation subsystem helps you add, change, and


delete information

This is your primary DBMS interface as you work with a database


Views, Report generators, QBE tools, SQL

CONTD

View allows you to see the contents of a database file


Make whatever changes you want
Perform simple sorting
Query to find the location of information
Looks similar to a workbook with no row numbers

CONTD

Report generator helps you quickly define formats of


reports and what information you want to see in a report

You can save report formats and generate reports at any time with
up-to-date information

CONTD

Query-by-example (QBE) tool helps you graphically design


the answer to a question
What driver most often delivers concrete to Triple A Homes?

CONTD

Structured query language (SQL) standardized fourthgeneration language found in most DBMSs
Performs the same task as a QBE tool
But uses a sentence structure instead of point-and-click
interface
SQL is used mostly by IT people
Application generation subsystem contains facilities to
help you develop transaction-intensive applications
Data entry screen (called forms)
Programming languages
Used mostly by IT specialists

WHY SEPARATE DATA WAREHOUSE?

High performance for both systems

DBMS tuned for OLTP: access methods,


concurrency control, recovery

indexing,

Warehousetuned for OLAP: complex OLAP queries,


multidimensional view, consolidation

Different functions and different data:

missing data: Decision support requires historical data


which operational DBs do not typically maintain

data consolidation: DS requires consolidation (aggregation,


summarization) of data from heterogeneous sources

data quality: different sources typically use inconsistent data


representations, codes and formats which have to be
reconciled

DATA WAREHOUSE

W.H.Immon - Father of the data warehouse


Data Warehouse(Definition): A subject-oriented, integrated, time-

Data Warehousing: process of constructing and using a data

variant, non-updatable collection of data used in support of management


decision-making processes

warehouse
Subject-Oriented

Organized around major subjects, such as customer, product, sales


Focusing on the modeling and analysis of data for decision makers, not
on daily operations or transaction processing
Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process

Integrated
Constructed by integrating multiple, heterogeneous data sources
Data cleaning and data integration techniques are applied
Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted

CONTD

Time Variant
Every piece of data contained within the warehouse must be associated
with a particular point in time if any useful analysis is to be conducted
with it.
Another aspect of time variance in DW data is that, once recorded, data
within the warehouse cannot be updated or changed.
Non-volatality
Typical activities such as deletes, inserts, and changes that are
performed in an operational application environment are completely
nonexistent in a DW environment.
Only two data operations are ever performed in the DW: data loading
and data access.
Collection of tools - gathering data; cleansing, integrating, querying, reporting,
analysis, data mining, monitoring, administering warehouse

Data-mining tools software tools used to query information in a data


warehouse

Query-and-reporting tools
Intelligence agents
Multidimensional analysis tools
Statistical tools

COMPONENTS OF A DATA WAREHOUSE

Sources - Data Source Interaction


Data Transformation
Data Warehouse (Data Storage)
Reporting (Data Presentation)
Metadata

10

NEED FOR DATA WAREHOUSING


Integrated, company-wide view of high-quality
information (from disparate databases)
Separation of operational and informational systems
and data (for improved performance)

WHAT IS A DATA WAREHOUSE?

11

WHAT ARE DATA-MINING TOOLS?


3-23

CONTD
Query-and-reporting tools
similar to QBE tools, SQL, and report generators in the typical
database environment
Intelligent Agents
Use various artificial intelligence tools such as neural networks
and fuzzy logic to form the basis for information discovery and
building business intelligence

Help you find hidden patterns in information

Multidimensional analysis (MDA) tools


Slice-and-dice
techniques
that
allow
you
to
multidimensional information from different perspectives

Bring new layers to the front

Reorganize rows and columns

view

Statistical Tools
Help you apply various mathematical models to the information
stored in a data warehouse to discover new information

Regression, Analysis of variance

12

DATA WAREHOUSE-ADVANTAGES & DISADVANTAGES


Advantages
complete control over the four main areas of data management
systems:
Clean data
Query processing: multiple options
Indexes: multiple types
Security: data and access

Disadvantages
Adding new data sources takes time and associated high cost
Data owners lose control over their data, raising ownership, security
and privacy issues
Long initial implementation time and associated high cost
Difficult to accommodate changes in data types and ranges, data source
schema, indexes and queries

DATA WAREHOUSE ARCHITECTURES


1.
2.
3.

4.
5.

Generic Two-Level Architecture


Independent Data Mart
Dependent Data Mart and Operational Data
Store
Logical Data Mart and active Warehouse
Three-Layer architecture
All involve some form of extraction, transformation and
loading (ETL)

13

Generic two-level data warehousing architecture

One,
companywide
warehous
e

Periodic extraction data is not completely current


in warehouse

27

Independent data mart data warehousing architecture


Data marts:
Mini-warehouses, limited in scope

T
E
Separate ETL for each independent
data mart

Data access complexity due


to multiple data marts
28

14

Dependent data mart with operational data store: a three-level


ODS provides option for
architecture
obtaining current data

T
E

Simpler data access

Single ETL for


enterprise data warehouse
(EDW)

Dependent data marts


loaded from EDW

29

Logical data mart and real time warehouse architecture


ODS and data warehouse
are one and the same

T
E
Near real-time ETL for
Data Warehouse

Data marts are NOT separate


databases, but logical views of the 30
data warehouse
Easier to create new data marts

15

DATA MINING AND VISUALIZATION


Knowledge discovery using a blend of statistical,
AI, and computer graphics techniques
Goals:

Explain observed events or conditions


Confirm hypotheses
Explore data for new or unexpected relationships

Techniques
Case-based reasoning
Rule discovery
Signal processing
Neural nets
Fractals

Data visualization representing data


graphical/ multimedia formats for analysis

in

DATA MARTS

Data warehouses can support all of an organizations


information
Data marts have subsets of an organization wide data
warehouse
Data mart subset of a data warehouse in which only a
focused portion of the data warehouse information is kept

16

BUSINESS INTELLIGENCE

Organizations need business intelligence


Business intelligence (BI) knowledge about
customers,
competitors,
business
partners,
competitive environment, and internal operations to
make effective, important, and strategic business
decisions
IT tools help process information to create business
intelligence according to:

OLTP
OLAP

OLTP VS. OLAP

Online transaction processing (OLTP)

gathering of input information, processing that


information, and updating existing information to reflect
the gathered and processed information

Online analytical processing (OLAP)

manipulation of information to support decision making


Databases can support some OLAP
Data warehouses only support OLAP, not OLTP
Data warehouses are special forms of databases that support
decision making

17

EXAMPLE

OLTP VS. OLAP


Online Transaction Processing (OLTP)

On Line Analytical Processing (OLAP)

Describes processing at operational sites Describes processing at warehouse


Relational databases - groups data using o Objectives are different
common attributes found in the data set
Designed for real time business o Designed for analysis of business
operations
measured by categories and attributes
Mostly updates
o Mostly reads
Many small transactions
o Queries are long and complex
Mb - Gb of data
o Gb - Tb of data
Optimized for a common set of o Optimized for bulk loads and large,
transactions, usually adding or retrieving
complex, unpredictable queries that
a single row at a time per table
access many rows per table
Optimized for validation of incoming o Loaded with consistent, valid data;
data during transactions; uses validation
requires no real time validation
data tables
Supports thousands of concurrent o Supports few concurrent users relative
to OLTP
users

18

WHAT AND WHY OLAP?

OLAP is the dynamic synthesis, analysis, and consolidation of


large volumes of multi-dimensional data.
OLAP uses multi-dimensional view of aggregate data to provide
quick access to strategic information for the purposes of
advanced analysis.
OLAP enables users to gain a deeper understanding and
knowledge about various aspects of their corporate data
through fast, consistent, interactive access to a variety of
possible views of data.
While OLAP systems can easily answer who? and what?
questions, it is easier ability to answer what if? and why? type
questions that distinguishes them from general-purpose query
tools.
The types of analysis available from OLAP range from basic
navigation and browsing (referred to as slicing and dicing) , to
calculations, to more complex analysis such as time series and
complex modeling.

OLAP KEY FEATURES

Multi-dimensional views of data.

Support for complex calculations.

Time Intelligence.

19

OLAP BENEFITS

Increased productivity of business end-users, IT developers,


and consequently the entire organization.
Reduced backlog of applications development for IT staff by
making end-users self-sufficient enough to make their own
schema changes and build their own models.
Retention of organizational control over the integrity of
corporate data as OLAP applications are dependent on data
warehouses and OLTP systems to refresh their source data
level.
Reduced query drag and network traffic on OLTP systems
or on the data warehouse.
Improved potential revenue and profitability by enabling
the organization to respond more quickly to market
demands.

REPRESENTATION OF MULTI-DIMENSIONAL DATA

OLAP database servers use multi-dimensional structures to


store data and relationships between data.
Multi-dimensional structures are best-visualized as cubes of
data, and cubes within cubes of data. Each side of a cube is a
dimension.

OLA
P, by
Dr.
Khali
l

20

CONTD

The cube can be expanded to include another dimension, for example, the
number of sales staff in each city.
The response time of a multi-dimensional query depends on how many
cells have to be added on-the-fly.

41

Multi-dimensional databases are a compact and easy-to-understand way


of visualizing and manipulating data elements that have many interrelationships.

As the number of dimensions increases, the number of cubes cells


increases exponentially.

OLA
P, by
Dr.
Khali
l

CONTD

Multi-dimensional OLAP supports common analytical


operations, such as:
Consolidation: involves the aggregation of data such
as roll-ups or complex expressions involving
interrelated data. Foe example, branch offices can be
rolled up to cities and rolled up to countries.
Drill-Down: is the reverse of consolidation and
involves displaying the detailed data that comprises
the consolidated data.
Slicing and dicing: refers to the ability to look at the
data from different viewpoints. Slicing and dicing is
often performed along a time axis in order to analyze
trends and find patterns.

21

OLAP TOOLS - CATEGORIES


OLAP tools are categorized according to the
architecture used to store and process multidimensional data.
There are four main categories of OLAP tools as
defined by Berson and Smith (1997) and Pends and
Greeth (2001) including:
Multi-dimensional OLAP (MOLAP)
Relational OLAP (ROLAP)
Hybrid OLAP (HOLAP)
Desktop OLAP (DOLAP)

MULTI-DIMENSIONAL OLAP (MOLAP)

MOLAP tools use specialized data structures and multi-dimensional database


management systems (MDDBMS) to organize, navigate, and analyze data.

To enhance query performance the data is typically aggregated and stored


according to predicted usage.

MOLAP data structures use array technology and efficient storage techniques
that minimize the disk space requirements through sparse data management.

The development issues associated with MOLAP:


Only a limited amount of data can be efficiently stored and analyzed.
Navigation and analysis of data are limited because the data is designed
according to previously determined requirements.
MOLAP products require a different set of skills and tools to build and
maintain the database.

22

RELATIONAL OLAP (ROLAP)

ROLAP is the fastest-growing type of OLAP tools.

ROLAP supports RDBMS products through the use of a metadata layer.

This facilitates the creation of multiple multi-dimensional views of the two-dimensional


relation.

45

To improve performance, some ROLAP products have enhanced SQL engines to support the
complexity of multi-dimensional analysis, while others recommend, or require, the use of
highly denormalized database designs such as the star schema.

The development issues associated with ROLAP technology:


Performance problems associated with the processing of complex queries that require
multiple passes through the relational data.
Development of middleware to facilitate the development of multi-dimensional
applications.
Development of an option to create persistent multi-dimensional structures, together
with facilities o assist in the administration of these structures.

OLA
P, by
Dr.
Khali
l

HYBRID OLAP (HOLAP)

HOLAP tools deliver selected data directly from DBMS or via MOLAP server
to the desktop (or local server) in the form of data cube, where it is stored,
analyzed, and maintained locally is the fastest-growing type of OLAP tools.
The issues associated with HOLAP tools:
The architecture results in significant data redundancy and may cause
problems for networks that support many users.
Ability of each user to build a custom data cube may cause a lack of data
consistency among users.
Only a limited amount of data can be efficiently maintained.
46

HOLAP tools provide limited analysis capability, either directly against


RDBMS products, or by using an intermediate MOLAP server.

OLA
P, by
Dr.
Khali
l

23

DESKTOP OLAP (DOLAP)

47

DOLAP tools store the OLAP data in


client-based files and support multidimensional processing using a client
multi-dimensional engine. DOLAP
requires that relatively small extracts
of data are held on client machines.
This data may be distributed in
advance or on demand (possibly
through the Web).
The administration of a DOLAP
database is typically performed by a
central server or processing routine
that prepares data cubes or sets of
data for each user.
The development issues associated
with DOLAP are as follows:
Provision of appropriate security
controls to support all parts of the
DOLAP environment.
Reduction in the effort involved in
deploying and maintaining the
DOLAP tools.
Current trends are towards thin
client machines.

OLA
P, by
Dr.
Khali
l

Slicing a data cube

48

24

Summary report

Drill-down

49

Drill-down with
color added

Starting with
summary data,
users can obtain
details for
particular cells

WAREHOUSE MODELS & OPERATORS

Data Models

relations
stars & snowflakes
cubes

Operators

slice & dice


roll-up, drill down
pivoting
other

25

CONCEPTUAL MODELING OF DATA


WAREHOUSES

Modeling data warehouses: dimensions & measures


Star schema: A fact table in the middle connected to

51

a set of dimension tables

Snowflake schema: A refinement of star schema


where some dimensional hierarchy is normalized into
a set of smaller dimension tables, forming a shape
similar to snowflake

Fact constellations: Multiple fact tables share

dimension tables, viewed as a collection of stars,


therefore called galaxy schema or fact constellation

EXAMPLE OF STAR SCHEMA


time

item
Sales Fact Table
time_key
item_key

item_key
item_name
brand
type
supplier_type

52

time_key
day
day_of_the_week
month
quarter
year

branch_key
location

branch

location_key

branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

location_key
street
city
province_or_street
country

Measure
s

26

EXAMPLE OF SNOWFLAKE SCHEMA


time

item
item_key
item_name
brand
type
supplier_key

Sales Fact Table


time_key
item_key

supplier
53

time_key
day
day_of_the_week
month
quarter
year

supplier_key
supplier_type

branch_key
location

branch

location_key

branch_key
branch_name
branch_type

units_sold

location_key
street
city_key

dollars_sold

city
city_key
city
province_or_street
country

avg_sales
Measure
s

EXAMPLE OF FACT CONSTELLATION


time

item
Sales Fact Table
time_key

item_key
item_name
brand
type
supplier_type

item_key
branch_key
location_key

branch
branch_key
branch_name
branch_type

units_sold
dollars_sold
avg_sales

Measures

Shipping Fact Table


time_key

54

time_key
day
day_of_the_week
month
quarter
year

item_key
shipper_key
from_location

location
location_key
street
city
province_or_street
country

to_location
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type

27

STAR
product

prodId
p1
p2

name price
bolt
10
nut
5

sale oderId date


o100 1/7/97
o102 2/7/97
105 3/8/97

customer

custId
53
81
111

custId
53
53
111

prodId
p1
p2
p1

name
joe
fred
sally

storeId
c1
c1
c3

address
10 main
12 main
80 willow

store

storeId
c1
c2
c3

qty
1
2
5

amt
12
11
50

city
nyc
sfo
la

city
sfo
sfo
la

STAR SCHEMA

product
prodId
name
price

sale
orderId
date
custId
prodId
storeId
qty
amt

customer
custId
name
address
city

store
storeId
city

28

STAR SCHEMA EXAMPLE


TIME

PRODUCT

time_key
day
day_of_the_week
month
quarter
year

SALES
time_key
product_key
location_key
measures

units_sold
amount

product_key
product_name
category
brand
color
supplier_name

LOCATION
location_key
store
street_address
city
state
country
region

ADVANTAGES OF STAR SCHEMA

Facts and dimensions are clearly depicted

dimension tables are relatively static, data is loaded


(append mostly) into fact table(s)
easy to comprehend (and write queries)
Find total sales per product-category in our stores in Europe
SELECT PRODUCT.category, SUM(SALES.amount)
FROM SALES, PRODUCT,LOCATION
WHERE SALES.product_key = PRODUCT.product_key
AND
SALES.location_key = LOCATION.location_key
AND
LOCATION.region=Europe
GROUP BY PRODUCT.category

29

STAR SCHEMA QUERY PROCESSING


TIME

PRODUCT

time_key
day
day_of_the_week
month
quarter
year

SALES

Pcategory

time_key
product_key
location_key
measures

product_key
product_name
category
brand
color
supplier_name

LOCATION

units_sold
amount

Sregion=Europe

location_key
store
street_address
city
state
country
region

JOIN-INDEX

Join index relates the values of


the dimensions of a star
schema to rows in the fact
table.
a join index on region
maintains for each distinct
region a list of ROW-IDs of
the tuples recording the
sales in the region
Join indices can span multiple
dimensions OR
can be implemented as bitmapindexes (per dimension)
use bit-op for multiple-joins

LOCATION
region = Africa
region = America
region = Asia
region = Europe

SALES

R102 1

R117 1
R118 1

R124 1

30

DATA CUBE: MULTIDIMENSIONAL VIEW


Quarter
1Qtr

2Qtr

3Qtr

4Qtr

sum

America
Europe
Asia

Region

DVD
PC
VCR
sum

Total annual sales


of DVDs in America

sum

DATA CUBE COMPUTATION

Model dependencies among the aggregates:


most detailed view
product,store,quarter

product,quarter

store,quarter

product, store

quarter

product

store

can be computed from view


(product,store,quarter) by
summing-up all quarterly sales

none

31

THE MOLAP CUBE

Fact table view:


sale

prodId
p1
p2
p1
p2

Multi-dimensional cube:

storeId
s1
s1
s3
s2

amt
12
11
50
8

p1
p2

s1
12
11

s2

s3
50

dimensions = 2

3-D CUBE
Fact table view:
sale

prodId
p1
p2
p1
p2
p1
p1

storeId
s1
s1
s3
s2
s1
s2

Multi-dimensional cube:
date
1
1
1
1
2
2

amt
12
11
50
8
44
4

day 2
day 1

s1
s2
s3
p1
44
4
p2 s1
s2
s3
p1
12
50
p2
11
8

dimensions = 3

32

EXAMPLE
roll-up to region

Dimensions:
Time, Product, Store
Attributes:
Product (upc, price, )
Store

Hierarchies:
Product Brand
Day Week Quarter
Store Region Country

NY
SF

roll-up to brand

LA

Product

Juice
Milk
Coke
Cream
Soap
Bread

10
34
56
32
12
56

roll-up to week

M T W Th F S S

Time
56 units of bread sold in LA on M

CUBE AGGREGATION: ROLL-UP


day 2
day 1

p1
p2 s1
p1
12
p2
11

s1
44

s2
4
s2

Example: computing sums


...

s3
s3
50

sum
p1
p2

s1
56
11

s2
4
8

rollup
drill-down

s1
67

s2
12

s3
50

s3
50

129
p1
p2

sum
110
19

33

CUBE OPERATORS FOR ROLL-UP

day 2
day 1

s1
s2
s3
p1
44
4
p2 s1
s2
s3
p1
12
50
p2
11
8

...
sale(s1,*,*)
sum

s1
56
11

p1
p2

s2
4
8

s1
67

s2
12

s3
50

s3
50

129
p1
p2

sale(s2,p2,*)

sum
110
19

sale(*,*,*)

EXTENDED CUBE
*

day 2

day 1

p1
p2
*

p1
p2
s1
*
12
11
23

p1
p2
*
s1

s1
56
11
67
s2

44

s2
44

s3
4
50

8
8

50

s2
4
8
12
s3
*
62
19
81

s3
50

*50
48
48

*
110
19
129

sale(*,p2,*)

34

AGGREGATION USING HIERARCHIES

day 2
day 1

s1
s2
s3
p1
44
4
p2 s1
s2
s3
p1
12
50
p2
11
8

store
region
country

p1
p2

region A region B
56
54
11
8

(store s1 in Region A;
stores s2, s3 in Region B)

POINTS TO BE NOTICED ABOUT MOLAP


Pre-calculating or pre-consolidating transactional data
improves speed.
BUT
Fully pre-consolidating incoming data, MDDs require an
enormous amount of overhead both in processing time and
in storage. An input file of 200MB can easily expand to 5GB

MDDs are great candidates for the <50GB department data


marts.

Rolling up and Drilling down through aggregate data.


With MDDs, application design is essentially the definition
of dimensions and calculation rules, while the RDBMS
requires that the database schema be a star or snowflake.

35

HYBRID OLAP (HOLAP)


HOLAP

= Hybrid OLAP:

Best of both worlds

Storing detailed data in RDBMS

Storing aggregated data in MDBMS

User access via MOLAP tools

DATA FLOW IN HOLAP


RDBMS Server

MDBMS Server
Multidimensional
access

SQL-Read
User
data

Multidimensiona
ldata

Meta data
Derived
data
SQLReach
Through

Client

Multidimensional
Viewer

Relational
Viewer
SQL-Read

36

Das könnte Ihnen auch gefallen