Beruflich Dokumente
Kultur Dokumente
Data-Warehousing
Typical Queries on DW
What was the total number of Cell Phones sold in India
in 2013 group by companies?
What was the total revenue for property sales for each
type of property in Mangalore between 2006 and 2008?
What would be the effect on cell phone sales in the
Mangalore if a new college is opened?
Which type of Cell Phone sells most in Mangalore?
Which is the most travelled train in India in 2013?
Data Warehouse
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organizations operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
A data warehouse is a
subject-oriented
integrated
time-variant
nonvolatile
W. H. Inmon
Ralph Kimball
DW Subject Oriented
Organized around major subjects, such as
customer, product, sales
Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
DW - Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line transaction
records
DW Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational
systems
Operational database: current value data
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
DW Non-volatile
A physically separate store of data transformed
from the operational environment
Operational update of data does not occur in the
data warehouse environment
Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two to three operations in data
accessing:
initial loading, incremental loading of data and
access of data
OLAP
users
clerk, IT professional
knowledge worker
function
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc/repetitive
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
complex query
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
usage
access
A Typical Data
Warehouse
Operational meta-data
History of migrated data, currency of data (active, archived, or purged),
algorithms used for queries
Business data
Business terms and definitions, ownership of data, policies (scope of
DW, security)
Uses of Metadata
Some of the uses
Extraction and loading processes - metadata is used
to map data sources to a common view of information
within the warehouse
Warehouse management process - metadata is used
to automate the production of summary tables
Query management process - metadata is used to
direct a query to the most appropriate data source
Region
Reg_ID
Reg_ID
Cntry_ID
Reg_Name
Cntry_Name
1 Europe
1 Germany
2 North America
1 Spain
3 Asia
2 Canada
2 Mexico
3 India
3 China
City
City_ID
Reg_ID
Cntry_ID
City_Name
Frankfurt
Vancouver
Toronto
Mexico City
Delhi
Beijing
Mumbai
Madrid
Location
Region
Country
City
Europe
Germany
Frankfurt
Europe
Spain
Madrid
Vancouver
Mexico City
Asia
India
Delhi
Asia
China
Beijing
Asia
India
Mumbai
Concept Hierarchy
OLAP:
select * from location
DW Queries Complexity
Complexity by just adding a maximum price:
Query 1: A simple data cube query: Find the total sales
in 2004, broken down by product, region, and month,
with subtotals for each dimension.
Query 2: A complex data cube query: Grouping by all
subsets of product, region, month, find the maximum
price in 2004 for each group and the total sales among
all maximum price tuples
Star Schema
Snowflake Schema
FACT Constellation
Ordinary Index
Surrogate Key
Most cases it is not a Natural key in addition
to the Business key.
Represents an object in the database, but not
visible outside
B Plus Tree
Bitmap Index
Index on a particular column
Each value in the column has a bit vector
The length of the bit vector: # of records in the
base table
Base table
Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
Index on Region
Index on Type
DSS Systems
Generally suited for low cardinal values (but
need not be limited to)
Suited for systems which gets changes during
non-peak business hours
JOIN Index
In data warehouses, join index relates the
values of the dimensions of a start schema
to rows in the fact table
Eg: Sales and two dimensions city and
product
A join index on city maintains for each
distinct city a list of R-IDs (Prim key) of
the tuples recording the Sales in the city
Join indices can span multiple dimensions
Join Index
4. Partitioning:
Physical partitioning
Eg: Partitioning Date/Time Dimension
Data Cube
Data Cube
Data Cube
The key operation of a OLAP is the
formation of a data cube
Pre-computed query result
A data cube allows data to be modeled
and viewed in multiple dimensions. It is
defined by dimensions and facts
A data cube is a multidimensional
representation of data, together with all
possible aggregates
A Spreadsheet Data
Date
Location
Product
Sales
1-Jan-13 USA
TV
100
2-Jan-13 Canada
TV
250
3-Jan-13 Mexico
TV
300
4-Jan-13 Brazil
TV
200
1-Jan-13 USA
PC
50
2-Jan-13 Canada
PC
70
3-Jan-13 Mexico
PC
40
4-Jan-13 Brazil
PC
60
Represent in a 3-Dimension
Consider previous
sales of products
at a number of
locations at
various dates
This data can be
represented
as a 3
dimensional array
TV
PC
VCR
sum
1Qtr
2Qtr
Date
3Qtr
4Qtr
sum
Country
Pr
od
uc
t
Asia Pacific
Europe
sum
product,date
date
product,country
country
1-D cuboids
date, country
2-D cuboids
3-D (base) cuboid
A Simple Representation
Base and Aggregate cells.
Consider the data cube with the DIMENSION
Date, Product, County and the FACT
Quantity.
1D Cells: (Jan, *, *, 350)
1D Cells: (Feb, *, *, 50)
2D Cells: (Jan, * , Mexico, 70)
3D Cells: (Aug, TV, USA, 80)
time
product
location
time,location
time,product
product,location
time,supplier
location,supplier
2-D cuboids
product,supplier
time,location,supplier
3-D cuboids
time,product,location
time,product,supplier
product,location,supplier
[product,
city,
year]:
sum
(city)
(product)
(year)
3. Class Characterization:
Class Characterization
Name
Gender
Jim
Woodman
Scott
Lachance
Laura Lee
Removed
Retained
Major
Birth_date
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
687-4598
3.67
253-9106
3.70
420-5232
3.83
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
CS
Gender Major
M
F
Birth-Place
Science
Science
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign
20-25
25-30
Richmond
Burnaby
Very-good
Excellent
Count
16
22
References
Data Mining: Concepts and Techniques
Jiawei Han, Micheline Kamber, Jian Pei