Beruflich Dokumente
Kultur Dokumente
Unit - IV : Lesson 8
Organizational Information
Levels, formats, and granularities of organizational information
Atomic (Transaction)
Lightly Summarized
Highly Summarized
Scrubbing Data
Sophisticated transformation tools. Used for cleaning the quality of data Clean data is vital for the success of the warehouse Example
Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person
Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Incremental extract = capturing Static extract = capturing a changes that have occurred since snapshot of the source data at a the last static extract point in time
Scrub = cleanseuses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, Also: decoding, reformatting, time
erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
stamping, conversion, key generation, merging, error detection/logging, locating missing data
Transform = convert data from format of operational system to format of data warehouse Record-level:
Field-level:
single-field from one field to one field multi-field from many fields to one, or one field to many
Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of Update mode: only changes in
non-volatile
sales
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous data sources relational databases, on-line transaction records Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources When data is moved to the warehouse, it is converted. Eg: male, female (0,1)
But the key of operational data may or may not contain time element
Data WarehouseNonvolatile
A physically separate store of data transformed from the
operational environment
Operational update of data does not occur in the data warehouse environment
Analysis Design Import data Install front-end tools Test and deploy
Stage 1: Analysis
Identify:
Target Questions Data needs Timeliness of data Granularity
Analysis Design Import data Install front-end tools Test and deploy
Stage 2: Design
Star schema Data Transformation Aggregates Pre-calculated Values HW/SW Architecture
Analysis Design Import data Install front-end tools Test and deploy Dimensional Modeling
Metadata
Data Warehouse
Serve
Data Marts Data Sources Data Storage OLAP Engine Front-End Tools
0-D(apex) cuboid
1-D cuboids
time,item
time,location
item,location item,supplier
location,supplier
time,supplier time,item,location
2-D cuboids
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
time, item, location, supplier
Multi-dimensional Data
Product
Product Industry
Region Country
Time Year
Category
Region
Quarter
Product
City
Month
Week
Month
Office Day
Canada Mexico
sum
Country
TV PC VCR sum
1Qtr
2Qtr
sum
U.S.A
Multidimensional Analysis
Cube common term for the representation of multidimensional information
3-D Cube
Fact table view:
sale prodId p1 p2 p1 p2 p1 p1 storeId c1 c1 c3 c2 c1 c2 date 1 1 1 1 2 2 amt 12 11 50 8 44 4
Multi-dimensional cube:
day 2 day 1
p1 p2 c1 p1 12 p2 11
c1 44 c2 8
c2 4 c3 50
c3
dimensions = 3
Star Schema
Creates non-normalized data structures Easier for users to understand Optimized for OLAP Uses fact (facts or measures in the business) and dimension (establishes the context of the facts) tables
Star Schema
A single fact table and for each dimension one dimension table Does not capture hierarchies directly
T i e
date, custno, prodno, cityname, ...
m
f a c t
p r o d
c u s t
c i t y
Star
product prodId p1 p2 name price bolt 10 nut 5
store
storeId c1 c2 c3
custId 53 53 111
prodId p1 p2 p1
storeId c1 c1 c3
qty 1 2 5
amt 12 11 50
customer
custId 53 81 111
item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_type
branch
branch_key branch_name branch_type
location
location_key street city province_or_street country
location_key units_sold
dollars_sold
avg_sales
Measures
Snowflake schema
Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage
T i
e
date, custno, prodno, cityname, ...
m
f a c t
p r o d
c u s t
c i t y
r e g i o n
item
Sales Fact Table
item_key item_name brand type supplier_key
supplier
supplier_key supplier_type
time_key
item_key branch_key
branch
branch_key branch_name branch_type
location
location_key street city_key
location_key
units_sold dollars_sold avg_sales Measures
city
region
Europe
...
North_America
country
Germany
...
Spain
Canada
...
Mexico
city
Frankfurt
...
Vancouver
...
Toronto
office
L. Chan
...
M. Wind
item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_type
item_key
shipper_key
from_location
location
location_key street city province_or_street country
branch
branch_key branch_name branch_type
What Is OLAP?
Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
Household Telecomm
Video
Audio
Europe
Far East India Retail Direct Special
Sales Channel
10
47 30
Cream 12
Product
Date
exposes the information being captured, stored, and managed by operational systems
Data warehouse view consists of fact tables and dimension tables
ROLAP
Relational OLAP Uses a RDBMS to implement and OLAP environment Typically involves a star schema to provide the multidimensional capabilities OLAP tool manipulates RDBMS star schema data Called slowlap by MOLAP vendors
MOLAP
Multidimensional OLAP Uses a MDDBS (e.g., Essbase) to store and access data Usually requires proprietary (non SQL) data access tools Provides exceptionally fast response times
Data Mart
A data mart stores data for a limited number of subject areas, such as marketing and sales data. It is used to support specific applications. An independent data mart is created directly from source systems. A dependent data mart is populated from a data warehouse.
Arrayed
Data Warehouse
More
Reporting Tools
Andyne Computing -- GQL Brio -- BrioQuery Business Objects -- Business Objects Cognos -- Impromptu Information Builders Inc. -- Focus for Windows Oracle -- Discoverer2000 Platinum Technology -- SQL*Assist, ProReports PowerSoft -- InfoMaker SAS Institute -- SAS/Assist Software AG -- Esperant Sterling Software -- VISION:Data
Sybase
Adaptive Server 11.5 Sybase MPP Sybase IQ
Teradata