Sie sind auf Seite 1von 50

ISQS 6339, Business

Intelligence

Data Warehousing

Zhangxi Lin
Texas Tech University

1
1

Outlines
So

far students have learned

Basic concepts of business intelligence


The definition and importance of data warehouse
In

this lecture, the following topics will be


covered
SQL Server 2008 data mart case study
How to access data in a network directory
How to access SQL Server 2008 on the Citrix Server
How to load data from an Excel file to a database

Data warehouse overview


Data warehouse architecture
Data integration
ISQS 6347, Data & Text Mining

Data Warehousing
Definitions and Concepts
Data

warehouse

Video Overview of data warehouse 238

A physical repository where relational


data are specially organized to
provide enterprise-wide, cleansed
data in a standardized format
Benefits of data warehouse 318
3

Data mart
Definition

A localized data warehouse that stores only


relevant data to a department or even an
individual
Dependent data mart
A subset that is created directly from a data
warehouse
Independent data mart
A small data warehouse designed for a strategic
business unit or a department

Data Mart
- The IMW Case
IMW, standing for Internet Media Works!, is
an ASP in real estate information services.
It is headquartered in Austin, Texas. CEO
is Gary Anderson.
Web page: http://www.inetworks.com

About IMW
Based

in Austin, Texas, IMW (


Internet Media Works!) is an ASP, specialized
mainly in web-based application development,
database integration, and web development and
hosting for all kinds of businesses.
IMW has been more successful in selling its ebusiness services for commercial real estate. Its
services include lead generation, real estate
transaction management, property listing, realtor
membership management, real estate indices,
real estate auctions, etc., with COMMREX as a
complete e-business solution.
IMW used to have up to 6 full-time employees
and a few part-time employees.
6

ISQS 6347, Data & Text Mining

IMWs Services

IMWs Web-Based Application Services

Public User
Application
Services

Public User Support

Website Hosting
Services

Optional Website
Hosting Services

Core Membership
Database Services

Optional Membership
Database Services

Core Property Listing


Database Services

Optional Property Listing


Database Services

Networking and System Operation Services

Internet Service Providers Services

ISQS 6347, Data & Text Mining

Why need Data Mart?


Data

mart complements the centralized


data warehousing based on UDM model,
for the situations where UDM cannot be
used
Legacy databases
Data are from nondatabase sources
No physical connection the centralized data
warehouse
Data are not clean

Data Mart Structures


Fact

tables

Measures
Dimension

tables

Dimensions and Hierarchies


Attributes (or columns)
Dimensional

modeling Stars and

Snowflakes

Measures
A

numeric quantity expressing some of


the organizations performance. The
information represented by this quantity
is used to support or evaluate the
decision making and performance of the
organization.
A measure is also called a fact
The table holding measure information is
called as a fact table
Dimensions

vs. Measures 238

Commrex Real Estate Operational


Database

Users: property listors, webmaster, marketing manager of


IMW
Objective: Encourage realtors to use the online ASP

services with the best information services to increase


IMWs revenue.
Value Chain
Listors create their account
Listors post their real estate properties to the web-based
database services and pay listing fees
Property buyers search the website-based database and
buy properties from listors. This is the incentive for listors
to use the ASP services
Business Processes
Listor sign up
Listor account management
Property data posting
Property search
Property database maintenance

11
11

IMWs Database ERD Model


Property Listing Database
TransactionID

M:1

UserID

Property ID

PropID

Listor ID
M:M

Property Type

Membership Database
M:1

Listor ID
Listor Name

Property Type

Type Name

Address

Company ID

Subtype 1

City

Subtype 2

Chapter

UpdateDate

Feature

Subtype n

Legends
Primary Key

M:M

Functions
Specializations

Company ID
Comp Name

Address
Secondary Key

Telephone #
Link to a table

ISQS 6339, Data Mgmt & BI, Zhangxi


Lin

12

Commrex Data Warehousing

Users: CEO of IMW, IMW business analyst, IMW


marketing manager
Analytic themes
Fast retrieval of business key performance indicators
(KPIs)
Decision making on business promotions

Applications
Geographic distribution of property listings
Scorecard for main performance indicators
Dashboard

Questions
How to model data warehouse?
What are required in data transformation and
preprocessing?
Any missing dimension for data ware housing?
How to perform routine data warehouse updates
frequency, timing, etc.

IMWs Data Warehouse Dimensional Model


Property Listing Fact
Property Type
Dimension

Membership Dimension

Property ID
Listor ID

Listor ID

Listor Name

PropType

PropType

SubName

Address

Company ID

City

Chapter

UpdateDate

Legends
Primary Key

Features

Specializations

Year

Company ID

Quarter

Comp Name

Month
Date

Secondary Key

Functions

Company
Dimension

Address

Telephone #

Link to a table

ISQS 6339, Data Mgmt & BI, Zhangxi


Lin

14

Data Warehouse
Overview

15

Data Warehousing
Characteristics
Basic

characteristics of data warehousing

Subject oriented
Integrated
Time variant (time series)
Nonvolatile (not allow to change)

Others

Web based
Relational/multidimensional
Client/server
Real-time
Include metadata
16

Data Warehousing
Process Overview
Data

in DW are constantly accumulated.

Organizations continuously collect data, information,


and knowledge at an increasingly accelerated rate
and store them in computerized systems
The

number of users is constantly increasing.

The number of users needing to access the


information continues to increase as a result of
improved reliability and availability of network access,
especially the Internet
The

organization using data warehouse relied


on DW more and more
17

Data Warehousing
More Concepts
Operational

data stores (ODS)


A type of database often used as an interim area for a
data warehouse, especially for customer information
files
Enterprise data warehouse (EDW)
A large-scale data warehouse used across the enterprise
for decision support. It integrates different sources of
information into a consolidated information system.
Metadata (Video 141)
Data about data. In a data warehouse, metadata
describe the contents of a data warehouse and the
manner of its use
Syntactic metadata, structural metadata, and semantic
metadata
18

Data Warehousing
Process Overview

19

Data Warehousing
Process Overview
The

major components of a data


warehousing process

Data sources
Data extraction
Data loading
Comprehensive database
Metadata
Middleware tools
20

Data Warehouse
Architectures

21

Three Parts of Data


Warehouse
The

data warehouse that contains the data


and associated software
Data acquisition (back-end) software that
extracts data from legacy systems and
external sources, consolidates and
summarizes them, and loads them into the
data warehouse
Client (front-end) software that allows users
to access and analyze data from the
warehouse
22

Three-Tier Data
Warehouse

23

Alternative Data Warehouse


Architectures (1)

24

Alternative Data Warehouse


Architectures (2)

25

Alternative Data Warehouse


Architectures (3)

26

Alternative Data Warehouse


Architectures (4)

27

Alternative Data Warehouse


Architectures (5)

28

Architectures Comparison

29

Teradatas EDW

30

Hadoop for BI in the


Cloudera
Hadoop is a free, Java-based programming
framework that supports the processing of
large data sets in a distributed computing
environment.
Hadoop makes it possible to run applications
on systems with thousands of nodes involving
thousands ofterabytes.
Hadoop was inspired byGoogle'sMapReduce,
a software framework in which anapplicationis
broken down into numerous small parts. Doug
Cutting, Hadoop's creator, named the
framework after his child's stuffed toy
elephant.
31

Apache Hadoop
The

Apache Hadoop framework is


composed of the following modules:

Hadoop Common - contains libraries and


utilities needed by other Hadoop modules
Hadoop Distributed File System
(HDFS).
Hadoop YARN - a resource-management
platform responsible for managing
compute resources in clusters and using
them for scheduling of users' applications.
Hadoop MapReduce - a programming
model for large scale data processing.

ISQS 6339, Data Mgmt & BI

32

MapReduce
MapReduce is a framework for processing parallelizable
problems across huge datasets using a large number of
computers (nodes), collectively referred to as a cluster
or a grid.

33

How Hadoop Operates

ISQS 6339, Data Mgmt & BI

34

Clouderas Hadoop System

35

Hadoop 2: Big data's big leap


forward
The

new Hadoop is the Apache Foundation's


attempt to create a whole new general framework
for the way big data can be stored, mined, and
processed.
The biggest constraint on scale has been
Hadoops job handling. All jobs in Hadoop are run
as batch processes through a single daemon
called JobTracker, which creates a scalability and
processing-speed bottleneck.
Hadoop 2 uses an entirely new job-processing
framework built using two daemons:
ResourceManager, which governs all jobs in the
system, and NodeManager, which runs on each
Hadoop node and keeps the ResourceManager
informed about what's happening on that node.
ISQS 6339, Data Mgmt & BI

36

MapReduce 2.0 YARN


(Yet Another Resource
Negotiator)

ISQS 6339, Data Mgmt & BI

37

Teradata Big Data


Platform

2013-12-02

38

Dell representation of the


Hadoop ecosystem

39

Nokias Big Data


Architechture

2013-12-02

40

Comparison between big data


platform and traditional BI platform

2013-12-02

41

Resolving legacy problem Dual


platform

2013-12-02

42

Ten factors that potentially


affect the architecture
selection decision
1. Information
interdependence
between organizational
units
2. Upper managements
information needs
3. Urgency of need for a
data warehouse
4. Nature of end-user tasks
5. Constraints on resources

6. Strategic view of the


data warehouse prior to
implementation
7. Compatibility with
existing systems
8. Perceived ability of the
in-house IT staff
9. Technical issues
10. Social/political factors

43

Data Integration

44

Data Integration
Integration

that comprises three major processes:

data access,
data federation, and
change capture.
When

these three processes are correctly


implemented, data can be accessed and made
accessible to an array of ETL and analysis tools and
data warehousing environments
ETL Tools 456

45

Data Integration

Enterprise application integration (EAI)


A technology that provides a vehicle for pushing data from source
systems into a data warehouse, including application functionality
integration. Recently service-oriented architecture (SOA) is
applied
Enterprise information integration (EII)
An evolving tool space that promises real-time data integration
from a variety of sources, such as relational databases, Web
services, and multidimensional databases
Extraction, transformation, and load (ETL)
A data warehousing process that consists of extraction (i.e.,
reading data from a database), transformation (i.e., converting
the extracted data from its previous form into the form in which it
needs to be so that it can be placed into a data warehouse or
simply another database), and load (i.e., putting the data into the
data warehouse)

46

Transformation Tools: To
purchase or to Build in-House
Issues affect whether an organization will purchase data
transformation tools or build the transformation process
itself
Data transformation tools are expensive
Data transformation tools may have a long learning curve
It is difficult to measure how the IT organization is doing
until it has learned to use the data transformation tools
Important criteria in selecting an ETL tool
Ability to read from and write to an unlimited number of
data source architectures
Automatic capturing and delivery of metadata
A history of conforming to open standards
An easy-to-use interface for the developer and the
functional user

47

Open Source Software for Big


Data
Oracle

VM VirtualBox
Cloudera Hadoop - Get Started
With Enterprise Hadoop
Hortonworks Data Platform Hortonworks.com
Google Hadoop Solutions google.com
Hadoop on Google Cloud Platform
Hadoop & NoSQL - MarkLogic.com

48

Structure and Components of


Business Intelligence
MS
MSSQL
SQLServer
Server2008
2008
SSMS
SSMS

SSIS
SSIS

SSAS
SSAS

BIDS
SSRS
SSRS

SAS
SAS
EG
EG
SAS
SAS
EM
EM
49

Exercise 1 Walk through


data warehousing process

Learning Objectives
To gain a general impression how to use SQL Server 2008 to
implement a data mart

Tasks
Create your database with SSMS, named as
ISQS6339_lastname
Import data from Commrex_2011.xls
Use SSMS to create a ERD diagram
Create a SSAS project using BIDS
Define data source, data source view, and cube

Deliverable:
One-page printout of the screenshot of the cube diagram

50

Das könnte Ihnen auch gefallen