Final

1
PRESENTED BY:
1. DHANASHRI CHINNAPPA
2. SAGAR PATEL
3. JASON GONSALVES
4. ANIKET PARAB
5. KETKI RAJE
6. GANESH PATIL
2
INDEX
Sr.No: CONTENTS
1. Introduction
2. Data Warehouse Quality Model
3. Tools for Data Warehouse Quality
4. Data Warehouse Querying & Loading
5. Commonly found Data Quality Issues
6. Conclusion
7. Bibliography
3
INTRODUCTION
A data warehouse is a relational database that is designed for query

and analysis rather than for transaction processing. It usually contains
historical data derived from transaction data, but it can include data
from other sources. It separates analysis workload from transaction
workload and enables an organization to consolidate data from several
sources.
In addition to a relational database, a data warehouse environment

includes an extraction, transportation, transformation, and loading
(ETL) solution, an online analytical processing (OLAP) engine, client
analysis tools, and other applications that manage the process of
gathering data and delivering it to business users.
4
 DATA
WAREHOUSE QUALITY MODEL:
5
Due to the principal role of Data warehouses (DW) in making strategy
decisions, data warehouse quality is crucial for organizations.
Therefore, we should use methods, models, techniques and tools to
help us in designing and maintaining high quality DWs.
In the last years, there have been several approaches to design DWs
from the conceptual, logical and physical perspectives. However, from
our point of view, none of them provides a set of empirically validated
metrics (objective indicators) to help the designer in accomplishing an
outstanding model that guarantees the quality of the DW.
1. Involve users: Data quality is ultimately a business problem, so

people in the business must be involved. People frequently enter the
data being used, so they are the first line of defense. People are also
the final consumers in most cases and provide the last line of
defense.
2. Monitor processes: Bad data actually might have been accurate at

one time but has since decayed. For example, prospect lists get
outdated. The more outdated the information, the more time and
money is wasted trying to sell goods or services to the wrong people.
Business processes can ensure timely and accurate updates to data.
Streamlining processes where possible can reduce the number of hands
touching data, thereby reducing the chances of manual data corruption.
3. Use Oracle Warehouse Builder: In addition to offering database

design and extract, transform, and load (ETL) features, it includes the
ability to profile, cleanse, and audit data, based on data rules. This
technology provides an umbrella over the data warehouse, using
predefined rules to catch critical mistakes before they make their way
into the decision-making process.
6
 DATA WAREHOUSE QUERYING AND
LOADING:
QUERYING IN DATAWAREHOUSE:
The Query Model is used to graphically represent a single business

query.
The diagram has four components:
• The topic or focus of the query,
• The dimensions of the query,
• The lines that link the dimensions to the topics,
• The level of the dimensions used in the query.
7
Query Model:
This diagram is a model of a

query model
The query model identifies the
focus of the business query
and the reference data
needed to describe the query.
The isolation of these
components helps de fine potential facts and dimensions.
The Monthly Sales by Market by Product" example identifies Sales as

the subject area, and Customer, Product, and Time as the dimensions
of the query .The query model helps identify the fact and dimension
tables for a subject area.
The Fact Table:
The blueprint design aims to anticipate the typical star query shape
and builds indexes over the fact table. The clustered index of the fact
table uses several dimension surrogate key columns (the foreign key
columns) as index keys. The most frequently used columns should occur
in the list of index keys. You may want to take the time to verify that
this indeed provides a good access path for the most frequently
executed queries in your workload.
8
Dimension Tables:
When you apply the blueprint

design to dimension tables, you
need to create indexes for each
dimension table. These include a
non-clustered primary key
constraint index on the
surrogate key column of the
dimension and a clustered index over the columns of the business key
of the dimension entity. For large dimension tables, you should also
consider adding non-clustered indexes over columns that are
frequently used in highly selective predicates.
LOADING IN DATAWAREHOUSE:
After the data has been cleansed
and transformed into a structure
consistent with the data warehouse
requirements, data is ready for loading
into the data warehouse. You may make
some final transformation during the loading operation, although
you should complete any transformations that could identify
inconsistencies before the final loading operation.
The initial load of the data warehouse consists of populating the tables
in the data warehouse schema and then verifying that the data is
ready for use. You can use various methods to load the data warehouse
tables, such as:
• Transact-SQL
• DTS
• BCP utility
9
When you load data into the data warehouse, you are populating the
tables that will be used by the presentation applications that make the
data available to users. Loading data often involves the transfer of
large amounts of data from source operational systems, a data
preparation area database, or preparation area tables in the data
warehouse database. Such operations can impose significant processing
loads on the databases involved and should be accomplished during a
period of relatively low system use.
After the data has been loaded into the data warehouse database,
verify the referential integrity between dimension and fact tables to
ensure that all records relate to appropriate records in other tables.
You should verify that every record in a fact table relates to a record
in each dimension table that will be used with that fact table
 TOOLS FOR DATA WAREHOUSE QUALITY:
Data quality tools generally fall into one of three categories:

EXTRACT, LOAD&TRANSFER. The focus of this paper is on tools that
clean and audit data, with a limited look at tools that extract and
migrate data.
Extract:
Data auditing tools enhance the accuracy and correctness of the data
at the source. These tools generally compare the data in the source
database to a set of business rules.
When using a source external to the organization, business rules can be

determined by using data mining techniques to uncover patterns in the
data. Business rules that are internal to the organization should be
entered in the early stages of evaluating data sources. Lexical analysis
10
may be used to discover the business sense of words within the data.
The data that does not adhere to the business rules could then be
modified as necessary.
Loading: Data cleansing tools are used in the intermediate staging

area. The tools in this category have been around for a number of
years. A data cleansing tool cleans names, addresses and other data
that can be compared to an independent source. These tools are
responsible for parsing, standardizing, and verifying data against
known lists such as U.S.PostalCodes. The data cleansing tools contain
features which perform the following functions:
•Data parsing (elementizing)- breaks a record into atomic units that

can be used in subsequent steps. Parsing includes placing elements of a
record into the correct fields.
• Data standardization- converts the data

elements to forms that are standard throughout
the data warehouse.
• Data correction and verification- matches data

against know lists, such as U.S. Postal Codes,
product lists, internal customer lists.
• Record matching- determines whether two records represent data

on the same subject.
• Data transformation- ensures consistent mapping between source

systems and data warehouse.
• Householding – combining individual records that have the same

address.
•Documenting – documenting the results of the data cleansing steps in

the meta data.
11
Transfer:
The third type of tool, the data migration tool, is

used in extracting data from a source database,
and migrating the data into an intermediate storage
area. The migration tools also transfer data from
the staging area into the data warehouse. The data migration tool is
responsible for converting the data from one platform to another. A
migration tool will map the data from the source to the data
warehouse. It can also check for Y2K compliance and other simple
cleansing activities. There can be a great deal of overlap in these tools
and many of the same features are found in tools of each category.
 COMMONLY FOUND DATA QUALITY ISSUES
A company’s database is its most important asset. It is a collection of

information on customers, suppliers, partners, employees, products,
inventory, locations, and more. This data is the foundation on which
your business operations and decisions are made; it is used in
everything from booking sales, analyzing summary reports, managing
inventory, generating invoices and forecasting. To be of greatest value,
this data needs to be up-to-date, relevant, consistent and accurate –
only then can it be managed effectively and aggressively to create
strategic advantage.
A customer of a hotel and casino makes a reservation to stay at the

property using his full name, Johnathan Smith.So, as part of its
customer loyalty-building initiative, the hotel’s marketing department
sends him an email with a free night’s stay promotion, believing he is a
new customer unaware that the customer is already listed under the
12
hotel’s casino/gaming department as a VIP client – under a similar name
John Smith.
The hotel did not have a data quality process in place to standardize,
clean and merge duplicate records to provide a complete view of the
customer. As a result, the hotel was not able to leverage the true value
of its data in delivering relevant marketing to a high value customer.
STEPS TO TACKLE THE PROBLEM:
1.Proling:
As the first line of defense for your data integration solution, proling
data helps you examine whether your existing data sources meet the
quality standards of your solution. Properly proling your data saves
execution time because you identify issues that require immediate
attention from the start – and avoid the unnecessary processing of
unacceptable data sources. Data proling becomes even more critical
when working with raw data sources that do not have referential
integrity or quality controls.
2. Cleansing:
After a data set

successfully meets
proling standards, it
still requires data
cleansing and de-
duplication to ensure
that all business
rules are properly
met. Successful data
cleansing requires
the use of
flexible,ef ficient
techniques capable
13
of handling complex quality issues hidden in the depths of large data
sets. Data cleansing corrects errors and standardizes information that
can ultimately be leveraged for MDM applications.
3. Parsing and Standardization:
This technique parses and restructures data into a common format to

help build more consistent data. For instance, the process can
standardize addresses to a desired format, or to USPS®
specifications, which are needed to enable CASS Certi ed™
processing. This phase is designed to identify, correct and standardize
patterns of data across various data sets including tables, columns and
rows, etc.
4. Matching:
Data matching consolidates data records into identical groups and

links/merges related records within or across data sets. This process
locates matches in any combination of over 35 different components –
from common ones like address, city, state, ZIP®, name, and phone.
5. Enrichment:
Data enrichment enhances the value of customer data by attaching

additional pieces of data from other sources, including encoding,
demographic data, full-name parsing and tenderizing, phone number
verification, and email validation. The process provides a better
understanding of your customer data because it reveals buyer behavior
and loyalty potential.
6. Monitoring:
This real-time monitoring phase puts automated processes into place to

detect when data exceeds pre-set limits. Data monitoring is designed
14
to help organizations immediately recognize and correct issues before
the quality of data declines. This approach also empowers businesses
to enforce data governance and compliance measures.
CONCLUSION
Data Warehousing is not a new phenomenon. All large organizations

already have data warehouses, but they are just not managing them.
Over the next few years, the growth of data warehousing is going to
be enormous with new products and technologies coming out
frequently. In order to get the most out of this period, it is going to
be important that data warehouse planners and developers have a clear
idea of what they are looking for and then choose strategies and
methods that will provide them with performance today and flexibility
for tomorrow.
Information quality is not an esotericnotion; it directly affects the

effectiveness and efficiency of business processes. Information
quality also plays a major role in customer satisfaction.Poor data
quality is costly. It lowers customer satisfaction, adds expense, and
makes it more difficult to run a business and pursue tactical
improvements such as data warehouses and re-engineering.
15
BIBLIOGRAPHY
www.google.com
www.altavista.com
www.scribd.com
www.dogpile.com
www.fickr.com
www.excite.com
16
17

Final

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Final

Hochgeladen von

Copyright:

Verfügbare Formate

1

A data warehouse is a relational database that is designed for query

In addition to a relational database, a data warehouse environment

WAREHOUSE QUALITY MODEL:

1. Involve users: Data quality is ultimately a business problem, so

2. Monitor processes: Bad data actually might have been accurate at

3. Use Oracle Warehouse Builder: In addition to offering database

The Query Model is used to graphically represent a single business

The diagram has four components:

• The topic or focus of the query,

• The dimensions of the query,

• The lines that link the dimensions to the topics,

• The level of the dimensions used in the query.

This diagram is a model of a

The Monthly Sales by Market by Product" example identifies Sales as

The Fact Table:

When you apply the blueprint

 TOOLS FOR DATA WAREHOUSE QUALITY:

Data quality tools generally fall into one of three categories:

When using a source external to the organization, business rules can be

Loading: Data cleansing tools are used in the intermediate staging

•Data parsing (elementizing)- breaks a record into atomic units that

• Data standardization- converts the data

• Data correction and verification- matches data

• Record matching- determines whether two records represent data

• Data transformation- ensures consistent mapping between source

• Householding – combining individual records that have the same

•Documenting – documenting the results of the data cleansing steps in

The third type of tool, the data migration tool, is

 COMMONLY FOUND DATA QUALITY ISSUES

A company’s database is its most important asset. It is a collection of

A customer of a hotel and casino makes a reservation to stay at the

STEPS TO TACKLE THE PROBLEM:

After a data set

3. Parsing and Standardization:

This technique parses and restructures data into a common format to

Data matching consolidates data records into identical groups and

Data enrichment enhances the value of customer data by attaching

This real-time monitoring phase puts automated processes into place to

Data Warehousing is not a new phenomenon. All large organizations

Information quality is not an esotericnotion; it directly affects the

Das könnte Ihnen auch gefallen