Sie sind auf Seite 1von 17

1

PRESENTED BY:

1. DHANASHRI CHINNAPPA

2. SAGAR PATEL

3. JASON GONSALVES

4. ANIKET PARAB

5. KETKI RAJE

6. GANESH PATIL

2
INDEX

Sr.No: CONTENTS

1. Introduction
2. Data Warehouse Quality Model
3. Tools for Data Warehouse Quality
4. Data Warehouse Querying & Loading
5. Commonly found Data Quality Issues
6. Conclusion
7. Bibliography

3
INTRODUCTION

A data warehouse is a relational database that is designed for query


and analysis rather than for transaction processing. It usually contains
historical data derived from transaction data, but it can include data
from other sources. It separates analysis workload from transaction
workload and enables an organization to consolidate data from several
sources.

In addition to a relational database, a data warehouse environment


includes an extraction, transportation, transformation, and loading
(ETL) solution, an online analytical processing (OLAP) engine, client
analysis tools, and other applications that manage the process of
gathering data and delivering it to business users.

4
 DATA

WAREHOUSE QUALITY MODEL:

5
Due to the principal role of Data warehouses (DW) in making strategy
decisions, data warehouse quality is crucial for organizations.
Therefore, we should use methods, models, techniques and tools to
help us in designing and maintaining high quality DWs.

In the last years, there have been several approaches to design DWs
from the conceptual, logical and physical perspectives. However, from
our point of view, none of them provides a set of empirically validated
metrics (objective indicators) to help the designer in accomplishing an
outstanding model that guarantees the quality of the DW.

1. Involve users: Data quality is ultimately a business problem, so


people in the business must be involved. People frequently enter the
data being used, so they are the first line of defense. People are also
the final consumers in most cases and provide the last line of
defense.

2. Monitor processes: Bad data actually might have been accurate at


one time but has since decayed. For example, prospect lists get
outdated. The more outdated the information, the more time and
money is wasted trying to sell goods or services to the wrong people.
Business processes can ensure timely and accurate updates to data.
Streamlining processes where possible can reduce the number of hands
touching data, thereby reducing the chances of manual data corruption.

3. Use Oracle Warehouse Builder: In addition to offering database


design and extract, transform, and load (ETL) features, it includes the
ability to profile, cleanse, and audit data, based on data rules. This
technology provides an umbrella over the data warehouse, using
predefined rules to catch critical mistakes before they make their way
into the decision-making process.

6
 DATA WAREHOUSE QUERYING AND
LOADING:

QUERYING IN DATAWAREHOUSE:

The Query Model is used to graphically represent a single business


query.

The diagram has four components:

• The topic or focus of the query,

• The dimensions of the query,

• The lines that link the dimensions to the topics,

• The level of the dimensions used in the query.

7
Query Model:

This diagram is a model of a


query model
The query model identifies the
focus of the business query
and the reference data
needed to describe the query.
The isolation of these
components helps de fine potential facts and dimensions.

The Monthly Sales by Market by Product" example identifies Sales as


the subject area, and Customer, Product, and Time as the dimensions
of the query .The query model helps identify the fact and dimension
tables for a subject area.

The Fact Table:

The blueprint design aims to anticipate the typical star query shape
and builds indexes over the fact table. The clustered index of the fact
table uses several dimension surrogate key columns (the foreign key
columns) as index keys. The most frequently used columns should occur
in the list of index keys. You may want to take the time to verify that
this indeed provides a good access path for the most frequently
executed queries in your workload.

8
Dimension Tables:

When you apply the blueprint


design to dimension tables, you
need to create indexes for each
dimension table. These include a
non-clustered primary key
constraint index on the
surrogate key column of the
dimension and a clustered index over the columns of the business key
of the dimension entity. For large dimension tables, you should also
consider adding non-clustered indexes over columns that are
frequently used in highly selective predicates.

LOADING IN DATAWAREHOUSE:
After the data has been cleansed
and transformed into a structure
consistent with the data warehouse
requirements, data is ready for loading
into the data warehouse. You may make
some final transformation during the loading operation, although
you should complete any transformations that could identify
inconsistencies before the final loading operation.
The initial load of the data warehouse consists of populating the tables
in the data warehouse schema and then verifying that the data is
ready for use. You can use various methods to load the data warehouse
tables, such as:

• Transact-SQL
• DTS

• BCP utility

9
When you load data into the data warehouse, you are populating the
tables that will be used by the presentation applications that make the
data available to users. Loading data often involves the transfer of
large amounts of data from source operational systems, a data
preparation area database, or preparation area tables in the data
warehouse database. Such operations can impose significant processing
loads on the databases involved and should be accomplished during a
period of relatively low system use.

After the data has been loaded into the data warehouse database,
verify the referential integrity between dimension and fact tables to
ensure that all records relate to appropriate records in other tables.
You should verify that every record in a fact table relates to a record
in each dimension table that will be used with that fact table

 TOOLS FOR DATA WAREHOUSE QUALITY:

Data quality tools generally fall into one of three categories:


EXTRACT, LOAD&TRANSFER. The focus of this paper is on tools that
clean and audit data, with a limited look at tools that extract and
migrate data.

Extract:

Data auditing tools enhance the accuracy and correctness of the data
at the source. These tools generally compare the data in the source
database to a set of business rules.

When using a source external to the organization, business rules can be


determined by using data mining techniques to uncover patterns in the
data. Business rules that are internal to the organization should be
entered in the early stages of evaluating data sources. Lexical analysis

10
may be used to discover the business sense of words within the data.
The data that does not adhere to the business rules could then be
modified as necessary.

Loading: Data cleansing tools are used in the intermediate staging


area. The tools in this category have been around for a number of
years. A data cleansing tool cleans names, addresses and other data
that can be compared to an independent source. These tools are
responsible for parsing, standardizing, and verifying data against
known lists such as U.S.PostalCodes. The data cleansing tools contain
features which perform the following functions:

•Data parsing (elementizing)- breaks a record into atomic units that


can be used in subsequent steps. Parsing includes placing elements of a
record into the correct fields.

• Data standardization- converts the data


elements to forms that are standard throughout
the data warehouse.

• Data correction and verification- matches data


against know lists, such as U.S. Postal Codes,
product lists, internal customer lists.

• Record matching- determines whether two records represent data


on the same subject.

• Data transformation- ensures consistent mapping between source


systems and data warehouse.

• Householding – combining individual records that have the same


address.

•Documenting – documenting the results of the data cleansing steps in


the meta data.

11
Transfer:

The third type of tool, the data migration tool, is


used in extracting data from a source database,
and migrating the data into an intermediate storage
area. The migration tools also transfer data from
the staging area into the data warehouse. The data migration tool is
responsible for converting the data from one platform to another. A
migration tool will map the data from the source to the data
warehouse. It can also check for Y2K compliance and other simple
cleansing activities. There can be a great deal of overlap in these tools
and many of the same features are found in tools of each category.

 COMMONLY FOUND DATA QUALITY ISSUES

A company’s database is its most important asset. It is a collection of


information on customers, suppliers, partners, employees, products,
inventory, locations, and more. This data is the foundation on which
your business operations and decisions are made; it is used in
everything from booking sales, analyzing summary reports, managing
inventory, generating invoices and forecasting. To be of greatest value,
this data needs to be up-to-date, relevant, consistent and accurate –
only then can it be managed effectively and aggressively to create
strategic advantage.

A customer of a hotel and casino makes a reservation to stay at the


property using his full name, Johnathan Smith.So, as part of its
customer loyalty-building initiative, the hotel’s marketing department
sends him an email with a free night’s stay promotion, believing he is a
new customer unaware that the customer is already listed under the

12
hotel’s casino/gaming department as a VIP client – under a similar name
John Smith.

The hotel did not have a data quality process in place to standardize,
clean and merge duplicate records to provide a complete view of the
customer. As a result, the hotel was not able to leverage the true value
of its data in delivering relevant marketing to a high value customer.

STEPS TO TACKLE THE PROBLEM:

1.Proling:

As the first line of defense for your data integration solution, proling
data helps you examine whether your existing data sources meet the
quality standards of your solution. Properly proling your data saves
execution time because you identify issues that require immediate
attention from the start – and avoid the unnecessary processing of
unacceptable data sources. Data proling becomes even more critical
when working with raw data sources that do not have referential
integrity or quality controls.

2. Cleansing:

After a data set


successfully meets
proling standards, it
still requires data
cleansing and de-
duplication to ensure
that all business
rules are properly
met. Successful data
cleansing requires
the use of
flexible,ef ficient
techniques capable
13
of handling complex quality issues hidden in the depths of large data
sets. Data cleansing corrects errors and standardizes information that
can ultimately be leveraged for MDM applications.

3. Parsing and Standardization:

This technique parses and restructures data into a common format to


help build more consistent data. For instance, the process can
standardize addresses to a desired format, or to USPS®
specifications, which are needed to enable CASS Certi ed™
processing. This phase is designed to identify, correct and standardize
patterns of data across various data sets including tables, columns and
rows, etc.

4. Matching:

Data matching consolidates data records into identical groups and


links/merges related records within or across data sets. This process
locates matches in any combination of over 35 different components –
from common ones like address, city, state, ZIP®, name, and phone.

5. Enrichment:

Data enrichment enhances the value of customer data by attaching


additional pieces of data from other sources, including encoding,
demographic data, full-name parsing and tenderizing, phone number
verification, and email validation. The process provides a better
understanding of your customer data because it reveals buyer behavior
and loyalty potential.

6. Monitoring:

This real-time monitoring phase puts automated processes into place to


detect when data exceeds pre-set limits. Data monitoring is designed
14
to help organizations immediately recognize and correct issues before
the quality of data declines. This approach also empowers businesses
to enforce data governance and compliance measures.

CONCLUSION

Data Warehousing is not a new phenomenon. All large organizations


already have data warehouses, but they are just not managing them.
Over the next few years, the growth of data warehousing is going to
be enormous with new products and technologies coming out
frequently. In order to get the most out of this period, it is going to
be important that data warehouse planners and developers have a clear
idea of what they are looking for and then choose strategies and
methods that will provide them with performance today and flexibility
for tomorrow.

Information quality is not an esotericnotion; it directly affects the


effectiveness and efficiency of business processes. Information
quality also plays a major role in customer satisfaction.Poor data
quality is costly. It lowers customer satisfaction, adds expense, and
makes it more difficult to run a business and pursue tactical
improvements such as data warehouses and re-engineering.

15
BIBLIOGRAPHY

www.google.com

www.altavista.com

www.scribd.com

www.dogpile.com

www.fickr.com

www.excite.com

16
17

Das könnte Ihnen auch gefallen