Sie sind auf Seite 1von 5

Businesses are increasingly focusing on the collection and organization of data

for strategic decision-making. The ability to review historical trends and monit
or near real-time operational data has become a key competitive advantage.
This article provides practical recommendations for testing extract, transform a
nd load (ETL) applications based on years of experience testing data warehouses
in the financial services and consumer retailing areas. Every attempt has been m
ade to keep this article tool-agnostic so as to be applicable to any organizatio
n attempting to build or improve on an existing data warehouse.
Testing Goals
There is an exponentially increasing cost associated with finding software defec
ts later in the development lifecycle. In data warehousing, this is compounded b
ecause of the additional business costs of using incorrect data to make critical
business decisions. Given the importance of early detection of software defects
, let s first review some general goals of testing an ETL application:
*
Data completeness. Ensures that all expected data is loaded.
*
Data transformation. Ensures that all data is transformed correctly accord
ing to business rules and/or design specifications.
*
Data quality. Ensures that the ETL application correctly rejects, substitu
tes default values, corrects or ignores and reports invalid data.
*
Performance and scalability. Ensures that data loads and queries perform w
ithin expected time frames and that the technical architecture is scalable.
*
Integration testing. Ensures that the ETL process functions well with othe
r upstream and downstream processes.
*
User-acceptance testing. Ensures the solution meets users current expectati
ons and anticipates their future expectations.
*
Regression testing. Ensures existing functionality remains intact each tim
e a new release of code is completed.
Data Completeness
One of the most basic tests of data completeness is to verify that all expected
data loads into the data warehouse. This includes validating that all records, a
ll fields and the full contents of each field are loaded. Strategies to consider
include:
*
Comparing record counts between source data, data loaded to the warehouse
and rejected records.
*
Comparing unique values of key fields between source data and data loaded
to the warehouse. This is a valuable technique that points out a variety of poss
ible data errors without doing a full validation on all fields.
*
Utilizing a data profiling tool that shows the range and value distributio
ns of fields in a data set. This can be used during testing and in production to
compare source and target data sets and point out any data anomalies from sourc
e systems that may be missed even when the data movement is correct.
*
Populating the full contents of each field to validate that no truncation
occurs at any step in the process. For example, if the source data field is a st
ring(30) make sure to test it with 30 characters.
*
Testing the boundaries of each field to find any database limitations. For
example, for a decimal(3) field include values of -99 and 999, and for date fie
lds include the entire range of dates expected. Depending on the type of databas
e and how it is indexed, it is possible that the range of values the database ac
cepts is too small.
Data Transformation
Validating that data is transformed correctly based on business rules can be the
most complex part of testing an ETL application with significant transformation
logic. One typical method is to pick some sample records and stare and compare to
validate data transformations manually. This can be useful but requires manual
testing steps and testers who understand the ETL logic. A combination of automat
ed data profiling and automated data movement validations is a better long-term
strategy. Here are some simple automated data movement techniques:
*
Create a spreadsheet of scenarios of input data and expected results and v
alidate these with the business customer. This is a good requirements elicitatio
n exercise during design and can also be used during testing.
*
Create test data that includes all scenarios. Elicit the help of an ETL de
veloper to automate the process of populating data sets with the scenario spread
sheet to allow for flexibility because scenarios will change.
*
Utilize data profiling results to compare range and distribution of values
in each field between source and target data.
*
Validate correct processing of ETL-generated fields such as surrogate keys
.
*
Validate that data types in the warehouse are as specified in the design a
nd/or the data model.
*
Set up data scenarios that test referential integrity between tables. For
example, what happens when the data contains foreign key values not in the paren
t table?
*
Validate parent-to-child relationships in the data. Set up data scenarios
that test how orphaned child records are handled.
Data Quality
For the purposes of this discussion, data quality is defined as how the ETL syste
m handles data rejection, substitution, correction and notification without modi
fying data. To ensure success in testing data quality, include as many data scena
rios as possible. Typically, data quality rules are defined during design, for e
xample:
*
Reject the record if a certain decimal field has nonnumeric data.
*
Substitute null if a certain decimal field has nonnumeric data.
*
Validate and correct the state field if necessary based on the ZIP code.
*
Compare product code to values in a lookup table, and if there is no match
load anyway but report to users.
Depending on the data quality rules of the application being tested, scenarios t
o test might include null key values, duplicate records in source data and inval
id data types in fields (e.g., alphabetic characters in a decimal field). Review
the detailed test scenarios with business users and technical designers to ensu
re that all are on the same page. Data quality rules applied to the data will us
ually be invisible to the users once the application is in production; users wil
l only see what s loaded to the database. For this reason, it is important to ensu
re that what is done with invalid data is reported to the users. These data qual
ity reports present valuable data that sometimes reveals systematic issues with
source data. In some cases, it may be beneficial to populate the before data in th
e database for users to view.
Performance and Scalability
As the volume of data in a data warehouse grows, ETL load times can be expected
to increase, and performance of queries can be expected to degrade. This can be
mitigated by having a solid technical architecture and good ETL design. The aim
of the performance testing is to point out any potential weaknesses in the ETL d
esign, such as reading a file multiple times or creating unnecessary intermediat
e files. The following strategies will help discover performance issues:
*
Load the database with peak expected production volumes to ensure that thi
s volume of data can be loaded by the ETL process within the agreed-upon window.
*
Compare these ETL loading times to loads performed with a smaller amount o
f data to anticipate scalability issues. Compare the ETL processing times compon
ent by component to point out any areas of weakness.
*
Monitor the timing of the reject process and consider how large volumes of
rejected data will be handled.
*
Perform simple and multiple join queries to validate query performance on
large database volumes. Work with business users to develop sample queries and a
cceptable performance criteria for each query.
Integration Testing
Typically, system testing only includes testing within the ETL application. The
endpoints for system testing are the input and output of the ETL code being test
ed. Integration testing shows how the application fits into the overall flow of
all upstream and downstream applications. When creating integration test scenari
os, consider how the overall process can break and focus on touchpoints between
applications rather than within one application. Consider how process failures a
t each step would be handled and how data would be recovered or deleted if neces
sary.
Most issues found during integration testing are either data related to or resul
ting from false assumptions about the design of another application. Therefore,
it is important to integration test with production-like data. Real production d
ata is ideal, but depending on the contents of the data, there could be privacy
or security concerns that require certain fields to be randomized before using i
t in a test environment. As always, don t forget the importance of good communicat
ion between the testing and design teams of all systems involved. To help bridge
this communication gap, gather team members from all systems together to formul
ate test scenarios and discuss what could go wrong in production. Run the overal
l process from end to end in the same order and with the same dependencies as in
production. Integration testing should be a combined effort and not the respons
ibility solely of the team testing the ETL application.
User-Acceptance Testing
The main reason for building a data warehouse application is to make data availa
ble to business users. Users know the data best, and their participation in the
testing effort is a key component to the success of a data warehouse implementat
ion. User-acceptance testing (UAT) typically focuses on data loaded to the data
warehouse and any views that have been created on top of the tables, not the mec
hanics of how the ETL application works. Consider the following strategies:
*
Use data that is either from production or as near to production data as p
ossible. Users typically find issues once they see the real data, sometimes leadin
g to design changes.
*
Test database views comparing view contents to what is expected. It is imp
ortant that users sign off and clearly understand how the views are created.
*
Plan for the system test team to support users during UAT. The users will
likely have questions about how the data is populated and need to understand det
ails of how the ETL works.
*
Consider how the users would require the data loaded during UAT and negoti
ate how often the data will be refreshed.
Regression Testing
Regression testing is revalidation of existing functionality with each new relea
se of code. When building test cases, remember that they will likely be executed
multiple times as new releases are created due to defect fixes, enhancements or
upstream systems changes. Building automation during system testing will make t
he process of regression testing much smoother. Test cases should be prioritized
by risk in order to help determine which need to be rerun for each new release.
A simple but effective and efficient strategy to retest basic functionality is
to store source data sets and results from successful runs of the code and compa
re new test results with previous runs. When doing a regression test, it is much
quicker to compare results to a previous execution than to do an entire data va
lidation again.
Taking these considerations into account during the design and testing portions
of building a data warehouse will ensure that a quality product is produced and
prevent costly mistakes from being discovered in production
Source:http://www.dmreview.com/article_sub.cfm?articleId=1086005

Das könnte Ihnen auch gefallen