Beruflich Dokumente
Kultur Dokumente
and DWH ETL Projects 20+ years with IBM, z/OS development and
test
Assuring Data Content, Structure, and Quality Since 2006, focused on DW data quality
testing as a consultant
Wayne Yaddow
ETL Tester Training Co-author of the book, Testing the Data
Datagaps.com Warehouse and author of ETL testing
wyaddow@gmail.com
articles for several industry publications
ETL coach/mentor for testers
1 2
1
8/2/2017
3 4
2
8/2/2017
The importance of project requirements, data models Discuss whether your design & development teams / supplier
and data mapping documents provides adequate design / development documentation for test
planning
The significance of an ETL QA process
Define project data quality and when it gets addressed in the
Effective test scenarios SDLC
QA strategies, test plans and test cases Develop a test strategy (a real one that you can follow)
Estimating process for QA resources Acquire the skilled resources ,,, early
ETL tester skill requirements
Prepare carefully, end-to-end data reconciliation practices
6 8
3
8/2/2017
How best to choose test scenarios for automation Data Analysis Data Flow ETL
& and ETL Development DW ETL QA
Understand a pathway to test automation using ETL Validator Requirements Design
How to avoid test automation failures throughout the SDLC whether or not
development.
4
8/2/2017
14 15
5
8/2/2017
16 18
6
8/2/2017
7
8/2/2017
DW Common Architecture
Ultimate Goals for ETL Testing
Sources and DW Targets
There is an exponentially increasing cost to businesses associated with finding
defects late in the development lifecycle. Considering the importance of early
detection, we list our primary goals for testing the ETLs.
Data completeness: Make certain that all expected data is loaded.
Data transformation: Ensuring that all data is transformed correctly according
to business rules and/or design specifications.
Data quality: Ensuring that the ETL tool (DataStage) correctly rejects,
substitutes default values, corrects or ignores and reports invalid data.
Performance and scalability tests: Making sure that data loads and queries
perform within expected time frames and that the technical architecture is
scalable.
Integration testing: Ensuring that the ETL process functions well with other
upstream and downstream processes.
Regression tests: Assuring that existing functionality remains intact each time a
new release of code is completed
User-acceptance testing: Ensuring the solution meets users current
expectations and anticipates their future expectations
25 26
8
8/2/2017
28 29
9
8/2/2017
What ETL Development Tools Can Do Data Movement Projects Using ETLs
Extract data Data warehouse (DW or EDW) -- database system used for reporting
MQ, web services and data analysis. It's a central repository for data which is created by
integrating data from one or more disparate sources, often over a long
Semi-structured data (email, web logs, wiki pages period of time (months, years).
Flat files, txt, DBMS, XML, XLS
Unstructured data (blogs, documents) Data integration -- involves combining data from different sources and
providing users with a unified view of that data.
Clean
Using Lookups, validations, filters, translations, inserting Data migration -- the process of transferring data between storage types,
defaults formats, or computer systems often a one-time project. It is a key
Transform and load data consideration for any new system implementation, upgrade, or
consolidation.
Change data structures, aggregate, rollup, sort, partition, de-
duplicate Source, DataVersity, 2015
30 31
10
8/2/2017
32 46
11
8/2/2017
47 48
12
8/2/2017
13
8/2/2017
Where Data Integration/ETL Testing is Needed ETL Development Tools and Process
64 65
14
8/2/2017
DW Common Architecture
Sources and DW Targets
Agenda
DW Concepts and Terms
Challenges of DW Testing
Planning for DW Tests
DW Test Scenarios
Test Data Planning
QA Risk Management
DW Test Tools and Automation
Recommended Tester Skills
DW QA Best Practices
75 82
15
8/2/2017
88 89
16
8/2/2017
Why ETL Data Errors Occur Challenges of Finding Defects Late in Process
17
8/2/2017
What Should be Known About Each DW Source? What ETL Processes Can Do
And therefore, what we should prepare to test
92 93
18
8/2/2017
19
8/2/2017
20
8/2/2017
100 101
21
8/2/2017
102 104
22
8/2/2017
Defects in Source & Target Data (cont.) Common ETL Defects and Causes
Issue Description Possible Causes Example(s)
- Invalid or incorrect Lookup table should contain a field value of
lookup table in the High which maps to Critical. However,
Data that does not transformation logic Source data field contains Hig - missing
Missing Data make it into the target - Bad data from the the h and fails the lookup, resulting in the
database source database target data field containing null. If this
(Needs cleansing) occurs on a key field, a possible join would
- Invalid joins be missed and the entire row could fall out.
- Invalid field
lengths on target
Source field value New Mexico City is
Data being lost by database
being truncated to New Mexico C since
Truncation of Data truncation of the data - Transformation
the source data field did not have the
field logic not taking into
correct length to capture the entire field.
account field
lengths from source
Data types not set up Source data field Source data field was required to be a date,
Data Type
correctly on target not configured however, when initially configured, was
Mismatch
database correctly setup as a VarChar.
114 118
23
8/2/2017
Common ETL Defects and Causes Common ETL Defects and Causes
Issue Description Possible Causes Example(s) Issue Description Possible Causes Example(s)
Development team A Source data field for null was supposed to Records which should Development team If a case has the deleted field populated,
Null source values not did not include the be transformed to None in the target data Extra Records not be in the ETL are did not include filter the case and any data related to the case
Null Translation being transformed to null translation in field. However, the logic was not included in the ETL in their code should not be in any ETL
correct target values the transformation implemented, resulting in the target data Development team
Records which should If a case was in a certain state, it should be
logic field containing null values. Not Enough had a filter in their
be in the ETL are ETLd over to the data warehouse but not
Records code which should
Opposite of the Null included in the ETL the data mart
not have been there
Translation error. Field Ex. 1) Target field should only be populated Ex. 1) Most cases may fall into a certain
Development team
should be null but is when the source field contains certain branch of logic for a transformation but a
incorrectly
Wrong populated with a non- values, otherwise should be set to null Development team small subset of cases (sometimes with
translated the
Translation null value or field Ex. 2) Target field should be "Odd" if the did not take into unusual data) may not fall into any
source field for
should be populated, source value is an odd number but target Testing sometimes account special branches. How the testers code and the
certain values
but with the wrong field is "Even" (This is a very basic example) can lead to finding cases. For example developers code handle these cases could
value Transformation holes in the international cities be different (and possibly both end up
Development team Logic Errors/Holes transformation logic that contain special being wrong) and the logic is changed to
A source data field was supposed to be
Source data fields not inadvertently or realizing the logic language specific accommodate the cases.
transformed to target data field
being transformed to mapped the source is unclear characters might Ex. 2) Tester and developer have different
Misplaced Data 'Last_Name'. However, the development
the correct target data data field to the need to be dealt interpretations of the transformation logic,
team inadvertently mapped the source data
field wrong target data with in the ETL code which results in different values. This will
field to 'First_Name'
field Source: RTTS, QuerySurge
lead to the logic being re-written to become
Source: RTTS, QuerySurge
more clear.
119 120
24
8/2/2017
Common ETL Defects and Causes Common ETL Defects and Causes
Issue Description Possible Causes Example(s)
Issue Description Possible Causes Example(s)
Development team
did not add an Product names on a case should be Development team
Duplicate records are
Simple/Small Capitalization, spacing additional space separated by a comma and then a space but did not add the Duplicate records in the sales report was
two or more records
Errors and other small errors after a comma for target field only has it separated by a Duplicate Records appropriate code to doubling up several sales transactions
that contain the same
populating the comma filter out duplicate which skewed the report significantly
data
target field. records
Ensuring that the Numbers that are not
Development team Development team
sequence number of formatted to the The sales data did not contain the correct
did not configure Numeric Field rounded the
reports are in the correct decimal point or precision and all sales were being rounded
the sequence Duplicate records in the sales report was Precision numbers to the
Sequence correct order is very not rounded per to the whole dollar
generator correctly doubling up several sales transactions wrong decimal point
Generator important when specifications
resulting in records which skewed the report significantly
processing follow-up
with a duplicate Development team
reports or answering
sequence number did not take into
to an audit Data rows that get Missing data rows on the sales table caused
There was a restriction in the "where" account data
Several of the Rejected Rows rejected due to data major issues with the end of year sales
clause that limited how certain reports conditions that
Find requirements members of the issues report
were brought over. Used in mappings that could break the ETL
that are understood development team for a particular row
Undocumented were understood to be necessary, but were
but are not actually did not understand
Requirements not actually in the requirements.
documented the understood
Occasionally it turns out that the
anywhere undocumented Source: RTTS, QuerySurge
understood requirements are not what the
Source: RTTS, QuerySurge requirements.
business wanted. 121 122
25
8/2/2017
Questions? Agenda
DW Concepts and Terms
Challenges of DW Testing
Test Planning & Management
DW Test Scenarios
Test Data Planning
QA Risk Management
DW Test Tools and Automation
Recommended Tester Skills
DW QA Best Practices
127 128
26
8/2/2017
27
8/2/2017
137 138
28
8/2/2017
Plan Your Test Strategy, Methods & Tools DW Documentation for Test Planning
29
8/2/2017
148 149
30
8/2/2017
31
8/2/2017
System testing
End to end testing
Regression testing
Load testing
Design Essential for Reconcilable Data Warehouse 4 2011 Formation Data Pty Ltd
Security testing
189 192
32
8/2/2017
Utopiainc.com Utopiainc.com
193 194
33
8/2/2017
34
8/2/2017
35
8/2/2017
224 225
36
8/2/2017
37
8/2/2017
38
8/2/2017
Run aggregating queries to total / summarize values Compare aggregate values such as count, avg, max, min between the source and
target tables (fact or dimension)
Format SQL query results /* from source */
SELECT count(row_id), count(fst_name), count(lst_name), avg(revenue)
Verify stored procedures and views FROM customer
/* from target */
SELECT count(row_id), count(first_name), count(last_name), avg(revenue)
Convert data types in query results (dates, times) FROM customer_dim
230 231
39
8/2/2017
232 233
40
8/2/2017
234 235
41
8/2/2017
243 244
42
8/2/2017
Verify source to target loads using data mapping Verify no data truncation in each column of each table
specifications
Verify data types and formats are as specified in design
Verify that all tables, records, and columns were loaded from
source to staging Verify no duplicate records in target tables.
Verify that Primary & Foreign keys were properly generated Verify data transformations based on business rules
using a sequence generator. No orphan foreign keys.
Check for string columns that are incorrectly left or right
Verify that not-null columns were populated trimmed
Assure that extraction scripts are granted security access to Verify Transaction Audit Log recording is occurring
the source systems
258 259
43
8/2/2017
Basic ETL Test Scenarios (cont.) Basic ETL Test Scenarios (cont.)
Verify numeric columns are populated with correct
precision Verify that Null source values are translated to correct target
value
Verify that ETL sessions completed with only planned
exceptions Verify correct Lookup translation to target
Verify all cleansing, transformation, error and Verify that no extra records in target -- records which should
exception handling not be in the ETL are included in the ETL
Verify ETL calculations, aggregations and data mapping Check logs for data loading status, rejected records and error
correctness messages after ETL's (extracts, transformations, loads)
44
8/2/2017
Basic ETL Test Scenarios (cont.) Basic ETL Test Scenarios (cont.)
Test Scenarios Test-Cases Test Scenarios Test-Cases
Validate the source and the target table structure as Verify there are no integrity constraints like Foreign Key
per the mapping document. Data Consistency check Verify where the length and data type of an attribute may vary
Verify data types should be validated in the source and in different tables
the target systems based on mapping document
Structure Validation Verify that all the data is loaded (where it should be loaded) to
Use mapping document to verify the length and types the target system from the source system
of data in the source and target schema. They could be
different. Verify by counting the number of records in the source and the
Data Completeness
target systems
Validation
Validate column names in the target system.
Verify that column boundary values (ex. Min/max) are correct
Validating the mapping document to ensure all the
information has been provided. The mapping Validate the unique values of primary keys
Validating Mapping document document should have change log, data types, length,
Validate the values of all data in the target system
transformation rules, etc..
Data Correctness
Search for misspelled or inaccurate data in target tables
Validate Constraints Validate all specified constraints to assure they are Validation
applied to the target tables.
262 263
45
8/2/2017
Basic ETL Test Scenarios (cont.) Basic ETL Test Scenarios (cont.)
Test Scenarios Test-Cases
Verify that duplicate values in the target system do not exist when data is Tester query identifies errors after ETL
coming from multiple columns in source systems
Duplicate Validation
Validating primary keys and other columns if there is any duplicate values
as per the business requirement
Validate date fields for various actions performed in ETL process
From_Date should not greater than To_Date
Date Validation checks
Format of date values should be as specified.
Date values should not have junk values or unexpected null values
Validate full data set in the source and the target tables by using Minus
query
Perform both source minus target and target minus source
Full Data Validation When minus query returns a value, it represents mismatching rows
using Minus Query The count returned by Intersect should match with the individual
counts of source and target tables
If the minus query returns no rows and the count intersect is less
than the source count or the target table count, then the table holds
duplicate rows
264 265
46
8/2/2017
Basic ETL Test Scenarios (cont.) Basic ETL Test Scenarios (cont.)
Physical duplicates
Logical duplicates
266 267
47
8/2/2017
Basic ETL Test Scenarios (cont.) Basic ETL Test Scenarios (cont.)
Source Target (DW)
268 269
48
8/2/2017
Basic ETL Test Scenarios (cont.) Basic ETL Test Scenarios (cont.)
Source
Target (DW)
270 271
49
8/2/2017
Basic ETL Test Scenarios (cont.) Basic ETL Test Scenarios (cont.)
272 273
50
8/2/2017
Basic ETL Test Scenarios (cont.) Basic ETL Test Scenarios (cont.)
What is change data capture (CDC)?
Detects all changes inserts, updates,
deletes
Reads log to find all changes
Makes all changes
274 275
51
8/2/2017
Load Testing
The primary target of load testing is to check if most running
transactions have performance impact on the database.
Testers check:
The response time for executing the transactions for multiple
remote users.
Time taken by the database to fetch specific records.
276 284
52
8/2/2017
318 321
53
8/2/2017
54
8/2/2017
335 336
55
8/2/2017
56
8/2/2017
57
8/2/2017
58
8/2/2017
363 380
59
8/2/2017
60
8/2/2017
384 385
61
8/2/2017
386 387
62
8/2/2017
Expert data profiling methods & tools skills Skills to understand and validation business data
transformations
Skills of MS Excel / Access for data analysis
Ability to perform adequate testing with huge volumes of
Skills to develop DW /BI test plans data. Selections of data samples
388 389
63
8/2/2017
390 391
64
8/2/2017
Ex., Dells Toad, Oracle SQL Developer, Microsoft SQL Server Mgt. Services (SSMS)
392 393
65
8/2/2017
Check for inconsistent data formats Strong in SQL queries: Oracle / SQL Server /DB2
Strong with SQL scripts based on ETL mapping documents to compare data
Verify correct Lookups used to replace source column values
Strong in ETL data validation: Informatica / Datastage / SSIS
Verify data from multiple sources combined correctly Extensive Data Warehouse testing background working with huge volume of data.
Duplicate values or records removed Exposure to end-to-end data validation for ETL & BI systems
Verify normalized spellings Strong in BI report validation in Cognos / Business Objects / Microstrategy / SSRS BI
environments
Verify aggregated data
Work with SMEs to resolve gaps/questions in requirements
Verify that missing required data values from source applied Assist developers to recreate test failures leading to problem resolutions to
to target. Ex., a data field in a source is either optional or requirements, code or test cases
mandatory but not enforced, hence, intermittent data. Exposure to DB tools: Toad / PL/SQL Developer / SSMS
However, field value is required in the target system. Nice to have: Exposure to automating DW testing
CareerBuilders Website, July 20, 2016
394 395
66
8/2/2017
Approximate Tester to
Condition Programmer Ratio
396 397
67
8/2/2017
Data and Business Analyst Requirements testing and acceptance Helps assure that stakeholders are confident in QA
Database Architect Data model / mapping reviews Allow stakeholders to participate in testing what they may
know best
DBA set up and verify schemas in test environment
Teach developers new testing ideas and methods
ETL Developer -- Unit test planning and execution Pair testing with designers and developers and testers
Business Sponsor Acceptance test strategy Learn how developers test code
End users Acceptance test scenario devleopment and execution
68
8/2/2017
400 401
69
8/2/2017
402 403
70
8/2/2017
413 414
71
8/2/2017
415 416
72
8/2/2017
417 418
73
8/2/2017
22. Centralize a repository for all DW project templates, 1. Provide evidence from published analysts and industry
checklists, test artifacts, lessons learnt, trackers, research on the high failure rate due to a lack of best-practices
questionnaires, training materials 2. Establish principals of DW testing from a QA handbook /
23. Constantly improve DW and BI test competencies / skills of guidebook
all QA staff 3. Perform an up-front DW impact assessment to identify
24. Implement risk-based approaches to testing, and hotspots
methodically optimized test case definition are the basic 4. Focus on the impact of a delayed DW loading to wider
requirements for high test coverage corporate strategy
25. Test cases should be developed so that they are easy to
understand from the business perspective.
26. Profile and audit all source data before writing docs.
419 420
74
8/2/2017
421 422
75
8/2/2017
425 426
76
8/2/2017
427 428
77
8/2/2017
429 430
78
8/2/2017
Thank You!
Questions, comments?
Wayne Yaddow
wayne@datagaps.com
1-(914) 466-4066
431
79