Sie sind auf Seite 1von 10

Data Warehouse and ETL Verification pg. 1 Data Validation pg.

Data Warehouse and ETL Verification


Test Objectives
A primary ob ecti!e of testing the DW is to ta"e a sample or real test data all the #ay through the ETL de!eloped architecture. The test data should be representati!e of all possible source inputs$ including %none or null$% contradictory$ redundant$ duplicated$ etc. During each step &e'tracting from each source$ merging #ith that of other sources$ translating and processing$ loading into the #arehouse$ (ueries)retrie!als)reporting from the #arehouse*$ e'pected inputs and outputs should be !erified.

+ollo#ing are main areas of testing that should be done for the ETL process, Assure that all the records in the source system that should be brought into the data #arehouse actually are e'tracted into the data #arehouse, no more$ no less. Assure that all of the components of the ETL process complete successfully Verify that all of the e'tracted source data is correctly transformed into dimension tables and fact tables Verify that all of the e'tracted and transformed data is successfully loaded into the data #arehouse

Methods
ETL Test Planning
Defining Scope The first step in the testing process is planning. -ne of the most important and challenging pieces of planning is defining a scope that is scaleable and obtainable. To do this$ it is necessary to ha!e a complete understanding of the business needs and pro ect !ision. During the re(uirement de!elopment process$ the needs and !ision must be defined and aligned. .t is beneficial to ha!e at least one resource on the !alidation team #ith in/depth e'perience in databases and data models. This resource from the !alidation team should be in!ol!ed early and often in the pro ect lifecycle$ not ust at the end of the pro ect during the testing phase. This #ill "eep the !alidation team in the loop for re(uirement$ design$ and process changes. The staffing should directly reflect the si0e of the #arehouse. Any resources less than t#o are too fe#. Scripting : 1reate a test scripting process that is right for a data #arehouse starts #ith understanding the source systems$ the ETL process$ and the #arehouse destination. 2ntil the !alidation team has a solid understanding of ho# all of this #or"s together$ scripting cannot begin. There are at least t#o approaches that can be considered. Approach I: follo# the data from the source to the target #arehouse. This approach

!alidates the data in the source tables is also in the target tables. Approach II: follo# the data from the source through the ETL process and into the target #arehouse. Validate the data at each transformation, first the source tables then the staging tables$ then the loading tables$ and finally the destination tables / the #arehouse. A!ailable resources and established timelines tend to dri!e the approach that is used for the !alidation process. .f time and resources are a!ailable$ Approach .. is the most comprehensi!e. 3ince this approach has logical !alidation points$ the !alidation team can easily determine #hen and #hat data has been lost or incorrectly manipulated. Again$ the trade/off #ith this approach is the time and resources necessary to script and e'ecute this test strategy. Approach . #ill ta"e less time to script and e'ecute. 4o#e!er$ since this approach does not offer logical !alidation points #ithin the e'tensi!e ETL process$ #hen issues are unco!ered$ it #ill be more difficult and time consuming to determine e'actly #here the error occurred in the ETL process.

ETL Data and Process Verification


DW Testing Levels: There are se!eral le!els of testing that can be performed during data #arehouse testing. 5ust t#o e'amples, 1. 1onstraint testing 3ource to target counts 6. 3ource to target data !alidation error processing. The le!el of testing to be performed should be defined as part of the testing strategy. Constraints, During constraint testing$ the ob ecti!e is to !alidate uni(ue constraints$ primary "eys$ foreign "eys$ inde'es$ and relationships in the ETL process. The test script should include these !alidation points. 3ome ETL processes can be de!eloped to !alidate constraints during the loading of the #arehouse. .f the decision is made to add constraint !alidation to the ETL process$ the ETL code must !alidate all business rules and relational data re(uirements. Depending solely on the automation of constraint testing is ris"y. When the setup is not done correctly or maintained throughout the e!er/changing re(uirements process$ the !alidation could become incorrect and #ill nullify the tests. 1ountsThe ob ecti!e of the count test scripts is to determine if the record counts in the source match the record counts in the target. 3ome ETL processes are capable of capturing record count information such as records read$ records #ritten$ records in error$ etc. .f the ETL process used can capture that le!el of detail and create a list of the counts$ allo# it to do so. This #ill sa!e time during the !alidation process. Source to Target Verification: 7o ETL process is smart enough to perform source

to target field/to field !alidation. This piece of the testing cycle is the most labor intensi!e and re(uires the most thorough analysis of the data. There are a !ariety of tests that can be performed during source to target !alidation. 8elo# is a list of tests that are best practices, Threshold Testing / e'pose any truncation that may be occurring during the transformation or loading of data For example: 3ource, table1.field1 &VA914A9:; *, 3tage, table6.field5 &VA914A965 *, Target, table<.field6 &VA914A9:;*, .n this e'ample the source field has a threshold of :;$ the stage field has a threshold of 65 and the target mapping has a threshold of :;. The last 15 characters #ill be truncated during the ETL process of the stage table. Any data that #as stored in position 6=/<; #ill be lost during the mo!e from source to staging. Field to Field Field-to-field Testing / is a constant !alue being populated during the ETL process> .t should not be unless it is documented in the re(uirements and subse(uently documented in the test scripts. Do the !alues in the source fields match the !alues in the respecti!e target fields> 8elo# are t#o additional field/to/field tests that should occur. Initiali ation / During the ETL process if the code does not re/initiali0e the cursor &or #or"ing storage* after each record$ there is a chance that fields #ith null !alues may contain data from a pre!ious record. For example: 9ecord 165, 3ource field1 ? !ed null Target field 1 ? !ed Target field1 ? !ed 9ecord 16=, 3ource field1 ?

Validating relationships across data sets / Validate parent)child relationship&s* For example: 3ource parent, @urple. 3ource child, 9ed and 8lue. Target parent, @urple Target child, 9ed and Aello#. "rror #rocessing Validation: 2nderstanding a script might fail during data validation$ may confirm the ETL process is #or"ing through process validation. During process !alidation the testing team #ill #or" to identify additional data cleansing needs$ as #ell as identify consistent error patterns that could possibly be di!erted by modifying the ETL code. Ta"ing the time to modify the ETL process #ill need to be determined by the pro ect manager$ de!elopment lead$ and the business integrator. .t is the responsibility of the !alidation team to identify any and all records that seem suspect. -nce a record has been both data and process !alidated and the script has passed$ the ETL process is functioning correctly.

<

1on!ersely$ if suspect records ha!e been identified and documented during data !alidation that are not supported through process !alidation$ the ETL process is not functioning correctly. The de!elopment team #ill need to become in!ol!ed in finding the appropriate solution. +or e'ample$ during the e'ecution of the source to target count scripts suspect counts are identified &there are less records in the target table than in the source table*. The records that are BmissingC should be captured during the error process and can be found in the error log. .f those records do not appear in the error log$ the ETL process is not functioning correctly and the de!elopment team needs to become in!ol!ed.
1

Inputs
6 9eco!ery source data models and LLD$ data dictionaries$ data attribute sources

Tools/Environment/Setup
Tools, Data e'traction soft#are$ business rule disco!ery soft#are$ data analysis tools. .nformatica E'plorer$ .nformatica 1ompare 2tility$ 3D, E3 Access and E'cel for data analysis En!ironment,, Euch of DW and ETL testing should be considered for the de!elopment en!ironment after unit testing has been completed for ETL phases such as e'tract$ then staging$ then aggregation and load. As each phase has been completed by the construction team$ DA can come in to DA the results. Waiting until formal testing to !erify the (uality of data often results in difficulty finding defects then debugging ) troubleshooting.

External Team Dependencies


The DA team is highly dependent on recei!ing a detailed data model$ a description of the ETL process$ and LLD design #hich pro!ides detailed mappings from source to the LLD plus data formats$ calculations$ transformations performed on the data.

Artifacts Created
7e#ly disco!ered attributes$ undefined business rules$ data anomaliesC such as fields used for different purposes$ Defect reports$ cleansed data$ re ected or uncorrectable data$ 9eport of matched$ consolidated$ related data that is suspect or in error List of duplicate data records or fields List of duplicate data suspects.

Data Validation
Objectives
Data (uality !erification can be defined as data that has been sub ected to a structured (uality process to ensure that it meets or e'ceeds the standards established by the business and its intended consumers. 3uch standards are typically documented !ia ser!ice le!el agreements &3LAs* and administered by an organi0ed data go!ernance structure. 4ere$ #e refer to the term Fstandards. Data (uality e!aluation and !erification for 1E3 DA is the application of data analysis techni(ues to e'isting data stores for the purpose of !erifying the actual content$ structure$ and (uality of the data. Ensuring that (uality is built in from the beginning of e!ery application database pro ect is critical to the de!elopment of a high (uality application data. A process #ith #ell/ defined procedures that are understood and agreed to by sta"eholders$ including managers$ data e'perts$ database de!elopers and re!ie#ers$ should to be instituted.

Methods
De!elop a plan for testing the new system before tests are started. Testing is critical for ne#ly de!eloped systems. @ac"aged systems #hich are already running on numerous sites still re(uire testing to ensure they are properly installed and to ensure system parameters are properly established. Technical testing must be conducted by .nformation 3er!ices and Technology staff. Ensure that the test plan is comprehensi!e and insures that, a. .dentified system processes including internal controls #ill be tested. a. The identified critical success factors #ill be tested. b. All data entry screen fields #ill be tested for edits and for the data they #ill accept. c. 3ystem tables #ill be tested. d. The database #ill be tested. e. Li!e data is used to test the system. f. The system is run in parallel #ith the e'isting system$ for some period. g. Volume testing is conducted #hich simulates pea" and normal #or" loads. h. The telecommunications component of the system #ill be tested. i. That system bac" up and reco!ery is tested. . That the system meets the stated systems specifications &e.g. response time*. Determine ho# tests are going to be documented$ #ho is going to conduct the tests and ho# #ill problems be reported and resol!ed. Who is responsible for insuring testing has been satisfactorily completed. For program conversion projects: 4o# #ill data from the current system be con!erted for the ne# system> Will the !endor be responsible for this> .f not then #ho #ill> Things to consider are, 5

a. Who is responsible for identifying the !arious types of data records in the current system that must be con!erted to the ne# system> a. Who is going to test the con!ersion plan)programs> b. .nsure controls are in place such for editing and control totals #hich ensure the con!erted data is complete and accurate. c. Ensure a con!ersion bac" out plan e'ists in case of failure. d. Who is responsible for chec"ing and appro!ing the con!erted data. Ensure plans are in place to pro!ide user training and technical training. a. .dentify #ho #ill be conducting the training. Will the trainers be (ualified instructors> a. Will users trained in the system train other users &e.g. train the trainer*> b. Will the training be timed to correspond #ith the %go li!e% date of the ne# system> c. Who #ill recei!e the technical training on the system> Determine #ho is responsible for deciding on the system parameter settings. Who #ill be responsible for setting the parameter !alues> .nternal Audit should be consulted to ensure that the parameters set pro!ide effecti!e controls. -ther things to consider are, a. Does the !endor ha!e a suggest parameter format> a. 4o# are the parameter settings reported to management> b. Who has the ability to change the settings and are changes logged> Ensure that e!erything is prepared to go li!e on the ne# system #ith assistance from .nformation 3er!ices and Technology. a. 4ard#are and operating soft#are is installed and properly configured for the 2ni!ersity computing en!ironment. a. 1ommunications hard#are$ cabling$ soft#are is properly set up. b. 3oft#are is properly installed. c. 3ystem tables are populated accurately and completely. d. 3ystem interfaces are built and function properly. e. 3pecial supplies such as forms are in stoc". f. 2nreco!erable components of the old system are bac"ed up in case of installation failure. g. 2sers and management are satisfied #ith the performance of the ne# system. h. 8ac"/ups are made of the ne# system and the con!erted data. i. @roper user and technical manuals ha!e been recei!ed. Ensure security procedures are defined for obtaining access to the system &i.e.user/ids* and access rights to data. Ensure plans are in place to bac" up the system and data on a regular basis and disaster reco!ery procedures are defined. Testing can be divided into three catagories: Data Input / Tests the edits and controls for entering data such as !alidations$ cross references$ and chec" digits.

Data Processing / Tests to ensure that the programs are #or"ing properly. Ensures that data tables are accurately updated and internal calculations are correct. Data Output / Tests to ensure that the reports being generated are in the proper format and pro!ide the proper information. There are other areas #here testing must be done such as data con!ersion$ hard#are$ operating systems and security #hich must be considered. Other Notes 1. All tests should be documented. 6. All tests should be authori0ed by the test group leader. <. The results of the testing should be presented to the 3teering 1ommittee and appro!ed. :. A number of tests may be re(uired to test each field or process in the system. Testing should be done both to see that the system #or"s as it should$ and to see if the system re ects data that is inaccurate &e.g. enter both accurate data that should #or" and inaccurate data that should re ect*.
3

Data quality evaluation methods include: Data Profiling Applying data analysis techni(ues to e'isting data stores to determine the actual content$ structure and (uality of the data. Column Property Analysis The process assessing indi!idual atomic !alues to determine &through use of rules to #hich !alues must conform* #hether they are !alid. tructural Analysis .dentifying !iolations of rulesG similar in purpose to Data 9ule Analysis$ an analysis to determine that conditions that must be true at all times are consistently so. Accuracy!Precision Accuracy refers to ho# closely the data !alue agrees #ith the correct or FtrueH !alue. @recision is the ability of a measurement or analytical results to be consistently reproduced$ or the number of significant digits to #hich a !alue has been measured or calculated. -ne can simultaneously be e'tremely precise and totally inaccurate. Completeness 1ompleteness measures the presence or absence of data. Consistency Data consistency refers to the common definition$ understanding$ interpretation and calculation of a data element.

Data Input Testing


This is a listing of some of the data input controls that may be tested for each data entry function. 7ote that not all edits #ill be in!ol!ed for e!ery data input screen and that some screens may ha!e their o#n uni(ue data edits that are not listed belo#. 1) Character Chec s / Will the field accept alphabetic characters only$ numeric characters only or both> !) Numeric "alue Chec s / Does the system recogni0e a number as being negati!e or does it only recogni0e the absolute !alues> Does the system recogni0e the number 0ero> Will the system accept a negati!e number> #) Chec Digit / .f a number has a chec" digit$ does the system re ect any entries #here that number is not accurate. $) %imit Tests / .s there a ma'imum or minimum dollar amount that can be entered into the system before a #arning or error message is recei!ed. .f there is a range$ then the range should be tested. Also$ are codes used in the system to determine #hat functions a person or process or course can do. &) 'easonableness Tests / .s there a range of reasonable amounts$ #here a #arning message is recei!ed if they are e'ceeded so that the operator can either in!estigate it or accept it> () Internal Compatibilit) / .s the data being entered cross/referenced to other data in the application$ #here if an error #as made$ and error message #ould be recei!ed. *) Cross Chec s +ith data in other applications / Do these function properly> ,) Duplicate Transactions / Will the system re ect a transaction entered t#ice in error such as a supplier in!oice> -) Table %oo .ps / .f a code is entered into the field$ does the system access the proper table and return the correct information &this may be a function of table testing*. 1/) 01istence o2 'e3uired Data / Where re(uired data &such as an account number* is needed$ if an error message recei!ed if that data is not present. 11) Con2irmation 4creens / +or on/line)real time systems$ once the data is entered$ the system is updated. .n order to ensure that the data is accurately updated$ the system #ill display all data inputted once the enter "ey is hit and as" if this is correct. 1!) Field %engths and Over2lo+ Chec s / Are the field lengths long enough> .f an entry is longer than the allo#ed field length is an error or #arning message recei!ed or is the data truncated. +or numeric fields$ an error message should occur if the !alue is longer than the field length. 1#) 0dit Over5rides / 1an the data edits be o!er/ridden. .f so$ ensure that the o!er/ride feature is properly #or"ing &this may be a function of security*. 1$) 6rithmetic 6ccurac)7Tolerance %evels / .f the numeric !alue of a field is recalculated and !erified by the system$ then this recalculation should be tested. Also$ if a field is calculated and automatically updated by the system$ then the calculation must be confirmed. .f tolerance le!els are set #here the system #ill accept the number entered if it is out by a certain percent or a certain dollar amount$ then these should be tested. An e'ample of this is the amount of J3T.

1&) Date5Driven 0dits / Where the date field is used for !alidation$ test to ensure it is properly #or"ing. +or e'ample$ dates for student #ithdra#als. Ensure date fields accommodate both the century and the year &e.g. : digits*. 1() 6re suspense 2iles or 2lags used to hold unfinished transactions> .f they are$ testing should be done to ensure that they function properly and that the data is not accepted in the system as completed &this can also be considered Data @rocessing testing*.

Internal Testin of Data


1) Delete vs8 'everse / At #hat point must an entry be re!ersed !s. being able to delete it> -nce the transaction has been updated$ it should probably need to be re!ersed &a re!ersing entry must be entered ensuring a proper audit trail* rather than deleted &#ith no audit trail*. Does this #or" as it should. !) 6utomaticall) Triggered Processing / .f the system has automatically triggered processing$ then testing must be done to ensure that the calculations or processing are done correctly. Testing should be done to ensure that the right parameters # ere used and that the output of the processing is accurate. #) .pdating / Testing should be done to ensure that the system is properly updated #ith the data entered. The right tables are updated #ith the right information &note that more than one table may be updated for each transaction*. $) 6udit Trails / 3imilar to < abo!e$ testing should be done to ensure that the system logs and audit trails are properly updated. &) Table "alues / Testing should be done on the procedures for updating the system parameters and code tables. Do the update transactions #or" properly$ including the data entry edits. () Initiali9ation and Purge / The process for initiali0ing the ne# year and purging old data records should be testing &technical*. *) :ac up and 'ecover) / The process for bac"ing up the system and for reco!ering the system in the e!ent of failure$ should be tested &technical*. ,) 6rithmetic Calculations / Test to ensure that all arithmetic calculations are performed correctly &adding$ subtracting$ etc.*. Ensure that report totals are correct$ that cross adds are done properly$ that ta'es are properly calculated etc. -) "olume Testing / The system should be tested at normal load le!els and a pea" le!els to ensure that there are no unacceptable bottlenec"s. A stress test could be performed to determine #hat the upper processing limit is on the system &ie. at #hat !olume does system performance become unacceptable*. 1/) %ive Transaction Testing / 3ystem processing testing should be conducted #ith li!e data &past data can be loaded into the system*$ as opposed to manufactured data$ to ensure that all data processing !ariations are tested. 11) Database ;anagement 4)stem Testing / The database structure should be tested to ensure it is properly designed. 1!) Inter2ace +ith other modules7s)stems / Testing should be conducted to ensure that other modules are properly updated #ith data entered and processed in one muodule. This applies also to data being passed to or from another system. Testing should ensure that the data passed is accurate and complete.

1#) Test an) batch processing being done to ensure that it per2orms properl). This #ould include batch totaling$ transaction re ecting$ edit chec"ing$ and reco!er and restart procedures for an abnormal ending in the batch programs.

DATA OUTPUT TESTING


1) 9e!ie# the reports recei!ed to ensure that they are properly formatted and include the necessary information. !) Ensure that the reports are arithmetically correct. #) 2se a!ailable reports to test the system processing described abo!e. $) .f e'ception reporting is being done$ then testing should be conducted to ensure that the data e'tracted is a complete set$ based on the e'ception criteria. +or any data (ueries$ testing should be done to ensure that all applicable records #ere loo"ed at and that all applicable records #ere selected and reported.

Inputs for Data Testin


$ ' ( , . / Data%ase &odels Data Architect documents )usiness !ules*!e+uirements D)A Intervie-s Data Architect Intervie-s Developer Intervie-s

Tools for Data Testin


01 00 03 05 0$ WinS2L S2L#lus T4AD &S Access &S "xcel

External Team Dependencies


For some projects, it is important to meet with members of the Data Operations group, the Data CoE, and Data Engineering services. The have particu!ar responsibi!ities for various aspects of data "ua!it and data "ua!it p!anning. The ma have too!s that cou!d be usefu! in addition to those !isted above.

Artifacts Created
Data (uality !erification plan Data (uality test scripts and test cases Data (uality defect reports

1;

Das könnte Ihnen auch gefallen