Beruflich Dokumente
Kultur Dokumente
ETL Testing
By - SrilakshmiSudhaker
1
ETL Testing
Data warehouse database contains structured data for query analysis and can be
accessed by users. The data warehouse can be created or updated at any time,
with minimum disruption to operational systems. It is ensured by a strategy
implemented in ETL process.
A source for the data warehouse is a data extract from operational databases.
The data is validated, cleansed, transformed and finally aggregated and it
becomes ready to be loaded into the data warehouse.
A data mart is generated from the data warehouse and contains data focused on
a given subject and data that is frequently accessed or summarized.
2
ETL Testing
3
ETL Testing
4
ETL Testing
Data warehouse provides a common data model for all data of interest
regardless of the data's source. This makes it easier to report and
analyze information than it would be if multiple data models were used
to retrieve information such as sales invoices, order receipts, general
ledger charges, etc.
Over their life, data warehouses can have high costs. Maintenance
costs are high.
5
ETL Testing
ETL Concept:
ETL is the automated and auditable data acquisition process from source system
that involves one or more sub processes of data extraction, data transportation,
data transformation, data consolidation, data integration, data loading and data
cleaning.
E- Extracting data from source operational or archive systems which are primary
source of data for the data warehouse.
T - Transforming the data which may involve cleaning, filtering, validating and
applying business rules.
L - Loading the data into the data warehouse or any other database or application
that houses the data.
6
ETL Testing
ETL Process:
ETL Process involves the Extraction, Transformation and Loading Process.
7
ETL Testing
Extraction:
The first part of an ETL process involves extracting the data from the source
systems. Most data warehousing projects consolidate data from different source
systems. Each separate system may also use a different data format. Common data
source formats are relational databases and flat files, but may include non-relational
database structures such as Information Management System (IMS) or other data
structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential
Access Method (ISAM), or even fetching from outside sources such as through web
spidering or screen-scraping. Extraction converts the data into a format for
transformation processing.
An intrinsic part of the extraction involves the parsing of extracted data, resulting in
a check if the data meets an expected pattern or structure. If not, the data may be
rejected entirely or in part.
Transformation:
Transformation is the series of tasks that prepares the data for loading into the
warehouse. Once data is secured, you have worry about its format or structure.
Because it will be not be in the format needed for the target. Example the grain
level, data type, might be different. Data cannot be used as it is. Some rules and
functions need to be applied to transform the data
ETL must support data integration for the data coming from multiple sources and
data coming at different times. This has to be seamless operation. This will avoid
8
ETL Testing
overwriting existing data, creating duplicate data or even worst simply unable to load
the data in the target
Loading:
Loading process is critical to integration and consolidation. Loading process decides
the modality of how the data is added in the warehouse or simply rejected. Methods
like addition, Updating or deleting are executed at this step. What happens to the
existing data? Should the old data be deleted because of new information? Or
should the data be archived? Should the data be treated as additional data to the
existing one?
So data to the data warehouse has to loaded with utmost care for which data
auditing process can only establish the confidence level. This auditing process
normally happens after the loading of data.
9
ETL Testing
Data Quality - It promises that the ETL application correctly rejects, substitutes
default values, corrects and reports invalid data.
Data transformation - This is meant for ensuring that all data is correctly
transformed according to business rules and design specifications.
Performance and scalability- This is to ensure that the data loads and queries
perform within expected time frames and the technical architecture is scalable.
Integration testing- It is to ensure that ETL process functions well with other
upstream and downstream applications.
Regression testing - To keep the existing functionality intact each time a new
release of code is completed.
Basically data warehouse testing is divided into two categories Back-end testing
and Front-end testing. The former applies where the source systems data is
compared to the end-result data in Loaded area which is the ETL testing. While
the latter refers to where the user checks the data by comparing their MIS with
the data that is displayed by the end-user tools.
Data Validation:
Data completeness is one of the basic ways for data validation. This is needed to
verify that all expected data loads into the data warehouse. This includes the
validation of all the records, fields and ensures that the full contents of each field
are loaded.
Data Transformation:
Validating that the data is transformed correctly based on business rules, can be
one of the most complex parts of testing an ETL application with significant
transformation logic. Another way of testing is to pick up some sample records
and compare them for validating data transformation manually, but this method
requires manual testing steps and testers who have a good amount of
experience and understand of the ETL logic.
10
ETL Testing
Unit testing: Traditionally this has been the task of the developer. This is a
white-box testing to ensure the module or component is coded as per agreed
upon design specifications. The developer should focus on the following:
a) That all inbound and outbound directory structures are created properly with
appropriate permissions and sufficient disk space. All tables used during the ETL
are present with necessary privileges.
System testing: Generally the QA team owns this responsibility. For them the
design document is the bible and the entire set of test cases is directly based
upon it. Here we test for the functionality of the application and mostly it is black-
box. The major challenge here is preparation of test data. An intelligently
designed input dataset can bring out the flaws in the application more quickly.
Wherever possible use production-like data. You may also use data generation
tools or customized tools of your own to create test data. We must test for all
possible combinations of input and specifically check out the errors and
exceptions. An unbiased approach is required to ensure maximum efficiency.
Knowledge of the business process is an added advantage since we must be
able to interpret the results functionally and not just code-wise.
The QA team must test for:
11
ETL Testing
Integration testing: This is done to ensure that the application developed works
from an end-to-end perspective. Here we must consider the compatibility of the
DW application with upstream and downstream flows. We need to ensure for
data integrity across the flow. Our test strategy should include testing for:
Acceptance testing: This is the most critical part because here the actual users
validate your output datasets. They are the best judges to ensure that the
application works as expected by them. However, business users may not have
proper ETL knowledge. Hence, the development and test team should be ready
to provide answers regarding ETL process that relate to data population. The test
team must have sufficient business knowledge to translate the results in terms of
business. Also the load windows, refresh period for the DW and the views
created should be signed off from users.
Summary:
12
ETL Testing
13