Once you have identified your business requirements, analysed
your source systems and developed a data model for your Data Warehouse you can then start to look at the Extract Transform Load (ETL) processes that are critical in the success of the Data Warehousing project.
What is ETL?
ETL processes can be summarised as?
The extraction The transformation The loading 08-06-2014 All Rights Reserved 2 ETL Extract is the process of extracting data from the original source, such as a database.
Transform is the process of applying rules to the data to turn it into the format required by the target database.
And, as its name suggests, load is the process of loading the transformed (cleansed) data into the target database (data warehouse structures) in preparation for data analysis.
08-06-2014 All Rights Reserved 3 ETL DESIGN PROCESS It is perhaps the most time consuming stage of the Data Warehouse project.
It is often the case that over 50% of the time dedicated to the Data Warehousing project is spent on designing and developing the ETL processes.
ETL processes will determine the quality of data that ends up in your Data Warehouse and so it is vital that you get it right because if you put rubbish into it you will get rubbish out 08-06-2014 All Rights Reserved 4 CONT., ETL processes will need maintaining and changing with time due to changes in the data sources or the Data Warehouse business requirements.
Badly designed ETL processes can lead to lengthy, unnecessary time and expense spent on maintaining and updating them.
It is important to keep in mind when we are designing our ETL processes that they should improve the data quality / integrity and although a transformation may alter the format / data type etc of the data, it should not change the meaning of the data.
08-06-2014 All Rights Reserved 5 DIFFERENT TOOLS IN ETL
IBM Information Server (Data stage) IBM
Data Services SAP BO
Powercenter Informatica Informatica
Ab-Intio Ab-Intio Software Corp.,
SQL Server Integration Services(SSIS) Microsoft
Oracle Data Integrator(ODI) Oracle 08-06-2014 All Rights Reserved 6 ARCHITECTURE We have 3 types of Architecture as follows:
Client-Server Architecture Client & Multi-ServerArchitecture Service Oriented Architecture (SOA)
08-06-2014 All Rights Reserved 7 CLIENT SERVER Architecture In client Server architecture we have only one server and multiple clients.
This architecture was implemented in the following ETL Tools with their respective Versions :
Ab-Initio base version 1.0.10 Informatica from version 4.7.2 and 5.1 (Power Mart) Data Stage - from version 5.0 - 6
08-06-2014 All Rights Reserved 8 Client Server Model 08-06-2014 All Rights Reserved 9
Server processes the request of Client Cont., Limitation Of Client-Server Model :
The server will accept only up to minimum of 10 client and if the number of clients are increased then there will a performance issue.
This was the drawback of the Client-Server Architecture.
So to over come this issue there came the next model called as Client & Multi-Server Architecture
08-06-2014 All Rights Reserved 10 CLIENT & MULTI SERVER Architecture In generic terms it is also called as CLUSTERED ARCHITECTURE .
Also called as Grid Architecture .
The term Grid implies group of servers ( Min is 2 server and it can have max N ) .
This architecture was implemented in the following ETL Tools with their respective Versions as follows :
08-06-2014 All Rights Reserved 11 Cont.,
Ab-Initio - base version 1.1.3.10 Informatica - from version 6.1,6.2 & 7.x (Power Centre) Data Stage - from version 7 & 7.5
In this architecture the servers are interconnected to each other and they used to execute the jobs by equally dividing among themselves and they collectively perform their respective jobs.
So this architecture is called as Performance Oriented Architecture.
08-06-2014 All Rights Reserved 12 Client & Multi - Server Model 08-06-2014 All Rights Reserved 13
All jobs are equally processed by the 3 servers
Cont.,
Where S1 is called as the Main Server and S2 and S3 are called as the Sub-ordinate Servers .
The request comes from the Client C1 and it is received by the Main Server and that process is been equally distributed by the S1 to S2 and S3 respectively.
Since the distribution of job is equally done the performance is high in this architecture, and hence it is called as POA .
Now let us discuss about the Threat of this Architecture . 08-06-2014 All Rights Reserved 14 Cont., Let us say that server S2 is failed while the process execution in middle .
Now this affects the entire performance of this architecture.
When the job gets distributed to each servers (i.e) when the records to be processed is separated from the main server all the 3 servers start their own processing and there is no one to monitor the server regarding the status of the job/process completion.
So this architecture has a drawback of Fail Over Mechanism . 08-06-2014 All Rights Reserved 15 Cont., By considering the server S2 is down, the performance of all the other servers are also down.
This so called mechanism FOM was not implemented in the versions ETL TOOLS specified earlier for this particular architecture.
For this architecture to be successful all the servers should be up and running.
In order to overcome this problem ,the Consortium of ETL has brought a solution in the next level of Architecture. 08-06-2014 All Rights Reserved 16 SERVICE ORIENTED ARCHITECTURE ( SOA ) This Architecture resolved the problem of FOM (i.e.) Fail Over Mechanism which was the problem in Clustered Architecture.
In this architecture we have a Monitoring Server to monitor the complete process that is to be done.
Rules for this Architecture :
Only 1 Domain Nodes Min 1 & Max n Client - Min 1 & Max n
08-06-2014 All Rights Reserved 17 SOA Model 08-06-2014 All Rights Reserved 18
Monitoring Server distributes the job equally to the Servers Cont., Now we have the Naming Conventions that is to be followed here in this architecture.
The Monitoring Server is called as DOMAI N . The Servers are called as NODES .
So according to the Naming Conventions this architecture is called as DNC Architecture ( DOMAIN NODE CLIENT ) .
08-06-2014 All Rights Reserved 19 SOA Model followed by Naming Convention 08-06-2014 All Rights Reserved 20
Monitoring Server distributes the job equally to the Servers Cont., Domain The heart of the system, Primarily for the complete administration of the system.
Node Logical representation of a machine in the domain.
In this Architecture we have certain hierarchy to be followed :
There are 2 hierarchy methods : (i) MGN & (ii) So-N / HN where :
MGN Master Gateway Nodes (The domain in which the node resides) So-N Sub-Ordinate Nodes / Helper Nodes
08-06-2014 All Rights Reserved 21 Cases to be Discussed in SOA 08-06-2014 All Rights Reserved 22 ? ? Cont., This architecture was implemented in the following ETL Tools with their respective Versions :
Ab-Initio base version 2.10.15 Informatica from version 8.6.1(Power Centre) Data Stage - from version 8.0.1
08-06-2014 All Rights Reserved 23 INTRODUCTION TO DATASTAGE
DataStage is on of the leading ETL products on the BI market.
The tool allows integration of the data across multiple systems and processing high volumes of the data.
Datastage has an user-friendly graphical frontend to designing jobs which manage collecting, transforming, validating and loading data from multiple sources, to the data warehouse systems. 08-06-2014 All Rights Reserved 24 Cont., The multiple sources may be the enterprise applications like Oracle, SAP, PeopleSoft and mainframes.
This application is capable of integrating meta data across the data environment to maintain consistent analytic interpretations.
Datastage provides data quality and reliability for accurate business analysis and reporting.
08-06-2014 All Rights Reserved 25 HISTORY OF DATASTAGE Datastage was formerly known as Ardent DataStage followed by Ascential DataStage in 2001.
In 2005 it was acquired by IBM and added to the WebSphere family.
Starting from 2006 its official name is IBM WebSphere Datastage
And in 2008 it has been renamed to IBM InfoSphere Datastage. 08-06-2014 All Rights Reserved 26 COMPONENTS OF DATASTAGE The two major componenets of Datastage are :
Server Components Web Admin Console Repository
Client Components Data stage administrator Data stage designer Data stage director Data stage manager 08-06-2014 All Rights Reserved 27 SERVER COMPONENTS Data Stage Server : Runs executable server jobs, under the control of the DS director, that extract,transform, and load data into a DWH.
Web Admin Console : Used for user level administration like granting roles, privileges, authorising the users.
Repository or project: A central store that contains all the information required to build DWH or data mart.
08-06-2014 All Rights Reserved 28 CLIENT COMPONENTS DataStage Administrator : It is used for creating the project, deleting the project, issuing permissions to the project & setting the environment variables. This is handled by DataStage administrator.
DataStage Designer : It is used to design the jobs. All the DataStage development activities are done here. For a DataStage developer he should know this part very well. Execution of the jobs are also done here.
08-06-2014 All Rights Reserved 29 CLIENT COMPONENTS DataStage Director : It is used to run the jobs, validate the jobs, scheduling the jobs. This is handled by DataStage developer/operator. We can able to view the logs for the entire Repository.
DataStage Manager : It is used for to import & export the project to view & edit the contents of the repository. This is handled by DataStage operator/administrator
08-06-2014 All Rights Reserved 30 TYPES OF JOBS There are five types of jobs in DataStage.
Sever Job Parallel Job Mainframe Job Shared Container Job Sequence 08-06-2014 All Rights Reserved 31 SERVER JOB Server job are both developed and compiled using DataStage client tools.
Compilation of a server job creates an executable that is scheduled and run from the DataStage Director.
They are recommended to design to process low volumes of data.
They are used on non parallel systems an SMP (Systematic Multiprocessor) Systems with up to 64 processors.
A Server jobs are executes in a sequential mode. 08-06-2014 All Rights Reserved 32 PARALLEL JOB These are compiled and run on the DataStage server in a similar way to server jobs, but support parallel processing on SMP, MPP, and cluster systems.
The data stage parallel extender (PX) brings the power of parallel processing to our applications for extraction and transformation.
A Parallel jobs are scalable.
As you increase the number of hardware resources the faster the job will run.
08-06-2014 All Rights Reserved 33 MAINFRAME JOBS Jobs are developed using the same DataStage client tools as for server jobs, but compilation and execution occur on a mainframe computer.
The DataStage Designer generates a COBOL source file and supporting JCL script, then lets you upload them to the target mainframe computer.
The job is compiled and run on the mainframe computer under the control of native mainframe software. 08-06-2014 All Rights Reserved 34 JOB SEQUENCE AND SHARED CONTAINER A job sequence allows you to specify a sequence of DataStage jobs to be executed, and actions to take depending on results.
A Shared Container are reusable job elements.
They typically comprise a number of stages and links.
Copies of shared containers can be used in any number of server jobs and edited as required. 08-06-2014 All Rights Reserved 35