Sie sind auf Seite 1von 35

INTRODUCTION TO ETL

Once you have identified your business requirements, analysed


your source systems and developed a data model for your Data
Warehouse you can then start to look at the Extract Transform Load
(ETL) processes that are critical in the success of the Data
Warehousing project.

What is ETL?

ETL processes can be summarised as?

The extraction
The transformation
The loading
08-06-2014 All Rights Reserved 2
ETL
Extract is the process of extracting data from the original source,
such as a database.

Transform is the process of applying rules to the data to turn it
into the format required by the target database.

And, as its name suggests, load is the process of loading the
transformed (cleansed) data into the target database (data
warehouse structures) in preparation for data analysis.

08-06-2014 All Rights Reserved 3
ETL DESIGN PROCESS
It is perhaps the most time consuming stage of the Data Warehouse
project.

It is often the case that over 50% of the time dedicated to the Data
Warehousing project is spent on designing and developing the ETL
processes.

ETL processes will determine the quality of data that ends up in
your Data Warehouse and so it is vital that you get it right because
if you put rubbish into it you will get rubbish out
08-06-2014 All Rights Reserved 4
CONT.,
ETL processes will need maintaining and changing with time due
to changes in the data sources or the Data Warehouse business
requirements.

Badly designed ETL processes can lead to lengthy, unnecessary
time and expense spent on maintaining and updating them.

It is important to keep in mind when we are designing our ETL
processes that they should improve the data quality / integrity
and although a transformation may alter the format / data type etc
of the data, it should not change the meaning of the data.

08-06-2014 All Rights Reserved 5
DIFFERENT TOOLS IN ETL

IBM Information Server (Data stage) IBM

Data Services SAP BO

Powercenter Informatica Informatica

Ab-Intio Ab-Intio Software Corp.,

SQL Server Integration Services(SSIS) Microsoft

Oracle Data Integrator(ODI) Oracle
08-06-2014 All Rights Reserved 6
ARCHITECTURE
We have 3 types of Architecture as follows:

Client-Server Architecture
Client & Multi-ServerArchitecture
Service Oriented Architecture (SOA)

08-06-2014 All Rights Reserved 7
CLIENT SERVER
Architecture
In client Server architecture we have only one server and multiple
clients.

This architecture was implemented in the following ETL Tools
with their respective Versions :

Ab-Initio base version 1.0.10
Informatica from version 4.7.2 and 5.1 (Power
Mart)
Data Stage - from version 5.0 - 6


08-06-2014 All Rights Reserved 8
Client Server Model
08-06-2014 All Rights Reserved 9

Server
processes the
request of
Client
Cont.,
Limitation Of Client-Server Model :

The server will accept only up to minimum of 10 client and if
the number of clients are increased then there will a
performance issue.

This was the drawback of the Client-Server Architecture.

So to over come this issue there came the next model called as
Client & Multi-Server Architecture


08-06-2014 All Rights Reserved 10
CLIENT & MULTI
SERVER Architecture
In generic terms it is also called as CLUSTERED
ARCHITECTURE .

Also called as Grid Architecture .

The term Grid implies group of servers ( Min is 2 server and it
can have max N ) .

This architecture was implemented in the following ETL Tools
with their respective Versions as follows :


08-06-2014 All Rights Reserved 11
Cont.,

Ab-Initio - base version 1.1.3.10
Informatica - from version 6.1,6.2 & 7.x (Power Centre)
Data Stage - from version 7 & 7.5

In this architecture the servers are interconnected to each other and
they used to execute the jobs by equally dividing among
themselves and they collectively perform their respective jobs.

So this architecture is called as Performance Oriented
Architecture.







08-06-2014 All Rights Reserved 12
Client & Multi - Server Model
08-06-2014 All Rights Reserved 13

All jobs are
equally
processed by
the 3 servers

Cont.,

Where S1 is called as the Main Server and S2 and S3 are called
as the Sub-ordinate Servers .

The request comes from the Client C1 and it is received by the
Main Server and that process is been equally distributed by the S1
to S2 and S3 respectively.

Since the distribution of job is equally done the performance is
high in this architecture, and hence it is called as POA .

Now let us discuss about the Threat of this Architecture .
08-06-2014 All Rights Reserved 14
Cont.,
Let us say that server S2 is failed while the process execution in
middle .

Now this affects the entire performance of this architecture.

When the job gets distributed to each servers (i.e) when the records
to be processed is separated from the main server all the 3 servers
start their own processing and there is no one to monitor the server
regarding the status of the job/process completion.

So this architecture has a drawback of Fail Over Mechanism .
08-06-2014 All Rights Reserved 15
Cont.,
By considering the server S2 is down, the performance of all the
other servers are also down.

This so called mechanism FOM was not implemented in the
versions ETL TOOLS specified earlier for this particular
architecture.

For this architecture to be successful all the servers should be up
and running.

In order to overcome this problem ,the Consortium of ETL has
brought a solution in the next level of Architecture.
08-06-2014 All Rights Reserved 16
SERVICE ORIENTED
ARCHITECTURE ( SOA )
This Architecture resolved the problem of FOM (i.e.) Fail Over
Mechanism which was the problem in Clustered Architecture.

In this architecture we have a Monitoring Server to monitor the
complete process that is to be done.

Rules for this Architecture :

Only 1 Domain
Nodes Min 1 & Max n
Client - Min 1 & Max n






08-06-2014 All Rights Reserved 17
SOA Model
08-06-2014 All Rights Reserved 18

Monitoring
Server
distributes the
job equally to
the Servers
Cont.,
Now we have the Naming Conventions that is to be followed here
in this architecture.

The Monitoring Server is called as DOMAI N .
The Servers are called as NODES .

So according to the Naming Conventions this architecture is called
as DNC Architecture ( DOMAIN NODE CLIENT ) .


08-06-2014 All Rights Reserved 19
SOA Model followed by
Naming Convention
08-06-2014 All Rights Reserved 20

Monitoring
Server
distributes the
job equally to
the Servers
Cont.,
Domain The heart of the system, Primarily for the complete
administration of the system.

Node Logical representation of a machine in the domain.

In this Architecture we have certain hierarchy to be followed :

There are 2 hierarchy methods : (i) MGN & (ii) So-N / HN
where :

MGN Master Gateway Nodes (The domain in which the
node resides)
So-N Sub-Ordinate Nodes / Helper Nodes

08-06-2014 All Rights Reserved 21
Cases to be Discussed in SOA
08-06-2014 All Rights Reserved 22
? ?
Cont.,
This architecture was implemented in the following ETL Tools
with their respective Versions :

Ab-Initio base version 2.10.15
Informatica from version 8.6.1(Power Centre)
Data Stage - from version 8.0.1



08-06-2014 All Rights Reserved 23
INTRODUCTION TO
DATASTAGE

DataStage is on of the leading ETL products on the BI market.

The tool allows integration of the data across multiple systems and
processing high volumes of the data.

Datastage has an user-friendly graphical frontend to designing jobs
which manage collecting, transforming, validating and loading data
from multiple sources, to the data warehouse systems.
08-06-2014 All Rights Reserved 24
Cont.,
The multiple sources may be the enterprise applications like
Oracle, SAP, PeopleSoft and mainframes.

This application is capable of integrating meta data across the data
environment to maintain consistent analytic interpretations.

Datastage provides data quality and reliability for accurate business
analysis and reporting.

08-06-2014 All Rights Reserved 25
HISTORY OF DATASTAGE
Datastage was formerly known as Ardent DataStage followed by
Ascential DataStage in 2001.

In 2005 it was acquired by IBM and added to the WebSphere
family.

Starting from 2006 its official name is IBM WebSphere Datastage

And in 2008 it has been renamed to IBM InfoSphere Datastage.
08-06-2014 All Rights Reserved 26
COMPONENTS OF
DATASTAGE
The two major componenets of Datastage are :

Server Components
Web Admin Console
Repository

Client Components
Data stage administrator
Data stage designer
Data stage director
Data stage manager
08-06-2014 All Rights Reserved 27
SERVER COMPONENTS
Data Stage Server :
Runs executable server jobs, under the control of the DS
director, that extract,transform, and load data into a DWH.

Web Admin Console :
Used for user level administration like granting roles,
privileges, authorising the users.

Repository or project:
A central store that contains all the information required to build
DWH or data mart.

08-06-2014 All Rights Reserved 28
CLIENT COMPONENTS
DataStage Administrator :
It is used for creating the project, deleting the project, issuing
permissions to the project & setting the environment variables.
This is handled by DataStage administrator.

DataStage Designer :
It is used to design the jobs.
All the DataStage development activities are done here.
For a DataStage developer he should know this part very well.
Execution of the jobs are also done here.

08-06-2014 All Rights Reserved 29
CLIENT COMPONENTS
DataStage Director :
It is used to run the jobs, validate the jobs, scheduling the jobs.
This is handled by DataStage developer/operator.
We can able to view the logs for the entire Repository.

DataStage Manager :
It is used for to import & export the project to view & edit the
contents of the repository.
This is handled by DataStage operator/administrator

08-06-2014 All Rights Reserved 30
TYPES OF JOBS
There are five types of jobs in DataStage.

Sever Job
Parallel Job
Mainframe Job
Shared Container
Job Sequence
08-06-2014 All Rights Reserved 31
SERVER JOB
Server job are both developed and compiled using DataStage client
tools.

Compilation of a server job creates an executable that is scheduled
and run from the DataStage Director.

They are recommended to design to process low volumes of data.

They are used on non parallel systems an SMP (Systematic
Multiprocessor) Systems with up to 64 processors.

A Server jobs are executes in a sequential mode.
08-06-2014 All Rights Reserved 32
PARALLEL JOB
These are compiled and run on the DataStage server in a similar
way to server jobs, but support parallel processing on SMP, MPP,
and cluster systems.

The data stage parallel extender (PX) brings the power of parallel
processing to our applications for extraction and transformation.

A Parallel jobs are scalable.

As you increase the number of hardware resources the faster the
job will run.

08-06-2014 All Rights Reserved 33
MAINFRAME JOBS
Jobs are developed using the same DataStage client tools as for
server jobs, but compilation and execution occur on a mainframe
computer.

The DataStage Designer generates a COBOL source file and
supporting JCL script, then lets you upload them to the target
mainframe computer.

The job is compiled and run on the mainframe computer under the
control of native mainframe software.
08-06-2014 All Rights Reserved 34
JOB SEQUENCE AND
SHARED CONTAINER
A job sequence allows you to specify a sequence of DataStage jobs
to be executed, and actions to take depending on results.

A Shared Container are reusable job elements.

They typically comprise a number of stages and links.

Copies of shared containers can be used in any number of server
jobs and edited as required.
08-06-2014 All Rights Reserved 35

Das könnte Ihnen auch gefallen