Sie sind auf Seite 1von 9

Data stage Architecture

DataStage is an ETL Tool and it is client-server technology and integrated toolset used
for designing, running, monitoring and administrating the data acquisition application is
known as job.
A job is graphical representation of dataflow from source to target and it is designed
with source definitions and target definition and transformation Rules.
The data stage software consists of client and server components

Data stage Designer


Data stage Server
Data stage Director

Data stage Manager

Data stage
Administrator

TCP/IP

Data stage
Repository

DataStage 8.0 (8.1 and 8.5) version-standalone:


DataStage 8 version was a standalone version where DataStage engine and service are in
DataStage server but the Database part repository (metadata) was installed in Oracle/DB2
Database server and client was installed in local PC and accesses the servers using the dsclient.
Metadata (Repository): This will be created as one database and will have 2 schemas
(xmeta and iauser).This can be made as RAC DB (Active/Active in 2 servers, if any one DB
failed means the other will be switch over without connection lost of the DataStage jobs
running) where
1. xmeta :will have information about the project and DataStage software
2. iauser: will have information about the user of DataStage in IIS or webconsole
Note: we can install 2 or 3 DataStage instance in the same server like ds-8.0 or ds-8.1 or
ds-8.5 and bring up any version whenever we want to work on that. This will reduce the
hardware cost. But only one instance can be up and running.
The DataStage 8 was also a standalone version but here the 3 components were introduced
defiantly.
1. Information server(IIS)- isadmin
2. Websphere server- wasadmin
3. DataStage server- dsadm
Client components & server components
Client components are
1. Data stage designer
2. Data stage administrator
3. Data stage director
4. DataStage manager
5. Webconsole
6. IBM InfoSphere DataStage and Quality stage multi-client manager

Note: DataStage Manager is merged with DataStage Designer from 8 versions.

DS Client components:
1) Data stage Designer:It is used to create the DataStage application known as job. The following activities can be
performed with designer window.
a) Create the source definition.
b) Create the target definition.
c) Develop Transformation Rules
d) Design Jobs.
2) Data Stage Administrator:This components will be used for to perform create or delete the projects. , cleaning
metadata stored in repository and install NLS.
3) Data stage Manager:it will be used for to perform the following task like..
a) Create the table definitions.
b) Metadata back-up and recovery can be performed.
c) Create the customized components.
4) Data stage Director:It is used to validate, schedule, run and monitor the Data stage jobs.

5) Webconsole: Webconsole is use for to create the DataStage users and do the
administration .This is handled by DataStage administrator.

6) Multi-client manager is use for to install multiple client like ds-7.5,ds-8.1 or ds-8.5 in the
local pc and can swap to any version when it is required. This is used by DataStage
developer/operator/administrator/all
Data Stage Repository:It is one of the server side components which is defined to store the information about
to build out Data Ware House.
Data Stage Server:This is defined to execute the job while we are creating Data stage jobs.

Different Types of DataStage jobs


Job is nothing but it is ordered series of individual stages which are linked together to describe the
flow of data from source and target.

Parallel Jobs:

Executed under control of DataStage server runtime environment.


Built-in functionality for Pipeline and Partitioning Parallelism.
Compiled into OSH (Orchestrate Scripting Language)
OSH executes Operators
Executable C++ Class instances
Runtime monitoring in DataStage Director

Job Sequences (Batch Jobs, Controlling jobs):

Master server jobs that kick-off and other activities


Can kick-off server or Parallel Jobs.
Runtime monitoring in DataStage Director.

Server Jobs (Requires Server Edition License):

Executed by the DataStage Server Edition.


Compiled into Basic (Interpreted Pseudo-Code)
Runtime monitoring in DataStage Director.

Mainframe Jobs (Requires Mainframe Edition license):


Compiled into COBOL.
Executed on the Mainframe, Outside of DataStage

Difference between server jobs and parallel jobs:


Difference between server jobs and parallel jobs:
Server jobs:a) In server jobs it handles less volume of data with more performance.
b) It is having less number of components.
c) Data processing will be slow.
d) Its purely work on SMP (Symmetric Multi-Processing).
e) It is highly impact usage of transformer stage.
Parallel jobs:a) It handles high volume of data.

b)
c)
d)
e)
f)

Its work on parallel processing concepts.


It applies parallism techniques.
It follows MPP (Massively parallel Processing).
It is having more number of components compared to server jobs.
Its work on orchestrate framework

Stage/ Various stages in data stage:


A stage defines a database, file and processing
There are two types of stages.
a) Built-in stages
b) Plug-in stages.
Built-in stages:- These stages defines the extraction, Transformation, and loading
These are also divided into two types
I.
Passive stages:- which defines read and write access are known as
passive stages.
EX:- All Database stages in palette window by designer.
II.

Active stages:- which defines the data transformation and filtering


the data known as active stage.
Ex:- All Processing Stages.

Parallism techniques In DataStage


It is a process to perform ETL task in parallel approach need to build the data warehouse. The parallel
jobs support the following hardware system like SMP, MPP to achieve the parallism.
There are two types of parallel parallism techniques.
A. Pipeline parallism.
B. Partition parallism.
Pipeline parallism : - The data flow continuously throughout it pipeline . All stages in the
job are Operating simultaneously.
Data pipelining is the process of pulling records from the source system and moving them through the
sequence of processing functions that are defined in the data-flow (the job). Because records are
flowing through the pipeline, they can be processed without writing the records to disk.

For example, my source is having 4 records as soon as first record starts processing, then all
remaining records processing simultaneously.

Let us assume that the input file has only one column i.e.,
Customer_Na
me
Clark
Raj
James
Cameroon
After the 1st record(Clark) is extracted by the 1st Stage (Sequential File Stage) and moved to the
second stage for processing, the 2nd record(Raj) is immediately even before the 1st record reaches the
final stage(Peek Stage). Thereby, by the time the 1st record reaches he peek stage the 3rd
record(Michael) would have been extracted in the Sequential file stage
Data can be buffered in blocks so that each process is not slowed when other components are running.
This approach avoids deadlocks and speeds performance by allowing both upstream and downstream
processes to run concurrently.
Without data pipelining, the following issues arise:

Data must be written to disk between processes, degrading performance and increasing storage
requirements and the need for disk management.
The developer must manage the I/O processing between components.
The process becomes impractical for large data volumes.
The application will be slower, as disk use, management, and design complexities increase.
Each process must complete before downstream processes can begin, which limits performance
and full use of hardware resources.

Partition parallism:- in this parallism, the same job would effectively be run simultaneously by several
processors. Each processors handles separate subset of total records.
Data partitioning is an approach to parallelism that involves breaking the record set into partitions, or
subsets of records. Data partitioning generally provides linear increases in application performance.
Figure shows data that is partitioned by customer surname before it flows into the Transformer stage.

Partition Parallelism divides the incoming stream of data into subsets that will be processed separately
by a separate node/processor. These subsets are called partitions and each partition is processed by
the same operation.
Let us understand this in layman terms:
As we know DataStage can be implemented in an SMP or MPP architecture. This provides us with
additional processors for performing operations.
To leverage this processing capability, Partition Parallelism was introduced in the Information Server
(DataStage).
Let us assume that in our current set up there are 4 processors available for use by DataStage. The
details of these processors are to be defined in the DataStage Configuration File
For example, my source is having 100 records and 4 partitions. The data will be equally partition across
4 partitions that mean the partitions will get 25 records. Whenever the first partition starts, the
remaining three partitions start processing simultaneously and parallel.
Sample Configuration File
{
node "node1"
{
fastname "newton"
pools ""
resource disk "/user/development/datasets" {pools ""}
resource scratchdisk "/user/development/scratch" {pools ""}
}
node "node2"
{
fastname "newton"
pools ""
resource disk "/user/development/datasets" {pools ""}
resource scratchdisk "/user/development/scratch" {pools ""}
}

node "node3"
{
fastname "newton"
pools ""
resource disk "/user/development/datasets" {pools ""}
resource scratchdisk "/user/development/scratch" {pools ""}
}
node "node4"
{
fastname "newton"
pools ""
resource disk "/user/development/datasets" {pools ""}
resource scratchdisk "/user/development/scratch" {pools ""}
}
}

Using the configuration file, DataStage can identify the 4 available processors and can utilize them to
perform operations simultaneously.
For the same example,
Customer_Na
me
James
Sonata
Yash
Carey
if we have to add _Female to the end of the string for names that starts S and _Male for names
that dont. We will use the Transformer Stage to perform the operation. By selecting the 4 node
configuration file, we will be able to perform the required operation on the names four times faster
than before.
The required operation is replicated on each processor i.e Node and each Customer name will
processed on a separate node simultaneously, thereby greatly increasing the performance of DataStage
Jobs.
A scalable architecture should support many types of data partitioning, including the following types:

Hash key (data) values


Range
Round-robin
Random
Entire
Modulus
Database partitioning Etc.

IBM Information Server automatically partitions data based on the type of partition that the stage
requires. Typical packaged tools lack this capability and require developers to manually create data

partitions, which results in costly and time-consuming rewriting of applications or the data partitions
whenever the administrator wants to use more hardware capacity.
In a well-designed, scalable architecture, the developer does not need to be concerned about the
number of partitions that will run the ability to increase the number of partitions, or repartitioning
data.

Dynamic repartitioning:
In the examples shown in figure1 and figure 2data is partitioned based on customer surname, and then
the data partitioning is maintained throughout the flow.
This type of partitioning is impractical for many uses, such as a transformation that requires data
partitioned on surname but must then be loaded into the data warehouse by using the customer
account number.

Dynamic data repartitioning is a more efficient and accurate approach. With dynamic data
repartitioning, data is repartitioned while it moves between processes without writing the data to disk,
based on the downstream process that data partitioning feeds. The IBM Information Server parallel
engine manages the communication between processes for dynamic repartitioning.
Data is also pipelined to downstream processes when it is available

Das könnte Ihnen auch gefallen