Beruflich Dokumente
Kultur Dokumente
Now you can argue that this is possible using a pivot stage. But for the sake of this article lets try
doing this using a Transformer!
Below is a screenshot of our input data
We are going to read the above data from a sequential file and transform it to look like this
Step 2:
In the adjacent image you can see a new box called Loop Condition. This where we are going to control
the loop variables.
programming. So, similar to a while statement need to have a condition to identify how many times
the loop is supposed to be executed.
To achieve this
In our example we need to loop the data 3 times to get the column data onto subsequent rows.
So lets have
@ITERATION <=3
If
Else DSLink2.Name3
Now all we have to do is map this Loop variable Loop Name to our output
Name
column
After running the job, we did a view data on the output stage and here is the data as desired.
Making some tweaks to the above design we can implement things like
1.
2.
This Blog give you a complete details, how we can improve the
performance of datastage Parallel jobs using appropriate partitioning
methods.
Specify hash partitioning for stages that require processing of group of related records.
Partitioning keys should include only those key columns that are necessary for proper
grouping If the grouping is on a single integer key column, go for Modulus partition on the
same key column If the data is highly skewed and the key column values and distribution
will not change significantly over time, use the Range partitioning technique
Use Round robin partition to distribute data evenly across all partitions. (If grouping is not
needed).This is very much suggested when the input data is in sequential mode or it is very
much skewed Same partitioning requires minimum resources and can be used for
optimization of job and to eliminate repartitioning of the already partitioned data
When the input data set is sorted in parallel, we need to use Sort merge collector, which
will produce a single sorted stream of rows. When the input data set is sorted in parallel
and range partitioned, the ordered collector method is more preferred for collection
For round robin partitioned input data set use round robin collector to reconstruct rows in
input order, as the long as the data set has not been re partitioned or reduced.
In scenarios where same data (huge number of records) is to be shared among more than
one jobs in the same project, use dataset stage approach instead of re-reading the same
data again.
If the input file has huge number of records and the business logic allows splitting up of
the data, then run the job in parallel to have a significant improvement in the
performance
Use parallel transformer stage instead of filter/switch stages ( filter/switch stages will
take more resources for execution. For egs: in the case of filter stage the were clause will
get executed during run time, thus creating the requirement for more resources, there by
decaying the job performance)
Figure: Example of using a Transformer stage instead of using a filter stage. The filter condition is
given in the constraint section of the transformer stage properties.
Use BuildOp stage only when the required logic cannot be implemented using the parallel
transformer stage.
Avoid calling routines in derivations in the transformer stage. Implement the logic in
derivation. This will avoid the over head of procedure call
Implement the logic using stage variables and call these stage variables in the derivations.
During processing the execution starts with stage variables then constraints and then to
individual columns. If ever there is a prerequisite formulae which can be used by both
constraints and also individual columns then we can define it in stage variables so that it
can be processed once and can be used by multiple records. If ever we require the
formulae to be modified for each and every row then it is advisable to place in code in
record level than stage variable level
Figure: Example for using stage variables in and using it in the derivations.
Figure: Sorting the input data on the grouping keys in an aggregator stage
The example shown in the figure is the properties window for an aggregator stage that
finds out the sum of a quantity column by grouping on the columns shown above. In such
scenarios, we will do sorting of the input data on the same columns so that the records
with same/similar values for these grouping columns will come together there by
increasing the performance. Also note that if we are using more than one node, then the
input dataset should be properly partitioned so that the similar records will be available
in the same node.
Select only the required records or Remove the unwanted rows as early, so that the job
need not deal with unnecessary records causing performance degrade
Figure: Using the User-defined SQL option in ODBC stages to reduce the overhead of datastage by
specifying the WHERE and ORDER BY clause in the SQL used to get data.
Know about Conductor Node, Section Leaders and Players Process in Datastage Details
about Conductor Node, Section Leaders and Players Process in Datastage
Refer This Link as well For More Details : Job Run Time Architecture
Jobs developed with DataStage Enterprise Edition (EE) are independent of the actual
hardware and degree of parallelism used to run the job. The parallel Configuration File
provides a mapping at runtime between the job and the actual runtime infrastructure and
resources by defining logical processing nodes.
To facilitate scalability across the boundaries of a single server, and to
maintain platform independence, the parallel framework uses a multi-process
architecture.
The runtime architecture of the parallel framework uses a process-based
architecture that enables scalability beyond server boundaries while avoiding
platformdependent threading calls. The actual runtime deployment for a given job design
is composed of a hierarchical relationship of operating system processes, running on one
or more physical servers
Section Leaders (one per logical processing node): used to create and manage player
processes which perform the actual job execution. The Section Leaders also manage
communication between the individual player processes and the master Conductor Node.
Players: one or more logical groups of processes used to execute the data flow logic. All
players are created as groups on the same server as their managing Section Leader
process.
Conductor Node (one per job): the main process used to startup jobs, determine
resource assignments, and create Section Leader processes on one or more processing
nodes. Acts as a single coordinator for status and error messages, manages orderly
shutdown when processing completes or in the event of a fatal error. The conductor node
is run from the primary server
It is a main process to
1. Start up jobs
2. Resource assignments
3. Responsible to create Section leader (used to create & manage player
player process which perform actual job execution).
4. Single coordinator for status and error messages.
5. manages orderly shutdown when processing completes in the event of fatal
error.
When the job is initiated the primary process (called the conductor) reads the job
design, which is a generated Orchestrate shell (osh) script. The conductor also reads the
parallel execution configuration file specified by the current setting of the
APT_CONFIG_FILE environment variable.
Once the execution nodes are known (from the configuration file) the conductor causes a
coordinating process called a section leader to be started on each; by forking a child
process if the node is on the same machine as the conductor or by remote shell execution
if the node is on a different machine from the conductor (things are a little more dynamic
in a grid configuration, but essentially this is what happens).
Communication between the conductor, section leaders and player processes in a
parallel job is effected via TCP.
Senario's To Calculate the Processes :
Sample APT CONFIG FILE : See in bold to mention conductor node.
{node "node1"
{
fastname "DevServer1"pools "conductor" resource disk
"/datastage/Ascential/DataStage/Datasets/node1" {pools "conductor"} resource
scratchdisk "/datastage/Ascential/DataStage/Scratch/node1" {pools ""}
}
node "node2"
{
fastname "DevServer1" pools "" resource disk
"/datastage/Ascential/DataStage/Datasets/node2" {pools ""} resource
scratchdisk "/datastage/Ascential/DataStage/Scratch/node2" {pools ""}
}
}
Please find the below different answers :
For every job that starts there will be one
(1) conductor process (started on the conductor node),
There will be one (1) section leader for each node in the configuration file and
There will be one (1) player process (may or may not be true) for each stage
in your job for each node.
So if you have a job that uses a two (2) node configuration file and has 3 stages then your
job will have
1
2
Conductor Node
Section leaders (2 Nodes * 1 Section leader per node)
6 Player processes (3 stages * 2 Nodes)Your dump score may show that your job will run 9
processes on 2 nodes.
This kind of information is very helpful when determining the impact that a particular job
or process will have on the underlying operating system and system resources.
Posted by Devendra Kumar Yadav at 11:53 PM No comments: Situations
to choose Parallel or Server Datastage Jobs
5. When data volume is high, it is better to choose parallel job than server job.
Parallel job will be a lot faster than server job even if it runs on single node.
The obvious incentive for going parallel is data volume. Parallel jobs can remove
bottlenecks and run across multiple nodes in a cluster for almost unlimited
scalability. At this point parallel jobs become the faster and easier option. A
parallel sort stage is lot faster than server stage. A Transformer stage in parallel
job with the same transformations in server job is faster. Even on one node with a
compiled transformer stage, the parallel version was three times faster. On 1 node
configuration that does not have a lot of parallel processing also we can still get
big performance improvements from an Enterprise Edition job. The improvements
will be multiplied 10 or more than that if we work on 2CPU machines and two
nodes in most stages.
6. Parallel jobs take advantage of both pipeline parallelism and partitioning
parallelism.
7. We can improve the performance of server job by enabling inter process row
buffering. This helps stages to exchange data as soon as it is available in the link.
IPC stage also helps passive stage to read data from another as soon as data is
available. In other words, stages do not have to wait for the entire set of records
to be read first and then transferred to the next stage. Link partitioner and link
collector stages can be used to achieve a certain degree of partitioning
parallelism.
8. Look up with sequential file is possible in parallel jobs and not possible in server
jobs.
9. Datastage EE jobs are compiled into OSH (Orchestrate Shell script language).
OSH executes operators - instances of executable C++ classes, pre-built
components representing stages used in Datastage jobs.
Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is
why parallel jobs run faster, even if processed on one CPU.
10. The major difference between Infosphere Datastage Enterprise and Server edition
is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a
completely new set of stages, which implement the scalable and parallel data
processing mechanisms. In most cases parallel jobs and stages look similiar to the
Datastage Server objects, however their capababilities are way different. In rough
outline:
Parallel jobs are executable datastage programs, managed and controlled by
Datastage Server runtime environment
Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism.
In most cases no manual intervention is needed to implement optimally those
techniques.
Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating
Refer This Link to Know More about parallel Jobs Stages: Parallel Jobs Stages Posted
by Devendra Kumar Yadav at 11:02 PM No comments:
Surrogate Key Generator Implementation
Surrogate Key Generator Implementation in Datastage 8.1, 8.5 & 9.1
The Surrogate Key Generator stage is a processing stage that generates surrogate key
columns and maintains the key source.
A surrogate key is a unique primary key that is not derived from the data that it
represents, therefore changes to the data will not change the primary key. In a star
schema database, surrogate keys are used to join a fact table to a dimension table.
Surrogate key generator stage uses:
1. Create or delete the key source before other jobs run
2. Update a state file with a range of key values
3. Generate surrogate key columns and pass them to the next stage in the job
4. View the contents of the state file
Generated keys are 64 bit integers and the key source can be stat file or database
sequence.
Surrogate keys are used to join a dimension table to a fact table in a star schema
database.
If you want the SCD stage to generate new surrogate keys by using a key
source that you created with a Surrogate Key Generator stage as
described in Surrogate Key Generator.
If you want to use your own method to handle surrogate keys, you should
derive the Surrogate Key column from a source column.
You can replace the dimension information in the source data stream with the surrogate key
value by mapping the Surrogate Key column to the output link.
Drag the surrogate key stage from palette to parallel job canvas with no input and output
links.
Double click on the surrogate key stage and click on properties tab.
Properties:
Key Source Action = create
Source Type : FlatFile or Database sequence(in this case we are using FlatFile) When
you run the job it will create an empty file.
If you want to the check the content change the View Stat File = YES and check the job
log for details. skey_genstage,0: State file /tmp/skeycutomerdim.stat is empty.
if you try to create the same file again job will abort with the following error.
skey_genstage,0: Unable to create state file /tmp/skeycutomerdim.stat: File exists.
Deleting the key source:
If the stat file exists we can update otherwise we can create and update it.
We are using SkeyValue parameter to update the stat file using transformer stage.
Now we have created stat file and will generate keys using the stat key file.
Click on the surrogate keys stage and go to properties add add type a name for the
surrogate key column in the Generated Output Column Name property.
Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the
output.I have updated the stat file with 100 and below is the output.
If you want to generate the key value from begining you can use following property in the
surrogate key stage.
A. If the key source is a flat file, specify how keys are generated:
1. To generate keys in sequence from the highest value that was last used, set the
Generate Key from Last Highest Value property to Yes. Any gaps in the key range
are ignored.
2. To specify a value to initialize the key source, add the File Initial Value property to
the Options group, and specify the start value for key generation.
3. To control the block size for key ranges, add the File Block Size property to the
Options group, set this property to User specified, and specify a value for the block
size.
B. If there is no input link, add the Number of Records property to the Options group, and
specify how many records to generate.