DataStage Best Practices

DataStage Best Practices
Introduction..................................................................................................................... 6
Objective.............................................................................................................. 6
Background.......................................................................................................... 6
Understanding a jobs UNIX environment....................................................................7
Best practices................................................................................................................. 9
Who has locked the job?......................................................................................9
How to use DS.TOOLS to release the lock..................................................10
Type conversion................................................................................................. 10
Decimal output................................................................................................... 11
How to copy or move data set file......................................................................11
NLS related options.....................................................................................12
COMMAND: copy | cp source-descriptor-file target-descriptor-file..............12
COMMAND: delete | del | rm [ -options... ] descriptor-files........................13
COMMAND: truncate [ -options... ] descriptor-files....................................13
COMMAND: dump [ -options... ] descriptor-files........................................14
COMMAND: describe | lp | lf | ls | ll [ -options... ] descriptor-files...............15
COMMAND: diskinfo [ -a | -np nodepool | -n node... ] diskpool...................16
COMMAND: check......................................................................................16
Job design..................................................................................................................... 17
Modular development........................................................................................17
Default and explicit type conversions.................................................................18
Source type to target type conversions.......................................................18
Sequential file stages (Import and Export).........................................................18
Improving sequential file performance.........................................................19
Partitioning sequential file reads..................................................................19
Sequential file (Export) buffering.................................................................19
Reading from and writing to fixed-length files..............................................19
Reading bounded-length VARCHAR columns.............................................20
Transformer usage guidelines............................................................................20
Choosing appropriate stages.......................................................................20
Transformer NULL handling and reject link..................................................21
Transformer derivation evaluation...............................................................22
Conditionally aborting jobs...........................................................................22
Transformer decimal arithmetic...................................................................22
Optimizing Transformer expressions and stage variables...........................23
Modify stage....................................................................................................... 26
Join stage........................................................................................................... 26
June 2009 2
Reduce Re-partition.....................................................................................27
Sorting......................................................................................................... 27
Joining tables...............................................................................................27
Lookup stage vs. Join stage...............................................................................27
Capturing unmatched records from a Join.........................................................27
The Funnel stage............................................................................................... 28
The Aggregator stage........................................................................................28
Version Control of ETL Jobs........................................................................................29

Change control process during development.....................................................29
Change Log.................................................................................................29
Control area or working area.......................................................................29
Change control process during QA..............................................................29
Database Stages........................................................................................................... 30
Native Parallel vs. Plug-In Database stage usage.............................................30
Native Parallel Database stages..................................................................30
Appropriate use of SQL and DataStage stages.................................................31
Optimizing Select lists........................................................................................ 31
Designing for restart........................................................................................... 32
Database OPEN and CLOSE commands..........................................................32
Database Sparse Lookup vs. Join.....................................................................32
Performance tips for job design..................................................................................34
Performance management........................................................................................... 36
Performance monitoring and tuning...........................................................................37

The Job Monitor.................................................................................................37
Reading a score dump................................................................................37
iostat.................................................................................................................. 37
vmstat................................................................................................................ 38
Load average for all versions.............................................................................38
Platform-specific tuning......................................................................................38
Tru64-specific tuning...................................................................................38
HP-UX-specific tuning.................................................................................38
Selectively rewriting the flow..............................................................................38
Eliminating repartitions.......................................................................................39
Ensuring data is evenly partitioned....................................................................39
Buffering for all versions.....................................................................................40
Job performance analysis............................................................................................41

Why upsert is slow even though the database performance is good?.........45
How does the upsert work?.........................................................................45
Identifying performance bottlenecks..........................................................................48
Bottleneck resolution................................................................................................... 50
Combinable operators........................................................................................ 50
Disk I/O.............................................................................................................. 50
Buffering............................................................................................................. 51
Sequential file stages.........................................................................................52
Hashed File stages............................................................................................ 53
Creating hashed files...................................................................................54
Using hashed file caching............................................................................56
Preloading hashed files to memory.............................................................57
Very large hashed files................................................................................57
DB/2 table stages............................................................................................... 57
Reading from DB/2......................................................................................57
Writing to DB/2............................................................................................58
Reference lookups to DB/2..........................................................................59
Transformer stages............................................................................................ 59
Routines and transforms....................................................................................61
Shared containers..............................................................................................61
Sequences......................................................................................................... 62
Production support process........................................................................................ 63

SWAT................................................................................................................. 63
Production support............................................................................................. 63
Version control................................................................................................... 64
Work flow diagram.......................................................................................64
Introduction
This chapter provides guidance on the best practices that can be adopted through some
Performance Tuning Tips and Best Practices for design and development of real-world
DataStage Enterprise Edition (EE) jobs within a UNIX environment.
The primary audience for this document is DataStage developers who have been trained
in Enterprise Edition. Information in certain sections may also be relevant for Technical
Architects and System Administrators.
Objective
PROJECT warehouse has nearly 400 odd jobs, which are developed by various groups
of developers from various parts of the world. Even though majority of the development
is done by XXXXi, other regions like XXXX have their contributions towards
enhancing PROJECT warehouse product. The collaborative approach of PROJECT
engagement model encourages the involvement of experts across the globe, thus
prompting for the need to have Reference handbook which can be used by these
various groups of developers.
Background
During development we observe that the developers are driven by certain factors.
Tight schedule and demanding environment
The ETLs were limited to ensure output data matches with design specifications
Unexpected data volume
Affecting the performance, where the performance was not a design factor
Usage of appropriate stages
Usage of SQLs
Considering ETL and data warehouse environment that are bound to grow and scale over
in the long term, we need to make the DataStage jobs run as efficiently as possible.
Understanding a jobs UNIX environment
DataStage EE provides a number of environment variables to control how the jobs
operate on a UNIX system. In addition to providing required information, environment
variables can be used to enable or disable various DataStage features, and to fine tune
performance settings. Although UNIX environment variables can be set in multiple
places, there is a defined order of precedence that is evaluated when a jobs actual
environment is established at runtime, as given below:
1. The daemon for managing client connections to the DataStage server engine is
called dsrpcd. By default (in a root installation), dsrpcd is started when the server is
installed, and should start whenever you restart your machine.
You can also start and stop dsrpcd manually, using the command,
$DSHOME/uv admin.
By default, DataStage jobs inherit the dsrpcd UNIX environment, which is set in
the /etc/profile and $DSHOME/dsenv scripts.
Note
Client connections DO NOT pick up per-user environment
settings from their $HOME/.profile script.
2. Environment variable settings for particular projects can be set in the DataStage
Administrator. Any project-level settings for a specific environment variable will
override any settings inherited from dsrpcd
Caution
When you are migrating projects between machines or
environments, it is important to note that project-level
environment variable settings are not exported when a project
is exported. Any project-level environment variables must be
set for new projects.
3. Within DataStage Designer, environment variables may be defined for a particular

job using the Job Properties dialog box.
Any job-level settings for a specific environment variable overrides any settings
inherited from dsrpcd or from project-level defaults.
To avoid hard-coding default values for job parameters, there are two special values that
can be used for environment variables within job parameters. They are $ENV and $
PROJDEF. Both of these cause special actions that set the values at the time the job is
started:
6
4. $ENV causes the value of the named environment variable to be retrieved from the
operating system of the job environment. Typically this is used to pickup values set
in the operating system outside of DataStage.
Note
$ENV should not be used for specifying the default
$APT_CONFIG_FILE value because, during job development,
the DataStage Designer parses the corresponding parallel
configuration file to obtain a list of node maps and constraints
(advanced stage properties).
5. $PROJDEF causes the project default value for the environment variable (as shown
on the DataStage Administrator client) to be picked up and used to set the
environment variable and job parameter for the job.
7
Best practices
Who has locked the job?

A job maybe locked due to any of the following reasons:
Open in connected DS Designer
Properties open in connected DS Manager
Monitor open in connected DS Director
Locked by defunct process
To discover who is holding the lock, the best solution is to use the following command,
list_readu
SAMPLE:
We can gather following information from the sample given above:

JOB_NAME: test001
Holding PID: 6475
DS designer client machine host name: RYANPIG
8
How to use DS.TOOLS to release the lock
SAMPLE:
Type conversion
All columns in external file have the data type as, VARCHAR, and you need to convert
it to the type according to your requirement. For example, VARCHAR to decimal, or
convert decimal to VARCHAR for input.
DataStage EE provides a lot of in-line type conversion functions such as the following:
DecimalToString
StringToDate
StringToDecimal
You can use IsValid function to check if this data type is valid.
For example:
IsValid(Date,2008-08-08) can be used to check if the string 2008-08-08 is a valid date.
To convert Decimal to Date you need to use DecimalToString to convert Decimal to
String first, and then use IsValid function to check if it is right date, following which, you
need to use StringToDate to convert it to a Date.
9
Decimal output
When you use Aggregator stage, output type of calculation column is normally double
(unless Default to Decimal Output has been set); setting this property overrides the default
setting and makes this a decimal. The value of this property denotes the precision and
scale of that decimal and should be specified as shown below:
Decimal Output
How to copy or move data set file

To copy, move or dump data set file you can use orchadmin command with right
APT_CONFIGFILE
COMMAND HELP:
NAME
orchadmin - delete, copy, describe and dump ORCHESTRATE files
SYNOPSIS
orchadmin command [ -options... ] descriptor-files...
The following table provides options that can be used with the orchadmin command.
Options Description
orchadmin [-help] Prints help message for all commands
orchadmin -f command-file Executes commands from specified file
orchadmin - Executes commands from standard input
10
DESCRIPTION
orchadminexecutes commands which delete, copy, and describe ORCHESTRATE files.

These commands may be given on the command line or read from a file or the standard
input.
OPTIONS:
Options Description
Command delete, copy, describe, dump or check
-f command-file Path of a file containing orchadmin commands
- Read commands from the standard input as if it were a command

file
-help | -h Write usage information to the standard output
The file may have multiple commands separated by semicolons. A command may be
spread over multiple lines. C and C++ style comments and csh style quotation marks are
allowed.
NLS related options

OPTIONS:
The following table provides the NLS related options along with their description.
NLS Options Description

-input_charset map-name Specifies the encoding of option values
-output_charset map- Specifies the encoding of orchadmin output

name
-os_charset map-name Specifies the encoding of data passed to or received from

the operating system via char *
-escaped Allows command line characters to be presented in a two-

byte Unicode hexadecimal format
COMMAND: copy | cp source-descriptor-file target-descriptor-file

This command copies the schema, contents and preserve-partitioning flag of the
specified ORCHESTRATE file dataset.
If the preserve-partitioning flag is set, the copy has the same number of partitions
and record order as the original.
11
If the target file already exists, it is truncated first.
If the preserve-partitioning flag of the source file is set and the target file already
exists, it must have the same number of partitions as the source file.
The copy command has no options. A warning message is issued if the target does not
already exist. This is a bug, not a feature.
COMMAND: delete | del | rm [ -options... ] descriptor-files...

This command deletes the specified descriptor files and all of their data files.
OPTIONS:
The following table provides options that can be used with the delete command.
Options Description
-f Force. Proceed even if some partitions of the dataset are on nodes
that are inaccessible from the current configuration file. This
leaves orphan data files on those nodes. They must be deleted by
some other means.
-x Use the system configuration file rather than the one stored in the
dataset.
EXAMPLE:
To delete all datasets in the current directory that end in .ds, you can use the following
command:
orchadmin rm *.ds
COMMAND: truncate [ -options... ] descriptor-files...

This command removes data from the specified datasets.
OPTIONS:
The following table provides options that can be used with the truncate command.
Options Description
-f Force. Proceed even if some partitions of the dataset are on nodes
that are inaccessible from the current configuration file. This
leaves orphan data files on those nodes. They must be truncated
by some other means.
dataset.
-n segment Leave these many segments. The default is 0.
12
EXAMPLE:
To truncate big.ds to 10 segments, you can use the following command:

orchadmin truncate -n 10 big.ds
To remove all data from small.ds, you can use the following command:
orchadmin truncate small.ds
COMMAND: dump [ -options... ] descriptor-files...

This command dumps the specified ORCHESTRATE parallel files as text to the standard
output. If no options are specified, all records are dumped in order from the first record
of the first partition to the last record of the last partition. Each field value is followed by
a space, and each record is followed by a newline. Specific top-level fields may be
dumped with the -field option.
OPTIONS:
The following table provides options that can be used with the dump command.
Options Description
-field name Dump the specified top-level field. The default is to dump all
fields. This option can occur multiple times. Each occurrence
adds to the list of fields.
-name Precede each value by its field name and a colon.
-n numrec Limit the number of records dumped per partition.

The default is not to limit.
-part N Dump only the specified partition.

The default is to dump all partitions.
-p period Dump every N'th record in a partition, starting with the first
record not skipped (see -skip). The period must be greater than 0.
The default is 1.
-skip N Skip the first N records in each partition.

The default is 0.
dataset.
If an option occurs multiple times, the last one takes effect. The -field option is an
exception, where each occurrence adds to the list of fields to be dumped.
EXAMPLE:
13
To dump all records of all partitions of a parallel file named small.ds, precede each
value by its field name and a colon. You can use the following command:
orchadmin dump -name small.ds
To dump the value of the customer field of the first 99 records of partition 0 of
big.ds, you can use the following command:
orchadmin dump -part 0 -n 99 -field customer big.ds
COMMAND: describe | lp | lf | ls | ll [ -options... ] descriptor-files...

This command prints a report about each of the specified parallel files:
lp = describe p
lf = describe f
ls = describe s
ll = describe l
OPTIONS:
The following table provides options that can be used with the describe command.
Options Description
-p List partitioning information (except for datafile information)
-c Print the stored configuration file, if any
-f List the data files
-s Print the schema
dataset
-e Describe segments individually
-v Describe all segments, valid or otherwise
-d Print numbers exactly, not in pretty form
-l Means -p -f -s -e -v -c
EXAMPLE:
To list the partitioning information, data files and schema of file1 and file2, you can use
the following command:
orchadmin ll file1 file2
14
COMMAND: diskinfo [ -a | -np nodepool | -n node... ] diskpool
This command prints a report about the specified disk pool.
OPTIONS:
The following table provides options that can be used with the diskinfo command.
Options Description
-a Print information for all nodes
-np Print information for just the specified node pool
-n Print information for the specified nodes
-q Print summary of information only
If no options are supplied, the default node pool is used.

EXAMPLE:
To describe disk pool pool1 in node pool bignodes, you can use the following command:
orchadmin diskinfo -np bignodes pool1
COMMAND: check
This command checks the configuration file for any problems. The command, check, has
no options.
15
Job design
The ability to process large volumes of data in a short period of time depends on all
aspects of the flow and environment being optimized for maximum throughput and
performance. Performance tuning and optimization is an iterative process that begins at
job design and unit tests, proceeds through integration and volume testing, and continues
throughout an applications production lifecycle.
You need to note the following points regarding job design:
When writing intermediate results that will only be shared between DataStage EE
jobs, always write to persistent DataSets.
DataSets achieve end-to-end parallelism across jobs by writing data in
partitioned form, retaining sort order, in EE-native format without the overhead
of format conversion or serial I/O.
DataSets should be used to create restart points in the event that a job (or sequence)
needs to be re-run.
Caution
Because datasets are platform and configuration-specific, you
should not use them for long-term backup and recovery of
source data.
Depending on the available system resources, it may be possible to optimize overall

processing time by allowing smaller jobs to run concurrently. However, care must be
taken to plan for scenarios when source files arrive later than expected, or need to be
reprocessed in the event of a failure.
Modular development
You should use the modular development techniques to maximize re-use of DataStage
jobs and components. Two of such techniques use the following:
Job Parameterization
Job parameterization allows a single job design to process similar logic instead
of creating multiple copies of the same job. In v7.x, the Multiple-Instance job
property allows multiple invocations of the same job to run simultaneously.
Parallel Shared Containers
Starting with DataStage V7, Parallel Shared Containers allow common logic to
be shared across multiple jobs.
For maximum component re-use, enable runtime column propagation at the
project level and for every stage within the parallel shared container. This allows
the container input and output links to contain only the columns relevant to the
container processing.
16
Using runtime column propagation, any additional columns can be passed
through the container at runtime without the need to separate and remerge.
Note
Parallel Shared Containers are inserted when a job is
compiled. If the shared container is changed, the Usage
Analysis and Multi-Job Compile tools can be used to
recompile jobs that use a shared container.
Default and explicit type conversions

DataStage Enterprise Edition provides a number of default conversions and conversion
functions when mapping from a source to a target data type. Default type conversions
take place across the output mappings of any EE stage.
Source type to target type conversions

The conversion of numeric data types may result in a loss of range and cause incorrect
results, depending on the source and result data types. In these instances, DataStage EE
displays a warning message in the job log.
When converting from variable-length to fixed-length strings, EE pads the remaining
length with NULL (ASCII zero) characters.
You can use the environment variable, $APT_STRING_PADCHAR , to change the
default pad character from an ASCII NULL (0x0) to another character; for example,
an ASCII space (0x20) or a Unicode space (U+0020).
As an alternate solution, you can use the PadString Transformer function to pad a
variable-length (VARCHAR) string to a specified length using a specified pad
character.
Note
PadString does not work with fixed-length (CHAR) string
types. You must first convert a Char string type to a
VARCHAR type before using PadString.
Some stages (for example, SequentialFile and DB2/UDB Enterprise targets) allow the
pad character to be specified in their stage or column definition properties.
Sequential file stages (Import and Export)

This section provides information on the sequential file stages during import and export
of data.
17
Improving sequential file performance
If the source file is fixed width, you can use the Readers per Node option to read a single
input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is
not maintained.
If the input sequential file cannot be read in parallel, performance can still be improved
by separating the file I/O from the column parsing operation. To accomplish this, define
a single large string column for the non-parallel Sequential File read, and then pass this
to a Column Import stage to parse the file in parallel. The formatting and column
properties of the Column Import stage match those of the Sequential File stage.
On heavily-loaded file servers or some RAID/SAN array configurations, you can use the
environment variables $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE
to improve I/O performance. These settings specify the size of the read (import) and
write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may
improve performance.
Finally, in some disk array configurations, setting the environment variable
$APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the read/write size in bytes can
significantly improve performance of Sequential File operations.
Partitioning sequential file reads

You must take care when choosing the appropriate partitioning method from a
Sequential File read:
Dont read from Sequential File using SAME partitioning! Unless more than one
source file is specified, SAME will read the entire file into a single partition, making
the entire downstream flow run sequentially (unless it is later repartitioned).
When multiple files are read by a single Sequential File stage (using multiple files,
or by using a File Pattern), each files data is read into a separate partition. It is
important to use ROUND-ROBIN partitioning (or other partitioning appropriate to
downstream components) to evenly distribute the data in the flow.
Sequential file (Export) buffering

By default, the Sequential File (export operator) stage buffers its writes to optimize
performance. When a job completes successfully, the buffers are always flushed to disk.
The environment variable $APT_EXPORT_FLUSH_COUNT allows the job developer to
specify how frequently (in number of rows) the Sequential File stage flushes its internal
buffer on writes. Setting this value to a low number (such as 1) is useful for real-time
applications, but there is a small performance penalty associated with the increased I/O.
Reading from and writing to fixed-length files

You must take care, particularly when processing fixed-length fields using the
Sequential File stage:
If the incoming columns are variable-length data types (for example, Integer,
Decimal, and VARCHAR), the field width column property must be set to match the
18
fixed-width of the input column. Double-click on the column number in the grid
dialog to set this column property.
If a field is nullable, you must define the NULL field value and length in the Nullable
section of the column property. Double-click on the column number in the grid
dialog to set these properties.
When writing fixed-length files from variable-length fields (for example, Integer,
Decimal, VARCHAR), you must set the field width and pad string column properties
to match the fixed-width of the output column. Double-click on the column number
in the grid dialog to set this column property.
To display each field value, use the print_field import property.
Reading bounded-length VARCHAR columns

You must take care when reading delimited, bounded-length VARCHAR columns
(VARCHARs with the length option set).
By default, if the source file has fields with values longer than the maximum VARCHAR
length, these extra characters are silently truncated. Starting with DataStage v7.01,
environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS directs
DataStage to reject records with strings longer than their declared maximum column
length.
Transformer usage guidelines

This section provides the guidelines on the usage of the Transformer stage.
Choosing appropriate stages

The parallel Transformer stage always generates C code, which is then compiled to a
parallel component. For this reason, it is important to minimize the number of
transformers, and to use other stages (Copy, Filter, Switch, and so on) when derivations
are not needed.
The Copy stage should be used instead of a Transformer for simple operations
including:
Job Design placeholder between stages (unless the Force option = true, EE
optimizes this out at runtime)
Renaming Columns
Dropping Columns
Default Type Conversions
19
Note
You can also perform the rename, drop (if runtime column
propagation is disabled), and default type conversions by the
output mapping tab of any stage.
NEVER use the BASIC Transformer stage in large-volume job flows. Instead,
user-defined functions and routines can expand parallel Transformer capabilities.
Consider, if possible, implementing complex derivation expressions using regular
patterns by Lookup tables instead of using a Transformer with nested derivations.
For example, the derivation expression:
If A=0, 1, 2, 3 Then B=X If A=4, 5, 6, 7 Then B=C
Could be implemented with a lookup table containing values for column A and
corresponding values of column B.
Optimize the overall job flow design to combine derivations from multiple
Transformers into a single Transformer stage when possible.
In DataStage V7 and later, you can use the Filter and/or Switch stages to separate
rows into multiple output links based on SQL-like link constraint expressions.
In DataStage V7 and later, you can use the Modify stage for non-default type
conversions, NULL handling, and character string trimming.
Buildops should be used instead of Transformers in the handful of scenarios where
complex reusable logic is required, or where existing Transformer-based job flows
do not meet performance requirements.
Transformer NULL handling and reject link

When evaluating expressions for output derivations or link constraints, the Transformer
rejects (through the reject link indicated by a dashed line) any row that has a NULL
value used in the expression.
To create a Transformer reject link in DataStage Designer, right-click on an output
link and choose Convert to Reject.
The Transformer rejects NULL derivation results because the rules for arithmetic
and string handling of NULL values are by definition undefined. For this reason,
always test for null values before using a column in an expression, for example:
If ISNULL (link.col) Then Else
Note
If an incoming column is only used in a pass-through
derivation, the Transformer allows this row to be output.
DataStage V7 enhances this behaviour by placing warnings
in the log file when the discards occur.
20
Transformer derivation evaluation
Output derivations are evaluated BEFORE any type conversions on the assignment.
For example,
The PadString function uses the length of the source type, and not the target.
Therefore, it is important to make sure the type conversion is done before a row reaches
the Transformer.
For example,
TrimLeadingTrailing(string) works only if string is a VARCHAR field. Thus, the
incoming column must be type VARCHAR before it is evaluated in the Transformer.
Conditionally aborting jobs

You can use the Transformer to conditionally abort a job when incoming data matches a
specific rule.
Create a new output link that is to handle rows that match the abort rule.
Within the link constraints dialog box, apply the abort rule to this output link, and
set the Abort After Rows count to the number of rows allowed before the job
should be aborted.
Since the Transformer aborts the entire job flow immediately, it is possible that valid
rows have not been flushed from Sequential File (export) buffers, or committed to
database tables. It is important to set the Sequential File buffer flush or database
commit parameters.
Transformer decimal arithmetic

Starting with DataStageV7.01, the parallel Transformer supports internal decimal
arithmetic. When the columns in a Transformer stage are being evaluated, there are times
when internal decimal variables need to be generated in order to perform the evaluation.
By default, these internal decimal variables have a precision and scale of 38 and 10.
If more precision is required, you can set the environment variables
$APT_DECIMAL_INTERM_PRECISION and $APT_DECIMAL_INTERM_SCALE to the desired
range, up to a maximum precision of 255 and scale of 125.
By default, internal decimal results are rounded to the nearest applicable value.
You can use the environment variable $APT_DECIMAL_INTERM_ROUND_MODE to
change the rounding behaviour using one of the following keywords given in the table:
Keywords Description
ceil Rounds towards positive infinity.
Examples: 1.4 -> 2, -1.6 -> -1
21
Keywords Description
floor Rounds towards negative infinity.
Examples: 1.6 ->1, -1.4 -> -2
round_inf Rounds or truncates towards nearest representable value,

breaking ties by rounding positive values toward positive infinity
and negative values toward negative infinity.
Examples: 1.4 -> 1, 1.5-> 2, -1.4 -> -1, -1.5 -> -2
trunc_zero Discard any fractional digits to the right of the rightmost

fractional digit supported regardless of sign.
For example, if $APT_DECIMAL_INTERM_SCALE is smaller
than the results of the internal calculation, round or truncate to the
scale size.
Examples: 1.56 -> 1.5, -1.56 ->-1.5
Optimizing Transformer expressions and stage variables

In order to write efficient Transformer stage derivations, it is useful to understand what
items get evaluated and when. The evaluation sequence is as follows:
Evaluate each stage variable initial value
For each input row to process:
Evaluate each stage variable derivation value,
Unless the derivation is empty
For each output link:
Evaluate each column derivation value
Write the output record
Next output link
Next input row
The stage variables and the columns within a link are evaluated in the order in which
they are displayed in the Transformer editor. Similarly, the output links are also
evaluated in the order in which they are displayed.
From this sequence, you can see that there are certain constructs that prove inefficient to
include in output column derivations, as they get evaluated once for every output column
that uses them. Such constructs are:
Where the same part of an expression is used in multiple column derivations
For example,
suppose multiple columns in output links want to use the same substring of an
input column, then the following test may appear in a number of output columns
derivations:
22
IF (DSLINK1.col[1,3] = 001) THEN ...
In this case, the evaluation of the substring of DSLINK1.col[1,3] is evaluated

for each column that uses it.
This can be made more efficient by moving the substring calculation into a stage
variable. By doing this, the substring is evaluated just once for every input row.
In this case, the stage variable definition would be:
DSLINK1.col1[1,3]
and each column derivation would start with:

IF (Stage Var1 = 001 THEN ...
In fact, this example could be improved further by also moving the string
comparison into the stage variable. The stage variable would be:
IF (DSLink1.col[1,3] = 001 THEN 1 ELSE 0
and each column derivation would start with:

IF (Stage Var1) THEN
This reduces both the number of substring functions evaluated and string
comparisons made in the Transformer.
Where an expression includes calculated constant values
For example,
a column definition may include a function call that returns a constant value,
such as:
Str( ,20)
This returns a string of 20 spaces. In this case, the function gets evaluated every
time the column derivation is evaluated. It is more efficient to calculate the
constant value just once for the whole Transformer.
This can be achieved using stage variables. This function could be moved into a
stage variable derivation; but in this case, the function still gets evaluated once
for every input row. The solution here is to move the function evaluation into the
initial value of a stage variable.
A stage variable can be assigned an initial value from the Stage Properties
dialog/Variables tab in the Transformer stage editor. In this case, the variable
would have its initial value set to:
Str( ,20)
You then leave the derivation of the stage variable on the main Transformer page
empty. Any expression that previously used this function would be changed to
use the stage variable instead.
23
The initial value of the stage variable is evaluated just once, before any input
rows are processed. However, because the derivation expression of the stage
variable is empty, it is not re-evaluated for each input row. Therefore, its value
for the whole Transformer processing is unchanged from the initial value.
In addition to a function value returning a constant value, another example
would be part of an expression such as:
"abc" : "def"
As with the function-call example, this concatenation is evaluated every time the
column derivation is evaluated. Since the subpart of the expression is actually
constant, this constant part of the expression could again be moved into a stage
variable, using the initial value setting to perform the concatenation just once.
Where an expression requiring a type conversion is used as a constant, or it is
used in multiple places
For example,
an expression may include something like this:
DSLink1.col1+"1"
In this case, the "1" is a string constant, and so, in order to be able to add it to
DSLink1.col1, it must be converted from a string to an integer each time the
expression is evaluated. The solution in this case is just to change the constant
from a string to an integer:
DSLink1.col1+1
In this example, if DSLINK1.col1 were a string field, then, again, a conversion

would be required every time the expression is evaluated.
If this just appeared once in one output column expression, this would be fine.
However, if an input column is used in more than one expression, where it
requires the same type conversion in each expression, then it would be more
efficient to use a stage variable to perform the conversion once. In this case, you
would create,
for example,
for an integer stage variable, specify its derivation to be DSLINK1.col1, and
then use the stage variable in place of DSLink1.col1, where that conversion
would have been required.
Note
When using stage variables to evaluate parts of expressions,
the data type of the stage variable should be set correctly for
that context. Otherwise, needless conversions are required
wherever that variable is used.
24
Suggest using staging variables for cases where a derivation value is used
multiple times.
If the derivation is done once and is applicable to only one target column, then
no need to derive it using staging variable and then assign, instead apply the
derivation logic against the target column itself.
Modify stage
The Modify stage is the most efficient stage available. Any transformation which can be
implemented in Modify is more efficient.
Transformations that touch a single field, such as keep/drop, type conversions, some
string manipulations, and NULL handling, are the primary operations, which should be
implemented using Modify instead of Transformer derivations.
Releases beyond 7.0 may be able to automatically perform this optimization. In addition,
the Output Mapping tab of any stage generates an underlying modify.
Starting with v7.01, the function string_trim has been added of Modify, with the
following syntax:
stringField=string_trim[character, direction, justify] (string)
You can use this function to remove the characters used to pad variable-length strings
when they are converted to fixed-length strings of greater length.
By default, these characters are retained when the fixed-length string is then converted
back to a variable-length string.
The character argument is the character to remove. By default, this is NULL. The value
of the direction and justify arguments can be either begin or end; direction defaults to
end, and justify defaults to begin. Justify has no affect when the target string has variable
length.
The following example removes all leading ASCII NULL characters from the beginning
of name and places the remaining characters in an output variable-length string with the
same name:
name:string = string_trim[NULL, begin](name)
The following example removes all trailing Z characters from colour, and left-justifies
the resulting hue fixed-length string:
hue:string[10] = string_trim[Z, end, begin](color)
The Modify stage uses the syntax of the underlying modify operator.
Join stage
This section provides information on the Join stage.
25
Reduce Re-partition
Understanding of data is critical for selecting the right partition key to allow high
scalability and parallelism in the job design.
You should define partitioning as possible to minimize re-partitioning. To allow parallel
processing in Join, KEY based partitioning is required. Therefore, it proves effective if
we set the right Partition key from the point of extraction, by either reading from files or
RDBMS.
Sorting
Sort stage provides better flexibility since you can do the following:
Skip keys that are previously sorted.
Set the buffer required.
Joining tables
To ensure parallel processing and good performance, all Join stage does the following by
default:
KEY based partitioning.
Insert a Sort operator to sort the Joint Keys this can be disabled at
APT_SORT_INSERTION_CHECK_ONLY (This is a Project/Job level setting) assuming
proper sorting has been done in previous stage.
All the settings above require proper understanding of the data and functions of the
stages. It should be based on case by case (or job by job) basis.
Typically, familiarization with the Jobs and Data to work with is required to fine and
effective tuning of Jobs.
Lookup stage vs. Join stage

The Lookup stage is most appropriate when the reference data for all lookup stages in a
job is small enough to fit into available physical memory. Each lookup reference
requires a contiguous block of physical memory. If the datasets are larger than available
resources, the Join or Merge stage should be used.
If the reference to a Lookup is directly from a DB2 or Oracle table, and the number of
input rows is significantly smaller than the number of reference rows, a Sparse Lookup
may be appropriate.
Capturing unmatched records from a Join

The Join stage does not provide reject handling for unmatched records (such as in an
Inner Join scenario). If un-matched rows must be captured or logged, you must perform
an Outer join operation.
26
In an Outer join scenario, all rows on an outer link (for example, Left Outer, Right
Outer, or both links in the case of Full Outer) are output regardless of match on key
values.
During an Outer Join, when a match does not occur, the Join stage inserts NULL values
into the unmatched columns. You must take care when changing the column properties
to allow NULL values before the Join. This is most easily done by inserting a Copy
stage and mapping a column from NON-NULLABLE to NULLABLE.
You can use a Filter stage to test for NULL values in unmatched columns.
In some cases, it is simpler to use a Column Generator to add an indicator column,
with a constant value, to each of the outer links and test that column for the constant
after you have performed the join. This is also handy with Lookups that have multiple
reference links.
The Funnel stage

The funnel requires all input links to have identical schemas (column names, types,
attributes including nullability). The single output link matches the input schema.
DataStage V7 introduces the continuous (non-blocking) Funnel, which allows
multiple inputs to be combined without performance impact from any individual
input link.
In DataStage V6.x, when combining a large number of inputs, the Funnel stage is
best used at the start of a job, combining multiple datasets created by earlier jobs.
The Aggregator stage

By default, the output data type of a parallel Aggregator stage calculation or
recalculation column is double.
Starting with v7.01 of DataStage EE, the new optional property
Aggregations/Default to Decimal Output specifies that all calculation or
recalculations result in decimal output of the specified precision and scale.
You can also specify that the result of an individual calculation or recalculation is
decimal by using the optional Decimal Output sub property.
You should not use the factual fields in GROUP BY; instead use the dimension
fields only.
27
Version Control of ETL Jobs
Change control process during development

This section provides information on the change control process followed during
development.
Change Log
It is always recommended to put Change log or Update history in the ETL annotation.
Details should include:
Updated by
Update Date
Details in brief, explaining what has been changed.
Also some other documents like excel document need to be maintained having job
names and change details.
Control area or working area

It is always recommended to have separate category as Control area or Working area
in DataStage environment to have working copies of ETL jobs.
Once the job is changed and peer reviewed, you can promote it to the main
deliverable category.
To make sure that somebody else is not working on the same job, the developer is
recommended to make a practice to always have a look in the working area, to find
out whether ETL is being changed by somebody else or not.
Change control process during QA

Take a job from QA environment Promote that job in the Development environment;
mark the original job that it is being updated by developer so that other developers
can be notified that ETL is being updated by somebody.
While releasing any job to QA always use DS version Control with proper Release
number and Batch naming conventions including comments in it.
If ETL changes are QA or System Integration Testing (SIT) fixes then you can add
the Defect number in the details.
Same Control area or Working area practice is applicable during QA phase also.
28
Database Stages
Native Parallel vs. Plug-In Database stage usage

Starting with release 7, DataStage EE offers database connectivity through native
parallel and plug-in stage types. For some databases (DB2, Informix, Oracle, and
Teradata), multiple stage types are available.
In general, for maximum parallel performance, scalability and features, it is best to use
the native parallel database stages in a job design. Native parallel database stages are:
Native Parallel Database stages

DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
Plug-in stage types are intended to provide connectivity to database configurations that
are not offered by the native parallel stages. Because plug-in stage types cannot read in
parallel, and cannot span multiple servers in a clustered or Massively Parallel Processing
(MPP) configuration, you should only use them when it is not possible to use a native
parallel stage. Plug-in database stages include:
Dynamic RDBMS (V7.1 and later)
DB2/UDB API
DB2/UDB Load
Informix CLI
Informix Load
Informix XPS Load
Oracle OCI Load
RedBrick Load
Sybase IQ12 Load
Sybase OC
Teradata API
Teradata MultiLoad (MultiLoad)
Teradata MultiLoad (TPump) (v7.1 and later)
29
Due to the exceptions to this rule (especially with Teradata), specific guidelines of when
to use various stage types are provided in the database-specific topics in this section.
Appropriate use of SQL and DataStage stages

When using relational database sources, there is often a functional overlap between SQL
and DataStage stages. Although it is possible to use either SQL or DataStage to solve a
given business problem, the optimal implementation involves leveraging the strengths of
each technology to provide maximum throughput and developer productivity.
While there are extreme scenarios when the appropriate technology choice is clearly
understood, there may be gray areas where you need to base your decision on factors
such as developer productivity, metadata capture and re-use, and ongoing application
maintenance costs. The following guidelines can assist with the appropriate use of SQL
and DataStage technologies in a given job flow:
When possible, use a SQL filter (WHERE clause) to limit the number of rows sent
to the DataStage job. This minimizes impact on network and memory resources, and
leverages the database capabilities.
Use a SQL JOIN to combine data from tables with a small number of rows in the
same database instance, especially when the joined columns are indexed.
When combining data from very large tables, or when the source includes a large
number of database tables, the efficiency of the DataStage EE Sort and Join stages
can be significantly faster than an equivalent SQL query. In this scenario, it can still
be beneficial to use database filters (WHERE clause) if appropriate.
Avoid the use of database stored procedures (for example, Oracle PL/SQL) on a per-
row basis within a high-volume data flow. For maximum scalability and parallel
performance, it is best to implement business rules natively using DataStage
components.
Wherever possible, use SQL to remove duplicate records instead of using remove
duplicate stages.
Avoid special functions in WHERE clause, for example HEX.
Avoid factual columns in GROUP BY condition; instead use dimension columns.
If possible, reduce the number of DB2 stages used in a job.
Optimizing Select lists

For best performance and optimal memory usage, it is best to explicitly specify column
names on all source database stages, instead of using an unqualified Table or SQL
SELECT * read.
For Table read method, always specify the Select List sub property.
30
For Auto-Generated SQL, the DataStage Designer will automatically populate the
select list based on the stages output column definition.
The only exception to this rule is when building dynamic database jobs that use
runtime column propagation to process all rows in a source table.
Designing for restart

To enable restart of high-volume jobs, it is important to separate the transformation
process from the database write (Load or Upsert) operation. After transformation, you
need to load the results to a parallel dataset. Subsequent job(s) should read this dataset
and populate the target table using the appropriate database stage and write method.
As a further optimization technique, you can either use a Lookup stage or Join stage,
depending on data volume, to identify existing rows before they are inserted into the
target table.
Database OPEN and CLOSE commands

The native parallel database stages provide options for specifying OPEN and CLOSE
commands. These options allow commands (including SQL) to be sent to the database
before (OPEN) or after (CLOSE) all rows are read/written/loaded to the database. OPEN
and CLOSE are not offered by plug-in database stages.
For example,
You can use the OPEN command to create a temporary table, and the CLOSE command
to select all rows from the temporary table and insert into a final target table.
As another example,
You can use the OPEN command to create a target table, including database-specific
options (tablespace, logging, constraints, and so on) that are not possible with the
Create option.
In general, dont let EE generate target tables unless they are used for temporary storage.
There are few options to specify Create table options and doing so may violate data-
management (DBA) policies.
It is important to understand the implications of specifying a user-defined OPEN and
CLOSE command.
For example,
When reading from DB2, a default OPEN statement places a shared lock on the source.
When specifying a user-defined OPEN command, this lock is not sent, and should be
specified explicitly if appropriate.
Database Sparse Lookup vs. Join

Data read by any database stage can serve as the reference input to a Lookup operation.
By default, this reference data is loaded into memory like any other reference link
(Normal Lookup).
31
When directly connected as the reference link to a Lookup stage, both DB2/UDB
Enterprise and Oracle Enterprise stages allow the lookup type to be changed to Sparse,
sending individual SQL statements to the reference database for each incoming Lookup
row. Sparse Lookup is only available when the database stage is directly connected to
the reference link, with no intermediate stages.
Caution
The individual SQL statements required by a Sparse Lookup
are an expensive operation from a performance perspective.
In most cases, it is faster to use a DataStage Join stage
between the input and DB2 reference data than it is to
perform a Sparse Lookup.
For scenarios where the number of input rows is significantly smaller (for example,
1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse
Lookup may be appropriate.
32
Performance tips for job design
You can follow the below mentioned tips to improve the performance of the job design:
Remove unnecessary columns as early as possible within the job flow every
additional unused column requires additional buffer memory which can impact
performance (it also makes each transfer of a record from one stage to the next more
expensive).
When reading from database sources, use a select list to read needed columns
instead of the entire table (if possible)
To ensure that columns are actually removed using a stages Output Mapping,
disable runtime column propagation for that column.
Always specify a maximum length for VARCHAR columns.
Unbounded strings (VARCHARs without a maximum length) can have a
significant negative performance impact on a job flow. There are limited
scenarios when the memory overhead of handling large VARCHAR columns
would dictate the use of unbounded strings.
For example:
VARCHAR columns of a large (such as, 32K) maximum length that are rarely
populated.
VARCHAR columns of a large maximum length with highly varying data sizes.
Placing unbounded columns at the end of the schema definition may improve
performance.
In DataStage v7.0 and earlier, limit the use of variable-length records within a flow.
Depending on the number of variable-length columns, it may be beneficial to
convert incoming records to fixed-length types at the start of a job flow, and trim to
variable-length at the end of a flow before writing to a target database or flat file
(using fixed-length records can dramatically improve performance). DataStage
v7.01 and later implement internal performance optimizations for variable-length
columns that specify a maximum length.
Avoid type conversions if possible.
Be careful to use proper datatype from source (especially Oracle) in EE job
design.
33
Enable $OSH_PRINT_SCHEMAS to verify runtime schema matches job
design column definitions.
Verify that the data type of defined Transformer stage variables matches the
expected result type.
Minimize the number of Transformers. Where appropriate, use other stages (for
example, Copy, Filter, Switch, Modify) instead of the Transformer.
NEVER use the BASIC Transformer in large-volume data flows. Instead, user-
defined functions and routines can expand the capabilities of the parallel
Transformer.
You should use buildops instead of Transformers in the handful of scenarios where
complex reusable logic is required, or where existing Transformer-based job flows
do not meet performance requirements.
Minimize and combine use of Sorts where possible.
It is sometimes possible to re-arrange the order of business logic within a job
flow to leverage the same sort order, partitioning, and groupings.
If data has already been partitioned and sorted on a set of key columns,
specifying the dont sort, previously sorted option for those key columns in
the Sort stage reduce the cost of sorting and take greater advantage of
pipeline parallelism.
When writing to parallel datasets, sort order and partitioning are preserved.
When reading from these datasets, try to maintain this sorting if possible by
using SAME partitioning.
The stable sort option is much more expensive than non-stable sorts, and should
only be used if there is a need to maintain row order except as needed to perform
the sort.
Performance of individual sorts can be improved by increasing the memory
usage per partition using the Restrict Memory Usage (MB) option of the
standalone Sort stage.
The default setting is 20MB per partition. Note that sort memory usage can
only be specified for standalone Sort stages, it cannot be changed for inline
(on a link) sorts.
34
Performance management
The primary factors that developers can change in their design process that affect
performance are:
Number of sequential I/Os
Number, type and timing of database (DB/2) I/Os
Number and type of hashed file I/Os
Amount of CPU used
Amount and distribution of memory usage
Minimum performance speeds that each job must meet are being introduced. No job is
accepted into production unless it meets minimum performance rules; there are only a
few exceptions and those are only under special circumstances, with approval.
Note
A standard job design approach yields acceptable
performance for most jobs. Only a small percentage of jobs
need to be analyzed and tuned. If a job only runs once a
month or it processes few rows, or it only runs less than 5
minutes then it is usually not worth spending additional
resources in trying to increase its performance.
Jobs that run several times a day or have execution times over 5 minutes are worth
examining for potential inadvertent bottlenecks or easy means of increasing
performance.
Any job that runs more than 15 minutes per day needs to be looked at carefully and in
detail to see if performance enhancement measures can increase the throughput without
impacting other jobs.
It is critical for the developer to begin the job design phase knowing:
Expected volumes and sizes for all sources, targets, reference files and tables.
Whether there are parent or child dependencies on the job
If the output data is important or time critical outside of job dependencies
How often and when the job is scheduled to be executed
35
Performance monitoring and tuning
This section provides information on the monitoring and tuning of the jobs
performance.
The Job Monitor

The Job Monitor provides a useful snapshot of a jobs performance at the moment of
execution, but does not provide thorough performance metrics. That means, you should
not use a Job Monitor snapshot in place of a full run of the job, or a run with a sample
set of data. Due to buffering and some job semantics, a snapshot image of the flow may
not be a representative sample of the performance over the course of the entire job.
The CPU summary information provided by the Job Monitor is useful as a first
approximation of where time is being spent in the flow. However, it does not include
operators that are inserted by EE. Such operators include sorts that were not explicitly
included and the sub operators of composite operators. The Job Monitor also does not
monitor sorts on links. For these components, the score dump can be of assistance.
A worst-case scenario occurs when a job flow reads from a dataset, and passes
immediately to a sort on a link. The job appears to hang, when, in fact, rows are being
read from the dataset and passed to the sort.
Reading a score dump

When attempting to understand an EE flow, the first task is to examine the score dump
which is generated when you set APT_DUMP_SCORE=1 in your environment.
A score dump includes a variety of information about a flow, including:
How composite operators and shared containers break down.
Where data is repartitioned and how it is repartitioned.
Which operators, if any, have been inserted by EE.
What degree of parallelism each operator runs with.
Exactly which nodes each operator runs on.
Also available is some information about where data may be buffered.
iostat
The UNIX utility iostat is useful for examining the throughput of various disk resources.
If one or more disks have high throughput, understanding where that throughput is
coming from is vital. If there are spare CPU cycles, I/O is often the culprit. iostat can
also help a user determine if there is excessive I/O for a specific job.
36
vmstat
The UNIX vmstat utility is useful for examining system paging. Ideally, once an EE
flow begins running, it should never be paging to disk (si and so should be zero). Paging
suggests EE is consuming too much total memory.
$vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 10692 24648 51872 2288360 0 0 01 2 2 1 1 0
vmstat produces the following every N seconds:

0 0 0 10692 24648 51872 228836 0 0 0 0 328 41 1 0 99
Load average for all versions
Platform-specific tuning
Tru64-specific tuning
Some environments have experienced better memory management when the
vm_swap_eager kernel parameter is set to true. This swaps out idle processes more
quickly, allowing more physical memory for EE. A higher degree of parallelism may
be available as a result of this setting, but system interactivity may suffer as a result.
Some environments have experienced improved performance when the virtual
memory Eager setting, vm_aggressive_swap, is enabled. This aggressively swaps
processes out of memory to free up physical memory for the running processes.
HP-UX-specific tuning
HP-UX has a limitation when running in 32-bit mode, which limits memory mapped I/O
to 2GB per machine. This can be an issue when dealing with large lookups. The Memory
Windows option can provide a work around for this memory limitation. Ascential
Product Support can provide this document on request.
In an EE flow, certain operators may complete before the entire flow has finished, but
the job is not deemed successful until the slowest operator has finished all its processing.
Selectively rewriting the flow

One of the most useful mechanisms you can use to determine what is causing
bottlenecks in your flow is to isolate sections of the flow by rewriting portions of it to
exclude stages from the set of possible causes.
The goal of modifying the flow is to see whether modified flow runs noticeably faster
than the original flow. If the flow is running at roughly an identical speed, change more
of the flow.
37
While editing a flow for testing, it is important to keep in mind that removing one
operator might have unexpected affects in the flow. Comparing the score dump between
runs is useful before concluding what has made the performance differ.
When modifying the flow, be aware of introducing any new performance problems.
For example,
Adding a persistent dataset to a flow introduces disk contention with any other datasets
being read. This is rarely a problem, but it might be significant in some cases.
Reading and writing data are two obvious places to be aware of potential performance
bottlenecks. Changing a job to write into a Copy stage with no outputs discards the data.
Keep the degree of parallelism the same, with a nodemap, if necessary. Similarly,
landing any read data to a dataset can be helpful if the point of origin of the data is a flat
file or RDBMS.
You should follow this pattern, removing any potentially suspicious stages while trying
to keep the rest of the flow intact. Removing any customer-created operators or sequence
operators should be at the top of the list. Much work has gone into the latest 7.0 release
to improve Transformer performance.
Eliminating repartitions
It is strongly recommended that you eliminate superfluous re-partitioning. Due to
operator or license limitations (import, export, RDBMS operators, SAS operators, and so
on) some operators run with a degree of parallelism that is different than the default
degree of parallelism. Some of this cannot be eliminated, but understanding the where,
when and why these repartitions occur is important for understanding the flow.
Repartitions are especially expensive when the data is being repartitioned on an MPP,
where significant network traffic is generated.
Sometimes a repartition might be able to be moved further upstream in order to
eliminate a previous, implicit repartition. Imagine an Oracle read, which does some
processing, and is then hashed and joined with another dataset. There might be a
repartition after the Oracle read stage and then the hash, when only one repartitioning is
ever necessary.
Similarly, a nodemap on a stage may prove useful for eliminating repartitions. In this
case, a transform between a DB2 read and a DB2 write might need to have a nodemap
placed on it to force it to run with the same degree of parallelism as the two DB2 stages
in order to avoid two repartitions.
Ensuring data is evenly partitioned

Due to the nature of EE, the entire flow runs as slow as its slowest component. If data is
not evenly partitioned, the slowest component is often a result of data skew. If one
partition has ten records, and another has ten million, EE can simply not make ideal use
of the resources.
$APT_RECORD_COUNTS=1 displays the number of records per partition for each
component. Ideally, counts across all partitions should be roughly equal. Differences in
38
data volumes between keys often skew this data slightly, but any significant (over 5 or
10%) differences in volume should be a warning sign that alternate keys or an alternate
partitioning strategy might be required.
Buffering for all versions

Buffer operators are introduced in a flow anywhere that a directed cycle exists or
anywhere that the user or operator requests them using the C++ API or osh.
The default goal of the buffer operator, on a specific link, is to make the source stage
output rate match the consumption rate of the target stage. In any flow, where there is
incorrect behaviour for the buffer operator, performance degrades.
For example,
The target stage has two inputs, and waits until it has exhausted one of those inputs
before reading from the next. Identifying these spots in the flow requires an
understanding of how each stage involved reads its records, and is often only found by
empirical observation.
There is a buffer operator tuning issue when a flow runs slowly when it is one massive
flow; but when broken up, each component runs quickly.
For example,
Replacing an Oracle write with a Copy stage vastly improves performance; and writing
that same data to a dataset, then loading using Oracle write, also goes quickly. When the
two are put together, performance grinds to a crawl.
xHyperlink\xd2 on page Default Font details specific common buffer operator
configurations in the context of resolving various bottlenecks.
39
Job performance analysis
You can refer to some of the job runtime information from the snapshots given below:
Job performance analysis: ETL jobs with longest elapsed time
After analysis of job monitor and job log, for almost all jobs you may see the long time
part is upsert with DB2:
40
From following picture you may find that most of the upsert is really an update process
in database:
DataStage Director Event Detail
41
DataStage Director Monitor
42
DataStage Director Monitor
43
DataStage Designer
Why upsert is slow even though the database performance is good?

This section provides information on how the upsert works and the solution to enhanve
the database performance.
How does the upsert work?

For example:
If you use an Upsert with an array size of (say) 500 with INSERT set to try first, then DS
attempts to insert all 500 rows in one statement. If just ONE of them fails, then the entire
array fails, and DS has to pick it apart and run the inserts that would succeed followed
by updates.
The cost of the original INSERT and the ROLLBACK are wasted time. If you know
which rows already exist, you can improve performance by bypassing this failed
INSERT, but that makes the job complex.
The best solution for this is using Load and Merge to replace Upsert:
Use load and Merge SQL to replace DataStage Upsert for better performance
44
Although the use of Upsert function with DB2 EE stage is easy in job design, it has poor
performance when we have big CDC data volume,
For example:
Job: djpHewr2LOD_CUST_IP_INTERFACE_HEW2_LEO
CDC data volume: 50M
Job run time: Startup time, 00:04:59; production run time, 3:26:25.
The job run slowly and its Upsert speed < 100row/s
You need to note the following points for
We use a sequence file to store the CDC data, so we need to modify job output.
Create a temp table with the same structural like final table which will be updated
Use load command load the data form sequence file into the temp table
Use merge SQL to update the final table with temp table
For the updated job you can refer to the job,
djpHewr2LOD_CUST_IP_INTERFACE_HEW2_LEO3
Load command sample: here tbnm is the table name

db2 " load client from /CLIENT/br2/ETL/tmp/data/${tbnm} of del modified
by fastparse coldel, messages ${tbnm}_ll.msg replace into ${tbsch}.$
{tbnm} "
Use load replace to refresh table after each job run.

Merge SQL sample: here TH.HEW_IP_UPSRT is temporary table that contains CDC
data
Merge into TH.HEW_IP AS T USING TH.HEW_IP_UPSRT AS S
on s.IP_ID=t.IP_ID
When matched Then Update set

t.IP_TYPE_CDE =s.IP_TYPE_CDE ,
t.SRCE_SYS_CDE =s.SRCE_SYS_CDE ,
t.SUSPT_MNTR_IND =s.SUSPT_MNTR_IND ,
t.UPDT_DT_TM =s.UPDT_DT_TM
When not matched Then Insert

( IP_ID ,IP_TYPE_CDE ,SRCE_SYS_CDE
,SUSPT_MNTR_IND ,UPDT_DT_TM )
Values
45
( S.IP_ID ,S.IP_TYPE_CDE
,S.SRCE_SYS_CDE ,S.SUSPT_MNTR_IND ,S.UPDT_DT_TM ) ;
Total performance with load and merger SQL:

write to sequence file
djpHewr2LOD_CUST_IP_INTERFACE_HEW2_LEO3
job runtime: Startup time, 00:02:18; production run time, 00:01:51.
Use script to load sequence file into Upsert table and then insert CDC_HIS
and do the merge
Script runtime:
real 6m0.14s
user 0m0.09s
sys 0m0.06s
Total runtime: < 15m
So the final performance is really improved dramatically!
46
Identifying performance bottlenecks
All jobs, no matter how efficiently they are written, have a bottleneck. Each job has
some aspect that, when changed, makes the job run faster. This bottleneck is often
different on development machines than it is on production ones.
A bottleneck is not something negative; it is an inherent attribute of a job. It is important
to determine which portion of a job is causing the bottleneck. Once armed with that
knowledge, the developer can then decide if the performance value limited by the
bottleneck is acceptable. There are no absolute values for this but common sense and an
understanding of the priorities makes the decision easier. Later on in this document some
metrics are presented, which also help build a decision.
The job at the right shows how a sample DataStage job can be broken into broad
categories that are used to locate bottlenecks.
Category Job
Source Data reads
Reference Data lookups
Target Data writes
Processing Speeds
A job performs at a maximum speed that is limited by its slowest component. Even with
buffering, once the buffer fills up the job reverts back to the speed of the slowest stage.
Usually a job functions so that one input row gets processed and after that row is written
out to the one or more destinations the next row is used. Jobs that perform aggregations
or sorting or have interim file stages and work a bit differently, but the overall speed is
also limited.
Measuring component performance in the context of the job is not complicated. It is a
matter of compartmentalizing into groups and measuring those speeds separately. Some
jobs lend themselves to quite simplistic measurements while others need more work, but
with just a couple of minutes of design modification effort, the slowest components can
be easily identified.
Usually a combination of modifying constraints and adding in temporary sequential files
is necessary. If the constraint of a Transform stage is set to always evaluate to a false
value (1=2 or @FALSE) then the performance of all stages leading up to that
transform is measured. If the speed comparison of this job with the original shows that
the overall performance has not changed appreciably then the bottleneck is before that
transform, otherwise the bottleneck lies after the transformer stage.
The best method is to use temporary sequential files between stages or sections of the
job as shown here:
47
Sequential file I/O is very fast and is only rarely a bottleneck. The addition of Sequential
File stages as depicted makes sure that each group of steps between sequential stages is
executed completely the last row of input into to the sequential file is written before
the first row on the output side is read before the subsequent stage is executed. Thus
each section has a Rows/Second speed that is independent of the others and can be
compared with each other for performance measurement.
48
Bottleneck resolution
This section provides information and tips to overcome the bottleneck using DataStage
best practices.
Combinable operators
Combined operators generally improve performance at least slightly; and in some cases,
the performance improvement may be dramatic. However, there may be situations where
combining operators actually hurt performance. Identifying such operators can be
difficult without trial and error.
The most common situation arises when multiple operators, such as Sequential File
(import and export) and Sort, are combined and are performing disk I/O. In I/O-bound
situations, turning off combination for these specific operators may result in a
performance increase.
This is a new option in the advanced stage properties of DataStage Designer version 7.x.
Combinable operators often provide a dramatic performance increase when a large
number of variable length fields are used in a flow.
To experiment with this, try disabling the combination of any stages that perform I/O
and any sort stages.
$APT_DISABLE_COMBINATION=1 globally disables operator combining.
Disk I/O
Total disk throughput is often a fixed quantity that EE has no control over. There are,
however, some settings and rules of thumb that are often beneficial:
If data is going to be read back in, in parallel, it should never be written as a
sequential file. A dataset or file set is a much more appropriate format.
When importing fixed-length data, the Number of Readers Per Node option on the
Sequential File stage can often provide a noticeable performance boost as compared
with a single process reading the data. However, if there is a need to assign a number
in source file row order, you cannot use the -readers option because it opens multiple
streams at evenly-spaced offsets in the source file. Also, you can use this option only
for fixed-length sequential files.
Some disk arrays have read-ahead caches that are only effective when data is read
repeatedly in like-sized chunks.
$APT_CONSISTENT_BUFFERIO_SIZE=n forces import to read data in chunks
which are size n or a multiple of n.
Memory mapped I/O is, in many cases, a big performance win; however, in certain
situations, such as a remote disk mounted via NFS, it may cause significant
performance problems. APT_IO_NOMAP=1 and APT_BUFFERIO_NOMAP=1 turn off
this feature and sometimes affect performance.
49
AIX and HP-UX default to NOMAP.
You can use APT_IO_MAP=1 and APT_BUFFERIO_MAP=1 to turn on the memory
mapped I/O for these platforms.
Buffering
Buffer operators are intended to slow down their input to match the consumption rate of
the output. When the target stage reads very slowly, or not at all, for a length of time,
upstream stages begin to slow down. This can cause a noticeable performance loss if the
optimal behaviour of the buffer operator is something other than rate matching.
By default, the buffer operator has a 3MB in-memory buffer. Once that buffer reaches
two-thirds full, the stage begins to push back on the rate of the upstream stage. Once the
3MB buffer is filled, data is written to disk in 1MB chunks.
In the following discussions, settings in all capital letters are environment variables and
affect all buffer operators. Settings in all lowercase are buffer-operator options and can
be set per buffer operator.
In most cases, the easiest way to tune the buffer operator is to eliminate the push
back and allow it to buffer the data to disk as necessary.
$APT_BUFFER_FREE_RUN=n or bufferfreerun do this. The buffer operator reads N *
max_memory (3MB by default) bytes before beginning to push back on the
upstream. If there is enough disk space to buffer large amounts of data, this usually
fixes any egregious slow-down issues caused by the buffer operator.
If there is a significant amount of memory available on the machine, increasing the
maximum in-memory buffer size is likely to be very useful if the buffer operator is
causing any disk IO. $APT_BUFFER_MAXIMUM_MEMORY or
maximummemorybuffersize is used to do this. It defaults to roughly 3000000 (3MB).
For systems where small to medium bursts of I/O are not desirable, the 1MB write to
disk size chunk size may be too small. $APT_BUFFER_DISK_WRITE_INCREMENT or
diskwriteincrement controls this and defaults to roughly 1000000 (1MB). This setting
may not exceed max_memory * 2/3.
Finally, in a situation where a large, fixed buffer is needed within the flow,
queueupperbound (no environment variable exists) can be set equal to max_memory
to force a buffer of exactly max_memory bytes.
Such a buffer blocks an upstream stage (until data is read by the downstream
stage) once its buffer has been filled, so this setting should be used with extreme
caution. This setting is rarely necessary to achieve good performance, but is
useful when there is a large variability in the response time of the data source or
data target. No environment variable is available for this flag; it can only be set
at the osh level.
For releases 7.0.1 and beyond, per-link buffer settings are available in EE. They appear on
the Advanced tab of the Input & Output tabs. The settings saved on an Output tab are shared
with the Input tab of the next stage and vice versa, like Columns.
50
Sequential file stages
This is the fastest stage both for reading and writing in DataStage. There is little that can
be done to speed up I/O with sequential files. Since this stage is only seldom the
bottleneck, any increases in efficiency are often not visible with regards to single job
performance. Nevertheless there are some simple approaches that optimize Sequential
stage performance and reduce the overall system load; these small changes add up across
all jobs and speed up processing.
Often, as part of processing, temporary or staging sequential files are created in one job
and read in another. When this is the case, it is worthwhile looking at the data and
deciding whether or not quotes are necessary for string fields and if the data can be
written as fixed length without separators.
For example,
If a staging file has three columns, ARRG_ID with 18 digits numeric and fields
CURRENCY and STATUS with Char(3) and Char(1) respectively, two possible formats
for the resulting sequential file could be:
Quoted and Comma Fixed Width, no Separators, no

Separated (Default) Quotes
4104385636294522113,GBP,A 4104385636294522113GBPA
4104385636294522114,GBP,A 4104385636294522114GBPA
4104385636294522115,USD,A 4104385636294522115USDA
4104385636294522116,JPY,A 4104385636294522116JPYA
The first method uses 140 bytes per row while the second uses only 105 a 25%
reduction in data stored. Since this data gets written and read at least once, a small
modification such as this can almost halve the I/O in this example. Even though the
difference in large sequential files is not always directly noticeable it lets the job have a
smaller impact on the system; the cumulative effect over the time and across many jobs
is significant. The more the columns in a file, the greater data size savings.
Fixed length sequential files have the secondary effect of being more efficient for the
read process to parse. As the start and end positions are the same for each line DataStage
needs only to access a substring on each row. With delimited files the process is more
involved, as the string needs to be searched character by character until a start or end
delimiter is encountered. This processing overhead is very noticeable with files
containing a hundred or more columns and no inter-process buffering.
Yet another advantage of using fixed-length sequential files is that they can be read
significantly faster by Parallel Extender (PX) jobs, should those be used in the future for
51
part of the processing. PX Jobs start a number of concurrent processes, if a delimited
sequential file is read, it must be parsed line by line by just one process to determine
how the rows are organized. You can easily read a fixed length file by several jobs in
parallel as the row start positions are easily determined.
Putting the data into a fixed width format is often not a solution. It can have a very
detrimental effect, especially when dealing with the fixed-width CHAR type columns
often found when using DB/2. Often these columns contain many extraneous spaces
which are trimmed of leading and trailing blanks in DataStage so writing a CHAR
(100) column containing hello only puts in 5 bytes in a variable-length sequential
column, but if written to a fixed-width flat file it would, of necessity, take up 100 bytes
of space.
Not using quote characters always saves a lot of space, but care needs to be taken when
removing them. If any of these columns could contain the column delimiter character
then the resulting text file becomes unusable and causes jobs to abort.
The default column delimiter is a comma , character and one common method is to
choose an arbitrary non-displayable character as the delimiter; so using 0x04 (the EOT
code) works as a separator if it is certain that this does not occur in any character field.
Hashed File stages

Hashed files are DataStage internal database files. They can be used for interim storage,
lookups and even as data repositories. All database systems, including DataStage hashed
files, have two core functions:
Quickly retrieving data using a key
Ensuring concurrency across processes.
Hashed files always have exactly one unique key to a record. Hashed files use on and off
21 different mathematical approaches to reduce the record key into a number. This
number is between 0 and the number of buckets (also called groups) defined during file
creation. Each bucket contains the data for one or more records in the form of strings
and linked lists. This simple approach means that there are only a few steps that need to
be performed in retrieving record data given the key and it functions extremely quickly.
The number of buckets used is called the files modulo and is set depending upon the
expected number of records in the file. Each group should contain just a few records,
optimally totalling fewer than 1600 Bytes; when a group contains more than that it
becomes inefficient, as locating a record in that group means that the program needs to
traverse through a linked list.
Thus module 11 in a file with 1 million records means that each group has about 91,000
records. Assuming a worst-case lookup, the system has to do 1 hash calculation and then
91,000 linked list operations to get the data. If the modulo were more sized at 1001 then
the number of records in an average group would be 999, so a worst-case lookup would
only need to do 1000 operations.
A simple example of hashing would be file with a numeric key and modulo of 4. If the
sequential numeric hashing algorithm (Type 2) is used then each record gets placed into
52
the bucket number MOD ({KeyValue}, 4) +1. So if all keys between 1 and 100 are used
an even distribution of 25 records per group is achieved. But if the keys were 2 through
200 with only even-numbered keys being used, then the group counts end up being 50,
0, 50, and 0 respectively resulting in a very uneven distribution and thus the hashing
algorithm of type 2 is inappropriate for this file.
Since modulo has such a profound impact on performance it is important to make sure
that the value is sufficiently large for the files data. DataStage has introduced a
maintenance-free type of hashed file called Dynamic (or type 30) and has made it the
default file type for new files. This hashed file type automatically increases or decreases
the actual modulo used depending on the file size.
By default a dynamic file starts off with one group, and the split threshold is set at 80% -
so when the file is 80% full it splits the groups and effectively increases modulo
dynamically. The reverse happens when the load goes down below 20% which triggers a
group merge.
This is a very efficient method of keeping no-maintenance high system performance for
typical database tables, which tend to grow or shrink slowly. In DataStage jobs hashed
files are often cleared and then filled (sometimes several times daily) so the overhead
associated with this splitting and merging can be very high.
Fortunately dynamic hashed files allow the developer to specify and override default
settings when they are created and by correctly setting initial values, performance is
increased.
Creating hashed files

You should use the default Type 30 file in almost all cases. Using static hashed files can
result in a 5-10% performance gain over dynamic hashed files but if these static files are
allowed to overflow badly then the performance loss compared to their dynamic
counterpart can be greater than 1000%. The only hashed files that you should consider
as static files are those that are used heavily, for example, for several million I/O
operations per day, or that are very static and wont grow {appreciably} even in the very
long term.
When hashed files are written to in DataStage jobs the hashed file stage shows the
options which in turn shows the following window:
53
Hashed File Stage
The only attribute that you need to change from the default value of 1 is the Minimum
modulus. There are only 2 hash algorithms possible for dynamic files and if the key of
the hashed file changes most in the rightmost bytes (regardless of whether numeric or
not) then using SEQ.NUM as the algorithm might increase speed up to 5%.
In order to find the optimum starting modulus it is necessary to fill the hashed file with
data. Once the file contains data, use either the DataStage Administrator or the TCL shell
client to execute the appropriate commands.
54
There are 2 methods of creating and accessing hashed files in DataStage, either by
putting them into the local account or specifying a path. If the file has been created using
a path, then a file pointer to this file needs to be created by using the command SETFILE
{full UNIX path to file} {FileName}, if a local file was specified then this entry has already
been made. The command ANALYZE.FILE {FileName} returns the following window:
Command Output
You should use the value shown for the No. of Groups (modulus) as the minimum modulo
(or as the basis for multiplication if the whole hashed file wasnt loaded).
Using hashed file caching

DataStage V7 has increased the functionality behind hashed files to include hashed file
caching. This feature allows some very time-saving functions related to hashed file reads
which include multiple processes using one hashed file that is loaded to memory or
dynamic modification of data contents for files loaded to memory. This functionality has
not been enabled at CLIENT and all files created should specify any caching features to
be disabled.
55
Preloading hashed files to memory
Any hashed file that is used for reference lookups and is less than approximately 128MB
in size should be pre-loaded into memory. There are only a few exceptions to this rule.
They are as follows:
When the hashed file is modified within the job itself or by another job at the same
time and the newly modified entries need to be visible in the lookup.
Only a small percentage of the records in the hashed file are going to be looked up
or when the source data is very small and time needed to load this hashed file to
memory takes longer than just reading it.
Very large hashed files

DataStage hashed files use 32bit internal pointer for their structure and are thus limited
to 2GB. Static hashed files are stored as one UNIX object, but dynamic hashed files are
stored as 2 files (DATA.30 and OVER.30) so the data limit tends to be a bit higher than
2GB. If the file system and the system/user file limit settings, which are visible through
the ulimit f command, allow it, then it is possible to declare hashed files that use a 64bit
pointer and can hold up to 8 Exabyte of data (8,388,608 Terabytes). 64-bit files are
indistinguishable from 32-bit files at an application level.
DB/2 table stages

As the DB/2 instances are all on separate machines any I/O is going to be constrained
not only by the database subsystem throughput but also by the network activity.
Reading from DB/2

Accessing DB/2 table information is split into two phases:
1. Query and pre-processing
2. Transporting the data into DataStage.
Optimizing the query is outside of the scope of this document. You can increase the data
read speed in DataStage by ensuring that only required rows and columns are actually
selected. Although this sounds both banal and straightforward, analysis of existing jobs
shows quite a few results where this is not the case. In many jobs large numbers of
columns are read from DB/2 but are never used in jobs and that quite often records are
read and discarded in jobs which could have been constrained in the original SELECT
clause.
DB/2 uses buffers for I/O activities and they are set to 4Kb (4096 bytes). In order to
make optimal buffer usage, you can raise the read array size so that as many records as
possible are transferred at one time without going into a second buffer.
If part of the DataStage job involves aggregation or sorting or some other reason exists
for the data to be sort, then it is best to perform this within the database SELECT
statement using ORDER BY. The sorting is more efficient in the database, as the
algorithms are efficient, it might make use of clustered indices and the processing
capacity used is shifted to the remote DB/2 server from the DataStage machine.
56
Writing to DB/2
Overall this stage tends to be the speed-limiting factor in most jobs. Similar to reading
DB/2 data, it is necessary to reduce the amount of data written to just the necessary
columns and updated/inserted rows. The processing time used to decide whether or not
to write a row of data is always going to be less than the time spent updating a row that
hasnt changed.
The array size attribute is computed the same way as for reads, setting it to a number that
fills the 4Kb buffer with data without overflowing into a second buffer. There is a
transaction size setting when writing and you should set it to a multiple of the array size.
This parameter is equivalent to the database commit frequency and is to be set to as high
a value as practical; the more commits a job does, the more overhead is incurred in the
job.
From a performance point of view, the best commit frequency is 0 denoting the whole
job is one transaction and a commit is only done at the end of processing. The drawback
to this is that uncommitted writes need to be stored on the DB/2 machine and if too
much data is placed in the buffers they can overflow. At present the recommendations
from the DB/2 DBAs is to use a commit frequency no larger than 10,000.
Two of the write options to DB/2 are:
1. Update existing or insert new rows
2. Insert new or update existing rows
Both options do the same thing, but due to their ordering they can have a big impact on
performance. If a job writes to a table that is 90% updates and only 10% inserts ; using
the insert new or update existing rows option results over 40% more database
operations! If we use 100 rows then you can refer to the following table to compare
between the two options:
Insert New or Update Existing Update Existing or Insert New

Rows Rows
10 new rows with INSERT 10 failed UPDATEd rows
90 failed attempts to INSERT 10 new rows with INSERT
90 UPDATEd rows 90 UPDATEd rows
190 DB operations 110 DB operations
If the job design allows it, split the output to DB/2 stages into two distinct data streams,
one doing only INSERT operations and the other doing only UPDATE. This has several
advantages:
57
The INSERT stream is doing pure inserts with no wasted DB checking if the record
already exists.
The UPDATE stream is also only doing updates with no wasted cycles.
If the job settings have enable row buffer and interprocess enabled, then both of
the DB/2 stages are executed as a separate process and build up two DB/2
connections that run in parallel.
Should the bulk load functionality be incorporated in the future, you dont need to
modify the job apart from changing the stage type in order to get maximum benefit
from it.
Reference lookups to DB/2

You are recommended to avoid this type of lookup in DataStage server jobs. If the
lookup table is in the same instance of DB/2 as the source data then you can perform the
join in the SELECT clause. The only valid use of direct DB/2 lookups is when the
number of source records is small and the lookup data volume is very large; for example,
100 rows of data are processed that lookup on a table of 6 million records. At any other
time then you must load the reference table into a hashed file, which is then used as the
lookup source.
Transformer stages
As the real workhorse in DataStage, the Transform stages usually consume most, if not
all of a jobs CPU time. Excepting Transform stages that have lookup stages attached,
these are seldom the bottleneck stages in processes. Nevertheless it is important to avoid
wasting CPU time with unnecessary computations and you need to always look at
transform with performance considerations in mind.
Note
Remember that each computation done in a transform is
repeated for every row.
By enabling the inter process row buffering in the job settings each transform stage
becomes its own UNIX process. If a transform stage does many complex or time-
consuming computations then it might cause a bottleneck as its processing is limited to
just one processor. By splitting a transformer stage into two or more distinct stages the
processing load can be distributed and the bottleneck removed. The CPU time used for a
Transformer (or other active) stage is displayed as part of the job log information.
The most common cause of CPU wastage in Transformer stages is when identical
computations or derivations are unnecessarily repeated. Transformer stage local
variables have been included so that a computation need only be done once per row and
then used as many times as possible. The following transform is typical of one that
performs extraneous operations.
58
Stage Variables
By adding a stage variable whose value is re-used makes this use a lot less CPU and
perform much more efficiently.
Stage variables are recomputed for each row of data. The exception to this rule is the
initial value, which is only computed once at the beginning of the job. If the variable
value is never changed such as when the derivation line is left empty then this value
can be used for each row processed in a job.
Stage Variables
59
Transformer Stage Properties
Routines and transforms

You can save on much processing time if repeated code is embedded into Routines and
Transforms. If a function can be written as a single line of code then it is much more
efficient to put this into a transform function than a routine. Transforms are replaced in
the job itself at compile time whereas Routines are called at runtime. Thus, changing a
Transform requires a recompilation of all jobs that use it; whereas a routine need only be
recompiled and all jobs that use the routine automatically use the new copy.
Shared containers
Shared containers are much like Transform calls in that their design information and
programming code is compiled into the jobs object. So any changes to the common
shared container will not be reflected until such time as the job is re-compiled.
The container should be a part of the normal data flow in a job, it should never possible
to write a container as a separate job; when this is the case it must be written as a
standalone job and connected to the original via a sequencer. Containers should also be
written so as to not to be misleading. The following job design looks like it doesnt do
much:
Job design
60
But if the containers contents were to be:
It would mean that each and every row needs to be processed through the ToCont
link and into the container before any data streams from the FromCont link to the
link collector. The container could also be updating table entries that again get updated from
the parent job.
Job design
Sequences
Job Sequences are, fortunately, rather straightforward. They almost never use any CPU
or I/O and therefore leave little to be tuned. The only performance consideration with
sequences is to attempt to allow as many jobs run in parallel within them as the design
allows.
61
Production support process
This section provides information on the procedure followed under production support.
SWAT
You can refer to the following procedure that is followed under SWAT:
XXXX sends issue details to SWH or XXXX directly. Issue is analyzed by XXXX if
issue is related to DS jobs or if its related to design or business logic then it is
analyzed by SWH.
XXXX sends their analysis to SWH/XXXX; XXXX also sends suggestion to
solution based on experience.
SWH instructs XXXX to amend DS jobs and Design specifications. XXXX amends
DS jobs, reviews and tests it in XXXX UAT environment and sends back to SWH
for their review and confirmation.
Note
Unit testing in UAT is must, DO NOT Deliver Code without
testing in UAT environment.
Capture unit testing result (in QC) and send it along with amended job to SWH.
Once job change is approved by SWH it is delivered to XXXX along with unit test
result.
Follow version control process (Applicable only if changes need to be incorporated
in v1.6/1.7 project)
XXXX further checks the jobs from their end before moving it to UAT (in case issue
is raised by them and not by the end user).
Follow up and get confirmation from XXXX that they have tested the code from
their side in UAT and that the code is fine.
Production support
XXXX sends Production issue details to SWH or XXXX directly.
Issue is analyzed by XXXX. If logic change is required then it gets confirmation
from SWH.
XXXX amends DS jobs, after getting confirmation, tests it in XXXX UAT
environment and delivers to XXXX.
62
Note
Unit testing in UAT is must, DO NOT Deliver Code without
testing in UAT environment.
Capture unit testing result (in QC) and send it along with amended job to XXXX.
XXXX further checks the jobs from their end before moving it to Production.
Follow up and get confirmation from XXXX that they have tested the code from
their side in UAT and code is fine.
Follow version control process.
XXXX moves code to Production.
Version control
Once the fix is applied and released, update the following spreadsheets with change
details and check in MKS,
\\mks_hew_r2\Production Fix Change-RequestsChange Request Tracking Sheet.xls
\\mks_hew_r2\ETL_Source\ETL Version Control\Amendments during Production
Implementation.xls
Copy dsx of updated jobs, CR document (if applicable) to project shared folder
\Share\XXXX\vol_ss\HEW 4.0\3 Development\GWM\Production Fixes\
Work flow diagram

The following diagram displays the work-flow in a production environment:
Figure: Work flow diagram in production environment
63

DataStage Best Practices

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

DataStage Best Practices

Hochgeladen von

Copyright:

Verfügbare Formate

DataStage Best Practices

Understanding a jobs UNIX environment....................................................................7

Version Control of ETL Jobs........................................................................................29

Performance tips for job design..................................................................................34

Performance monitoring and tuning...........................................................................37

Job performance analysis............................................................................................41

Production support process........................................................................................ 63

3. Within DataStage Designer, environment variables may be defined for a particular

Who has locked the job?

We can gather following information from the sample given above:

How to copy or move data set file

orchadmin -f command-file Executes commands from specified file

orchadmin - Executes commands from standard input

orchadminexecutes commands which delete, copy, and describe ORCHESTRATE files.

-f command-file Path of a file containing orchadmin commands

- Read commands from the standard input as if it were a command

-help | -h Write usage information to the standard output

NLS related options

NLS Options Description

-output_charset map- Specifies the encoding of orchadmin output

-os_charset map-name Specifies the encoding of data passed to or received from

-escaped Allows command line characters to be presented in a two-

COMMAND: copy | cp source-descriptor-file target-descriptor-file

COMMAND: delete | del | rm [ -options... ] descriptor-files...

COMMAND: truncate [ -options... ] descriptor-files...

-n segment Leave these many segments. The default is 0.

To truncate big.ds to 10 segments, you can use the following command:

COMMAND: dump [ -options... ] descriptor-files...

-name Precede each value by its field name and a colon.

-n numrec Limit the number of records dumped per partition.

-part N Dump only the specified partition.

-skip N Skip the first N records in each partition.

COMMAND: describe | lp | lf | ls | ll [ -options... ] descriptor-files...

-c Print the stored configuration file, if any

-f List the data files

-s Print the schema

-e Describe segments individually

-v Describe all segments, valid or otherwise

-d Print numbers exactly, not in pretty form

-np Print information for just the specified node pool

-n Print information for the specified nodes

-q Print summary of information only

If no options are supplied, the default node pool is used.

Depending on the available system resources, it may be possible to optimize overall

Default and explicit type conversions

Source type to target type conversions

Sequential file stages (Import and Export)

Partitioning sequential file reads

Sequential file (Export) buffering

Reading from and writing to fixed-length files

Reading bounded-length VARCHAR columns

Transformer usage guidelines

Choosing appropriate stages

Transformer NULL handling and reject link

Conditionally aborting jobs

Transformer decimal arithmetic

round_inf Rounds or truncates towards nearest representable value,

trunc_zero Discard any fractional digits to the right of the rightmost

Optimizing Transformer expressions and stage variables

In this case, the evaluation of the substring of DSLINK1.col[1,3] is evaluated

and each column derivation would start with:

and each column derivation would start with:

In this example, if DSLINK1.col1 were a string field, then, again, a conversion

Lookup stage vs. Join stage