Sie sind auf Seite 1von 375

DataStage Enterprise Edition

Proposed Course Agenda


~

Day 1
Review of EE Concepts Sequential Access Best Practices DBMS as Source

Day 3
Combining Data Configuration Files Extending EE Meta Data in EE

Day 2
EE Architecture Transforming Data DBMS as Target Sorting Data

Day 4
Job Sequencing Testing and Debugging

Contents
1. Introduction 2. DataStage Administrator 3. DataStage Manager 4. DataStage Designer 5. DataStage Director 6. ODBC/Relational Stages 7. Intelligent Assistants 8. Hashed Files 9. Transforming Data 10. DataStage BASIC 11. Repository Objects 12. Performance

Enhancements
13. Job Sequencer

iii

The Course Material

Course Manual

Exercise Files and Exercise Guide

Online Help

Using the Course Material


~

Suggestions for learning


Take notes Review previous material Practice Learn from errors

Intro Part 1

Introduction to DataStage EE

What is DataStage?
~

Design jobs for Extraction, Transformation, and Loading (ETL) Ideal tool for data integration projects such as, data warehouses, data marts, and system migrations Import, export, create, and managed metadata for use within jobs Schedule, run, and monitor jobs all within DataStage Administer your DataStage development and execution environments

DataStage Server and Clients

DataStage Administrator

Client Logon

DataStage Manager

DataStage Designer

DataStage Director

Developing in DataStage
~

Define global and project properties in Administrator Import meta data into Manager Build job in Designer Compile Designer Validate, run, and monitor in Director

~ ~ ~ ~

DataStage Projects

Quiz True or False


~

DataStage Designer is used to build and compile your ETL jobs Manager is used to execute your jobs after you build them Director is used to execute your jobs after you build them Administrator is used to set global and project properties

Intro Part 2

Configuring Projects

Module Objectives
~

After this module you will be able to:


Explain how to create and delete projects Set project properties in Administrator Set EE global properties in Administrator

Project Properties
~

Projects can be created and deleted in Administrator Project properties and defaults are set in Administrator

Setting Project Properties


~

To set project properties, log onto Administrator, select your project, and then click Properties

Licensing Tab

Projects General Tab

Environment Variables

Permissions Tab

Tracing Tab

Tunables Tab

Parallel Tab

Intro Part 3

Managing Meta Data

Module Objectives
~

After this module you will be able to:


Describe the DataStage Manager components and functionality Import and export DataStage objects Import metadata for a sequential file

What Is Metadata?

Data

Source
Meta Data

Transform

Target
Meta Data

Meta Data Repository

DataStage Manager

Manager Contents
~

Metadata describing sources and targets: Table definitions DataStage objects: jobs, routines, table definitions, etc.

Import and Export


~ ~ ~ ~ ~

Any object in Manager can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers

Export Procedure
~

In Manager, click Export>DataStage Components Select DataStage objects for export Specified type of export: DSX, XML Specify file path on client machine

~ ~ ~

Quiz: True or False?


~

You can export DataStage objects such as jobs, but you cant export metadata, such as field definitions of a sequential file.

Quiz: True or False?


~

The directory to which you export is on the DataStage client machine, not on the DataStage server machine.

Exporting DataStage Objects

Exporting DataStage Objects

Import Procedure
~

In Manager, click Import>DataStage Components Select DataStage objects for import

Importing DataStage Objects

Import Options

Exercise
~

Import DataStage Component (table definition)

Metadata Import
~

Import format and column destinations from sequential files Import relational table column destinations Imported as Table Definitions Table definitions can be loaded into job stages

~ ~ ~

Sequential File Import Procedure


~

In Manager, click Import>Table Definitions>Sequential File Definitions Select directory containing sequential file and then the file Select Manager category Examined format and column definitions and edit is necessary

~ ~

Manager Table Definition

Importing Sequential Metadata

Intro Part 4

Designing and Documenting Jobs

Module Objectives
~

After this module you will be able to:


Describe what a DataStage job is List the steps involved in creating a job Describe links and stages Identify the different types of stages Design a simple extraction and load job Compile your job Create parameters to make your job flexible Document your job

What Is a Job?
~ ~

Executable DataStage program Created in DataStage Designer, but can use components from Manager Built using a graphical user interface Compiles into Orchestrate shell language (OSH)

~ ~

Job Development Overview


~

In Manager, import metadata defining sources and targets In Designer, add stages defining data extractions and loads And Transformers and other stages to defined data transformations Add linkss defining the flow of data from sources to targets Compiled the job In Director, validate, run, and monitor your job

~ ~

Designer Work Area

Designer Toolbar
Provides quick access to the main functions of Designer Show/hide metadata markers

Job properties

Compile

Tools Palette

Adding Stages and Links


~

Stages can be dragged from the tools palette or from the stage type branch of the repository view Links can be drawn from the tools palette or by right clicking and dragging from one stage to another

Sequential File Stage


~

Used to extract data from, or load data to, a sequential file Specify full path to the file Specify a file format: fixed width or delimited Specified column definitions Specify write action

~ ~ ~ ~

Job Creation Example Sequence


~ ~

Brief walkthrough of procedure Presumes meta data already loaded in repository

Designer - Create New Job

Drag Stages and Links Using Palette

Assign Meta Data

Editing a Sequential Source Stage

Editing a Sequential Target

Transformer Stage
~

Used to define constraints, derivations, and column mappings A column mapping maps an input column to an output column In this module will just defined column mappings (no derivations)

Transformer Stage Elements

Create Column Mappings

Creating Stage Variables

Result

Adding Job Parameters


~ ~

Makes the job more flexible Parameters can be:


Used in constraints and derivations Used in directory and file names

Parameter values are determined at run time

Adding Job Documentation


~

Job Properties
Short and long descriptions Shows in Manager

Annotation stage
Is a stage on the tool palette Shows on the job GUI (work area)

Job Properties Documentation

Annotation Stage on the Palette

Annotation Stage Properties

Final Job Work Area with Documentation

Compiling a Job

Errors or Successful Message

Intro Part 5

Running Jobs

Module Objectives
~

After this module you will be able to:


Validate your job Use DataStage Director to run your job Set to run options Monitor your jobs progress View job log messages

Prerequisite to Job Execution


Result from Designer compile

DataStage Director
~ ~

Can schedule, validating, and run jobs Can be invoked from DataStage Manager or Designer
Tools > Run Director

Running Your Job

Run Options Parameters and Limits

Director Log View

Message Details are Available

Other Director Functions


~ ~ ~

Schedule job to run on a particular date/time Clear job log Set Director options
Row limits Abort after x warnings

Module 1

DSEE DataStage EE Review

Ascentials Enterprise Data Integration Platform


Command & Control

ANY SOURCE CRM ERP SCM RDBMS Legacy Real-time Client-server Web services Data Warehouse Other apps.

DISCOVER Gather relevant informatio n for target enterprise application s


Data Profiling

PREPARE Cleanse, correct and match input data

TRANSFORM Standardiz e and enrich data and load to targets

ANY TARGET CRM ERP SCM BI/Analytics RDBMS Real-time Client-server Web services Data Warehouse Other apps.

Data Quality

Extract, Transform, Load

Parallel Execution Meta Data Management

Course Objectives
~

You will learn to:


Build DataStage EE jobs using complex logic Utilize parallel processing techniques to increase job performance Build custom stages based on application needs

Course emphasis is:


Advanced usage of DataStage EE Application job development Best practices techniques

Course Agenda
~

Day 1
Review of EE Concepts Sequential Access Standards DBMS Access

Day 3
Combining Data Configuration Files

Day 2
EE Architecture Transforming Data Sorting Data

Day 4
Extending EE Meta Data Usage Job Control Testing

Module Objectives
~

Provide a background for completing work in the DSEE course Tasks


Review concepts covered in DSEE Essentials course

Skip this module if you recently completed the DataStage EE essentials modules

Review Topics
~ ~

DataStage architecture DataStage client review


Administrator Manager Designer Director

~ ~

Parallel processing paradigm DataStage Enterprise Edition (DSEE)

Client-Server Architecture
Command & Control

Microsoft Windows NT/2000/XP

ANY SOURCE

ANY TARGET CRM ERP SCM BI/Analytics RDBMS Real-Time Client-server Web services Data Warehouse Other apps.

Designer

Director

Administrator

Repository Manager

Discover Extract

Prepare Transform Transform Cleanse

Extend Integrate

Server

Repository

Microsoft Windows NT or UNIX


Parallel Execution Meta Data Management

Process Flow
~ ~ ~ ~

Administrator add/delete projects, set defaults Manager import meta data, backup projects Designer assemble jobs, compile, and execute Director execute jobs, examine job run logs

Administrator Licensing and Timeout

Administrator Project Creation/Removal

Functions specific to a project.

Administrator Project Properties

RCP for parallel jobs should be enabled

Variables for parallel processing

Administrator Environment Variables

Variables are category specific

OSH is what is run by the EE Framework

DataStage Manager

Export Objects to MetaStage

Push meta data to MetaStage

Designer Workspace

Can execute the job from Designer

DataStage Generated OSH

The EE Framework runs OSH

Director Executing Jobs

Messages from previous run in different color

Stages
Can now customize the Designers palette Select desired stages and drag to favorites

Popular Developer Stages

Row generator

Peek

Row Generator
~

Can build test data

Edit row in column tab

Repeatable property

Peek
~

Displays field values


Will be displayed in job log or sent to a file Skip records option Can control number of records to be displayed

Can be used as stub stage for iterative development (more later)

Why EE is so Effective
~

Parallel processing paradigm


More hardware, faster processing Level of parallelization is determined by a configuration file read at runtime

Emphasis on memory
Data read into memory and lookups performed like hash table

Parallel Processing Systems


~

DataStage EE Enables parallel processing = executing your application on multiple CPUs simultaneously
If you add more resources (CPUs, RAM, and disks) you increase system performance
1 2

3 4

Example system containing 6 CPUs (or processing nodes) and disks

Scaleable Systems: Examples


Three main types of scalable systems
~

Symmetric Multiprocessors (SMP): shared memory and disk Clusters: UNIX systems connected via networks MPP: Massively Parallel Processing

~ ~

note

SMP: Shared Everything


Multiple CPUs with a single operating system Programs communicate using shared memory All CPUs share system resources (OS, memory with single linear address space, disks, I/O) When used with Enterprise Edition: Data transport uses shared memory Simplified startup
cpu cpu

cpu

cpu

Enterprise Edition treats NUMA (NonUniform Memory Access) as plain SMP

Traditional Batch Processing

Operational Data

Transform

Clean

Load

Archived Data Disk

Data Warehouse Disk Disk

Source
Traditional approach to batch processing: Write to disk and read from disk before each processing operation Sub-optimal utilization of resources a 10 GB stream leads to 70 GB of I/O processing resources can sit idle during I/O Very complex to manage (lots and lots of small jobs) Becomes impractical with big data volumes disk I/O consumes the processing terabytes of disk required for temporary staging

Target

Pipeline Multiprocessing
Data Pipelining
Transform, clean and load processes are executing simultaneously on the same processor rows are moving forward through the flow Operational Data

Archived Data

Transform

Clean

Load

Data Warehouse

Source

Target
Start a downstream process while an upstream process is still running. This eliminates intermediate storing to disk, which is critical for big data. This also keeps the processors busy. Still has limits on scalability

Think of a conveyor belt moving the rows from process to process!

Partition Parallelism
Data Partitioning
Break up big data into partitions Run one partition on each processor 4X times faster on 4 processors With data big enough: 100X faster on 100 processors This is exactly how the parallel databases work! Data Partitioning requires the same transform to all partitions: Aaron Abbott and Zygmund Zorn undergo the same transform
Node 1 A-F G- M
Source Data

Transform
Node 2

Transform
Node 3

N-T U-Z

Transform
Node 4

Transform

Combining Parallelism Types


Putting It All Together: Parallel Dataflow
Pipelining

Source Data

Transform

Clean

Data Warehouse

Load

Source

Target

Repartitioning
Putting It All Together: Parallel Dataflow with Repartioning on-the-fly
Pipelining
U-Z N-T G- M A-F

Source Data

Data Warehouse

Transform

Clean
Customer zip code

Load
Credit card number

Customer last name

Source
Without Landing To Disk!

Target

EE Program Elements

Dataset: uniform set of rows in the Framework's internal representation


- Three flavors: 1. file sets *.fs : stored on multiple Unix files as flat files 2. persistent: *.ds : stored on multiple Unix files in Framework format read and written using the DataSet Stage 3. virtual: *.v : links, in Framework format, NOT stored on disk - The Framework processes only datasetshence possible need for Import - Different datasets typically have different schemas - Convention: "dataset" = Framework data set.

Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file).
- All the partitions of a dataset follow the same schema: that of the dataset

DataStage EE Architecture
DataStage:
Provides data integration platform

Orchestrate Framework:
Provides application scalability
Orchestrate Program
(sequential data flow)

Flat Files
Import

Clean 1 Merge Clean 2 Analyze

Relational Data

Configuration File Orchestrate Application Framework and Runtime System

Centralized Error Handling and Event Logging Performance Visualization

Parallel access to data in RDBMS Parallel pipelining


Clean 1 Import Merge Clean 2 Analyze

Inter-node communications

Parallel access to data in files

Parallelization of operations

DataStage Enterprise Edition:


Best-of-breed scalable data integration platform No limitations on data volumes or throughput

Introduction to DataStage EE
DSEE:
Automatically scales to fit the machine Handles data flow among multiple CPUs and disks
~

With DSEE you can:


Create applications for SMPs, clusters and MPPs Enterprise Edition is architecture-neutral Access relational databases in parallel Execute external applications in parallel Store data across multiple disks and nodes

Job Design VS. Execution


Developer assembles data flow using the Designer

and gets: parallel access, propagation, transformation, and load. The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file. No need to modify or recompile the design

Partitioners and Collectors


~

Partitioners distribute rows into partitions


implement data-partition parallelism

~ ~

Collectors = inverse partitioners Live on input links of stages running


in parallel (partitioners) sequentially (collectors)

Use a choice of methods

Example Partitioning Icons


partitioner

Exercise
~

Complete exercises 1-1 and 1-2, and 1-3

Module 2

DSEE Sequential Access

Module Objectives
~

You will learn to:


Import sequential files into the EE Framework Utilize parallel processing techniques to increase sequential file access Understand usage of the Sequential, DataSet, FileSet, and LookupFileSet stages Manage partitioned data stored by the Framework

Types of Sequential Data Stages


~

Sequential
Fixed or variable length

~ ~ ~

File Set Lookup File Set Data Set

Sequential Stage Introduction


~ ~

The EE Framework processes only datasets For files other than datasets, such as flat files, Enterprise Edition must perform import and export operations this is performed by import and export OSH operators generated by Sequential or FileSet stages During import or export DataStage performs format translations into, or out of, the EE internal format Data is described to the Framework in a schema

How the Sequential Stage Works


~

Generates Import/Export operators, depending on whether stage is source or target Performs direct C++ file I/O streams

Using the Sequential File Stage


Both import and export of general files (text, binary) are performed by the SequentialFile Stage.

Importing/Exporting Data
Data import:
EE internal format

Data export

EE internal format

Working With Flat Files


~

Sequential File Stage


Normally will execute in sequential mode Can be parallel if reading multiple files (file pattern option) Can use multiple readers within a node DSEE needs to know
How file is divided into rows y How row is divided into columns
y

Processes Needed to Import Data


~

Recordization
Divides input stream into records Set on the format tab

Columnization
Divides the record into columns Default set on the format tab but can be overridden on the columns tab Can be incomplete if using a schema or not even specified in the stage if using RCP

File Format Example

Record delimiter Field 1 , Field 1 , Field 1 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

Field 1

Field 1

, Last field

, nl

Final Delimiter = comma

Sequential File Stage


~

To set the properties, use stage editor


Page (general, input/output) Tabs (format, columns)

Sequential stage link rules


One input link One output links (except for reject link definition) One reject link
y

Will reject any records not matching meta data in the column definitions

Job Design Using Sequential Stages

Stage categories

General Tab Sequential Source

Multiple output links

Show records

Properties Multiple Files

Click to add more files having the same meta data.

Properties - Multiple Readers

Multiple readers option allows you to set number of readers

Format Tab

File into records

Record into columns

Read Methods

Reject Link
~ ~

Reject mode = output Source


All records not matching the meta data (the column definitions)

Target
All records that are rejected for any reason

Meta data one column, datatype = raw

File Set Stage


~ ~ ~

Can read or write file sets Files suffixed by .fs File set consists of:
1. Descriptor file contains location of raw data files + meta data 2. Individual raw data files

Can be processed in parallel

File Set Stage Example

Descriptor file

File Set Usage


~

Why use a file set?


2G limit on some file systems Need to distribute data among nodes to prevent overruns If used in parallel, runs faster that sequential file

Lookup File Set Stage


~ ~

Can create file sets Usually used in conjunction with Lookup stages

Lookup File Set > Properties

Key column specified

Key column dropped in descriptor file

Data Set
~ ~ ~ ~

Operating system (Framework) file Suffixed by .ds Referred to by a control file Managed by Data Set Management utility from GUI (Manager, Designer, Director) Represents persistent data Key to good performance in set of linked jobs

~ ~

Persistent Datasets

~ ~

Accessed from/to disk with DataSet Stage. Two parts:


Descriptor file:
y

input.ds
record ( partno: int32; description: string; )

contains metadata, data location, but NOT the data itself

Data file(s)
y

contain the data y multiple Unix files (one per node), accessible in parallel

node1:/local/disk1/ node2:/local/disk2/

Quiz!

True or False?
Everything that has been data-partitioned must be collected in same job

Data Set Stage

Is the data partitioned?

Engine Data Translation


~

Occurs on import
From sequential files or file sets From RDBMS

Occurs on export
From datasets to file sets or sequential files From datasets to RDBMS

Engine most efficient when processing internally formatted records (I.e. data contained in datasets)

Managing DataSets
~

GUI (Manager, Designer, Director) tools > data set management Alternative methods
Orchadmin
Unix command line utility y List records y Remove data sets (will remove all components)
y

Dsrecords
y

Lists number of records in a dataset

Data Set Management

Display data

Schema

Data Set Management From Unix


~

Alternative method of managing file sets and data sets


Dsrecords
y

Gives record count


Unix command-line utility $ dsrecords ds_name I.e.. $ dsrecords myDS.ds
156999 records

Orchadmin
y

Manages EE persistent data sets


Unix command-line utility I.e. $ orchadmin rm myDataSet.ds

Exercise
~

Complete exercises 2-1, 2-2, 2-3, and 2-4.

Module 3

Standards and Techniques

Objectives
~

Establish standard techniques for DSEE development Will cover:


Job documentation Naming conventions for jobs, links, and stages Iterative job design Useful stages for job development Using configuration files for development Using environmental variables Job parameters

Job Presentation

Document using the annotation stage

Job Properties Documentation

Organize jobs into categories

Description shows in DS Manager and MetaStage

Naming conventions
~

Stages named after the


Data they access Function they perform DO NOT leave defaulted stage names like Sequential_File_0

Links named for the data they carry


DO NOT leave defaulted link names like DSLink3

Stage and Link Names

Stages and links renamed to data they handle

Create Reusable Job Components


~

Use Enterprise Edition shared containers when feasible

Container

Use Iterative Job Design


~ ~

Use copy or peek stage as stub Test job in phases small first, then increasing in complexity Use Peek stage to examine records

Copy or Peek Stage Stub

Copy stage

Transformer Stage Techniques


~

Suggestions Always include reject link. Always test for null value before using a column in a function. Try to use RCP and only map columns that have a derivation other than a copy. More on RCP later. Be aware of Column and Stage variable Data Types.
y

Often user does not pay attention to Stage Variable type. Try to maintain the data type as imported.

Avoid type conversions.


y

The Copy Stage


With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):


Partitioners Sort / Remove Duplicates Rename, Drop column

can be inserted on: input link (Partitioning): Partitioners, Sort, Remove Duplicates)
output link (Mapping page): Rename, Drop.

Sometimes replace the transformer: Rename,

Developing Jobs
1.

Keep it simple
Jobs with many stages are hard to debug and maintain. Use view data, copy, and peek. Start from source and work out. Develop with a 1 node configuration file.

2.

Start small and Build to final Solution


3.

Solve the business problem before the performance problem.


Dont worry too much about partitioning until the sequential flow works as expected.

4.

If you have to write to Disk use a Persistent Data set.

Final Result

Good Things to Have in each Job


~ ~

Use job parameters Some helpful environmental variables to add to job parameters
$APT_DUMP_SCORE
y

Report OSH to message log Establishes runtime parameters to EE engine; I.e. Degree of parallelization

$APT_CONFIG_FILE
y

Setting Job Parameters

Click to add environment variables

DUMP SCORE Output


Setting APT_DUMP_SCORE yields:
Double-click Partitoner And Collector

Mapping Node--> partition

Use Multiple Configuration Files


~ ~ ~

Make a set for 1X, 2X,. Use different ones for test versus production Include as a parameter in each job

Exercise
~

Complete exercise 3-1

Module 4

DBMS Access

Objectives
~

Understand how DSEE reads and writes records to an RDBMS Understand how to handle nulls on DBMS lookup Utilize this knowledge to:
Read and write database tables Use database tables to lookup data Use null handling options to clean data

~ ~

Parallel Database Connectivity

Traditional ClientClient-Server Client Client Client Client

Client Client

Enterprise Edition

Sort

Load Parallel RDBMS

Parallel RDBMS

  

Only RDBMS is running in parallel Each application has only one connection Suitable only for small data volumes

   

Parallel server runs APPLICATIONS Application has parallel connections to RDBMS Suitable for large data volumes Higher levels of integration possible

RDBMS Access
Supported Databases

Enterprise Edition provides high performance / scalable interfaces for:


~ ~ ~ ~

DB2 Informix Oracle Teradata

RDBMS Access
~

Automatically convert RDBMS table layouts to/from Enterprise Edition Table Definitions RDBMS nulls converted to/from nullable field values Support for standard SQL syntax for specifying:
field list for SELECT statement filter for WHERE clause

~ ~

~ ~

Can write an explicit SQL query to access RDBMS EE supplies additional information in the SQL query

RDBMS Stages
~ ~ ~ ~

DB2/UDB Enterprise Informix Enterprise Oracle Enterprise Teradata Enterprise

RDBMS Usage
~

As a source
Extract data from table (stream link)
Extract as table, generated SQL, or user-defined SQL User-defined can perform joins, access views

Lookup (reference link)


Normal lookup is memory-based (all table data read into memory) Can perform one lookup at a time in DBMS (sparse option) Continue/drop/fail options
~

As a target
Inserts Upserts (Inserts and updates) Loader

RDBMS Source Stream Link

Stream link

DBMS Source - User-defined SQL

Columns in SQL statement must match the meta data in columns tab

Exercise
~

User-defined SQL
Exercise 4-1

DBMS Source Reference Link

Reject link

Lookup Reject Link

Output option automatically creates the reject link

Null Handling
~

Must handle null condition if lookup record is not found and continue option is chosen Can be done in a transformer stage

Lookup Stage Mapping

Link name

Lookup Stage Properties


Reference link

Must have same column name in input and reference links. You will get the results of the lookup in the output column.

DBMS as a Target

DBMS As Target
~

Write Methods
Delete Load Upsert Write (DB2)

Write mode for load method


Truncate Create Replace Append

Target Properties
Generated code can be copied

Upsert mode determines options

Checking for Nulls


~

Use Transformer stage to test for fields with null values (Use IsNull functions) In Transformer, can reject or load default value

Exercise
~

Complete exercise 4-2

Module 5

Platform Architecture

Objectives
~

Understand how Enterprise Edition Framework processes data You will be able to:
Read and understand OSH Perform troubleshooting

Concepts
~

The Enterprise Edition Platform


Script language - OSH (generated by DataStage Parallel Canvas, and run by DataStage Director) Communication - conductor,section leaders,players. Configuration files (only one active at a time, describes H/W) Meta data - schemas/tables Schema propagation - RCP EE extensibility - Buildop, Wrapper Datasets (data in Framework's internal representation)

DS-EE Stage Elements


EE Stages Involve A Series Of Processing Steps
Input Data Set schema: prov_num:int16; member_num:int8; custid:int32; Output Data Set schema: prov_num:int16; member_num:int8; custid:int32;

Piece of Application Logic Running Against Individual Records Parallel or Sequential

EE Stage

Partitioner

Business Logic

DSEE Stage Execution


Dual Parallelism Eliminates Bottlenecks! EE Delivers Parallelism in Two Ways
Pipeline Partition
Producer

Block Buffering Between Components


Eliminates Need for Program Load Balancing Maintains Orderly Data Flow
Consume r

Pipeline

Partition

Stages Control Partition Parallelism


~

Execution Mode (sequential/parallel) is controlled by Stage default = parallel for most Ascential-supplied Stages Developer can override default mode Parallel Stage inserts the default partitioner (Auto) on its input links Sequential Stage inserts the default collector (Auto) on its input links Developer can override default y execution mode (parallel/sequential) of Stage > Advanced tab y choice of partitioner/collector on Input > Partitioning tab

How Parallel Is It?


~

Degree of parallelism is determined by the configuration file

Total number of logical nodes in default pool, or a


subset if using "constraints".
y

Constraints are assigned to specific pools as defined in configuration file and can be referenced in the stage

OSH
~

DataStage EE GUI generates OSH scripts


Ability to view OSH turned on in Administrator OSH can be viewed in Designer using job properties

~ ~

The Framework executes OSH What is OSH?


Orchestrate shell Has a UNIX command-line interface

OSH Script
~

An osh script is a quoted string which specifies:


The operators and connections of a single Orchestrate step In its simplest form, it is:
osh op < in.ds > out.ds

Where:
op is an Orchestrate operator in.ds is the input data set out.ds is the output data set

OSH Operators
~

OSH Operator is an instance of a C++ class inheriting from APT_Operator Developers can create new operators Examples of existing operators:
Import Export RemoveDups

~ ~

Enable Visible OSH in Administrator

Will be enabled for all projects

View OSH in Designer

Operator

Schema

OSH Practice
~

Exercise 5-1 Instructor demo (optional)

Elements of a Framework Program


Operators Datasets: set of rows processed by Framework
Orchestrate data sets: persistent (terminal) *.ds, and virtual (internal) *.v. Also: flat file sets *.fs

Schema: data description (metadata) for datasets and links.

Datasets
Consist of Partitioned Data and Schema Can be Persistent (*.ds) or Virtual (*.v, Link) Overcome 2 GB File Limit
What you program: What gets processed:
Node 1 Node 2
Operator A

Node 3
Operator A

Node 4
Operator A

GUI

Operator A

What gets generated: OSH

data files of x.ds

. . .

$ osh operator_A > x.ds

Multiple files per partition Each file up to 2GBytes (or larger)

Computing Architectures: Definition


Dedicated Disk Shared Disk Shared Nothing

Disk

Disk
CPU CPU CPU CPU

Disk

Disk

Disk

Disk

CPU

CPU

CPU

CPU

CPU

Memory

Shared Memory

Memory

Memory

Memory

Memory

Uniprocessor
PC Workstation Single processor server

SMP System
(Symmetric Multiprocessor)
IBM, Sun, HP, Compaq 2 to 64 processors Majority of installations

Clusters and MPP Systems


2 to hundreds of processors MPP: IBM and NCR Teradata each node is a uniprocessor or SMP

Job Execution: Orchestrate


Conductor Node

Conductor - initial DS/EE process


Step Composer Creates Section Leader processes (one/node) Consolidates massages, outputs them Manages orderly shutdown.

Processing Node
SL

Section Leader
Forks Players processes (one/Stage) Manages up/down communication.

Processing Node
SL

Players
The actual processes associated with Stages Combined players: one process only Send stderr to SL

Communication:

Establish connections to other players for data flow Clean up upon completion.

- SMP: Shared Memory - MPP: TCP

Working with Configuration Files


~

You can easily switch between config files:


y y y

'1-node' file
testing parallelism

- for sequential execution, lighter reportshandy for

'MedN-nodes' file - aims at a mix of pipeline and data-partitioned 'BigN-nodes' file - aims at full data-partitioned parallelism

Only one file is active while a step is running


y

The Framework queries (first) the environment variable:

$APT_CONFIG_FILE

# nodes declared in the config file needs not match # CPUs


y

Same configuration file can be used in development and target machines

Scheduling Nodes, Processes, and CPUs


~

DS/EE does not:


know how many CPUs are available schedule

Who knows what?

Nodes = # logical nodes declared in config. file Ops = # ops. (approx. # blue boxes in V.O.) Processes = # Unix processes CPUs = # available CPUs

Nodes User Orchestrate O/S


~

Ops
N Y

Processes
Nodes * Ops "

CPUs
N Y

Y Y

Who does what?


DS/EE creates (Nodes*Ops) Unix processes The O/S schedules these processes on the CPUs

Configuring DSEE Node Pools


{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }

3 1

4 2

Configuring DSEE Disk Pools


{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }

3 1

4 2

Re-Partitioning
Parallel to parallel flow may incur reshuffling: Records may jump between nodes
node 1 node 2

partitioner

Partitioning Methods
~ ~ ~ ~ ~

Auto Hash Entire Range Range Map

Collectors
Collectors combine partitions of a dataset into a single input stream to a sequential Stage ...
data partitions

collector

Collectors do NOT synchronize data

sequential Stage

Partitioning and Repartitioning Are Visible On Job Design

Partitioning and Collecting Icons

Partitioner

Collector

Setting a Node Constraint in the GUI

Reading Messages in Director


~ ~ ~ ~

Set APT_DUMP_SCORE to true Can be specified as job parameter Messages sent to Director log If set, parallel job will produce a report showing the operators, processes, and datasets in the running job

Messages With APT_DUMP_SCORE = True

Exercise
~

Complete exercise 5-2

Module 6

Transforming Data

Module Objectives
~

Understand ways DataStage allows you to transform data Use this understanding to:
Create column derivations using user-defined code or system functions Filter records based on business criteria Control data flow based on data conditions

Transformed Data
~

Transformed data is:


Outgoing column is a derivation that may, or may not, include incoming fields or parts of incoming fields May be comprised of system variables

Frequently uses functions performed on something (ie. incoming columns)


Divided into categories I.e.
y y y y y

Date and time Mathematical Logical Null handling More

Stages Review
~

Stages that can transform data


Transformer
Parallel y Basic (from Parallel palette)
y

Aggregator (discussed in later module)

Sample stages that do not transform data


Sequential FileSet DataSet DBMS

Transformer Stage Functions


~ ~

Control data flow Create derivations

Flow Control
~

Separate records flow down links based on data condition specified in Transformer stage constraints Transformer stage can filter records Other stages can filter records but do not exhibit advanced flow control
Sequential can send bad records down reject link Lookup can reject records based on lookup failure Filter can select records based on data value

~ ~

Rejecting Data
~

Reject option on sequential stage


Data does not agree with meta data Output consists of one column with binary data type

Reject links (from Lookup stage) result from the drop option of the property If Not Found
Lookup failed All columns on reject link (no column mapping option)

Reject constraints are controlled from the constraint editor of the transformer
Can control column mapping Use the Other/Log checkbox

Rejecting Data Example

Constraint Other/log option Property Reject Mode = Output If Not Found property

Transformer Stage Properties

Transformer Stage Variables


~ ~

First of transformer stage entities to execute Execute in order from top to bottom
Can write a program by using one stage variable to point to the results of a previous stage variable

Multi-purpose
Counters Hold values for previous rows to make comparison Hold derivations to be used in multiple field dervations Can be used to control execution of constraints

Stage Variables

Show/Hide button

Transforming Data
~

Derivations
Using expressions Using functions
y

Date/time

Transformer Stage Issues


Sometimes require sorting before the transformer stage I.e. using stage variable as accumulator and need to break on change of column value

Checking for nulls

Checking for Nulls


~

Nulls can get introduced into the dataflow because of failed lookups and the way in which you chose to handle this condition Can be handled in constraints, derivations, stage variables, or a combination of these

Transformer - Handling Rejects

Constraint Rejects
All expressions are false and reject row is checked

Transformer: Execution Order

Derivations in stage variables are executed first Constraints are executed before derivations Column derivations in earlier links are executed before later links Derivations in higher columns are executed before lower columns

Parallel Palette - Two Transformers


~ ~ ~

All > Processing > Transformer Is the non-Universe transformer Has a specific set of functions No DS routines available

~ ~ ~

Parallel > Processing Basic Transformer Makes server style transforms available on the parallel palette Can use DS routines

Program in Basic for both transformers

Transformer Functions From Derivation Editor


~ ~ ~ ~ ~ ~

Date & Time Logical Null Handling Number String Type Conversion

Exercise
~

Complete exercises 6-1, 6-2, and 6-3

Module 7

Sorting Data

Objectives
~ ~

Understand DataStage EE sorting options Use this understanding to create sorted list of data to enable functionality within a transformer stage

Sorting Data
~

Important because
Some stages require sorted input Some stages may run faster I.e Aggregator

Can be performed
Option within stages (use input > partitioning tab and set partitioning to anything other than auto) As a separate stage (more complex sorts)

Sorting Alternatives

Alternative representation of same flow:

Sort Option on Stage Link

Sort Stage

Sort Utility
~ ~

DataStage the default UNIX

Sort Stage - Outputs


~

Specifies how the output is derived

Sort Specification Options


~

Input Link Property


Limited functionality Max memory/partition is 20 MB, then spills to scratch

Sort Stage
Tunable to use more memory before spilling to scratch.

Note: Spread I/O by adding more scratch file systems to each node of the APT_CONFIG_FILE

Removing Duplicates
~

Can be done by Sort stage


Use unique option

OR

Remove Duplicates stage


Has more sophisticated ways to remove duplicates

Exercise
~

Complete exercise 7-1

Module 8

Combining Data

Objectives
~

Understand how DataStage can combine data using the Join, Lookup, Merge, and Aggregator stages Use this understanding to create jobs that will
Combine data from separate input streams Aggregate data to form summary totals

Combining Data
~

There are two ways to combine data:


Horizontally: Several input links; one output link (+ optional rejects) made of columns from different input links. E.g.,
Joins y Lookup y Merge
y

Vertically: One input link, one output link with column combining values from all input rows. E.g.,
y

Aggregator

Join, Lookup & Merge Stages


~

These "three Stages" combine two or more input links according to values of user-designated "key" column(s). They differ mainly in:
Memory usage Treatment of rows with unmatched key values Input requirements (sorted, de-duplicated)

Not all Links are Created Equal


Enterprise Edition distinguishes between:
- The Primary Input (Framework port 0) - Secondary - in some cases "Reference" (other ports)

Naming convention:
Joins Primary Input: port 0 Secondary Input(s): ports 1, Left Right Lookup Source LU Table(s) Merge Master Update(s)

Tip: Check "Input Ordering" tab to make sure intended Primary is listed first

Join Stage Editor

Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)

One of four variants: Inner Left Outer Right Outer Full Outer

Several key columns allowed

1. The Join Stage


Four types: Inner Left Outer Right Outer Full Outer

2 sorted input links, 1 output link


"left outer" on primary input, "right outer" on secondary input Pre-sort make joins "lightweight": few rows need to be in RAM

2. The Lookup Stage


Combines:
one source link with one or more duplicate-free table links Source input One or more tables (LUTs)

no pre-sort necessary allows multiple keys LUTs flexible exception handling for source input rows with no match

0 0

2 1

Lookup

Output

Reject

The Lookup Stage


~

Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging) On an MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link On an SMP, no physical duplication of a Lookup Table occurs

The Lookup Stage


~

Lookup File Set


Like a persistent data set only it contains metadata about the key. Useful for staging lookup tables

RDBMS LOOKUP
NORMAL
y y y

Loads to an in memory hash table first Select for each row. Might become a performance bottleneck.

SPARSE

3. The Merge Stage


~

Combines
one sorted, duplicate-free master (primary) link with one or more sorted update (secondary) links. Pre-sort makes merge "lightweight": few rows need to be in RAM (as with joins, but opposite to lookup).

Follows the Master-Update model:


Master row and one or more updates row are merged if they have the same value in user-specified key column(s). A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored) Unmatched ("Bad") master rows can be either
y y

kept dropped

Unmatched ("Bad") update rows in input link can be captured in a "reject" link Matched update rows are consumed.

The Merge Stage


Allows composite keys
Master One or more updates

Multiple update links Matched update rows are consumed Unmatched updates can be captured

0 0
Merge

1 1

Lightweight
2

Space/time tradeoff: presorts vs. inRAM table


Rejects

Output

Synopsis: Joins, Lookup, & Merge

Joins
Model Memory usage # and names of Inputs Mandatory Input Sort Duplicates in primary input Duplicates in secondary input(s) Options on unmatched primary Options on unmatched secondary On match, secondary entries are # Outputs Captured in reject set(s) RDBMS-style relational light exactly 2: 1 left, 1 right both inputs OK (x-product) OK (x-product) NONE NONE reusable 1 Nothing (N/A)

Lookup
Source - in RAM LU Table heavy 1 Source, N LU Tables no OK Warning! [fail] | continue | drop | reject NONE reusable 1 out, (1 reject) unmatched primary entries

Merge
Master -Update(s) light 1 Master, N Update(s) all inputs Warning! OK only when N = 1 [keep] | drop capture in reject set(s) consumed 1 out, (N rejects) unmatched secondary entries

In this table: , <comma>

= separator between primary and secondary input links (out and reject links)

The Aggregator Stage


Purpose: Perform data aggregations Specify:
~

Zero or more key columns that define the aggregation units (or groups) Columns to be aggregated Aggregation functions:
count (nulls/non-nulls) sum max/min/range

~ ~

The grouping method (hash table or pre-sort) is a performance issue

Grouping Methods
~

Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed
doesnt require sorted data good when number of unique groups is small. Running tally for each groups aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM. Example: average family income by state, requires .05MB of RAM

Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out.
requires input sorted by grouping keys can handle unlimited numbers of groups Example: average daily balance by credit card

Aggregator Functions
~ ~ ~ ~ ~ ~

Sum Min, max Mean Missing value count Non-missing value count Percent coefficient of variation

Aggregator Properties

Aggregation Types

Aggregation types

Containers
~

Two varieties
Local Shared

Local
Simplifies a large, complex diagram

Shared
Creates reusable object that many jobs can include

Creating a Container
~ ~ ~

Create a job Select (loop) portions to containerize Edit > Construct container > local or shared

Using a Container
~

Select as though it were a stage

Exercise
~

Complete exercise 8-1

Module 9

Configuration Files

Objectives
~

Understand how DataStage EE uses configuration files to determine parallel behavior Use this understanding to
Build a EE configuration file for a computer system Change node configurations to support adding resources to processes that need them Create a job that will change resource allocations at the stage level

Configuration File Concepts


~

Determine the processing nodes and disk space connected to each node When system changes, need only change the configuration file no need to recompile jobs When DataStage job runs, platform reads configuration file
Platform automatically scales the application to fit the system

Processing Nodes Are


~

Locations on which the framework runs applications Logical rather than physical construct Do not necessarily correspond to the number of CPUs in your system
Typically one node for two CPUs

~ ~

Can define one processing node for multiple physical nodes or multiple processing nodes for one physical node

Optimizing Parallelism
~

Degree of parallelism determined by number of nodes defined Parallelism should be optimized, not maximized
Increasing parallelism distributes work load but also increases Framework overhead

Hardware influences degree of parallelism possible System hardware partially determines configuration

More Factors to Consider


~

Communication amongst operators


Should be optimized by your configuration Operators exchanging large amounts of data should be assigned to nodes communicating by shared memory or high-speed link

SMP leave some processors for operating system Desirable to equalize partitioning of data Use an experimental approach
Start with small data sets Try different parallelism while scaling up data set sizes

~ ~

Factors Affecting Optimal Degree of Parallelism


~

CPU intensive applications


Benefit from the greatest possible parallelism

Applications that are disk intensive


Number of logical nodes equals the number of disk spindles being accessed

Configuration File
~

Text file containing string data that is passed to the Framework


Sits on server side Can be displayed and edited

Name and location found in environmental variable APT_CONFIG_FILE Components


Node Fast name Pools Resource

Node Options
~

Node name name of a processing node used by EE


Typically the network name Use command uname n to obtain network name

Fastname
Name of node as referred to by fastest network in the system Operators use physical node name to open connections NOTE: for SMP, all CPUs share single connection to network

Pools
Names of pools to which this node is assigned Used to logically group nodes Can also be used to group resources

Resource
Disk Scratchdisk

Sample Configuration File


{ node Node1" { fastname "BlackHole" pools "" "node1" resource disk "/usr/dsadm/Ascential/DataStage/Datasets" {pools "" } resource scratchdisk "/usr/dsadm/Ascential/DataStage/Scratch" {pools "" } } }

Disk Pools
~

Disk pools allocate storage By default, EE uses the default pool, specified by

pool "bigdata"
~

Sorting Requirements
Resource pools can also be specified for sorting:
~

The Sort stage looks first for scratch disk resources in a sort pool, and then in the default disk pool

Another Configuration File Example


{ node "n1" { fastname s1" pool "" "n1" "s1" "sort" resource disk "/data/n1/d1" {} resource disk "/data/n1/d2" {} resource scratchdisk "/scratch" } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/data/n2/d1" {} resource scratchdisk "/scratch" } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/data/n3/d1" {} resource scratchdisk "/scratch" } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/data/n4/d1" {} resource scratchdisk "/scratch" } ... }

{"sort"}

{}

4 2 1

5 3

{}

{}

Resource Types
~ ~ ~ ~ ~ ~ ~

Disk Scratchdisk DB2 Oracle Saswork Sortwork Can exist in a pool


Groups resources together

Using Different Configurations

Lookup stage where DBMS is using a sparse lookup type

Building a Configuration File


~

Scoping the hardware:


Is the hardware configuration SMP, Cluster, or MPP? Define each node structure (an SMP would be single node):
y y y y y

Number of CPUs CPU speed Available memory Available page/swap space Connectivity (network/back-panel speed)

Is the machine dedicated to EE? If not, what other applications are running on it? Get a breakdown of the resource usage (vmstat, mpstat, iostat) Are there other configuration restrictions? E.g. DB only runs on certain nodes and ETL cannot run on them?

Exercise
~

Complete exercise 9-1 and 9-2

Module 10

Extending DataStage EE

Objectives
~

Understand the methods by which you can add functionality to EE Use this understanding to:
Build a DataStage EE stage that handles special processing needs not supplied with the vanilla stages Build a DataStage EE job that uses the new stage

EE Extensibility Overview Sometimes it will be to your advantage to leverage EEs extensibility. This extensibility includes:
~ ~ ~

Wrappers Buildops Custom Stages

When To Leverage EE Extensibility


Types of situations: Complex business logic, not easily accomplished using standard EE stages Reuse of existing C, C++, Java, COBOL, etc

Wrappers vs. Buildop vs. Custom


~

Wrappers are good if you cannot or do not want to modify the application and performance is not critical. Buildops: good if you need custom coding but do not need dynamic (runtime-based) input and output interfaces. Custom (C++ coding using framework API): good if you need custom coding and need dynamic input and output interfaces.

Building Wrapped Stages


You can wrapper a legacy executable: y Binary y Unix command y Shell script and turn it into a Enterprise Edition stage capable, among other things, of parallel execution As long as the legacy executable is: y amenable to data-partition parallelism
y

no dependencies between rows

pipe-safe
can read rows sequentially y no random access to data
y

Wrappers (Contd)

Wrappers are treated as a black box


~ ~

EE has no knowledge of contents EE has no means of managing anything that occurs inside the wrapper EE only knows how to export data to and import data from the wrapper User must know at design time the intended behavior of the wrapper and its schema interface If the wrappered application needs to see all records prior to processing, it cannot run in parallel.

LS Example

Can this command be wrappered?

Creating a Wrapper

To create the ls stage

Used in this job ---

Wrapper Starting Point

Creating Wrapped Stages


From Manager:
Right-Click on Stage Type > New Parallel Stage > Wrapped

We will "Wrapper an existing Unix executables the ls command

Wrapper - General Page

Name of stage

Unix command to be wrapped

The "Creator" Page

Conscientiously maintaining the Creator page for all your wrapped stages will eventually earn you the thanks of others.

Wrapper Properties Page


~

If your stage will have properties appear, complete the Properties page

This will be the name of the property as it appears in your stage

Wrapper - Wrapped Page

Interfaces input and output columns these should first be entered into the table definitions meta data (DS Manager); lets do that now.

Interface schemas
Layout interfaces describe what columns the stage:
Needs for its inputs (if any) Creates for its outputs (if any) Should be created as tables with columns in Manager

Column Definition for Wrapper Interface

How Does the Wrapping Work?

Define the schema for export and import y Schemas become interface schemas of the operator and allow for by-name column access

input schema export stdin or named pipe UNIX executable stdout or named pipe import output schema

QUIZ: Why does export precede import?

Update the Wrapper Interfaces


~

This wrapper will have no input interface i.e. no input link. The location will come as a job parameter that will be passed to the appropriate stage property. Therefore, only the Output tab entry is needed.

Resulting Job

Wrapped stage

Job Run
~

Show file from Designer palette

Wrapper Story: Cobol Application


~

Hardware Environment:
IBM SP2, 2 nodes with 4 CPUs per node.

Software:
DB2/EEE, COBOL, EE

Original COBOL Application:


Extracted source table, performed lookup against table in DB2, and Loaded results to target table. 4 hours 20 minutes sequential execution

Enterprise Edition Solution:


Used EE to perform Parallel DB2 Extracts and Loads Used EE to execute COBOL application in Parallel EE Framework handled data transfer between DB2/EEE and COBOL application 30 minutes 8-way parallel execution

Buildops
Buildop provides a simple means of extending beyond the functionality provided by EE, but does not use an existing executable (like the wrapper) Reasons to use Buildop include:
~ ~

Speed / Performance Complex business logic that cannot be easily represented using existing stages
Lookups across a range of values Surrogate key generation Rolling aggregates

Build once and reusable everywhere within project, no shared container necessary Can combine functionality from different stages into one

BuildOps
The DataStage programmer encapsulates the business logic The Enterprise Edition interface called buildop automatically performs the tedious, error-prone tasks: invoke needed header files, build the necessary plumbing for a correct and efficient parallel execution. Exploits extensibility of EE Framework

BuildOp Process Overview

From Manager (or Designer):


Repository pane: Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped}

"Build" stages from within Enterprise Edition "Wrapping existing Unix


executables

General Page
Identical to Wrappers, except:

Under the Build Tab, your program!

Logic Tab for Business Logic

Enter Business C/C++ logic and arithmetic in four pages under the Logic tab Main code section goes in Per-Record page- it will be applied to all rows NOTE: Code will need to be Ansi C/C++ compliant. If code does not compile outside of EE, it wont compile within EE either!

Code Sections under Logic Tab

Temporary variables declared [and initialized] here

Logic here is executed once BEFORE processing the FIRST row

Logic here is executed once AFTER processing the LAST row

I/O and Transfer


Under Interface tab: Input, Output & Transfer pages

First line: output 0

Optional renaming of output port from default "out0"

Write row

In-Repository Table Definition

Input page: 'Auto Read' Read next row

'False' setting, not to interfere with Transfer page

I/O and Transfer

First line: Transfer of index 0

Transfer all columns from input to output. If page left blank or Auto Transfer = "False" (and RCP = "False") Only columns in output Table Definition are written

BuildOp Simple Example


~

Example - sumNoTransfer
Add input columns "a" and "b"; ignore other columns that might be present in input Produce a new "sum" column Do not transfer input columns

a:int32; b:int32

sumNoTransfer
sum:int32

No Transfer

From Peek:

NO TRANSFER - RCP set to "False" in stage definition


and - Transfer page left blank, or Auto Transfer = "False"

Effects: - input columns "a" and "b" are not transferred


- only new column "sum" is transferred

Compare with transfer ON

Transfer

TRANSFER
- RCP set to "True" in stage definition or - Auto Transfer set to "True"

Effects:
- new column "sum" is transferred, as well as - input columns "a" and "b" and - input column "ignored" (present in input, but not mentioned in stage)

Columns vs. Temporary C++ Variables


Columns
DS-EE type Defined in Table Definitions
~ ~

Temp C++ variables


C/C++ type Need declaration (in Definitions or Pre-Loop page) Value persistent throughout "loop" over rows, unless modified in code

~ ~

Value refreshed from row to row

Exercise
~

Complete exercise 10-1 and 10-2

Exercise
~

Complete exercises 10-3 and 10-4

Custom Stage
~

Reasons for a custom stage:


Add EE operator not already in DataStage EE Build your own Operator and add to DataStage EE

~ ~

Use EE API Use Custom Stage to add new operator to EE canvas

Custom Stage
DataStage Manager > select Stage Types branch > right click

Custom Stage

Number of input and output links allowed

Name of Orchestrate operator to be used

Custom Stage Properties Tab

The Result

Module 11

Meta Data in DataStage EE

Objectives
~

Understand how EE uses meta data, particularly schemas and runtime column propagation Use this understanding to:
Build schema definition files to be invoked in DataStage jobs Use RCP to manage meta data usage in EE jobs

Establishing Meta Data


~

Data definitions
Recordization and columnization Fields have properties that can be set at individual field level
y

Data types in GUI are translated to types used by EE

Described as properties on the format/columns tab (outputs or inputs pages) OR Using a schema file (can be full or partial)
~

Schemas
Can be imported into Manager Can be pointed to by some job stages (i.e. Sequential)

Data Formatting Record Level


~ ~ ~

Format tab Meta data described on a record basis Record level properties

Data Formatting Column Level


~

Defaults for all columns

Column Overrides
~ ~

Edit row from within the columns tab Set individual column properties

Extended Column Properties

Field and string settings

Extended Properties String Type


Note the ability to convert ASCII to EBCDIC

Editing Columns

Properties depend on the data type

Schema
~

Alternative way to specify column definitions for data used in EE jobs Written in a plain text file Can be written as a partial record definition Can be imported into the DataStage repository

~ ~ ~

Creating a Schema
~

Using a text editor


Follow correct syntax for definitions OR

Import from an existing data set or file set


On DataStage Manager import > Table Definitions > Orchestrate Schema Definitions Select checkbox for a file with .fs or .ds

Importing a Schema

Schema location can be on the server or local work station

Data Types
~ ~ ~ ~ ~ ~ ~

Date Decimal Floating point Integer String Time Timestamp

~ ~ ~ ~

Vector Subrecord Raw Tagged

Runtime Column Propagation


~

DataStage EE is flexible about meta data. It can cope with the situation where meta data isnt fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP). RCP is always on at runtime. Design and compile time column mapping enforcement. RCP is off by default. Enable first at project level. (Administrator project properties) Enable at job level. (job properties General tab) Enable at Stage. (Link Output Column tab)

~ ~

Enabling RCP at Project Level

Enabling RCP at Job Level

Enabling RCP at Stage Level


~ ~

Go to output links columns tab For transformer you can find the output links columns tab by first going to stage properties

Using RCP with Sequential Stages


~

To utilize runtime column propagation in the sequential stage you must use the use schema option Stages with this restriction:
Sequential File Set External Source External Target

Runtime Column Propagation


~

When RCP is Disabled


DataStage Designer will enforce Stage Input Column to Output Column mappings. At job compile time modify operators are inserted on output links in the generated osh.

Runtime Column Propagation


~

When RCP is Enabled


DataStage Designer will not enforce mapping rules. No Modify operator inserted at compile time. Danger of runtime error if column names incoming do not match column names outgoing link case sensitivity.

Exercise
~

Complete exercises 11-1 and 11-2

Module 12

Job Control Using the Job Sequencer

Objectives
~

Understand how the DataStage job sequencer works Use this understanding to build a control job to run a sequence of DataStage jobs

Job Control Options


~

Manually write job control


Code generated in Basic Use the job control tab on the job properties page Generates basic code which you can modify

Job Sequencer
Build a controlling job much the same way you build other jobs Comprised of stages and links No basic coding

Job Sequencer
~ ~ ~ ~

Build like a regular job Type Job Sequence Has stages and links Job Activity stage represents a DataStage job Links represent passing control
Stages

Example
Job Activity stage contains conditional triggers

Job Activity Properties

Job to be executed select from dropdown

Job parameters to be passed

Job Activity Trigger

~ ~

Trigger appears as a link in the diagram Custom options let you define the code

Options
~

Use custom option for conditionals


Execute if job run successful or warnings only

~ ~

Can add wait for file to execute Add execute command stage to drop real tables and rename new tables to current tables

Job Activity With Multiple Links

Different links having different triggers

Sequencer Stage
~

Build job sequencer to control job for the collections application

Can be set to all or any

Notification Stage

Notification

Notification Activity

Sample DataStage log from Mail Notification


~

Sample DataStage log from Mail Notification

Notification Activity Message


~

E-Mail Message

Exercise
~

Complete exercise 12-1

Module 13

Testing and Debugging

Objectives
~

Understand spectrum of tools to perform testing and debugging Use this understanding to troubleshoot a DataStage job

Environment Variables

Parallel Environment Variables

Environment Variables
Stage Specific

Environment Variables

Environment Variables
Compiler

The Director
Typical Job Log Messages:
~ ~ ~ ~ ~ ~

Environment variables Configuration File information Framework Info/Warning/Error messages Output from the Peek Stage Additional info with "Reporting" environments Tracing/Debug output

Must compile job in trace mode Adds overhead

Job Level Environmental Variables


Job Properties, from Menu Bar of Designer Director will prompt you before each run

Troubleshooting
If you get an error during compile, check the following:
~

Compilation problems If Transformer used, check C++ compiler, LD_LIRBARY_PATH If Buildop errors try buildop from command line Some stages may not support RCP can cause column mismatch . Use the Show Error and More buttons Examine Generated OSH Check environment variables settings Very little integrity checking during compile, should run validate from Director.

Highlights source of error

Generating Test Data

Row Generator stage can be used


Column definitions Data type dependent

Row Generator plus lookup stages provides good way to create robust test data from pattern files

Das könnte Ihnen auch gefallen