Sie sind auf Seite 1von 122

Training course Datastage (part 1)

V. BEYET
03/07/2006

Presentation ...
Who am I ?
Who are you ?
2

Summary

General presentation (DataStage : what is it ?)


DataStage : how to use it ?
The other components (part 2)

General presentation
Datastage : What is it ?

An ETL tool: Extract-Transform-Load

A graphic environment
A tool integrated in a suite of BI tools

Developed by Ascential (IBM)

General presentation
Datastage : why to use it ?
big size of data (volume)
multi-source and multi-target :
files,
Databases (oracle, sqlserver, access, ).

Data transformation :

Select,
Format,
Combine,
Aggregate
Sort.

General presentation
Datastage : how it works ?
Development is done :
on a client-server mode,
with a graphical Design of flows,
with simple and basic elements,
with a simple language (basic).

Treatments are :

Compiled and run by an engine,


Written on a Universe database,

General presentation

Designer

Manager

Server

Director

Administrator

The different tools


7

General presentation

Server

The server contains programs and data.


The programs
Called Jobs : first as source code and then as

executable programs, written in Universe Database


But we cant understand source code

Data :
May be written in Universe Database but better in

server directories.

General presentation

Server

What is a Project for Datastage ?


A server is organized in different environments called
Projects
A Project is a separated environment for jobs, table

definitions and routines

A Project can be created at any time


The number of projects is unlimited
The number of jobs is unlimited for each project
But the number of simultaneous client connection is

limited

General presentation

Servur

Universe Database:
The Universe Database is a relational Database with files
Tables are called "Hash File"

A Hash file is an indexed file; Its the central element to use all
the possibilities of the Datastage engine.

A Hash file with incorrectly defined keys may create disastrous problems.

10

Summary

General presentation (Datastage : what is it ?)


DataStage : how to use it ?
The other components (part 2)

11

The designer

The designer is to design jobs : look at the icon


The jobs are composed with Stages :
active stages : action
passive stages : data storage
Links : between the stages

12

Designer

The designer
Passive stages :

a place for Data storage (the

data flow is from the stage or to the stage)

Text File : sequential file


Hash File : It can be treated only by
datastage (and not by WordPad, ) but
simultaneous access is possible on Hash file.

UV Stage : The file is in the Universe Core


(DataStage engine).

ODBC Stage, OLEDB, ORAOCI :


Representation of a database; it allows to
access directly to a database with an ODBC
link.

13

Designer

The designer
Active stages
An active stage is a representation of a transformation on the dataflow :

Sort : of a file
Aggregator : calculations
Transformer : selection, transformation, transport of properties

14

Designer

The designer
links

Between active and passive stages


Between passive stages
Between active stages

15

Designer

Designer

The designer
A job in the designer
Active Stage

Passive Stage

16

The designer

Designer

DataStage Designer :
Each job has :
- one or more source of data
- one or more transformation
- one or more destination for the data
The toolbar contains the stage icons to design
the jobs.
The jobs have to be compiled to create
executable programs.
17

The designer

To compile the job


To run the job

The repository

The toolbar
with stage
icons
(palette)

18

Designer

The designer
Lets study now the different Stages :
Sequential Files (text files)
Transformer
Hash Files
Sort
Aggregator
Routines
UV Stages

19

Designer

The designer

Sequential file Stage :


Can be read,
Can be written,
Can be read and written in the same job,
Can be written cash or not,
Can be DOS file or Unix file
Can be read by two jobs at the same time
Cant be written by two jobs at the same time

20

Designer

The designer

Sequential File :

Stage name
File Type
Stage description

21

Designer

The designer

Sequential File :

Output link

Stage name (to be written)

22

Designer

The designer

Sequential File :

Data Format (Output file)

Always those values

23

Designer

Designer

The designer
Sequential File :
Different columns of the
file (Output) : type, length

Size to display
(for View Data)

24

To test the connection and


view the data in the file

The designer

Sequential File :
To describe easily a file :
use or create a table
definition
Group your table
definitions by
application
Create or modify the table
definitions (for files,
databases, transformers, )

25

Designer

The designer
Sequential File :

Then it can be used in different


jobs (click on Load to find the right
definition).

26

Designer

The designer
Sequential File :

View Data

27

Designer

The designer

Transformer Stage :
Multi-source and multi-target,
Wait for the availability of the source of data,
Makes lookup between 2 flows (reference),
Transform or propagate the data of each flow,
Allows to select, filter, create refusals file.

28

Designer

The designer

Transformer Stage :
Can do treatments by :
native basic function or created in the manager,
DataStage function or DataStage macro,
routines (before/after type)
Or only propagate columns.

29

Designer

Designer

The designer
Transformer Stage :

Input data

Right click :
propagate all
the columns

30

Output data

The designer

Designer

Transformer Stage :
Output data

Input data

31

The designer

Exercise n1 :
Objective : Read a sequential file and create a new one (save the file)
The catalogue.in file has to be read and the catalogue_save.tmp file has to be written
Source File :
catalogue.in
(in \in directory)
Target File : catalogue_save.tmp (in \tmp directory)
Steps :
1- Create a table definition (structure of Catalogue table )
2- Design the job with 2 Sequential Files and 1 Transformer
3- Create the links (data flow)
4- Save and Compile the job
5- Run the job
6-Look at the performances statistics (right click)

32

Designer

The designer
Transformer Stage :
Look at the performances of your job :
Right click on the grid and then select
Show performance statistics

33

Designer

The designer

Create the parameters of the job :

menu Edit - Job Properties , tab Parameters.

34

Designer

The designer

Exercise n2 :
Objective : Use environment variables
- create a job parameter : directory
- place it on all the paths from the job of the first
exercise (example : #directory#\tmp),
- compile
- modify your input file (add your best film)
- run with different path (other groups).
35

Designer

The designer
Hash File Stage :

Necessary for a lookup


One Hash file is entirely written before it can be
read (FromTrans link must be finished before FromFilmTypeHF
can start)

Allows to group multiple records with the same


key (suppress duplicate keys)
Can be read in different jobs simultaneously
Can be written by different links simultaneously
(in the same job or in different jobs)

36

Designer

The designer
Hash File :

Stage name

Account name
(DataStage project)
File path

37

Designer

Designer

The designer
For files to write

Hash File :
File name

Select this check box to


specify that all records
should be cached, rather
than written to the hashed
file immediately. This is
not recommended where
your job writes and reads
to the same hashed file in
the same stream of
execution

38

The designer
Hash File :
A key must be defined (it can be a single or multiple key)

39

Designer

The designer

Designer

Stage Transformer : Lookup


The main flow can be from every type
The secondary flow must has a Hash File to design a lookup (so very
often, you will have to design a temporary Hash File)
The look up is done with the key of the secondary flow
The number of records in the main flow cant be higher after the
lookup than before the look up
The lookup is shown with a dotted line
When a lookup is exclusive the number of records after the lookup
is smaller then the number of records before the lookup

40

The designer
Transformer Stage : Lookup
Reference Flow
(vertical flow)

Principal Flow
(horizontal)

41

Designer

Designer

The designer

Exercise n3 :
Objective : make a lookup between Catalog file and Film Type
to put the type film in the output file.
Source File :
catalogue.in
Target File : catalogue.out (in \out directory)

(in \in directory)

Steps :
1- Create a table definition (structure of FilmType table )
2- Modify your job to create a Hash File from the FilmType.in file
3- Create the link to show the lookup (data flow)
4- Save and Compile the job
5- Run the job
6-Look at the performances statistics (right click)

42

The designer

Designer

Exercise n4 :
Objective : put the director name and the film name together
separated by a >. If the film type is not found, put unknown
type in the output file. What happens when the director name is
empty ? Find a solution.

43

The designer

Designer

Exercise n5 :
Objective : If the film type is not found (use constraint), put the
film in a refusals file (First a Sequential file and then a Hash
File)

44

The designer

Designer

Stage Lookup with selection (exclusive lookup)

Dont forget : lookup can be designed with ORAOCI stage or UV stage but it is more
better with Hash Files.

45

The designer

Exercise n6 :
Objective : Select only the films for which the type is known
(that means that the lookup is OK)

46

Designer

The designer

Designer

Exercise n7 :
Objective : Select all the clients who are female to put them in
an output file
The SEXE column contains M (Male) or F (female)
And then create an annotation for this job (all the jobs must have annotations)

47

The director
The Director is the job controller, it allows to :
Run jobs
Immediately or later, with more options than in the Designer

Control job status


Status : Compiled, Running, Aborted, Validated, Failed validation ...

Job monitoring
To control the number of lines treated by each active stage of a job.

48

Director

The director
Run jobs with Director
Select the job and
click here

And then enter


the parameters

49

Director

The director

Director

To run a job later :


click here

And then choose


the date and time

50

The director

Director

To modify running parameters for a job : Limits Tab


Rows limit : the job stops after x
rows (on each flow)

Warnings limit : the job


stops after x warnings

51

The director

Director

Verify the status of jobs with Director


The status :
"Not compiled"
"Compiled"
"Failed validation"
"Validated ok"
"Aborted"
"Finished"
"Running"

52

The director

Director

Example : list of jobs


To view the log

To run jobs

53

To stop jobs

To reset job status

To run jobs later

The director
Example of a Monitor :

Director
The monitor allows to follow the
different stages of a job. See
the importance of a good name
for the stages and the links !

For each step :


the number of treated lines (input and output)
the beginning time
the execution duration (Elapsed time)
Link type :
the status
Pri : principal flow
the performance (rows/sec)
Ref : reference flow (lookup)
Out : output flow

54

The director
Example of a log :
To look at error messages,
choose the job and click on the
log button

Green : OK No problem
Yellow : warning
Red : blocking problem

Dont forget : Clear the log from time to time (Job>Clear log).

55

Director

The manager

Manager

The manager is the tool to export/import elements from a


DataStage project to an other DataStage project.
File>Open Project to change project

To import or export elements click on


the appropriate button

All the elements :


jobs
Routines
table definitions
are classified in Categories but the
name must be unique within a project

Drag and Drop on an element to change


category

56

The manager
EXPORT

choose what do you want to export (create a .dsx)

To append to an
existing file
To change the selection
options :
- By category
- By individual components

Jobs
Table definitions

Routines (always check


Source Code box)

57

Manager

The manager
IMPORT
This will create/modify elements in
the DataStage Project
choose what do you want to import

Make your choice

58

Manager

The manager
With the manager, you can compile many jobs at the same time (multiple compile
jobs)

Tools > Run multiple job compile


you select the type of jobs you want to compile and select Show manual
selection page and click on Next button

select the jobs and click on Next button


click on the Start compile button

59

Manager

The designer
Sort Stage :
Criteria of sorting are filled in
In Stage Tab/Properties Tab

Modify those parameters if the


file to sort has a lot of lines

60

Designer

The designer

Designer

Exercise n8 :
Objective : When you have selected all the Women, sort the file
by alphabetical order.

61

The designer

Designer

Aggregator Stage :
- Allows data to be aggregated on a smaller number of
records,
- Intermediate treatments executed in memory,
- Allows to execute a before/after routine (before or after
the stage treatment when all the lines have been treated),
- Performances are better if data is sorted (Input tab),
- The aggregator does not sort the records.

62

Designer

The designer

Aggregator Stage : Input Tab


When input data
is sorted

63

The designer

Designer

Aggregator Stage : Output tab

Group by

Different
functions

64

The designer

Designer

Exercise n9 :
Objective : create a Job which reads location.in
And calculates the hit-parade from the most hired cassettes
(order by number of hire descending). Put also the name of the
film and not only the number of the cassette (lookup with
catalogue.in).

65

The designer

Exercise n10 :
Objective : create a Job which reads location.in
And calculates the average number of hire for each cassette.
(2 different methods can be used)

66

Designer

The designer

Exercise n9 (job to design)

67

Designer

The designer

Exercise n10 (job to design)

68

Designer

The designer

Designer

Hash File Stage :


We have seen that the Hash File is necessary for a lookup
We have seen also that Hash File allows to suppress
duplicate key
Lets see now how it is useful to group different flows

69

The designer

Designer

Exercise n11 :
Objective : With the job from exercise 10 (use the 2 methods in
the same job), create a Hash File to put the different results in the
same Hash File.
Column 1 : AVERAGE METHOD 1 or AVERAGE
METHOD 2
Column 2 : the result of each method
In the Hash file, you must have 2 lines.

70

The designer

Exercise n11 (job to design)

71

Designer

The designer

Designer

Stage Variables :
Simple treatments can be made easily with stage variable.
- It is a data which remain active during all the duration of the stage. So you
can find a max (if data is sorted), calculate a sum or count something.
- In the transformer, click on the right button and then select Show Stage
variables. Example :

72

The designer
Another example :

73

Designer

The designer

Designer

Exercise n12 :
Objective : Try to calculate the average with stage variables.

Exercise n13 :
Objective : Create a job that create a file with all the client (key)
and in a second column the list of the films (separated by a dot).

74

The designer

Exercise n13 (job to design)

75

Designer

The designer

Designer

Exercise n13 (job to design)


The order of the different variables is important. The instructions are executed in the
order of the stage variables ! (to change the order => right click>stage properties>Link
ordering Tab)
The variables must be initialized (=> right click>stage properties>variables).
There must be a hash file after the stage.

76

The designer

DataStage Variables :
Different variables are defined by Datastage :
-@NULL
- @INROWNUM, @OUTROWNUM
- @DATE
- @TRUE, @FALSE
- @PATH

Link Variables :
The more useful is : NOTFOUND

77

Designer

The designer

Designer

Routines :
- Source code (written with Basic language)
- It is external from the jobs and can be used many times at many
levels
- It can be a Transform function or a Before/After Function :
a transform function is called at each line
a before subroutine is called before the first line
(example : empty a file)
an after subroutine is called when all the lines have been
treated

78

Designer

The designer
Routines (1/3)

Type of routine

Name of the routine

Always fill in this


Short description

79

Designer

The designer
Routines (2/3)

To be filled in

Arguments : they
are used in the code

80

Designer

The designer
Routines (3/3)

Code : use
Argument names

Save

81

Compile

Test of
the
routine

Designer

The designer
Routines : access to a sequential file
OpenSeq FicXXX to xxx then
end
else
end
WriteSeq FicXXX to xxx then
end
else
end

File Header

ReadSeq FicXXX to xxx then


end
else
end
WeofSeq xxx

CloseSeq FicXXX

82

To empty the file

Designer

The designer
Routines :
Call DSLogInfo("Information", "RoutineName")
Call DSLogWarn("Warning", "RoutineName")
Call DSLogFatal("Abort", "RoutineName")
Loop Until
Repeat
Loop While
Repeat

GoTo

Iconv("05/27/97", "D2/")

For i= To
Next i
If Then
End
Else

Upcase()

Oconv(10740, "D2/")

End

field(,',',3,1) search string file after the third comma


A=Hello
B=World
C=A:B
C=Hello World

Trim(, ,T) suppress the trailing spaces


A=Hello
A[1,3]=Hel

83

Designer

The designer
Routines : Test

By double-click on Result column

84

The designer

Designer

Exercise n14 :
Step 1 :
Objective : write a routine which calculates the number of day
between two dates.
If begin date is null then return 0 ,
If end date is null then initialize it with date of today,
Save, compile and test the routine.

85

The designer

86

Designer

The designer

Designer

Exercise n14 :
Step 2
Objective : Read location.in, generate a file with the hire
duration (returned cassettes only)
Non returned cassettes after 10 days (end date null) will be
written in a refusals file with the name and address of client (to
send then a mail)

87

The designer

Exercise n14 (job to be designed)

88

Designer

The designer

Designer

Exercise n15 :
Objective : With a routine (Use CASE ), calculate the amount
for the cassette hire (days number * hire price * coefficient).
The coefficient is calculated with that rule :
<5 days = days * hire price
>=5 and <10 days = days * hire price * 1.20
>=10 and <30 days = days * hire price * 1.50
>= 30 days = days * hire price * 3

89

The designer

UV Stage :
works with internal hash file (in the DataStage Project)
makes a Cartesian product
uses SQL requests (select from where order by )

90

Designer

The designer

Designer

Exercise n16 : execute the Cartesian product on Clients file


and Cassettes file
Objective : Propose to the clients cassettes he has never hired
Step 1 : create the job parameter account,
Step 2 : create a job to write clients hash file et cassettes hash file
in the DS project with account parameter
Step 3 : In a new job, use those hash files to make the Cartesian
product
Look at your job performances !!

91

The designer
Exercise 16 : Step 1 and Step 2

92

Designer

The designer
Step 3 :

93

Designer

The designer

94

Designer

The designer
The number of records

95

Designer

Designer

The designer
The normalization :

Normalization :
12 A
12 B
12 C
12 D
12 E

12 A|B|C|D|E

Normalized file

Multi-valuated file

Un-normalization :
96

Designer

The designer
Normalization :
Multi-valuated file must have :
1- a key
2- char(253) or @VM for separator
3- The Normalize On field from Hash File checked
4- the column(s) to normalize
1

97

The designer

Designer

Exercise n17 : normalization/un-normalization


Step 1 : create a job which reads location.in file and writes a hash
file (Id_Cli as the key and the list of all Id_Cas separated by
@VM) : use Sort stage and Stage Variables !
=> View Data on the Input Link of the Hash File
Step 2 : modify the a job to add normalization of this file
=> View Data on the Output Link of the Hash File
Step 3 : Compare the sequential file with location.in file

98

The designer
Exercise N17 : job to design and View Data

99

Designer

The designer
The ORAOCI Stages :
The version of oracle used is 9i so use ORAOCI9 stage
You can :
Either use a query generated by DataStage
Or use a user-defined query
Or a combination of the both precedent possibilities

The access parameters have to be defined by job parameters


The stage can access only one table or more
Different actions can be programmed : read, insert, update
You can also use Stocked Procedures

100

Designer

The designer
The ORAOCI Stages :
The access parameters have to be defined by job parameters

101

Designer

The designer
The ORAOCI Stages : Output link
query generated by
DataStage or userdefined query

102

Designer

The designer
query generated
by DataStage

Selection of the table(s)

Selection of
the
columns

Group by
clause

103

Designer

Sort parameters

The designer
Generate SELECT clause from column list; enter other clauses

104

Designer

The designer

Designer

Enter custom SQL statement : when you want to add something specific

To format a date for


example

105

The designer
The ORAOCI Stages : Output link
Choose the table

Important parameters

Choose the action

106

Designer

The designer
The ORAOCI Stages : Output link

Number of lines
between 2 commit

107

Designer

The designer
The ORAOCI Stages : verify error code (1/3)

If the job must abort


when there is a
SQL error

108

Designer

The designer
The ORAOCI Stages : verify error code (2/3)

To receive SQL error code

109

Designer

The designer

Designer

The ORAOCI Stages : verify error code (3/3)


To select the errors

To receive SQL error code

Treat lines 1 by 1

110

The designer
The ORA Bulk Stages :
-

to insert in a table (like SQLLOAD)


Very fast (deactivate the index before the load and reactivate it
after the load)
But no warning if the index is in Unusable state after the load
(when duplicate keys for example)
Not a lot of Date and Time format (DD.MM.YYYY, YYYY-MM-DD, DDMON-YYYY, MM/DD/YYYY - hh24:mi:ss, hh:mi:ss am)

111

Designer

The designer
The ORA Bulk Stages
DSN
user
password
Table name (with
oracle.tableName)
Date and Time format
Number of lines between
2 Commit

112

Designer

The designer
How to create a table definition from a table in the database ?
On the repository,
right click on Table Definitions
and then choose Import
and then Plug-in Meta Data
Definitions

113

Designer

The designer
Then choose the table (s) and click on Import
The table definitions will be created in the category ODBC

114

Designer

The designer
Exercise n18 : Read a Database
Objective : Create a job which reads the table
REF_CPTE in BIODS database
Step 1 : create the table definition from the database
Step 2 : create the job that reads the table

115

Designer

The designer
Exercise n19 : Write in a Database
Objective : Create a job which writes in the table
TST_ALADIN_JGV in BIODS database (only the 2 first
columns : keys)
Location.in
TST_ALADIN_JGV :
Id_Cli
======== >>
CHAR1
Id_Cas
======== >>
CHAR2
In CHAR1, put a letter (different for each group) before the client number (Id_Cli).

Step 1 : Use ORAOCI stage


Step 2 : Same exercise with ORABULK stage

116

Designer

The designer
Exercise n20 : Update a Database
Objective : Create a job to update the columns
BEGIN_DATE and END_DATE in the table
TST_ALADIN_JGV in BIODS database from location.in file
BEGIN_DATE and END_DATE are defined as timestamp !

117

Designer

The administrator

Administrator

The Administrator :
Create a DataStage project
Unlock a job
Sometimes, due to server problems, the designer (or manager) falls down and
some elements may be locked (jobs, table definitions, routines, )
In that case, in the Administrator (with administrator security rights) :

118

The administrator
Unlock a job (1/3)

Administrator

To create a project

choose your project

And click on
Command button

119

The administrator

Administrator

Unlock a job (2/3)


CHDIR C:\Ascential\DataStage\Engine
LIST.READU

Search the device number

120

Search the user number

The administrator

Administrator

Unlock a job (3/3)


or with user number
unlock your job with device number

121

(UNLOCK USER UserNumber READULOCK)


Or everything
(UNLOCK ALL)

The administrator
Create a project

Location for the Project (jobs,


routines, UV hash files, table

Project name

definitions, ) on the server. Must be


different from the location for the
directories of data !

122

Administrator