DataStage Notes Bhaskar20130428 PDF

DATAWAREHOUSE:
What is Data ware house:

1.A data warehouse is a relational data base that is designed for query and analysis rather
than for transactional processing
2. its usually contains historical data derived from transactional data but it can include
data from other sources
Data ware house is nothing but collection of transactional data and historical data and can
be maintained in DWH for analysis purpose.
They are 3 types of tools should be maintained in any data warehousing project
1. ETL Tools
2. OLAP Tools (or) Reporting Tools
3. Modeling Tool
1
ETL TOOLS:
ETL is nothing but Extraction, Transformation, and Loading. a ETL Developer(those who
are expertise in dwh extracts data from heterogeneous databases(or)Flat files, Transform
data from source to target(dwh) while transforming needs to apply transformation rules
and finally load data into dwh.
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica Power center
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)
7. Pentaho Kettle
8.Talend
9. Inaplex Inaport
OLAP:
OLAP is nothing but Online Analytical Processing and these tools are called as reporting
Tools Also
A OLAP Developer analyses the data ware house and generate reports based on selection
criteria.
There are several OLAP Tools are available
1. Business Objects
2. Cognos
3. Report Net
4. SAS
5. Micro Strategy
6. Hyperion
7. MSAS (Microsoft Analysis Services)
MODELING TOOL:
Those who are working with ERWIN Tool called data modeler .A data modeler can
design data base of DWH with the help of fallowing tools
A ETL Developer can extract data from source databases (or) flat files(.txt,csv,.xls etc)
and populates into DWH .While populating data into DWH they are some staging areas
can be maintained between source and target .these staging areas are called staging area1
and staging area2.
2
STAGING AREA:
Staging Area is nothing but is temporary place which is used for cleansing unnecessary
data (or) unwanted data (or) inconsistency data.
Note: A Data Modeler can design DWH in two ways

1. ER Modeling
2. Dimensional Modeling
ER Modeling:
ER Modeling is nothing but entity relationship modeling. in this model always call table
as entities and it may be second normal form (or) 3rd normal form (or) in between 2nd and
3rd normal form
Dimensional Modeling:
In this model tables called as dimensions (or) fact tables. It can be subdivided into three
schemas.
1. Star Schema
2. Snow Flake Schema
3. Multi Star Schema (or) Hybrid (or) Galaxy
Star Schema:
A fact table surrounded by dimensions is called start schema. it looks like start
In a start schema if there is only one fact table then it is called simple start schema.
In a start schema if there are more than one fact table then it is called complex start
schema
3
Sales Fact table:
Sale_id
Customer_id
Product_id
Account_id
Time_id
Promotion_id
Sales_per_day
Profit_per_day
Account Dimension:
Account_id
Account_type
Account_holder_name
Account_open_date
Account_nominee
Account_open_balence
Pramotion:
Promotion_id
Promotion_type
Promotion_date
Pramotion_designation
Pramotion_Area
4
Product:
Product_id
Product_name
Product_type
Product_desc
Product_version
Product_stratdate
Product_expdate
Product_maxprice
Product_wholeprice
Customer:
Cust_id
Cust_name
Cust_type
Cust_address
Cust_phone
Cust_nationality
Cust_gender
Cust_father_name
Cust_middle_name
Time:
Time_id
Time_zone
Time_format
Month_day
Week_day
Year_day
Week_Yeat
DIMENSION TABLE:
If a table contains primary keys and provides detail information about the table
(or) master information of the table then called dimension table.
FACT TABLE:
If a table contains more foreign keys and it’s having transactions, provides
summarized information such a table called fact table.
5
DIMENSION TYPES:
There are several dimension types are available
CONFORMED DIMENSION:
If a dimension table shared with more than one fact table (or) having foreign key more
than one fact table. Then that dimension table is called confirmed dimension.
DEGENERATED DIMENSION:
If a fact table act as dimension and it’s shared with another fact table (or) maintains
foreign key in another fact table .such a table called degenerated dimension.
JUNK DIMENSION:
A junk dimension contains text values, genders,(male/female),flag values(True/false) and
which is not use full to generate reports. Such dimensions is called junk dimension.
DIRTY DIMENSION:
If a record occurs more than one time in a table by the difference of non key attribute
such a table is called dirty dimension
FACT TABLE TYPES:

There are 3 types of fact s are available in fact table
1. Additive facts
2. Semi additive facts
3. Non additive facts
ADDITIVE FACTS:
If there is a possibility to add some value to the existing fact in the fact table .that facts
we called as additive fact.
SEMI ADDITIVE FACT:
If there is possibility to add some value to the existing fact up to some extent in the fact
table is we called as semi additive fact.
NON ADDITIVE FACT:

If there is not possibility to add some value to the existing fact in the fact table is we
called as Non additive fact.
SNOW FLAKE SCHEMA:

Snow Flake schema maintains in dimension table normalized data .in this schema some
dimension tables are not directly maintained relation ship with fact table and those are
maintained relation ship with another dimension
6
DIFFERENCE BETWEEN STAR SCHEMA AND SNOW FLAKE SCHEMA:
Star schema Snow flake schema

It maintains demoralized data in the It maintains normalized data in the
dimension table dimension table
Performance will be increased when Performance will be decreases when
joining fact table to dimension table when joining fact table to dimension table to
shrunken dimension table because it require
more inner joins when compared with snow
compared with snow flake flake
All dimension table should maintain ed Some dimension tables are not directly
relation ship directly with fact table maintained relationship with fact table
INTRODUCTION ABOUT ETL TOOLS:
What do ETLs tool do?

An ETL tool is a tool that:
 Extracts data from various data sources (usually legacy)
Transforms data
 from -> being optimized for transaction to -> being optimized for reporting and
analysis
 synchronizes the data coming from different databases
data cleanses to remove errors
 Loads data into a data warehouse
ETL stands for Extarction,Transformation ,Load
EExtraction from any source
TTransformation(rich set of transformation capabilities)
LLoading in to any target
Why use an ETL tool?

ETL tools save time and money when developing a data warehouse by removing the
need for “hand-coding”.
“Hand Coding” is still the most common way of integrating data today. It requires
hours and hours of development and expertise to create a Business-Intelligence-System.
It is very difficult for data base administrators to connect between different brands of
databases without using an external tool.
7
In the event that databases are altered or new databases need to be integrated, a lot of
“hand-coded” work needs to be completely redone.
Different ETL Tools:
There are several ETL Tools are available in the market those are
1. Data stage
2. Informatica Power center
3. Pentaho Kettle
4. Talend
5. Inaplex Inaport
6. Abinitio
1. Data stage:
Data stage is a comprehensive ETL tool Or we can say Data stage is an data Integration
and transformation tool which enables collection and consolidation of data from several
sources,its transformation and delivery into one or multiple target systems
History begins in 1997 the first version of data stage released by VMRAK company
it’s a US based company
 Mr. Lee scheffner is the father of data stage
Those days data stage we called as Data integrator
In 1997 Data integrator acquired by company called TORRENT
Again in 1999 INFORMIX Company has acquired this Data integrator from
TORRENT Company
In 2000 ASCENTIAL Company acquired this Data Integrator and after that Ascentaial
Data stage server Edition
From 6.0 to 7.5.1 versions they supports only Unix flavor environment
Because server configured on only Unix plot form environment
In 2004, a version 7.5.x2 is released that support server configuration for windows flat
Form also.
In 2004, December the version 7.5.x2 were having ASCENTIAL suite components
Profile stage,
 Quality stage,
 Audit stage,
 Meta stage,
 DataStage Px,
 DataStage Tx,
 DataStage MUS,
These are all Individual tools
In 2005, February the IBM acquired all the ASCENTIAL suite components and the
IBM released IBM DS EE i.e., enterprise edition.
In 2006, the IBM has made some changes to the IBM DS EE and the changes are the
integrated the profiling stage and audit stage into one, quality stage, Meta stage, and
8
DataStage Px.IBM WEBSPHERE DS & QS 8.0
In 2009, IBM has released another version that “IBM INFOSPHERE DS & QS 8.1”
Informatica Power center:

Informatica has a very good commercial data integration suite
 It was founded in 1993
 It is the market share leader in data integration (Gartner
Dataquest)
 It has 2600 customers. Of those, there are fortune 100 companies, companies listed on
the Dow Jones and government
organization
The company's sole focus is data integration
 It has quite a big package for enterprises to integrate their systems, cleanse their data
and can connect to a vast number of current and legacy systems
Pentaho Kettle:
Pentaho is a commercial open-source BI suite that has a product called Kettle for data
integration.
 It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI
 The company started around 2001
 It has a strong community of 13,500 registered users
 It uses a stand-alone java engine that process the tasks for moving data between many
different databases and files
Talend:
 Talend is an open-source data integration tool It uses a code-generating approach and

uses a GUI(implemented in Eclipse RC)
It started around October 2006
 It has a much smaller community then Pentaho, but is supported by 2 finance
companies
It generates Java code or Perl code which can later be run on a server
Inaplex:
Inaplex is a small UK company

InaPlex is a producer of Customer Data Integration products for mid-market CRM
solutions
Inaplex mainly focuses on providing simple solutions for it’s Customers to integrate
their data into CRM and accounting Software like Sage and Goldmine
9
Features Of Data stage:
There are 5 important features of DataStage, they are
- Any to Any,
- Plat form Independent,
- Node configuration,
- Partition parallelism, and
- Pipe line parallelism.
Any to Any:
Data stage can Extarct data from any source and can load data in to any target
Platform Independent:
A job can run in any processor is called platform independent
Data stage jobs can run on 3 types of processors
Three types of processor are there, they are
1. UNI Processor
2. Symmetric Multi Processor (SMP), and
3. Massively Multi Processor (MMP).
Node Configuration:
Node is a logical CPU ie.instance of physical CPU
The process of creating virtual CPU’s is called Node configuration
Example:
ETL job requires executing 1000 records
In Uni processor it takes 10 mins to execute 1000 records
But in same thing SMP processor takes 2.5 mins to execute 1000 records
Difference between server jobs and Parallel jobs:

Parallel jobs:
1. Datastage parallel jobs can run in parallel on multiple nodes
2. Parallel jobs support partition parallelism(Round robin Hash modulus etc.
3. The transformer in Parallel jobs compiles in C++
4. Parallel jobs run on more than one node
5. Parallel jobs run on unixplotform
6. Major difference in job architecture level Parallel jobs process in parallel. It uses the
configuration file to know the number of CPU's defined to process parallely
Server jobs:
1. Datastage server jobs do not run on multiple nodes
2. Data stage server jobs don't support the parallelism (Round robin Hash modulus etc.
3. The transformer in server jobs compiles in Basic language
4. Data stage server jobs run on only one node
10
5. Data stage server jobs run on unix platform
6. Major difference in job architecture level Server jobs process in sequence one stage
after other
Configuration File:
What is configuration file? What is the use of this in data stage?
It is normal text file. it is having the information about the processing and storage
resources that are available for usage during parallel job execution.
The default configuration file is having like
Node: - it is logical processing unit which performs all ETL operations.
Pools: - it is a collection of nodes.
Fast Name: it is server name. by using this name it was executed our ETL jobs.
Resource disk:- it is permanent memory area which stores all Repository components.
Resource Scratch disk:-it is temporary memory area where the staging operation will be
performed.
Configuration file:
Example:
{
node "node1"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
node "node2"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
}
11
Note:
In a configuration file No node names has same name
Default Node pool is “ “
 At least one node must belong to the default node pool, which has a name of "" (zero-
length string).
Pipeline parallelism:
Pipe:
Pipe is a channel through which data moves from one stage to another stage
Pipeline parallelism: It’s a technique of simultaneously processing Extraction,

Transformation and, Loading
Partition Parallelism:
Partitioning:
Partioning is a technique of dividing the data into chunks
Data stage supports 8 types of partitions
Partioning plays a important role in data stage
Every stage in Data stage associated with default partitioning technique
Defualt partinining technique is auto
Note:
Selection of partioning techniques is based on
1 .Data(Volume ,Type
2 .Stage
3. No of key Columns
5.Key column data type
Partitioning techniques are grouped in to two categories

1.Key Based
2.Key Less
Key Based Partitiong techniques:

1.Hash
2.Modulo
3.Range
4.DB2
Key Less Partioning techniques:

1.Random
2.Round robin
3.Entire
12
4.Same
Data stage Architecture:

Data stage is a client server technology so that it has server components and client
components
Servercompoents (Unix) Client components (Windows)

PX Engine Data stage Administrator
Data stage Repository Data stage Manager
Package Installer Data stage Director
Data stage Designer
Data stage Server:

The server components again classified in to
Data stage server:

It is the heart of data stage and its contain the archestrate Engine usually this engine
picks up requirement from the client component and according to the request it performs
the operation and respond to the client components ,if it requires it get the information
from data stage repository
Data stage Repository:

The repository conatins jobs,table definations,file definations,routines,shared containers
etc
Package Instaler:
It is used to install the softwares and gives compatability to the other softwares
Client Components:
The client component again classified into
Data stage Administartor
Data stage Manager
Data stage Director
Data stage Designer
Data stage Administrator:

Ds admin can create projects and delete the projects
Can give permissions to the users
Can define global parameters
Data stage Manager:

Datastage Manager can import and export the jobs
can create routines
Can configure configuration file
Data stage Director:
13
Da ta stage Director can validate ajobs
Can run the jobs
Can monitor a job
Can schedule ajob
Can view the job logs
Data stage Designer:

Through Data stage Designer a developer can design a jobs and compile and run a jobs
Differences between 7.5.x2 & 8.0.1:

7.5X2:
1. Four client components (Ds Designer,Ds Director,Ds Manager,Ds Administrator)
2. Architecture Components( Server Components,Client Component
3. Two tier architecture
4.Os dependent with respect to users
5. Capable of Phase3, Phase4
6.No web based Administration
7.File Based Repository
8.0.:
1. Five client components (Ds Designer,Ds Director,Information Analyzer,Ds
Administrator, Web console)
2. Architecture Components
Common User Interface
Common Repository
Common Engine
Common Connectivity
Common Shared services
3. N- tier architecture
4. Os Independent with respect to users but one time dependent
5. Capable of All Phases
6.Web based Administration through web console
7.Data base Based Repository
Data stage 7.5x2 Client Components:

In 7.5x2 we have 4 client components
14
it is to create jobs, compile, run and multiple job compile.
It can handle four type of jobs
1.Server Jobs
2. Parallel jobs
3. Job Sequencing
4. Main frame Jobs
Data stage Director

Using data stage Director
Can schedule the jobs,run the jobs
Can Monitor the jobs,Unlock the jobs,Batch jobs
Can View (job, Status, logs)
Message Handling
Data stage Manager

Can Import and Export the repository components
Node Configuration
Data stage Administrator

Can create the projects
Can delete the projects
Organize the projects
Server Components:
We have 3 server components
1. PX Engine: it is executing DataStage jobs and it automatically selects the partition

technique.
2. Repository: It contains the repository components
3.Package Installer:
Package Instaler has packs and Plug Ins
Data stage 8.0.1 Client Components:

In 8.0.1 we have 5 client components
it is to create jobs, compile, run and multiple job compile.
15
It can handle four type of jobs
1.Server Jobs
2. Parallel jobs
3. Job Sequencing
4. Main frame Jobs
5.Data Qulaity jobs
Data stage Director

Using data stage Director
Can schedule the jobs,run the jobs
Can Monitor the jobs,Unlock the jobs,Batch jobs
Can View (job, Status, logs)
Message Handling
Data stage Administrator

Can create the projects
Can delete the projects
Organize the projects
Web Console:
Through administrator components can perform the below tasks
1.Security services
2.Scheduling services
3.Logging services
4.Reporting services
5.Domian Management
5.Session Management
Information Analyzer:
It is also console for IBM infosphere Information Server console
It performs All activities of Phase1
1.Column Analysis
2.BaseLine Analysis
3.Primary Key Analysis
4.foriegn Key Analysis
5.Cross Domian Analysis
Data stage 8.0.1 Architecture:

1. Common User Interface
Unified user is called a Common user interface
1. Web console
2. Information Analyzer
16
3. Data stage Designer
4. Data stage Director
5. Data stage Administrator
2. Common Respository
Common Repository devided in to two types

1. Global Repository: It is used for data stage jobs files would store here
2. Local Repository: for storing induvidual files
Common repository we called as a Meta data server
3. Comon Engine:
It is responsible for the following
Data Profiling Analysis
Data Data Quality Analysis
Data Transmission Analysis
4. Common Connectivity
It provides the common connections to the Common Repository
Stages Enhancements and Newly Introduced stages Comparison from
7.5x2 And 8.0.1:
Stage Category Available Stage version
Type Stage Name 7.5X2 Avilable Stage in Version 8.0.1
SCD(Slowly Change
Processing Stage Dimension) Not Available Available
Processing Stage FTP(File Transfer Protocal) Not Available Available
Processing Stage WTX(Webshere Transfer) Not Available Available
Processing Stage Surrogate Key Available Available (Enhance ment done
Available
(Normal Lookup,Sprase Available( Range Lookup,Case
Processing Stage Look up Lookup) Lookup)
Data Base Stage IWAY Not Available Available
Data Base Stage Classic Federation Not Available Available
Data Base Stage ODBC Connector Not Available Available
Data Base Stage Netteza Not Available Available
Available
(All Stages Technqs used wrto
Data Base Stage Sql builder Not Available Builder)
Note: Data base stages and Processing stage has Enhancements has done
Datastage Designer Window:
Its has Title BarIBM Infosphere Datastage and Quality stage Designer
17
Menu bar File,Edit,View,Repository,Diagram,Import,Export,Tools,Window,Help
Tool BarTool Options like Jobproperties,Compile,Run
RepositoryRepository which contains Repository components

Palletehas list of categorize stages
Designer Canvas: here we can design the jobs
File Stages:
----------------
Sequential file stage:
===============
Sequential file stage is a file stage which is used to read the data sequentially or
Parallely.
If it is 1 file - It reads the data sequentially
If it is N files - It reads the data Parallely
Sequential file supports 1 Input link |1 Output Link | 1 reject link.
To read the data, we have read methods. Read methods are
a) Specific files
b) File Patterns
Specific File is for particular file
And File Pattern is used for Wild cards.
And in Error Mode. It has
Continue
Fail and
Output
If you select Continue-If any data type mismatch it will send the rest of the data to the
target
If you Select Fail- Job Abort or Any Data type mismatch
Output- It will send the mismatch data to Rejected data file.
Error data we get are
Data type Mismatch
Format Mismatch
Condition Mismatch
and we have the option like
Missing File Mode: In this Option
we have three sub options like
Depends
Error
Ok
(That means How to handle, if any file is missed)
Different Options usage in Sequential file:
-----------------------------------------------------
Read Method=Specific file Then execute in sequencial mode
18
Read Method=File patternThen execute in file pattern
Note :If we choose Read method =Specific file then it asks input file path
If we select Read method=File Pattern then it asks ask pattern
Example for file pattern:

Emp1.txt
Emp2.txt
To read the data of above two files then file pattern should be like Emp?.txt
?--> for one character match
*--> for one or more character match
Example jobs for Lab Hand out:

1. Read Method =Specific files
Rejectmode=Continue,Fail,Output
Note:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
2. Read Method=File Pattern

Note:If Reject Mode=Output then you must provide the output reject link for rejected

FileNameColumn=InputRecordFilepath
Note1:If Reject Mode=Output then you must provide the output reject link for rejected
Note2: If FileNameColumn=InputRecordFilepath
Here FileNameColumn is an Option if we select this option then we have to create one
more extra column in extended column properties then in output file you will get the
InputRecordFilepathColumn extra at output

RowNumberColumn=InputRowNumberColumn
Note1:If Reject Mode=Output then you must provide the output reject link for rejected
Note2: RowNumberColumn=InputRowNumberColumn
19
Here RowNumberColumn is an Option if we select this option then we have to create one
more extra column in extended column properties then in output file you will get the
InputRowNumberColumn extra at output
Sequencialfile Options:
Filter
FileNameColumn
RowNumcolumn
Read Firstrows
NullFieldvalue
1. Filter Options
Sed command:
--------------------
Sed: is a stream Editor for filtering and transforming text from standard input to
standard output
Sed ‘5q’It displays first 5 lines

Sed ‘2p’It displays all lines but 2nd line will displayed twice
Sed ‘1d’it displays all records except first record
Sed ‘1d,2d’ it displays all lines except the first and second record
Sed –n ‘2,4p’ here it prints only from record 2 to 4 only
Sed –n –e ‘2p’ –e ‘3p’It displays only 2 nd 3rd line
Sed ‘$d’ it is for deleting the trailer record
Grep commands:
-----------------------
Syntax:
grep “string”
Ex: grep “bhaskar”
1) grep –v “string” Ex: grep – v “bhaskar” it displays except ’bhaskar’
2) grep –i “String” Ex: grep - i “bhaskar” it ignores case sensitive
20
Example Job for Sequential File:
Requirement: Extracting EMP data from text file and loading into text file By using
Sequential file Stage
Here Read method= specific file(s)
Input File data:
Output File data:
21
Job:
Properties for sequential_File_0:
Target Sequential_File_1 Properties:
22
Importing table definition from sequential file:
Right click on sequential_file_0 Or double click on seqential_file_0
Click on Columns tab on left hand side tab
23
Click on Load Tab at Bottom
It will show the Window like this
Now Click On import and select Sequential file definitions..
24
Now in the file list u select the file emp1.txt
Click on Import tab:
25
Tick the check box First line is column names and Click on Defines
Now Click on Ok and click on Close tab Now that file emp1.txt will show in the table
Definition list
26
Now Select Emp1.txt in table definition list and click on OK
It will show Window like this
Now Click on OK and again Click on OK ….This is the way of procedure for importing
table definition
27
2)Example Job for Sequential File:
Here Read method=file pattern
Input sequential_File_0
Properties:
Emp1.txt
Emp2.txt
These two files are in this path: D:\dspractice\sanjeev\emp*.txt

Requirement:
Output target sequential_File_1:
28
Job:
Input Sequential_File_0 Properties:
29
Output Sequential_File_1 Properties:
30

Reject Mode=Continue
Continue: Continue to simply discard any rejected rows;
Fail: Fail to stop if any row is rejected; Output to send rejected rows down a reject link.
here two records field delimiter is not properly ended empno=300,400

Input sequential file_0 data:
31
Job:
Input sequential file_0 data Properties:
32
Output sequential_File_01 Properties:
Output data:


Reject Mode=Output
33
Input file data:
Input sequential file data :
Output Sequential File Data:
Output Rejects Data:
34
Job:
Note: if U select Reject Mode=Output then u must have reject link


Reject Mode=Output
Options:
File Name Column=InputRecFilepath
Note:Here u should create InputRecFilepath Column in Extended column Properties
Job:
Input Seq file data:
35
Input Properties:
36
Columns:
Output Properties:
Outputdata:
37

Reject Mode=Output
Options:
Note:Here u should create InputRecFilepath Column and InputRowNumberColumn in
Extended column Properties
Job:
InputProperties:
38
Columns:
Input Data:
39
Output Properties:
OutputData:
40

Reject Mode=Output
Options:
Options
Filter=Sed –n’3,5p’
Note:Here u should create InputRecFilepath Column and InputRowNumberColumn in
Extended column Properties
Input Source file data:
Input Sequential filedata:
41
Job:
Input properties:
42
Input Columns:
Output properties:
43
Output Data:


Reject Mode=Continue
Options:
Read First Rows=3
Input source File data:
Job:
44
Input sequential file properties:
Columns:
45
Output sequential file properties:
46

Reject Mode=Output
Options:
Read First Rows=5
Filter=grep “bhaskar”
Grep Options:
3) grep “string” Ex: grep “bhaskar”
4) grep –v “string” Ex: grep – v “bhaskar”
5) grep –i “String” Ex: grep - i “bhaskar”
47
Source filed data:
Job:
Input sequential file Properties:
48
Columns:
Output sequential file properties:
49
Output Data:


Reject Mode=Output
Options:
Filter=grep –v “bhaskar” It dsiplay except bhaskar record
Filter=grep –i “bhaskar”It display only bhaskar record “I “ means ignore case sensitive
50
Source filedata:
Job:
Input Sequential file stage properties:
Columns:
51
Output seqfile stage properties:
52
Usage of Parameters at Sequential file stage input file Path
Example :
-------------
My input file is located in the path: C:\Sourcefiles\Bhaskar\Emp1.txt
Here create one parameter for Path: C:\Sourcefiles\Bhaskar\
And another parameter for inputfile name: Emp1.txt
Job Parameters Creation Process:

Step1: Go to job parameters and click on paramer tab provide the below details
Parameter name prompt type Default Value Help Text
Allows files or data s
automaticaly over written if they al
1 $APT_CLOBBER_OUTPUT overwrite List FALSE exist
2 Inputpath Inputfilepath Pathname C:\Sourcefiles\Bhaskar\
3 Inputfilename InputfileName String Empdata1.txt
File Stages:
-----------------
Data set:
------------
Dataset is the parallel processing file stage which is used for staging the data when we
design dependent jobs.
By Default dataset is parallel processing stage
Dataset will be stored in the binary format.
If we use dataset for the jobs, data will be stored in the Data Stage. That’s is inside the
repository.
Dataset will over come the limitations of the sequential file.
Limitations of sequential files are
1) Memory limitations ( It can store up to 2 GB Memory in the file format )
2) Sequential ( By default it is Sequential )
3) Conversion Problem ( Every time when we run the job, it has to convert from one
format to another format)
4) Stores the data outside the Data stage ( Where in Dataset it stores the data inside the
Data stage)
Types of Datasets are 2 types

1) Virtual Dataset
2) Persistence Dataset
Virtual Dataset is the temporary dataset which is formed when passing in the link.
Persistence Dataset is the Permanent Dataset which is formed when loaded in the Target.
53
Alias names of Datasets are
a) Orchestrate Files
b) Operating System Files
Datasets are Multiple files

Dataset files are
1) Descriptor Files
2) Data Files
3) Controll Files
4) Header Files
1) Descriptor Files contains the Schema Details and address of the data.
It stores the data in C:/Data/file.ds
2) Data Files contains the data in Binary format

It will be stored in the c:/IBM/InformationServer/Server/Data/file.ds
3rd and 4th Control and Header Files resides in the Operating System.
Dataset Organization are View, Copy, Delete
Dataset utilities for organizing are
GUI - Dataset management ( In Windows Environment )
CMD- Orchadmin ( In Unix Environment )
Data set GUI utility:
It is an utility for Organizing data set

ToolsDataset ManagementFile Name Output. Ds
Multiple files for Data set

1. Descriptor file
2. Data File
3. Control File
4. Header File
Descriptor file:
It holds the information about the address and about the structure
Data file: Represents the data in the native form
Control and Header file: these files are operate at the OS level for controlling the
Descriptor and Data file
Note: Data set other names or alias

1. Os Files
2. Orchestrate File
54
EXAMPLE JOBS FOR DATA SET STAGE:
Example Job on DATA SET:
Source file data:
JOB:
Input sequential properties:
Output dataset Properties:
55
Note: the target dataset file extension is .ds
Output dataset data:
We can view the record schema of data set

Go to toolsdataset management it will show the window like this
56
it will show the list of files and datasets
Now select empoutput1.ds and click OK
57
U can view the record schema By click on table definition icon
you can see the data here by click on data viewer option:
58
Note: By using data set management we can we can open the dataset, it can show the
schema window’ it can show the data window, it can copy dataset and can delete dataset
DIFFRENCES BETWEEN SEQUENTIAL FILE AND DATA SET:
Sequential File stage Data set stage

It executes in sequential Mode It executes in parallel mode
Cannot Apply Part ion techniques Can Apply Part ion Techniques
It supports all formats like.txt,.csv,.xls etc It supports only .ds
It is used to extract data from flat file It never use to extract data from client flat
files
59
Development and Debugging Stages:
Row Generator stage:
1.The Row Generator stage is a Development/Debug stage

2. It has no input links, and a single output link.
3. The Row Generator stage produces a set of mock data fitting the specified meta data
4. This is useful where you want to test your job but have no real data available to process
Example job for Row Generator:
Requirement: Need to generate employee sample data For fields empno,ename,Hiredate
JOB:
Row generator properties:
60
Click on Columns
now click Doble click on Row no 1 it will show the below screen
For empno filed U select Type=cycle,intial value=1000,increment=1 and then click next
Again it will show the below screen for Name field Set the Algorithm=Cycle,and give
value=RafelNadal,value=JamesBlake,value=Andderadick
Similary click next it will show the window for hire datefield set Type=Random
61
For Name field
62
for hire datafield:
Target Seq_EMPData Properties:
63
Click on view data:
It will show the screen here In this job I defined 3 parameters i) is for No of rows 2) and
inputfilepath 3)inputfilename
Now click ok and again click ok it will show the data
64
2. Column Generator stage:
1.The Column Generator stage is a Development/Debug stage
2. It can have a single input link and a single output link.
3. The Column Generator stage adds columns to incoming data and generates mock data
for these columns for each data row processed.
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify details about the input link.
Output Page. This is where you specify details about the generated data being output
from the stage.
Example Job For Column generator:
Input seqEmpData:
JOB:
seqEmpData properties:
65
Column Generator Properties:
Click on in put and click on Columns
66
Click on output and click on columns:
Here u need to give column name salary in extended column properties:
67
Output columns:
Target Output_Sequential_data properties:
68
Output:
Development and Debugging Stages:
Head Stage:
1.The Head Stage is a Development/Debug stage
2. It can have a single input link and a single output link
3. The Head Stage selects the first N rows from each partition of an input data set and
copies the selected rows to an output data set. You determine which rows are copied by
setting properties which allow you to specify:
 The number of rows to copy
69
 The partition from which the rows are copied
 The location of the rows to copy
 The number of rows to skip before the copying operation begins
4.This stage is helpful in testing and debugging applications with large data sets. For
example, the Partition property lets you see data from a single partition to determine if
the data is being partitioned as you want it to be. The Skip property lets you access a
certain portion of a data set.
the stage.
Input Page. This is where you specify the details about the single input set from
which you are selecting records.
Output Page. This is where you specify details about the processed data being output
from the stage.
Head stage properties:

1. Rows
 All Rows[After Skip]=True/False
Copy all rows to the output following any requested skip positioing
 All Rows[After Skip]=False
No of rows [Per partition]=10
Period[Per Partition]=N
Copy every N'th row per partition, starting with the first.
Skip[Per Partition]=1
Number of rows to skip at the start of every partition.
If we select false then only No of rows [Per partition]=10 will come
2. Partitions
 All Partition=True
When set to True copies rows from all partitions. When set to False, copies from specific
partition numbers, which must be specified.
70
Example Job for Head Stage:
Input SeqEmpData:
Output seqEmpdata:
JOB:
InputseqEmpData Properties:
71
Case-1:
Head stage Properties:

AllRows=False
Number of rows=5
Allpartitios=TRUE
72
Head Stage Output columns:
Target Output_Sequentialdata:
73
Example Job for Head Stage:
Input SeqEmpData:
OutputSequential data:
Job:
74
Input SeqEmpData:
Head stage Properties

Case-2:
Head stage Properties:
AllRows=True
Allpartitions=True
75
Head stage output columns:
Target OutputSequentialData:
Target Output_Sequentail_data:
76
Tail Stage:
1.The Tail Stage is a Development/Debug stage
2. It can have a single input link and a single output link
3. The Tail Stage selects the last N records from each partition of an input data set and
copies the selected records to an output data set. You determine which records are copied
by setting properties which allow you to specify:
 The number of records to copy
 The partition from which the records are copied
4.This stage is helpful in testing and debugging applications with large data sets. For
example, the Partition property lets you see data from a single partition to determine if
the data is being partitioned as you want it to be. The Skip property lets you access a
certain portion of a data set
the stage.
from the stage
Tail stage properties:

1.Rows
2.Partitions
Rows:
 No of rows[Per partition]=10(Default is 10 if we need more than 10 or less
than 10 we have to change the number)
Number of rows to copy from input to output per partition.
Partitions:
All Partition=True/False
When set to True copies rows from all partitions. When set to False, copies from
specific partition numbers, which must be specified.
77
Example Job for Tail Stage:
Input SeqEmpData:
JOB:
Input seqEmpdata Properties:
78
Tailstage Properties:
Output Columns:
79
Target Output_Sequentialdata:
Sample Stage:
1.The Sample stage is a Development/Debug stage.
2. It can have a single input link and any number of output links when operationg in
percent mode,
3. And a single input and single output link when operating in period mode
4.The Sample stage samples an input data set. It operates in two modes. In Percent mode,
it extracts rows, selecting them by means of a random number generator, and writes a
given percentage of these to each output data set. You specify the number of output data
sets, the percentage written to each, and a seed value to start the random number
generator. You can reproduce a given distribution by repeating the same number of
outputs, the percentage, and the seed value
5.In Period mode, it extracts every Nth row from each partition, where N is the period,
which you supply. In this case all rows will be output to a single data set, so the stage
used in this mode can only have a single output link
6.For both modes you can specify the maximum number of rows that you want to sample
from each partition.
80
the stage.
Input Page. This is where you specify details about the data set being Sampled.
Output Page. This is where you specify details about the Sampled data being output
from the stage.
EXAMPLE JOB FOR SAMPLE STAGE:
Note: Sample stage we can Operate in Two Modes one is Period and Another one is
Percentage Mode
Input data:
JOB:
Input seqfile properties:
81
Sample stage properties:
Output Columns:
82
Target Seqfile properties:
Output data:
83
Note :Here out put we get only 3 records because we set option period[perpartion]=3 So
it takes every 3 rd record from input file data
Example Job for Sample stage :

Operate in percentage mode:
Input seqfile data:
Job:
84
Input sequential file stage properties:
Sample stage properties:
85
Output columns for Output1:
Output Columns for Output2:
Output Coluns for output3:
86
Sample stage link order ing:
Output1 seqfile stage properties:
87
Output data for outdat1 seqfile:
Output2 Seqfile stage properties:
88
Output2 seqfile data:
Output3 Seqfile stage properties:
Output3 Seqfile data:
89
Peek Stage:
1.The Peek stage is a Development/Debug stage.

2. It can have a single input link and any number of output links.
3.The Peek stage lets you print record column values either to the job log or to a separate
output link as the stage copies records from its input data set to one or more output data
sets.
4.Like the Head stage and the Tail stage (Sample stage), the Peek stage can be helpful for
monitoring the progress of your application or to diagnose a bug in your application.

the stage.
from the stage.
Peek Stage Properties:
1.Rows
 All records[AfterSkip]=True/False
Print all rows following any requested skip positioning.
All records[AfterSkip]=True/False
If we select True then Number of records [Per Partition]=10 it wont come if we set false
then only the number of records[ Per Partition]=10 will appear
2.Columns
 Peek all input columns=True/False
When set to True prints all column values. When set to False, prints specific column
values, which must be specified.
3.Partitions
 All Partitions=True/False
When set to True prints rows from all partitions. When set to False, prints from specific
partition numbers, which must be specified.
4.Options
 Peek Records output mode=Joblog/output
Job log = print output to log file; Output = write to second output link of stage
 Show column names=True/False
Set True to print the column name, followed by a colon, followed by the value; otherwise
prints only the value, followed by a space.
90
Example Job for Peek Stage:
Inputdata:
JOB:
Input sequential file properties:
91
Peek stage properties:
Peek stage Output columns:
92
Output seqfile properties:
Output seq file data:
EXAMPLE JOB FOR

PEEKSTAGE:
Inputdata:
Option outputmode=Joblog:
Job:
93
Input seqfile data:
94
peek stage properties:
Here we set the option Peek output mode=job log so we can see the data at Logs only
Procedure for see the data at logs:
Go to the tools and run directornow click on view log it will show the screen like
95
in the above screen from bottom to 8th row u click it will show the log details
96
Example Job for Peek Stage
Inputdata:
JOB:
97
Input seqfile data:
98
Peek stage properties:
Peek output1 columns:
99
peek output1 mappings:
Peek output2 columns:
100
peekoutput2 mappings:
peekoutput3 columns:
101
Peekoutput3 mappings:
peekout1 properties:
102
peekoutput1 data:
Peekoutput2 properties:
103
Peekoutput2 data:
104
Peekoutput3 data:
Steps for using the Trunctable script:

Server:/home/bhaskar$
Example:
Server:/home/bhaskar$ cd/
Server:/$ cd /Data/Script
Server:/Data/Script$ls
Deletetab.sh Trunctab.sh
It will show the list of script files in the Script files

Truncate script for deleting data in a table command:
Server:/Data/Script$sh Trunctab.sh “DBname” “Userid” “Password”
“Schemaname.tabalename”
Create table syntax:

Create Schema “Schemaname”
Create table “schemaname.tablename”
105
(
“Field1” datatype(size),
“Field5” datatype(size)
Field6” datatype(size),
“Field10” datatype(size)
)
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE "Schemaname.Tablename"
TO Group "Groupname";
Using Script in job properties before subroutine or After Sub routine:
Before subroutine:ExecSh
Input Value
/Data/Script/Deletetab.sh #Servername # #UserId# #Password#
#schemaname.TableName#
How to create New configuration File by Using Default Configuration

file(Default.Apt)
Goto Tools->Configuration-> select default from configuration drop down list

Copy the content of Default configuration file and select “New” and Paste the content
of default configuration file content and now remove which ever the nodes are not
required and click on save and Save configuration as provide the configuration file Name
“2Nodeconfigfile”
Step2:select the environment variable APT_CoONFIG_FILE at job parameters
Step3:When we trigger the job then prompt will display the default conguration file and
change the file name as “2Nodeconfigfile” Now your job will run on 2 node
configuration file
106
Differences Between Filer ,Switch,External Filter
Filter:
1.Condition can put on Multiple Columns
2.It Have 1-Input,N-Output, 1-Reject Link
Switch:
1.Condition can put on Single Column
2.It Have 1-Input,128-Output, 1- default Reject Link
External Filter:
1.Here we can use All unix Filter commands
2.It Have 1-Input,1-Output, No-Reject Link
Differences Between Oracle Enterprise Stage,ODBC Enterprise Stage:

1.Oracle Enterprise Stage
1.Version Dependent
2.Good Performance
3.Specific to Oracle
4.Uses Plug Ins
5.No rejects at source
2.ODBC Enterprise Stage:
1.Version In Dependent
2.Poor Performance
3.Use for Multiple DB
4.Uses OS Drivers
5.Rejects at source and Target
Extraction process of XLS data with ODBC Enterprise Stage:

First step is to create MS Excel that is called “work book”. It’s having ‘n’ number of
sheets in that.
 For example CUST work book is created
1.Read Excel work book with ODBC Enterprise Stage:

Read Method=Table
Table=”emp1$” \\when we reading from excel name must be in double codes end with $
Symbol
Connections
DSN=EXE
Password=OS system password
User=OS system user name
107
Columns
 Load
 Import ODBC Table Defination
 DSN here select work Book
 User ID and Password
Operating System
Add in ODBC
 MS EXCEL Drivers
 Name=EXE (DSN)
Example Job for Dynamic Rdbms:

JOB:
Dynamic Rdbms Properties:
108
Click On output
Properties for Outputname=Emp
Columns:
109
Selection:
SQL:
110
Viwe Data:
111
Columns:
112
SQL:
113
View Data:
Emp_Dataset Properties:
View data:
114
Dept_Data_set:Properties:
View data:
ENCODE STAGE:
1.It is processing stage that encodes the records into single format with the support of
command line”.
2.It supports – “1-input and 1-output”.
Properties:
 Stage
 Options Command Line=Compress/gZip
 Input
 Output
 Load the meta data of source files
DECODE STAGE:
1.It is processing stage that decodes the encoded data”.
2.It supports – “1-intput and 1-output”.
Properties:
 Stage
 Options: command line = (uncompress/gunzip)
 Output
 Load the ‘Meta’ data of the source file.
115
Filter Stage:
1.The Filter stage is a processing stage.
2.It can have a single input link and a any number of output links and, optionally, a single
reject link.
3.The Filter stage transfers, unmodified, the records of the input data set which satisfy the
specified requirements and filters out all other records.
4.You can specify different requirements to route rows down different output links. The
filtered out records can be routed to a reject link, if required.

the stage.
Input Page. This is where you specify details about the input link carrying the data to
be filtered.
Output Page. This is where you specify details about the filtered data being output
from the stage down the various output links.
Filter stage properties tab options
 Predicates
 Where Clause
 Options
 Output rejects=False
 Output rejects=True
Set to true to output rejects to reject link.
 Output Row only once=False
Output row only once=True/False

Set to true to output the row only to the first 'Where Clause' it matches; False means the
row will be output to all 'Where Clauses' that match.
116
Example Job for Filter stage:
JOB:
Emp Properties:
Copy stage properties:
117
Copy1 output :
Outputname=Emp_Copy
Output name=Emp_Copy_all
118
Out_Emp_Copyall_Dataset Properties:
outputdata:
Filter_3properties:
119
Filter_3 Output Mappings:
OutputName=OutputSalg1andsall3:
120
Output_Deptno10 PROPERTIES:
OUTPUT DATA:
Data_SET5 properties:
121
OUTPUT DATA:
Filter_10 PROPERTIES:
Output Mappings:Dslink15
122
Output columns:
output name=dslink13:
123
Dataset_14 PROPERTIES:
output
Datbase_12 Properties:
Output:
124
Switch stage:
1.The Switch stage is a processing stage.
2.It can have a single input link, up to 128 output links and a single rejects link.
The Switch stage takes a single data set as input and assigns each input row to an output
data set based on the value of a selector field.
3.The Switch stage performs an operation analogous to a C switch statement, which
causes the flow of control in a C program to branch to one of several cases based on the
value of a selector variable.
4.Rows that satisfy none of the cases are output on the rejects link.

the stage.
which you are selecting rows.
from the stage.
Switch stage properties options

1.Input
2.User defined Mapping
3.Options
1.Input
 Selector=Column Name
 Selector Mode=Auto,Hash,User Defined Mapping
1.Auto can be used when there is as many distinct selector values as output links.
2.Hash means that rows are hashed on the selector column modulo the number of output
links and assigned to an output link accordingly. In this case, the selector column must be
of a type that is convertible to Unsigned Integer and may not be nullable.
3.User-defined Mapping means that the onus is on the user to provide explicit mapping
for values to outputs
2.User Defined Mapping

 Case=?
Specifies user-defined mapping between actual values of the selector column and an
output link. Mapping is a string of the form: <Selector Value>[=<Output Link Label
125
Number>], The Link Label Number is not needed if the value is intended for the same
output link as specified by the previous mapping that specified a number. You must
specify an individual mapping for each value of the selector column you want to direct to
one of the output links, thus this property will be repeated as many times as necessary to
specify the complete mapping.
3.Options
If Not Find =Fail,Drop,Output
Fail means that an invalid selector value causes the job to fail; Drop drops the offending
row; Output sends it to a reject link.
EXAMPLE JOB FOR SWITCH STAGE:

JOB:
126
Oracle_Enterprise_0 properties:
Switch stage properties:
127
Outputmappings:
Outputname=T1;
Output_Dataset_2 properties::
128
Output
Outputname=T2;
Output Dataset_3 Properties:
129
output:
Outputname=T3;
Output Dataset_4 Properties:
130
Output:
External Filter stage:

1.The External Filter stage is a processing stage.
2. It can have a single input link and a single output link.
The External Filter stage allows you to specify a UNIX command that acts as a filter on
the data you are processing.
3.An example would be to use the stage to grep a data set for a certain string, or pattern,
and discard records which did not contain a match. This technique can be a quick and
efficient way of filtering data.
4.Whitespace is stripped from the start and end of the data before the command is
executed. To avoid this behavior, use an explicitly wrapped command that sets format
options on the schema.
Stage Page. Use this page to specify general information about the stage.
Input Page. Use this page to specify details about the input link carrying the data to
be filtered.
Output Page. Use this page to specify details about the filtered data being output
from the stage.
131
External Filter Proper ties options
1.Options
 Arguments=?
Type: String
Any command-line arguments required.
 Filter Command=?
Type: Pathname
Program or command to execute, which must be configured to accept input from stdin
and write its results to stdout.
Example Filter Commands:

sed’1d2d’
Grep ‘bhaskar’
Sed –n –e ‘2p’-e ‘3p’
EXAMPLE JOB FOR EXTERNAL FILTER:
Input data:
Job:
132
Sequential_File_7 Properties:
Columns:
133
External Filter properties:
Output columns:
Here we need to give the column names manually at output columns
134
Target data set_9: Properties:
OUTPUT:
Example job for copy and External filter:

Inputdata:
135
JOB:
Sequential file stage properties:
136
Copy_1 STAGE properties:
Copy_1 stage output mappings:
External_filter3 stage properties:
137
External_Filter_3 Output columns:
Data_set_7 stage properties:
138
output:
External_FILT4 stage properties:
139
Output Columns:
Dataset_8 Properties:
140
OUTPUT:
JOIN Queries:
Join is a query which combines the data from multiple tables
Types of joins:
1. Cartezion join
2. Equi join
3. Non equi join
4.Self join or inner join
5.Outer join
Left outer join
Right outer join
Employee Table Data:

SQL> select * from emp;
141
EMPNO ENAME JOB MGR DEPTNO
---------- ---------- ---------- ---------- ----------
111 bhaskar analyst 444 10
222 prabhakar clerk 333 20
333 pradeep manager 111 10
444 srujana engineer 222 40
Department Table Data
SQL> select * from dept;
DEPTNO DNAME LOC

------ ---------- ----------
10 marketing hyderabad
20 finance banglore
30 hr bombay
Examples:
Cartezion join:
If we combine a data from multiple tables with out applying any condition then each
record in the first table will join with all records in the second table
SQL>select * from emp,dept
SQL>select empno,ename,job ,dname,loc from emp e,dept d
Equi join:
If we combine a data from multiple tables by applying equal no of conditions on multiple
tables then each record in the first table will join with one row in the second table.
This kind of join can be called as Equi join
SQL>select e.empno,e.ename,d.dname,d.loc from emp e,dept d where e.deptno=d.deptno
Inner join:
This will display all the records that have matched.
Ex:
SQL> select empno,ename,job,dname,loc from emp inner join dept using(deptno);
142
Left outer join:
This will display the all matching records and the records which are in left hand side table
those that are not in right hand side table.
Ex:
SQL> select empno,ename,job,dname,loc from emp e left outer join dept d
on(e.deptno=d.deptno);
Or
SQL> select empno,ename,job,dname,loc from emp e,dept d where
e.deptno=d.deptno(+);
Right outer join:
This will display the all matching records and the records which are in right hand side
table those that are not in left hand side table.
Ex:
SQL> select empno,ename,job,dname,loc from emp e right outer join dept d
Or
SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno(+) =
d.deptno;
Full outer join

This will display the all matching records and the non-matching records from both tables.
Ex:
SQL> select empno,ename,job,dname,loc from emp e full outer join dept d
Join Stage:
These topics describe Join stages, which are used to join data from two input tables and
produce one output table. You can use the Join stage to perform inner joins, outer joins, or
full joins.
1.An inner join returns only those rows that have matching column values in both
input tables. The unmatched rows are discarded.
143
2.An outer join returns all rows from the outer table even if there are no matches. You
define which of the two tables is the outer table.
3.A full join returns all rows that match the join condition, plus the unmatched rows
from both input tables.
Unmatched rows returned in outer joins or full joins have NULL values in the columns of
the other link
1.Join stages have two input links and one output link.
2. The two input links must come from source stages. The joined data can be output to
another processing stage or a passive stage
Join stage Properties:

1.Join Keys
2.Options
1.Join Keys
 Key=?
Type: Input Column
Name of input column you want to join on. Columns with the same name must appear
in both input data sets and have compatible data types.
Case sensitive=True/False
Type: List
Whether this join column is case sensitive or not.
2.Options
 Join type= F,I,L,R
F->Full outer join
I->Inner join
L->Left outer join
R->Right outer join
Type: List
Type of join operation to perform.
Example Job for Join stage:
Emp table data:
144
Dept table Data:
JOB: Inner Join
Inner Join Output:
145
JOB2: Left Outer Join
Output:
LOOKUP JOB :
146
JOB3:Right Outer Join
Output:
JOB4;Full outer join
147
Output:
148
LOOKUP STAGE:
1.The Lookup stage is a processing stage.
2.It is used to perform lookup operations on a data set read into memory from any other
Parallel job stage that can output data
3. The most common use for a lookup is to map short codes in the input data set onto
expanded information from a lookup table which is then joined to the incoming data and
output. For example, you could have an input data set carrying names and addresses of
your U.S. customers. The data as presented identifies state as a two letter U. S. state
postal code, but you want the data to carry the full name of the state. You could define a
lookup table that carries a list of codes matched to states, defining the code as the key
column. As the Lookup stage reads each line, it uses the key to look up the state in the
lookup table. It adds the state to a new column defined for the output link, and so the full
state name is added to each address. If any state codes have been incorrectly entered in
the data set, the code will not be found in the lookup table, and so that record will be
rejected.
4/Lookups can also be used for validation of a row. If there is no corresponding entry in a
lookup table to the key's values, the row is rejected.
5.The Lookup stage is one of three stages that join tables based on the values of key
columns. The other two are:
Join stage - Join stage
Merge stage - Merge Stage
6.The three stages differ mainly in the memory they use, the treatment of rows with
unmatched keys, and their requirements for data being input
7. The Lookup stage can have a reference link, a single input link, a single output link,
and a single rejects link
Input Data:
149
ReferenceData:
LOOK UPJOB:
Lookup Faiure: Drop
If Lookup failure =Drop then Inner join will perform
Output:
150
2.LOOK UP JOB
Lookup Failure=Continue
If Lookup failure =Continue then Left outer join will perform
Output:
151
3.LOOKUP JOB:
Lookup Failure=Reject
If Lookup failure =Reject then the records which are not match with reference data
those records will send to the reject output Link
JOB:
Output:
Input Rejected :
LOOK UPJOB:
Lookup Faiure: Fail
If Lookup failure =Fail then If any of the input record not found in the reference file
Then the job will fails
152
MERGE STAGE:
Merge Stage Properties:
1.Merge keys
2.Options
1.Merge Keys
 Key=?
 Sort order=Ascending
Sort in Either ascending or descending order
2.Options:
 Unmatched Master mode=Drop/keep
 Warn on reject updates=True
 Warn on unmatched master=True
Masterdata:
UpdateData:
153
JOB1:
 Unmatched Master mode=Drop
Type: List
Keep means that unmatched rows (those without any updates) from the master link are
output; Drop means that unmatched rows are dropped
Output:
Master_Rejects:
154
JOB2:
 Unmatched Master mode=Keep
Output:
Master_Records:
Note : If the options "Warn on Reject Updates = True" and "Warn on Unmatched Masters
= True" then the log file shows the warnings on Reject Updates and Unmatched Data
from Masters
Note : If the options "Warn on Reject Updates = False" and "Warn on Unmatched
Masters = False" then the log file do not shows the warnings on Reject Updates and
Unmatched Data from Masters.
155
MODIFY STAGE JOBS:
Modify Stage Job1:

TypeConversion:Date_From_Timestamp
Inputfile Data:
CUSTID,CNAME,ADDRESS,CITY,STATE,ZIP,CUSTDOB
1000,BHASKAR,MOOSPET,HYDERABAD,AP,500071,1983-03-10 16:02:00
2000,SUMIT,,BANGALORE,KA,560070,1985-03-01 16:02:00
3000,SRIKAR,ERRAGADDA,HYDERABAD,AP,,1985-05-01 16:02:00
4000,SRUJANA,,HYDERABAD,AP,,1986-07-01 16:02:00
Job:
Sequential file Input columns:
Modify stage properties:

Specification=CUSTDOB=date_from_timestamp(CUSTDOB)
156
Modify stage input columns
Modify stage output columns
Output:
157
Modify Job2:
Null Handle:
Inputdata:
Job:
Modify stage properties
CUSTDOB=date_from_timestamp(CUSTDOB)
ZIP=Handle_Null('ZIP','999999')
158
Output:
3.Modify Job
Drop Columns
Job:

Specification=CUSTDOB=date_from_timestamp(CUSTDOB)
159
ZIP=Handle_Null('ZIP','999999')
Specification=Drop CUSTID
Modify stage Input columna
Output:
160
3.Modify Job
Keep Columns
Inputdata:
Job:
161
KEEP CUSTID
Modify stage input columns
162
Modify stage output columns
Output:
Copy Stage:
1.The Copy stage is a processing stage.
2.It can have a single input link and any number of output links.
3. The Copy stage copies a single input data set to a number of output data sets.
4. Each record of the input data set is copied to every output data set. Records can be
copied without modification or you can drop or change the order of columns (to copy
with more modification - for example changing column data types - use the Modify stage
5. Where you are using a Copy stage with a single input and a single output, you should
ensure that you set the Force property in the stage editor TRUE. This prevents
InfoSphere™ DataStage® from deciding that the Copy operation is superfluous and
optimizing it out of the job

the stage.
163
Input Page. This is where you specify details about the input link carrying the data to
be copied.
Output Page. This is where you specify details about the copied data being output
from the stage
Copy stage Properties tab Options:

1.Force=True/False
True to specify that DataStage should not try to optimize the job by removing the Copy
operation.
Input:
Output:
Job:
164
Copy stage general tab
Copy stage Output Mapping
165
Copy stage Results set Output:
166
Aggregator Stage:
1.The Aggregator stage is a processing stage.
2.It classifies data rows from a single input link into groups and computes totals or other
aggregate functions for each group. The summed totals for each group are output from
the stage via an output link.
the stage.
Input Page. This is where you specify details about the data being grouped or
aggregated.
Output Page. This is where you specify details about the groups being output from
the stage.
Aggregator stage general tab Options:
1.Grouping Keys:
2.Aggregations
3.Options
1.Gropuing Keys:
Group=Specifies an input column you are using as a grouping key.
Grouping Keys
 Group
 CaseSensitive=True/False
2. Aggrigations:
Aggregation type
Whether to perform calculation(s) on column(s), re-calculate on previously created
summary columns, or count rows.
 Calculation
 Count of Rows
 Re-Calculation
Aggregation type=Calculation
Column for calculation=Column name
If u given the column name then it asks
167
1.Corrected sum of squares output column
Name of column to hold the corrected sum of squares of data in the aggregate column.
->Decimal output=?
1.Maximum value output column
Name of column to hold the maximum value encountered in the aggregate column.
->Decimal output=?
2.Mean Value output column
->Decimal output=?
Name of column to hold the mean value of data in the aggregate column.
3.Minimum Value output column

->Decimal output=?
Name of column to hold the minimum value encountered in the aggregate column.
4. missing values
Specifies what constitutes a 'missing' value, for example -1 or NULL. Enter the value as a
floating point number.
5.missing values count output column

->Decimal output=?
Name of column to hold the count of the number of aggregate column fields with values
in them.
6. Percentage Coefficient of variation output column

->Decimal output=?
Name of column to hold the percent coefficient of variation of data in the aggregate
column..
7.Preserve type=True/False
True means that the datatype of the output column is derived from the input column when
calculating minimum value, maximum value.
8.Range output Column:

->Decimal output=?
Name of column to hold the range of values in the aggregate column (maximum -
minimum).
9.Standard derivation output column:

->Decimal output=?
Name of column to hold the standard deviation of data in the aggregate column.
10. Standard Error output column:

->Decimal output=?
168
Name of column to hold the standard error of data in the aggregate column.
11.sum of weights output column

->Decimal output=?
Name of column to hold the sum Of weights of data in the aggregate column. (See
Weighting Column.)
12.sum output column

->Decimal output=?
Name of column to hold the sum of data in the aggregate column.
13.Summary output column

->Decimal output=?
Name of sub record column to which to write the results of the reduce or rereduce
operation.
14.un corrected sum of squares ouput column

->Decimal output=?
Name of column to hold the uncorrected sum of squares for data in the aggregate column.
15.Variance output column

->Decimal output=?
->Variance Devisor=?
Name of column to hold the variance of data in the aggregate column.
16.weighting column
Increment the count for the group by the contents of the weight field for each record in
the group, instead of by 1. (Applies to: Percent Coefficient of Variation, Mean Value,
Sum, Sum of Weights, Uncorrected Sum of Squares.)
2.Options:
 Allow null output=True/False
True means that NULL is a valid output value when calculating minimum value,
maximum value, mean value, standard deviation, standard error, sum, sum of weights,
and variance. False means 0 is output when all input values for calculation column
are NULL.
 Method=Hash/Sort
Use hash mode for a relatively small number of groups; generally, fewer than about
1000 groups per megabyte of memory. Sort mode requires the input data set to have
been partition sorted with all of the grouping keys specified as hashing and sorting
keys.
Example Job for Aggrigator:
169
Input data:
Requirement:
Output file11 data:
Output field12 data:

JOB:
database properties;
170
171
output tab:
Aggrigator2 properties:
172
output tab:
Target seqfile 11 properties:
173
Column generator properties:
output tab:
174
Aggrigator 4 properties:
Output tab:
175
Target seqfile 12 properties:
Aggrigator job example2:

Example Job for Aggrigator:
Input data:
176
Requirement:
Output file11 data:
Output field12 data:
JOB:
database properties;
177
Aggrigator 2 properties:
178
Output tab:
target seqfile11 properties:
179
Column generator properties:
Column genenator out put tab:
180
Aggrigator4 properties:
output tab:
181
Target se file12 Properties:
Sorting:
Sorting can be done at different ways
1.If source is a Data base then we can use order by class by that we can sort the data
based on column names
2.If source is a Data base then we can use query like this
Select Distinct column(s)
From tabname
Order by Column(s)
Same above task can perform using link level sort

Step1: open the target sequential file and select partition and select the check box
perform sort
Stable
Unique
Here the data display order is Ascending with case sensitive and Nulls first
182
Link level sort Example:
Input Data:
Target Partitioning tab
Here in the above screen shot if we can observe carefully 3 check box has to be selected
183
Output:
Example Job for Dynamic Rdbms:

JOB:
Dynamic Rdbms Properties:
184
Click On output
185
Columns:
Selection:
186
SQL:
Viwe Data:
187
Columns:
188
SQL:
189
View Data:
Emp_Dataset Properties:
190
View data:
Dept_Data_set:Properties:
191
View data:
ENCODE STAGE:
1.It is processing stage that encodes the records into single format with the support of
command line”.
2.It supports – “1-input and 1-output”.
Properties:
 Stage
 Options Command Line=Compress/gZip
 Input
 Output
 Load the meta data of source files
DECODE STAGE:
1.It is processing stage that decodes the encoded data”.
2.It supports – “1-intput and 1-output”.
Properties:
 Stage
 Options: command line = (uncompress/gunzip)
 Output
 Load the ‘Meta’ data of the source file.
192
Parameter Set :
Procedure to create Parameter Set:
1. Choose File > New to open the New dialog box.
2. Open the Other folder, select the Parameter Set icon, and click OK.
3. The Parameter Set dialog box appears.
4. Enter the required details on each of the pages as detailed in the following
sections.
Parameter Set General tab
Use the General page to specify a name for your parameter set and to provide
descriptions of it.
Parameter Set  Parameters tab:
Use this page to specify the actual parameters that are being stored in the parameter
Set
Parameter Set Value tab:
Use this page to optionally specify sets of values to be used for the parameters in this
parameter set when a job using it is designed or run.
1.Parameter Set General tab

Use the General page to specify a name for your parameter set and to provide
descriptions of it.
Parameter Sets
General Parameters Values
Parameter set Name:
Ps_StagingDB
Short Description
Parameter Set cretaed for connecting for StagingDB
Short Description
2.Parameter Set  Parameters tab:

Use this page to specify the actual parameters that are being stored in the parameter
Set
Parameter Sets
Paramater name Prompt Type Default Vaue Help Text

1 HostSerever Server String
2 UserName UserId String
3 Password Password Encrypted
193
3.Parameter Set Value tab:
Use this page to optionally specify sets of values to be used for the parameters in this
parameter set when a job using it is designed or run.
Parameter Sets
Value File name HostSerever UserName Password

1 DevdDB DevdDB abreddy ******
2 ProdDB ProdDB abreddy ******
3 TestDB TestDB abreddy ******
Run Time Prompt Display Window:

The Above Parameters prompt when u run the job immediately prompt display and ask
the details of above
Parameters Limits General

Name Value
Ps_StagingDB DevdDB
Ps_StagingDB ProdDB
Ps_StagingDB TestDB
Server ABC
UserId abreddy
Password ******
Transformer Job Example4:

------------------------------------
InputData1:
InputData2:
Output:
194
Job:
Here Join Type choose=Full outer join

Transformer Logic:
Here create Stage variable
STATUS= If ((DSLink9.leftRec_ENO = DSLink9.rightRec_ENO) And
(DSLink9.ENAME = DSLink9.ENAME1) And (DSLink9.SAL = DSLink9.SAL1) )Then
"SAME" Else If
(((DSLink9.leftRec_ENO = DSLink9.rightRec_ENO)) And
(DSLink9.ENAME = DSLink9.ENAME1) And
(DSLink9.SAL <> DSLink9.SAL1) ) Or
((DSLink9.leftRec_ENO = DSLink9.rightRec_ENO) And
(DSLink9.ENAME <> DSLink9.ENAME1) And (DSLink9.SAL = DSLink9.SAL1)) Or
((DSLink9.leftRec_ENO = DSLink9.rightRec_ENO) And (DSLink9.ENAME <>
DSLink9.ENAME1) And (DSLink9.SAL <> DSLink9.SAL1))Then "UPDATE" Else
"NEW"
Output:
195
Pivote Stage:
1.Pivot stage is an active stage,
2.Pivote stage is an processing stage
3.Maps sets of columns in an input table to a single column in an output table. This type
of mapping is called pivoting.
4. Pivot Stage converts columns in to rows.
Scenario:
Eg., Marks1 and Marks2,Marks3 are three columns.
Task : Convert all the columns in to one column ie Marks
Using Methodology : In the deviation field of the output column change the input
columns in to one column.
Eg., Column Name – "Marks".
Derivation : Marks1 and Marks2,Marks3
Note : Column "Marks" is derived from the input columns Marks1 and Marks2 and
Marks3
Example Job for Pivote stage:

Input Data:
OutputData:
196
Job:
-----
Pivote stage inputs columns:
Pivote stage outputs columns:
197
OutputData:
2.Surrogate Key Stage:

1.The Surrogate Key Generator stage is a processing stage that generates surrogate key
columns and maintains the key source.
2.A surrogate key is a unique primary key that is not derived from the data that it
represents, therefore changes to the data will not change the primary key. In a star schema
database, surrogate keys are used to join a fact table to a dimension table.
3.The Surrogate Key Generator stage can have a single input link, a single output link,
both an input link and an output link, or no links. Job design depends on the purpose of
the stage.
4.You can use a Surrogate Key Generator stage to perform the following tasks:
 Create or delete the key source before other jobs run
198
 Update a state file with a range of key values
 Generate surrogate key columns and pass them to the next stage in the job
View the contents of the state file
5.Generated keys are unsigned 64-bit integers. The key source can be a state file or a
database sequence. If you are using a database sequence, the sequence must be created by
the Surrogate Key stage. You cannot use a sequence previously created outside of
DataStage.
6.You can use the Surrogate Key Generator stage to update a state file, but not a database
sequence. Sequences must be modified with database tools.
InputData:
Output:
Job:
Surrogate key stage proper ties:
199
Surrogate key properties:
1.Key source:
 Generate output column name= Surrogatekey1(This column we have to generate
at output)
 Source Name=C:/data/bhaskar/empty.txt(we have to create the empty .txt file in
the given path)
 Source Type=Flat File
2.Options:
 Generate key from Last Highest value= Yes/No
Output:
TRANSFORMER STAGE:
Trans former stage plays major role in data stage .it is used to modify the data, apply
some functions while populating data from source to target
It takes one input link and gives one (or) more than one output links.
It has 3 components
1. Stage variable
2. Constraints
3. Derivations (or) Expressions
1. Transformer stage can works as copy stage and filter stage
2. Transformer stage requires C++ Compiler .it convert high level data into machine
language
200
Double click on transformer stage drag and drop of required target columnsClick Ok
The order of execution of transformer stage is
1. Stage variable
2. Constraints
3. Derivations
Example:
How to work transformer as filter stage (or) how to apply constraints in the
transformer stage:
Double click on trans former stage  double click on constraint again double on
particular link click on this window  it provides all information’s automatically and
view Constraints  for reject link click other wise.
Example Derivation:
If Sale_Id <300 then Amount_Sold=Amount_Sold+300

Else if Sale_Id>300 and Sale_Id<600 then Amount _Sold=Amount_Sold+600
Else if Sale_Id>600 and Sale_Id<1000 then Amount _Sold=Amount_Sold+1000 Else
Amount_Sold=Amount_Sold+100
Transformer stage provides some Functions and other informations.those are

1. Ds Macro
2. Ds routine
3. Job parameter
4. Input column
5. Stage Variables
6. System Variables
7. String
8. Function
9. Parenthesis
10. If Then Else
1. Ds Macro:
201
Ds Macro provides some built in Functions like
1. DsProjectName()
2. DsJobName()
3. DsHostName()
4. DsJobStartDate()
5. DsJobStartTime()
6. DsJobStartTimeStamp()
2. Ds routine:
It is nothing but set up functions
3. Job parameters:
Job parameters are nothing but some variables. these are used to reduce the redundancy
of work
4. Input columns:
It provides all input column names
5. Stage variables:
Stage variables are used to increase the performance and to reduce the redundancy of
work
How to define stage variable properties:
Click on stage variable right click on stage variable select stage variable
propertiesdefine stage variables
6. System variables:
It contain some built in functions like
1. @INROWNUM
2. @OUTROWNUM
3. @NUMPARTIONS
INROWNUM and OUTROWNUM provides how many records are loading into
transformer stage and how many records extracted from transformer stage ,Num Portion
tells how many nodes is handled
7. String:
It provides information with in double quotation hard coded value
8. Functions:
There are several built functions in data stage
1. Date&Time
2. Logical
3. Mathematical
4. Null Handling
5. Number
6. Raw
7. String
8. Type Conversion
202
9. Utility
EXAMPLE JOBS OF TRANSFORMER STAGE:

1)EXAMPLE JOB FOR TRANSFORMER
JOB:1
Inputfile:
Output requirement
JOB:
203
Sequential file:
Transformer Stage properties:
204
Results stage variable Derivation:
---------------------------------------
Reults=If (Input.MARKS1>=35 And Input.MARKS2 >=35 And Input.MARKS3>=35)

Then 'PASS' Else 'FAIL'
INPUTCOLUMNS:
OUTPUTLINK:
TARGET FILE:
205
2) EXAMPLE JOB FOR TRANSFORMER:
JOB2:
Input file:
Output requirement:
206
JOB:
Input file properties:
Transformer stage properties:
207
Stage variable derivation:
Field(INPUT.HDATE,"/",3):"-": Field(INPUT.HDATE,"/",2):"-":
Field(INPUT.HDATE,"/",1)
4. EXAMPLE JOB FOR TRANSFORMER:

Inputfile data:
Out put Reqirements:

Output1:
Output2:
JOB:
INPUT:
208
Transformer1:
Stage variable derivation:
209
Left(INPUT.REC,1)
Transformer2:
Constrains logic:
210
OUTPUTINVC:
OUTPUTPRODID:
211
212
4. EXAMPLE JOB FOR TRANSFORMER STAGE:
Input file:seqfile1:
Input file:seqfile0:
Out put Requirement:
Job:
213
Input file file1 properties:properties:
Join properties:
Transformer stage properties:
214
Stage variable Status derivation:
If ((DSLink9.leftRec_ENO = DSLink9.rightRec_ENO) And (DSLink9.ENAME =

DSLink9.ENAME1) And (DSLink9.SAL = DSLink9.SAL1) )Then "SAME" Else If
(((DSLink9.leftRec_ENO = DSLink9.rightRec_ENO)) And (DSLink9.ENAME =
DSLink9.ENAME1) And (DSLink9.SAL <> DSLink9.SAL1) ) Or
DSLink9.ENAME1) And (DSLink9.SAL = DSLink9.SAL1)) Or
DSLink9.ENAME1) And (DSLink9.SAL <> DSLink9.SAL1))Then "UPDATE" Else
"NEW"
Target filel :properties:
215
2. REMOVE DUPLICATES:
Inputdata:
Output:
216
JOB:
Remove duplicates properties:
Xml Output Stage:

Input Text file data:
EMPID,NAME,GENDER,COMPANY,CITY
1,BHASKAR,MALE,IBM,HYDERABAD
2,PRADEEP,MALE,WIPRO,BANGLORE
3,SRUJANA,FEMALE,INFOSYS,HYDERABAD
4,KRISHNAVENI,FEMALE,TCS,PUNE
5,SRIKARAN,MALE,COGNIZANT,CHENNAI
Output file XML data:


- <EMPINFO>
- <EMPDETAILS>
<EMPID>1</EMPID>
<NAME>BHASKAR</NAME>
<GENDER>MALE</GENDER>
<COMPANY>IBM</COMPANY>
<CITY>HYDERABAD</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>2</EMPID>
<NAME>PRADEEP</NAME>
<COMPANY>WIPRO</COMPANY>
<CITY>BANGLORE</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>3</EMPID>
<NAME>SRUJANA</NAME>
<GENDER>FEMALE</GENDER>
<COMPANY>INFOSYS</COMPANY>
<CITY>HYDERABAD</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>4</EMPID>
<NAME>KRISHNAVENI</NAME>
<GENDER>FEMALE</GENDER>
<COMPANY>TCS</COMPANY>
<CITY>PUNE</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>5</EMPID>
<NAME>SRIKARAN</NAME>
<COMPANY>COGNIZANT</COMPANY>
<CITY>CHENNAI</CITY>
</EMPDETAILS>
</EMPINFO>
218
JOB:
XML Output stage properties:

1. Validation settings
219
1. Validation settings
3. Transformation settings
220
3. Options
Options->Input->Columns
221
XML Input Stage:
Xml input file data:
<EMPLOYEE>
<EMP>
<NAME>BHASKAR</NAME>
<DEPT>FINANCE</DEPT>
<SAL>10000</SAL>
</EMP>
<EMP>
<NAME>SRUJANA</NAME>
<DEPT>OTC</DEPT>
<SAL>20000</SAL>
</EMP>
<EMP>
<NAME>PRADEEP</NAME>
<DEPT>CUSTOMER</DEPT>
<SAL>30000</SAL>
</EMP>
</EMPLOYEE>
222
Job:
Input Sequential file data at data browser window
Input Sequental file proper ties:
223
sequential file columns:
Here we have to read the entire xml file as a single reocrd
Xml Input Stage Stage tab proper ties:
224
Xml Input Stage Input tab column proper ties:
225
Xml Input Stage Output tab general proper ties:
Xml Input Stage Output tab Transformation setting proper ties:
226
Xml Input Stage Output tab advanced proper ties:
Xml Input Stage Output tab columns proper ties:
227
Xml Input Stage Output tab Last Advanced proper ties:
228
Target sequential file proper ties:
Output sequential file data at data browser:
FTP STAGE:
File Transfer from one data stage file server to other file server:
Job:
Sequential file stage Input properties:
229
FTP stage general tab proper ties:
230
FTP stage properties tab properties:
FTP input properties tab :
231
File Transfer from Local Windows machine to UNIX server:
C:\Documents and Settings\Administrator\Data>ls

BeforeData.txt AfterData .txt Sample.txt
C:\Documents and Settings\Administrator\Data>ftp “server name or IP Address”
Connected to “servername”
 >User <servername:<none>>:Userid
 Password Required for UserId
 Password:<Enter Password>
FTP>cd “Path”
Example:
FTP>cd temp\Data
250 command Successful
FTP>PWD
257 “temp\Data” is a current directory
FTP>put Sample.text
200 PORT command successful
150 Opening data connection for Sample.txt
232
Containers:
Containers are used to minimize the complexity of a job and for better under standing and
reusability purpose.
There are two types of containers are available in data stage.
1. Local container
2. Shared container
Local containers: it is used to minimize the complexity of job for better understanding
purpose only.
And It never used for reusability purpose and it limit is with in a job.
Shared container:
Shared containers used for both purposes like to minimize the complexity of a job and
reusability. Its limit is with in project.
Differences between local container and shared containers
Local Container:
1. Itis used to minimize the complexity of job only

2. Its limit is with in a job
3. It occupies No memory
4. Can be constructed directly
Shared container:
1. Itis used to for both purposes minimize the complexity of job and reusability
2.It is limit is with in a project
3. It occupies some memory
4. It can not be deconstructed directly first need to convert into local then
Constructed
How to construct container:
Go to data stage designer-->open a specific job-->select required stages in ajob-->

Click on edit--> click on construct container then choose local or shared container
Note: if we want to deconstruct then right click on containerclcik deconstruct.
233
Usage of shared containers in another job:
Create anew job-->drag and drop of shared container into new job-->double click on
shared container--> go to output--> assign old output link(shared container link)to new
output link--> go to columns--> click on load--->select reconcile from container link--
>click on validate-->same thing do it for remaing links-->click okss
JOB SEQUENCE:
It is used to run all jobs in a sequence (or) in a order by considering its dependencies. it
has many activities.
How to go to job sequence:
Select job sequencedrag drop of required jobs from jobs in repositorygive
connectionsave itcompile it now sequentially these 3 jobs will be run.
These jobs are called job activity.

Double click on job activity we can find general/job/triggersgo to job make
execution =reset if required ,then rungo to triggersin expression type make un
conditional/conditional ok/conditional fail
Unconditionally= if job1 is successfully finished (or) aborted then job2 will run
Conditional ok= if job1 is successfully finished then job2 will run.
Conditional fail=if job1 is fail then job2 will run.
NOTIFICATION ACTIVITY:
It is used to send a mail to required persons automatically.
Double click on notification activity go to notification SMTP mail server: Company
name (www.xyz.com) ,sender email address: abreddy2@xyz.com, recipients email
234
address: recipients email address : abreddy2@xyz.com Email subject:Aggregatot job has
been aboarted,give some information on the bodyclick ok
TERMINATOR ACTIVITY:
It is used to send stop request to all running job.
WAIT FOR FILE ACTIVITY:

It is used to wait for a file up to some extent of time
Double click on wait for file activitygo to filefilename: select the file and set
timing(24 hours time only)
235
SEQUENCER:
It is used to connect one activity to another activity it takes more input links and gives
one output link
Double click on sequencergo to sequencer chose mode=All/Any

All is nothing but needs to get all requests from all input links.
Any means any request from one input link
236
ROUTINE ACTIVITY:
It is used to execute a routine between two jobs
Double click on routine activitychoose routine nameif required parameter give
parameter.
EXECUTE COMMAND ACTIVITY:

It is used to execute a UNIX command between two jobs
Double click on execute command activityexecute UNIX command click ok
END LOOP START LOOP ACTIVITY:

It is used to execute some jobs more than one time in a sequence.
237
SLOWLY CHANGING DIMENSIONS:
There are 3 types of SCD‘s Available in DWH.
Type1: It always maintains current data and updated data
Type2: It always maintains current data and full historical data
Type3: It always maintains current data and partial historical data
EXERICE-1:
no name sal
100 Bhaskar 1500
101 Mohan 2000
103 Sanjeev 2000
no name sal
100 Bhaskar 1000
101 Mohan 1500
102 Srikanth 2000
After implementing SCD Type1
no name sal
100 Bhaskar 1500
101 Mohan 2000
102 Srikanth 2000
103 Sanjeev 2000
Type-I:
238
In SCD Type-I If a record exists in source table and not exists in target table then simply
insert a record into target table (103 record) if a record exists in both source and target
tables then simply update source record into target table(100,101)
Type-II:
While implementing SCD Type-II there are two extra columns are maintained in target
called Effective Start Date and Effective End Date .Effective start date is also part of
primary key.
If a record exists in source and not exists in target table then simply insert records into
target table. while inserting put Effective Start Date is equal to current date and effective
end date set null.
If a record exists in both source and target tables even though we are inserting a
source record into target table but before insert a record into target table the existing
record in target table update effective End Date=CurrentDate-1.
Now insert source record into target table effective start date=Current Date and Effective
End Date=Null
no name sal Effective_Strat_Date Effective_End_Date

100 Bhaskar 1000 2011-01-31 2012-04-05
101 Mohan 1500 2011-01-31 2012-04-05
102 Srikanth 2000 2011-01-31 2012-04-05
100 Bhaskar 1500 2011-02-01 Null
101 Bhaskar 2000 2011-02-01 Null
103 Sanjeev 2000 2011-02-01 Null
Type-III:
If a record exists in source and not exists in target table then simply insert records
Into target table. While inserting put Effective start Date is equal to Current Date and
Effective End Date set Null.
If a record exists in both source and target tables then check target table count group by
primary key if count=1 then update Effective End Date=Current Date-1 then simply
insert source record into target record.
If count greater than one then delete a record into target table group by primary key
where Effective End Date=Not Null. Now update target record Effective End
Date=Current Date-1 Then simply insert source record into target.
DATAWAREHOUSE:
Data ware house is nothing but collection of transactional data and historical data and can
be maintained in dwh for analysis purpose.
They are 3 types of tools should be maintained on any data warehousing project
239
1. ETL Tools
2. OLAP Tools (or) Reporting Tools
3. Modeling Tool
ETL TOOL:
ETL is nothing but Extraction, Transformation, and Loading. a ETL Developer(those who
are expertise in dwh extracts data from heterogeneous databases(or)Flat files, Transform
data from source to target(dwh) while transforming needs to apply transformation rules
and finally load data into dwh.
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica
3. Abinitio
OLAP:
OLAP is nothing but Online Analytical Processing and these tools are called as reporting
Tools Also
A OLAP Developer analyses the data ware house and generate reports based on selection
criteria.
There are several OLAP Tools are available
1. Business Objects
2. Cognos
3. Report Net
4. SAS
5. Micro Strategy
6. Hyperion
7. MSAS (Microsoft Analysis Services)
MODELING TOOL:
Those who are working with ERWIN Tool called data modeler .A data modeler can
design data base of DWH with the help of fallowing tools
240
A ETL Developer can extract data from source databases (or) flat files(.txt,csv,.xls etc)
and populates into DWH .While populating data into DWH they are some staging areas
can be maintained between source and target .these staging areas are called staging area1
and staging area2.
STAGING AREA:
Staging Area is nothing but is temporary place which is used for cleansing unnecessary
data (or) unwanted data (or) inconsistency data.
Note: A Data Modeler can design DWH in two ways

3. ER Modeling
4. Dimensional Modeling
ER Modeling:
ER Modeling is nothing but entity relationship modeling. in this model always call table
as entities and it may be second normal form (or) 3rd normal form (or) in between 2nd and
3rd normal form
Dimensional Modeling:
In this model tables called as dimensions (or) fact tables. It can be subdivided into three
schemas.
4. Star Schema
5. Snow Flake Schema
6. Multi Star Schema (or) Hybrid (or) Galaxy
Star Schema:
A fact table surrounded by dimensions is called start schema. it looks like start
In a start schema if there is only one fact table then it is called simple start schema.
In a start schema if there are more than one fact table then it is called complex start
schema
241
Sales Fact table:
Sale_id
Customer_id
Product_id
Account_id
Time_id
Promotion_id
Sales_per_day
Profit_per_day
Account Dimension:
Account_id
Account_type
Account_holder_name
Account_open_date
Account_nominee
Account_open_balence
Pramotion:
Promotion_id
Promotion_type
Promotion_date
Pramotion_designation
Pramotion_Area
242
Product:
Product_id
Product_name
Product_type
Product_desc
Product_version
Product_stratdate
Product_expdate
Product_maxprice
Product_wholeprice
Customer:
Cust_id
Cust_name
Cust_type
Cust_address
Cust_phone
Cust_nationality
Cust_gender
Cust_father_name
Cust_middle_name
Time:
Time_id
Time_zone
Time_format
Month_day
Week_day
Year_day
Week_Yeat
DIMENSION TABLE:
If a table contains primary keys and provides detail information about the table
(or) master information of the table then called dimension table.
FACT TABLE:
If a table contains more foreign keys and it’s having transactions, provides
summarized information such a table called fact table.
DIMENSION TYPES:
There are several dimension types are available
CONFORMED DIMENSION:
243
If a dimension table shared with more than one fact table (or) having foreign key more
than one fact table. Then that dimension table is called confirmed dimension.
DEGENERATED DIMENSION:
If a fact table act as dimension and it’s shared with another fact table (or) maintains
foreign key in another fact table .such a table called degenerated dimension.
JUNK DIMENSION:
A junk dimension contains text values, genders,(male/female),flag values(True/false) and
which is not use full to generate reports. Such dimensions is called junk dimension.
DIRTY DIMENSION:
If a record occurs more than one time in a table by the difference of non key attribute
such a table is called dirty dimension
FACT TABLE TYPES:

There are 3 types of fact s are available in fact table
1. Additive facts
2. Semi additive facts
3. Non additive facts
ADDITIVE FACTS:
If there is a possibility to add some value to the existing fact in the fact table .that facts
we called as additive fact.
SEMI ADDITIVE FACT:
If there is possibility to add some value to the existing fact up to some extent in the fact
table is we called as semi additive fact.
NON ADDITIVE FACT:

If there is not possibility to add some value to the existing fact in the fact table is we
called as Non additive fact.
SNOW FLAKE SCHEMA:

Snow Flake schema maintains in dimension table normalized data .in this schema some
dimension tables are not directly maintained relation ship with fact table and those are
maintained relation ship with another dimension
DIFFERENCE BETWEEN STAR SCHEMA AND SNOW FLAKE SCHEMA:
244
Star schema Snow flake schema
It maintains demoralized data in the It maintains normalized data in the
dimension table dimension table
Performance will be increased when Performance will be decreases when
joining fact table to dimension table when joining fact table to dimension table to
shrunken dimension table because it require
more inner joins when compared with snow
compared with snow flake flake
All dimension table should maintain ed Some dimension tables are not directly
relation ship directly with fact table maintained relationship with fact table
Data Profiling
Data Profiling:
1.Data profiling is the process of examining the data available in an existing data source
2.A data source usually a data base or a file
3. By doing data profiling we can collect the statistics and information about data
Data Profiling Tools:

1) Informatica Data Explorer 8x
2) Informatica PowerCenter 8x (Profiling option in Source Analyzer)
3) Oracle Warehouse Builder 10g (Data Profiling node in the Project

Explorer)
4) SQL Server Integration Service (Data Profiling Task)
5) IBM InfoSphere (Information Analyzer)
Why we need Data statistics

1. Find out whether existing data can easily be used for other purposes
2. whether the data conforms to particular standards or patterns
3. Assess whether metadata accurately describes the actual values in the source
database
4. Understanding data challenges early in any data intensive project, so that late
project surprises are avoided. Finding data problems late in the project can lead to
delays and cost overruns
Data governance:
Is a quality control discipline for assessing, managing, using, improving, monitoring,
maintaining, and protecting organizational information?
245
Overview about Data Profiling:
1. Data profiling helps you create data model of the 3’rd normal form, based solely
on data available in the source system
2. In order to create a data model of the 3’rd normal form we need the following
information
1) Domain - Column Data type and Length.
2) Dependency – Primary Key.
3) Relationship - Foreign Key .
Data profiling is divided into three steps
1) Single Column Profiling:

to get column domain information
2) Table Structural Profiling

to get dependency information
3) Cross Table Profiling

to get relationship information.
What is a Domain
A simple example of a Domain is the list of United States state abbreviations. The
Domain could be implemented as a CHAR(2) and would contain the following
valid value set: AL, AK, AR, CA, CO, CT, DE, DC, FL, GA, HI, ID, IL, IN, IA,
KS, KY, LA, ME, MD, MA, MI, MN, MS, MO, MT, NE, NV, NH, NJ, NM, NV,
NC, ND, OH, OK, OR, PA, RI, SC, SD, TN, TX, UT, VT, VA, WA, WV, WI, WY.
2
Many Columns can share the same Domain. Columns which share the same
Domain may be Synonym candidates.
A Domain is defined as the set of all valid values for a Column or set of Columns.
Domains contain target data type information, a user-defined list of valid values,
and a list of valid patterns. Each Schema has its own set of Domains.
Normalization:
Normalization is a process of decomposing a relation into smaller, well structured
relations without anomalies is called as narmalization
246
The rules are used in a relations is called normal forms
1.Single Column Profiling:

Single Column profiling gives you a column domain Information,which is used to
determine the correct data type and length of a column
Example:
If the values in a column all have six digits and look like 040500 and the data type
could be either INTEGER OR DATE in ‘mmddyy’ or’ddmmyy’ format
Column profiling produces a list of Inferred data types which fits the column data
below are some of the reports generated by SSIS Data profiling task
1.Column Length distribution profile:

Reports all distinct lengths of string values in the selected column and percentage of
rows in a table that each length represents
2. Column Statistics Profile: Reports statistics such as Minimum, Maximum,
Average and standard deviation for numeric columns and minimum and maximum
values for date and time columns
3. Column Null Ratio Profile:

Reports the percentage of Null values in a selected column
4. Profile Time
How much time it took to profile the sample data
2.Table Structural Profiling:

Table structural profiling discovers functional dependencies, table structural profiling
asks the question “If I know a value in one column(or a values in a set of columns)
can I positively determine the value in another column
If you ran a Dependency profile for this table you would find the following
dependencies, among others
247
The List in the previous slide represents True dependencies Now lets take a look at
the below dependencies
The first one is not a true dependency because First Name does not positively
determine Last Name, in that “BHASKAR” could be “REDDY” or “RAO”.
Similarly FirstName + LastName doesn’t uniquely determine PAN.
If you add the first list to the Dependency Model, you would get two keys:
EmployeeID
PAN
However, only one of them can be a primary key, the other key is called an alternate
key
3.CrossTable Profiling:
Cross table profiling compares column in a schema determines which ones contain
similar values. This profile can determine whether a column or set of columns is
appropriate to serve as a foreign key between the selected tables.
Cross Table profiling can find the following types of redundancies:

 Redundant data to eliminate by creating Synonyms.
 Redundant data that is intentionally redundant to improve database
performance. You may still want to synonym these Column pairs to allow the
normalizer to create a true third normal form (3NF).
 ♦ Data that looks redundant but actually represents different business facts
(homonyms).
 For example, the Columns Employee_Age and Quantity_On_Hand may
appear as a pair of redundant Columns if both contain integer values under
100. Although the values in these Columns are similar, the Columns actually
have very different business meanings. 4
Synonyms:
Two or more Columns that have the same business meaning are called Synonyms
248
. Suppose a Schema contains the following two Tables:
Both relations contain employee data, but they are defined separately to segregate public
and private information. The EmpID and EmployeeID Columns have the same business
meaning and can be meaningfully combined into a single Column. In contrast, look at
how the MgrID column is used in the Employee Table. Even though MgrID uses similar
values to EmployeeID, it represents a different role in the database. Therefore you would
not define MgrID and EmployeeID as Synonyms.
Normalization has the following impacts on Synonyms:

 If two or more “Columns made Synonyms” represent the identical construct, they
will collapse into one Column in the normalized model.
 If two “Columns made Synonyms” represent a parent-child relationship, they will

result in two Columns in two Tables, with one Column participating in a primary
Key and the other in the corresponding Foreign Key
Introduction About Infosphere Information Analyzer:

1. Infosphere Information Analyzer is a Data Profiling tool
2. Information Analyzer is a critical component of IBM InfoSphere Information Server
that profiles and analyzes data so that you can deliver trusted information to your users
Information Analyzer Capabilities:

Information Analyzer automates the task of source data analysis by expediting
comprehensive data profiling and minimizing overall costs and resources for critical data
integration projects.
End-to-end data profiling and content analysis:
Provides standard data profiling features and quality controls. The metadata repository
holds the data analysis results and project metadata such as project-level and role-level
security and function administration.
Business-oriented approach:
With its task-based user interface, aids business users in reviewing data for anomalies and
changes over time, and provides key functional and design information to developers.
249
Adaptable, flexible, and scalable architecture :
Handles high data volumes with common parallel processing technology, and utilizes
common services such as connectivity to access a wide range of data sources and targets
Information Analyzer workspace Navigator Menus:
1. Home
2. Overview
3. Investigate
4. Develope
5. Operate
1. Home: Contains system administration, security, configuration, and metadata

tasks
 My home
 Reports
 Metadata Management
 Data stores
 Data schemas
 Tables or Files
 Data Fields
 Data Clasess
 User Classes
 Contacts
 Policies
 Global Logical Variable
250
2. Overview: Contains project properties tasks and the project dashboard.
 Dash board
 Project properties
3. Investigate: Contains information discovery and investigation tasks.
 Column Analysis
 Key and Cross Domain Analysis
 Base Line Analysis
 Publish Analysis Results
 Table Management
4. Develop: Contains data transformation and information services enablement tasks.
 Data Quality
5. Operate: Contains scheduling and log view tasks.

 Log Views
 Scheduling Views
Information Analyzer Project roles:

Information Analyzer, administrators can further define user authority by assigning suite
component and project roles to InfoSphere Information Analyzer users.
You can assign suite component roles in the IBM InfoSphere Information Server console
or the IBM InfoSphere Information Server Web console. Project roles can be assigned
only in the Project Properties workspace of the console.
Suite Component roles:
Information Analyzer Data Administrator
Can import metadata, modify analysis settings, and add and modify system sources.
Information Analyzer Project Administrator
Can administer projects by creating, deleting, and modifying information analysis
projects.
Information Analyzer User
Can log on to InfoSphere Information Analyzer, view the dashboard, and open a project.
Information Analyzer Project roles:
Information Analyzer Business Analyst
Reviews analysis results. With this role, users can set baselines and checkpoints for
baseline analysis, publish analysis results, delete analysis results, and view the results of
analysis jobs.
Information Analyzer Data Operator
Manages data analyses and logs. With this role, users can run or schedule all analysis
jobs.
251
Information Analyzer Data Steward
Provides read-only views of analysis results. With this role, users can also view the
results of all analysis jobs.
Information Analyzer DrillDown User
Provides the ability to drill down into source data if drill down security is enabled.
Information Analyzer Real Time Environment :
Data Profiling Process from End to end delivery:

1. Prior to start Data profiling Data load needs to be completed using middle ware
technologies called Data stage.
2. Source to staging area Data stage Jobs need to created for data loading for data
analysis or for columns examine purpose
3. Once the data load completed now have to start work on IA environment
252
4. Create the IA project to create IA project have to be login with IA admin privileges
5. Import metadata from staging tables to IA environment
6. To Import Meta data have to be login with IA admin privileges
7. Add a data sources to created project
8. Adding the necessary users to the created project
9 .And also can Add the Groups to the created project
10. Assigning a project roles to the user or groups
11. The following are the four different project roles we have in IA
 Information Analyzer Business Analyst
 Information Analyzer Data operator
 Information Analyzer Data steward
 Information Analyzer Data operator
12.Running CA job for single or multiple columns
13. View Analysis results
14.Capture the analysis results where ever the data validation rules given in Data
profiling requirement template
15. Fill the Data profiling requirement sheet with all the columns where ever the Data
validation rules given
16.Generate the project required reports
17.Deliver the analysis results template and reports to the Focals
Steps to run Column Analysis job:

1. First login with IBM Infosphere Information server console
2. Enter User Name, Password, and Host Name of the services tier
3. After login to Information Analyzer main home screen click on file menu
4. Select open project and open project will display the list of created projects
5. Select the project for which analyze table want to run Column Analysis and click on
open
6. Now in the Information Analyzer workspace navigator menu select Investigate Tab
7. Select Column Analysis then it will display the below column analysis window
253
8. Now select the table and go to the Tasks and under the task will find the Run Column
Analysis option now click on Run column Analysis then it it will several minutes to
complete the column analysis once the column analysis completes then it will give the
status as Analyzed with analysis Run date.
9. Now find the below screen shot the CA status was now Analyzed
2.Process to Review the Column Analysis Results:
1. Now select the EMPID Column from the above list and now go to task and under task
select Open column analysis
254
And next click on view details then it will open the Column Analysis results window
2. Now we will find the column analysis results encountered in different tabs:
1. Overview:
2.Frequency Distribution:
3.Data Class:
255
4. Properties Analysis:
Properties has the fallowing Information
1.Data type
2.Data Length
3.Precision
4. Scale
5.Nallability
6.Cardinality Type
5.Domain and Completeness Analysis:
6.Format tab Analysis
Note: If the format is Invalid then we have to change the status as violate and then have
to change the domain value status is Mark as invalid then automatically those Invalid
format associate values became change as Invalid in Domain and Completeness tab
Base Line Analysis:
You use baseline analysis to determine whether the structure or content of your data has
changed between two versions of the same data. After a baseline analysis job completes,
256
you can create a report that summarizes the results of that job and then compare the
results with the results from your baseline data.
Baseline analysis reports:

You use baseline analysis to determine whether the structure or content of your data has
changed between two versions of the same data. After a baseline analysis job completes,
you can create a report that summarizes the results of that job and then compare the
results with the results from your baseline data.
There are two types of baseline analysis reports:
Baseline structure - current to prior variances report:
Shows a summary of the structural differences between the baseline version of your data
and another version of the same data. The structure of your data consists of the elements
of the data such as data values and properties. Structure also depends on how the data is
organized.
Baseline content - current to prior variances report:

Shows a summary of the differences in content between the baseline version and another
version of the same data source. Content is the actual data, such as names, addresses, or
dates.
Baseline analysis:
When you want to know if your data has changed, you can use baseline analysis to
compare the column analysis results for two versions of the same data source. The
content and structure of data changes over time when it is accessed by multiple users.
When the structure of data changes, the system processes that use the data are affected.
To compare your data, you choose the analysis results that you want to set as the baseline
version. You use the baseline version to compare all subsequent analysis results of the
same data source. For example, if you ran a column analysis job on data source A on
Tuesday, you could then set the column analysis results of source A as the baseline and
save the baseline in the repository. On Wednesday, when you run a column analysis job
on data source A again, you can then compare the current analysis results of data source A
with the baseline results of data source A.
To identify changes in your data, a baseline analysis job evaluates the content and
structure of the data for differences between the baseline results and the current results.
The content and structure of your data consists of elements such as data classes, data
properties, primary keys, and data values. If the content of your data has changed, there
will be differences between the elements of each version of the data. If you are
257
monitoring changes in the structure and content of your data on a regular basis, you might
want to specify a checkpoint at regular intervals to compare to the baseline. You set a
checkpoint to save the analysis results of the table for comparison. You can then choose
to compare the baseline to the checkpoint or to the most recent analysis results.
If you know that your data has changed and that the changes are acceptable, you can
create a new baseline at any time
Comparing analysis results
To identify changes in table structure, column structure, or column content from the
baseline version to the current version, you can compare analysis results.
Before you begin
You must have InfoSphere™ Information Analyzer Business Analyst privileges and have
completed the following task.
 Setting an analysis baseline

Over time, the data in your table might change. You can import metadata for the table
again, analyze that table, and then compare the analysis results to a prior version to
identify changes in structure and content. You can use baseline analysis to compare the
current analysis results to a previously set baseline.
Procedure
To compare analysis results:
1. On the Investigate navigator menu, select Baseline Analysis.

2. Select the table that you want to compare to the baseline analysis. To locate
changes in data, you must have re-imported metadata for the table and run a
column analysis job on the table.
3. On the Tasks pane, click View Baseline Analysis.
4. In the Pick an Analysis Summary window, select which analysis result you want
to compare to the baseline.
o Select Checkpoint to compare the baseline to the last checkpoint analysis.
o Select Current Analysis to compare the baseline to the last run analysis
job.
What to do next
The View Baseline Analysis pane details the changes in data.
Setting an analysis baseline

To establish which version of analysis results will be used as the baseline for comparison,
you must set an analysis baseline.
Before you begin
258
You must have InfoSphere™ Information Analyzer Business Analyst privileges and an
Information Analyzer Data Operator must have completed the following task.
 Running a column analysis job

After you run a column analysis job and verify the results of that analysis, you can set the
analysis results as an analysis baseline. You set an analysis baseline to create a basis for
comparison. You can then compare all of the subsequent analyses of this table to the
baseline analysis to identify changes in the content and structure of the data.
Procedure
To set an analysis baseline:
1. On the Investigate navigator menu in the console, select Baseline Analysis.

2. Select a table in which at least one column has been analyzed.
3. On the Tasks pane, click Set Baseline.
4. On the Set Baseline window, click Close.
What to do next
You can now compare the analysis baseline to a subsequent analysis result of the table.
Setting the checkpoint
You can set a checkpoint to save a subsequent point of the selected analysis results for a
table to use in comparing to the baseline. The checkpoint is saved as a point of
comparison even if subsequent analysis jobs are run on the table.
About this task
If you are monitoring changes in the structure and content of your data on a regular basis,
you might want to specify a checkpoint at regular intervals to compare to the baseline.
You set a checkpoint to save the analysis results of the table for comparison. You can then
choose to compare the baseline to the checkpoint or to the most recent analysis results.
A checkpoint can also save results at a point in time for analysis publication.
Procedure
To set the checkpoint:
1. On the Investigate navigator menu in the console, select Baseline Analysis.

2. Select the table that you want to set as the checkpoint.
3. On the Tasks pane, click Set Checkpoint.
What to do next
You can now compare the checkpoint to the baseline.
Identifying changes in your data over time
259
To determine whether the content and structure of your data has changed over time, you
can use baseline analysis to compare a saved analysis summary of your table to a current
analysis result of the same table.
About this task
You can use baseline analysis to identify an analysis result that you want to set as the
baseline for all comparisons. Over time, or as your data changes, you can import
metadata for the table into the metadata repository again, run a column analysis job on
that table, and then compare the analysis results from that job to the baseline analysis.
You can continue to review and compare changes to the initial baseline as often as needed
or change the baseline if necessary.
To compare analysis results, you complete the following tasks:
1.Setting an analysis baseline

To establish which version of analysis results will be used as the baseline for
comparison, you must set an analysis baseline.
2.Comparing analysis results
To identify changes in table structure, column structure, or column content from the
baseline version to the current version, you can compare analysis results.
Data Rules Creation Process:

Data Rules Creation Process:
1.Go to Develop in Work space navigator Menu
2. select Data quality and go to task under task find the New data rule definition
260
3. Click on new data rule definition then the below window will be pop up
4.Click on overview and provide the Data rule name in the name text field and short
description and Long description is optional
5. Goto Rule Logic and there we have to right the logic
Write this logic source_data exists and len(trim(source_data))<>0
1.Condtion:
 IF
 THEN
 ELSE
 AND
 OR
 NOT
2.(
261
 ((
 (((
Example:
Once we written the logic then we have to be validate weather the logic is syntactically
correct or not if the logic is correct then have to click on and save and exist.
3. SourceData
Here source data is a field name
4. Condtion
Not
5. Type of check
 =
 >
 <
 >=
 <=
 <>
 Contains-- >String containment
 Exists->Null value test
 Matches_Format-- >If country=India then zip code format=’999999’
 ,Matches_Regex
 occurs
 occurs>
 occurs>=
 occurs<
 occurs<=
 In_Reference_column
 In_reference_List
 Unique
 Is_numeric
 Is_Date
6. Reference Data:
Here in reference data we have to give the reference column name

7.)
))
)))
262
Rule Builder:
1. Business Problem:
Identifies when a column contains Data
Type of Check: exists
Here exists check for Non Null fields

And len(trim(sourcedata))<>0 for non empty field
Type of Check: in_reference_list
Identify weather a column GENDER contains a data value has MALE or FEMALE
Here in source we may have either upper or lower case then those values have to
Compare with reference data
263
Type of Check: matches_regex
Identify weather a column EMPID contains a value has numeric value and length
of EMPID is 4 and the format of EMPID should be ‘9999’
Here in source we may have either upper or lower case then those values have to
Compare with reference data
Type of Check: matches_format
If country code=”India” then have to be check weather the zip code format is ‘999999’ or
not
Quality Stage:
1.Why investigate:
Discover trends and potential anomalies in data

Identify invalid and default values in a data
Verify the reliability of the data in the fields to be used as a matching criteria
Gain complte understanding of the data in a context
Investiage:
Verify the domain:
Review each field and verify the data matches the meta data
Identyfy the data formats and missing and default values
Identify the data anomalies:
Format
Structure
264
Content
Feature of investigate:
Analyze free form and single domain columns
Provide frequency distribution of distinct values and patterns
Investigaet methods:
Characte distcrete
Character concate nate:
Word invistgate:
INVESTIGATE STAGE:
1.Character Discrete Investigate C mask:
Job:
Add the columns which ever u want to investigate:
265
Click on change mask and select for all the fields C mask
At out put it gives 5 columns

1.qsInvColumnName
2.qsInvPattern
3.qsInvSample
4.qsInvCount
5.QsInvPercent
At output it gives like below
Above screen shot is only for one field it gives same like another fields which ever u
selected
1.Character Discrete Investigate T mask:
266
Job:
Click on change mask and select for all the fields C mask
At out put it gives 5 columns

1.qsInvColumnName
2.qsInvPattern
3.qsInvSample
4.qsInvCount
5.QsInvPercent
267
At output it gives like below
3. Character Discrete Investigate X mask:
Here I selected for Policy Number Column C,T,X alternatively
268
At out put it gives like below
CharacterConcatenate Investigate T mask:
Job:
Add the columns which ever u want to concatenate and investigate:
269
CharacterConcatenate Investigate C mask:

Job:
Add the columns which ever u want to concatenate and investigate:
270
CharacterConcatenate Investigate X mask:

Job:
Add the columns which ever u want to concatenate and investigate
271
Word Investigate Adding rule set to NAME field

Job:
Select the Name rule set from USNAME
Select the Full name field from Available data columns
272
Output you will get like below
Word Investigate Adding rule set to NAME field

Token report and pattern report
Job:
Select the Name rule set from USNAME

And add fullname field from available data columns
If u want two reports tokenreport and pattern report tick mark for both token report box
and pattern report box
273
Pattern report output u will get like below:
Word investigate pattern report it give 5 columns at output
1.qsInvColumnName
2.qsInvPattern
3.qsInvSample
4.qsInvCount
5.QsInvPercent
Pattern report Output:
Token Report output u will get like below

Word investigate pattern report it give 3 columns at output
274
1.qsInvCount
2.qsInvWord
3.qsInvClassCode
Token report Output
Classification Code table:
2.STANDARDIZE STAGE:
Example job:
Open standardize stage and select rule set text field and select standardize rules folder
with in that folder select OTHER folder and select COUNTRY folder and select
COUNTRY
Next in literal text field enter ZQUSZQ
275
Add column which column u want to select from available data columns
Select column name <literal> and AddressLine1, AddressLine1, city, state, Zip
columns in selected columns list
You will get the below screen after entering all
Next click on OK
you will get the below screen
next click on Ok
Output:
At output it gives additional column ie ISOCOUNTRYCODE AND

identifierFlag_COUNTRY
276
Quality stage :
Investigate: 3 methods:
1.chardiscreate->C,T,X masks
2.Charconcatenate:C,T.X masks
Investigate default column names for Pattern Report:

1.Qsinvcolumn name:
2.QsInvPattern
3.QsInvsample
4.QsInvcount
5.Qsinvpercentgae
Investigate default column names for column Report:
1.QsInvcount
2.QsInvword
3.QsInvclasscode
Lab:
Chardiscreate C mask (select one or many columns)
Characterconcatenate C MASK(select two or more columns concate nate)
WordInvstgate: FullName:
Token Rpt
Pattern Rpt
WordInvestigate:Address(pass address line 1,address line2)
Token Rpt
Pattern Rpt
WordInvestigate:Area(city ,state,Zip)
Token Rpt
Pattern Rpt
2.Standardize stage:
1.country identifier:
--- >select the rule set from others COUNTRY
--- > pass the literal ZQUSZQ and add the columns addressline1,addressline 2,city
,state,zip
--- > filter the records where ever we have flag ‘Y’ Those or US records
--- >split US, non US records into separate target
2. Apply the USPREP rule set to filter name components from address fields, and
area components from address fields
 ->Select USPREP rule set from standardize rules

 ->pass ZQNAMEZQ and add the column “Fullname”
 ->pass ZQADDRZQ and add the column “addressline1”
 ->pass ZQADDRZQ and add the column “addressline2”
277
 ->pass ZQAREAZQ and add the column “City”
 ->pass ZQAREAZQ and add the column “State”
 ->pass ZQAREAZQ and add the column “Zip”
Standardize USNAME USADDR USAREA
1.Select USNAME rule set from standardize rules and add the clumn
NameDomain_USPREP
2. select new process and select the USADDR rule set and add the column
AddressDomain_USPREP
3. select new process and select the USAREA rule set and add the column
AreaDomain_USPREP
Rules Columns
USNAME.SET NameDomain_USPREP
USADDR.SET AddressDomain_USPREP
USAREA.SET AreaDomain_USPREP
Investigate unhandled name patterns
Take the above job as input and use 3 investigate stages
1 for Inv Unhandled Name
2. for InvUnhandeldAddr
3.for InvUnhandledArea
Inv Unhandled Name:

select the method character concatenate for Name
select the columns
UnhandledPattern_USNAME, --- >set C mask
UnhandledData_USNAME--- >set X mask
InputPattern_USNAME--- >set X mask
NameDomain_USPREP--- >set X mask
InvUnhandeldAddr:
select the method character concatenate for Address
select the columns
UnhandledPattern_USADDR, --- >set C mask
UnhandledData_USADDR--- >set X mask
InputPattern_USADDR--- >set X mask
AddressDomain_USPREP--- >set X mask
InvUnhandeldArea:
select the method character concatenate for Area
select the columns
UnhandledPattern_USAREA, --- >set C mask
UnhandledData_USAREA--- >set X mask
InputPattern_USAREA--- >set X mask
AreaDomain_USPREP--- >set X mask
278
Rule Set override
1.Input Pattern +FI
UnhandledPttern UnhandledData Inputpattern InputName

+FI DAMORA WILLIAM H +FI DAMORA WILLIAM H
Process to Apply Input Pattern Override
 > Select the Name rule set USNAME

 > Click the Test Button to test the string DAMORA WILLIAM H
 > Select the Overrides
 >select the Input Pattern
 >Enter the input pattern +FI
 >From the current Pattern List select the first entry +
Current Pattern List
Token Override Code
+ Primary Name1
F AdditionalName1
I Additional Name1
 >Click the check boxesMove current,Orginal Value ,NoLeading Space
 Similarty for F and I also
 Now Test the String DAMORA WILLIAM H
2.Unhandled Pattern +,+

UnhandledPattern Unhandled Data InputPattern Input Name Text
+,+ HOCHREITER, CAROLYNNE +,+ HOCHREITER, CAROLYNNE
Process to Apply Classification Override

 > Click the Test Button to test the string HOCHREITER, CAROLYNNE
 >select the Classifictaion
 >Enter the input token CAROLYNNE
 >From the classification drop down menu chose F-FirstName click the add button
 >Use the test process to validate results.
 Now Test the String HOCHREITER, CAROLYNNE
279
3.Unhandled Pattern FFI
The pattern FFI represents a last name that was recognized as a first name,
UnhandledPattern Unhandled Data Input Pattern Input Name Text
FFI HARRIS MARJORIE M +FI. HARRIS MARJORIE M.
Process to Apply UnhandledPattern Override

 > Click the Test Button to test the string HARRIS MARJORIE M
 >select the Unhandled Pattern tab Enter The unhandled pattern FFI
 >From the Current Pattern List select the first entry, F
 >From the User Override Options choose
a. Dictionary Fields: PrimaryName
b. Move Current
c. Original Value
d. No Leading Space
 Repeat the process for the remaining tokens for F,I also
 Now Test the String HARRIS MARJORIE M
ETL PROJECT LIFE CYCLE:
Data warehousing projects are categorized into 4 types.
1) Development Projects.
2) Enhancement Projects
3) Migration Projects
4) Production support Projects.
-> The following are the different phases involved in a ETL project development life
cycle.
1) Business Requirement Collection ( BRD )

2) System Requirement Collection ( SRD )
3) Design Phase
a) High Level Design Document ( HRD )
b) Low level Design Document ( LLD )
c) Mapping Design
4) Code Review
5) Peer Review
6) Testing
a) Unit Testing
b) System Integration Testing.
c) User Acceptance Testing (UAT)
280
7) Pre - Production
8) Production (Go-Live)
Business Requirement Collection: -

---------------------------------------------
-> The business requirement gathering start by business Analyst, onsite technical
lead and client business users.
-> In this phase, a Business Analyst prepares Business Requirement Document

(BRD) (or) Business Requirement Specifications (BRS)
-> BR collection takes place at client location.
-> The o/p from BR Analysis are
-> BRS: - Business Analyst will gather the Business Requirement and document in
BRS
-> SRS: - senior technical people (or) ETL architect will prepare the SRS which
contains s/w and h/w requirements.
The SRS will includes

a) O/S to be used ( windows or UNIX )
b) RDBMS required to build database ( oracle, Teradata etc )
c) ETL tools required ( Informatica,Datastage )
d) OLAP tools required ( Cognos ,BO )
The SRS is also called as Technical Requirement Specifications ( TRS )
Designing and Planning the solutions: -

------------------------------------------------
-> The o/p from design and planning phase is

a) HLD ( High Level Design ) Document
b)LLD ( Low Level Design ) Document
HLD ( High Level Design ) Document : -
An ETL Architect and DWH Architect participate in designing a solution to build a

DWH.
An HLD document is prepared based on Business Requirement.
LLD ( Low Level Design ) Document : -
Based on HLD,a senior ETL developer prepare Low Level Design Document
The LLD contains more technical details of an ETL System.
An LLD contains data flow diagram (DFD), details of source and targets of each
mapping.
An LLD also contains information about full and incremental load.
After LLD then Development Phase will start
281
Development Phase (Coding):-
--------------------------------------------------
-> Based an LLD, the ETL team will create mapping (ETL Code)
-> After designing the mappings, the code ( Mappings ) will be reviewed by
developers.
Code Review:-
-> Code Review will be done by developer.

-> In code review, the developer will review the code and the logic but not the data.
-> The following activities takes place in code review
-> You have to check the naming standards of transformation, mappings of data etc.
-> Source and target mapping ( Placed the correct logic or not in mapping )
Peer Review:-
-> The code will reviewed by your team member (third party developer)
Testing:-
--------------------------------
The following various types testing carried out in testing environment.

1) Unit Testing
2) Development Integration Testing
3) System Integration Testing
4) User Acceptance Testing
Unit Testing:-
-> A unit test for the DWH is a white Box testing, It should check the ETL procedure
and Mappings.
-> The following are the test cases can be executed by an ETL developer.
1) Verify data loss
2) No.of records in the source and target
3) Dataload/Insert
4) Dataload/Update
5) Incremental load
6) Data accuracy
7) verify Naming standards.
8) Verify column Mapping
-> The Unit Test will be carried by ETL developer in development phase.
-> ETL developer has to do the data validations also in this phase.
Development Integration Testing -
282
-> Run all the mappings in the sequence order.
-> First run the source to stage mappings.
-> Then run the mappings related to dimensions and facts.
System Integration Testing:-
-> After development phase, we have to move our code to QA environment.

-> In this environment, we are giving read-only permission to testing people.
-> They will test all the workflows.
-> And they will test our code according to their standards.
User Acceptance Testing (UAT):-
-> This test is carried out in the presence of client side technical users to verify the
data migration from source to destination.
Production Environment:-
---------------------------------
-> Migrate the code into the Go-Live environment from test environment ( QA
Environment ).
EXPLANATION:
We have to start with ….Our projects are mainly onsite and offshore model
projects.Inthis project we have one staging area in between source to target
databases. In some project they won’t use staging area’s. Staging area simplify the
process..
Architecture
AnalysisRequirement GatheringDesign Development Testing Production

Analysis and Requirement Gathering: Output :Analysis Doc, Subject Area
100% in onsite,Business Analyst, project manager.
Gather the useful information for the DSS and indentifying the subject areas, identify
the schema objects and all..
Design: Output: Technical Design Doc’s, HLD, UTP ETL Lead, BA and Data Architect
80%onsite .( Schema design in Erwin and implement in database and preparing the
technical design document for ETL.
20% offshore: HLD & UTP

Based on the Technical specs.. developers has to create the HLD(high level design)
it will have he Informatica flow chart. What are the transformation required for that
mapping.
In some companies they won’t have HLD. Directly form technical specs they will
create mappings. HLD will cover only 75% of requirement.
283
UTP Unit Test Plan.. write the test cases based on the requirement. Both positive and
negative test cases.
Development : output : Bugs free code, UTR, Integration Test Plan

ETL Team and offshore BA
100% offshore
Based on the HLD. U have to create the mappings. After that code review and code
standard review will be done by another team member. Based on the review
comments u have to updated the mapping. Unit testing based on the UTP. U have to
fill the UTP andEnter the expected values and name it as UTR (Unit Test Results). 2
times code reviewand 2 times unit testing will be conducted in this phase. Migrating
to testing repositoryIntegration test plan has to prepare by the senior people.
Testing : Output: ITR, UAT, Deployment Doc and User Guide

Testing Team, Business Analyst and Client.
80% offshore
Based on the integration test plan testing the application and gives the bugs list to
thedeveloper. Developers will fix the bugs in the development repository and again
migrated to testing repository. Again testing starts till the bugs free code.
20% Onsite
UAT User Accept Testing.Client will do the UAT.. this is last phase of the etl
project. If client satisfy with the product .. next deployment in production
environment.
Production: 50% offshore 50% onsite
Work will be distributed between offshore and onsite based on the run time of the
application. Mapping Bugs needs to fix by Development team. Development team will
support for warranty period of 90 days or based on agreement days..
In ETL projects Three Repositorys. For each repository access permissions and
location
will be different.
Development : E1
Testing : E2
Prduction : E3
284
1.Project Explanation:
I’m giving generic explanation of the project. Any project either banking or sales or
insurance can use this
explanation.
First u have to start with
1) U have to first explain about objective of the project

and what is client expectations
2) u have to start where ur involvement and responsibility of ur job and limitations of
job.
Add some points from post Project Architecture reply like offshore and onsite model
and team structure.. etc.,
Main objective of this project is we are providing a system with all the information
regarding Sales / Transactions(sales if sales domain / transactions if bank domain or
insurance domain) of entire organizations all over the country US / UK ( based on the
client locationUS/UK/…..). we will get the daily transaction data from all branches at
the end of the day. We have to validate the transactions and implement the business
logic based on the transactions type or transaction code. We have to load all
historical data into dwh and once finished historical data.We have to load Delta
Loads. Delta load means last 24 hrs transactions captured from the source system.In
other words u can call it as Change Data Capture (CDC). This Delta loads are
scheduled daily basis. Pick some points from What is Target Staging Area Post..
Source to Staging mappings, staging to warehousing.. based on ur comfort level.
.Each transaction contains Transaction code.. based on the transaction code u can
identify wheather that transaction belongs to sales, purchase / car insurance, health
insurance,/ deposit , loan, payment ( u have to change the words based on the
project..) etc., based on that code business logic will be change.. we validate and
calculate the measureand load to database.
One Mapping explanation :
In Informatica mapping, we first lookup all the transaction codes with code master
table to identify the transaction type to implement the correct logic and filter the
unnecessary transactions.. because in an organization there are lot of transactions
will be there but u have to consideronly required transactions for ur project.. the
transaction code exists in the code master table are only transactions u have to
consider and other transactions load into one tablecalled Wrap table and invalid
records( transaction code missing, null, spaces) to error table. For each dimension
table we are creating surrogate key and load into dwh tables.
SCD2 Mapping:
We are implementing SCD2 mapping for customer dimension or account dimension
to keep history of the accounts or customers. We are using SCD2 Date
method.before telling this u should know it clearly abt this SCD2 method..careful abt
it..
285
Responsibilities.. pick from Project architecture Post and tell according ur comfortable
level.. we are responsible foronly development and testing and scheduling we are
usingthird party tools..( Control M, AutoSys, Job Tracker, Tivoli or etc..) we simply
give the dependencies between each mapping and run time. Based on that
Informationscheduling tool team will schedule the mappings. We won’t schedule in
Informatica .. that’s itFinished…
Please Let me know if u required more explanation

regarding any point reply
Telme Your self and Explain your current project process:
I have done my B.sc Computers from Osmania University in 2007. Ap.After that I
had an opportunity to work for a Wipro Technologies from Oct 2006 to Aug 2008
where I started off my career as an ETL developer. I have been with Wipro almost 2
years then I shifted to Ness Technologies In Aug 2008. Presently I am working with
Ness
Total I have 3.5 Years of experience in DWH using Data stage tool in development
and Enhancement projects. Primarily I worked on healthcare and manufacturing
domains.
In my Current project my roles & responsibilities are basically
 I am working with onsite offshore model so we use to get the tasks from my
onsite team.
 As a developer first I need to understand the physical data model i.e
dimensions and facts; their relationship & also functional specifications that tells
the business requirement designed by Business Analyst.
 I involved into the preparation of source to target mapping sheet (tech Specs)
which tell us what is the source and target and which column we need to map
to target column and also what would be the business logic. This document
gives the clear picture for the development.
 Creating Data stage jobs using different transformations to implement business
logic.
 Preparation of Unit test cases also one of my responsibilities as per the business
requirement.
286
 And also involved into Unit testing for the mappings developed by myself.
 I use to source code review for the Data stage jobs developed by my team
members.
 And also involved into the preparation of deployment plan which contains list
of Data stage jobs they need to migrate based on this deployment team can
migrate the code from one environment to another environment.
 Once the code rollout to production we also work with the production support
team for 2 weeks where we parallel give the KT. So we also prepare the KT
document as well for the production team.
Coming to My Current Project:

Currently I am working for XXX project for YYY client. Generally YYY does not have a
manufacturing unit, What BIZ (Business) use to do here is before quarter ends they
will call for quotations for primary supply channels this process we called as a
RFQ’s(Request for quotations).Once BIZ creates RFQ automatically notification will
go to supply channels .So these supply channels send back their respective quoted
values that we called it as response from the supply channel. After that biz will start
negotiations with supply channels then they approve the RFQ’s.
All these activities (Creating RFQ, supplier response and approve RFQ etc.)Performed
in the oracle apps this is source frontend tool application. These data which get
stored into OLTP. So the OLTP contains all the RFQs, supplier response and approval
status data.
We have some Oracle jobs running between OLTP and ODS which replicate the OLTP
data to ODS. It is designed in such a way that any transaction entering into the OLTP
is immediately reflected into the ODS.
We have a staging area where we load the entire ODS data into staging tables
for this we have created some ETL Data stage jobs these jobs will truncate and
reload the staging tables for each session run. Before loading to staging tables we
are dropping indexes then after loading bulk data we are recreating indexes using
store procedures.
Then we extract all this data from stage & load it into the dimensions & facts on
top of dims and facts we have created some materialized views as per the report
requirement .Finally report directly pulls the data from MV .These reports
/dashboards performance always good because we are not doing any calculation at
287
reporting level. These dashboards/reports can be used for the analysis purpose like say
how many RFQs created, how many RFQs approved, how many RFQs got responded
from the supply channels?
What is the budget?
What is budget approved?
Who is the approval manager pending with whom what is the feedback of the supply
channels from the past etc?
They don’t have the BI design, so they are using the manual process to achieve the
above by exporting the excel sheet; so we can do the drill up, drill down & get all the
detailed reports by charts.
Prepared By
A.Bhaskar Reddy
Email:abreddy2003@gmail.com
91-9948047694
288

DataStage Notes Bhaskar20130428 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

DataStage Notes Bhaskar20130428 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

DATAWAREHOUSE:

What is Data ware house:

Note: A Data Modeler can design DWH in two ways

FACT TABLE TYPES:

SEMI ADDITIVE FACT:

NON ADDITIVE FACT:

SNOW FLAKE SCHEMA:

Star schema Snow flake schema

INTRODUCTION ABOUT ETL TOOLS:

What do ETLs tool do?

Why use an ETL tool?

Informatica Power center:

 Talend is an open-source data integration tool It uses a code-generating approach and

Inaplex is a small UK company

Difference between server jobs and Parallel jobs:

The default configuration file is having like

Node: - it is logical processing unit which performs all ETL operations.

Pools: - it is a collection of nodes.

Pipeline parallelism: It’s a technique of simultaneously processing Extraction,

Partitioning techniques are grouped in to two categories

Key Based Partitiong techniques:

Key Less Partioning techniques:

Data stage Architecture:

Servercompoents (Unix) Client components (Windows)

Data stage Server:

Data stage server:

Data stage Repository:

Data stage Administrator:

Data stage Manager:

Data stage Director:

Data stage Designer:

Differences between 7.5.x2 & 8.0.1:

Data stage 7.5x2 Client Components:

Data stage Designer:

Data stage Director

Data stage Manager

Data stage Administrator

1. PX Engine: it is executing DataStage jobs and it automatically selects the partition

2. Repository: It contains the repository components

Data stage 8.0.1 Client Components:

Data stage Director

Data stage Administrator

Data stage 8.0.1 Architecture:

Common Repository devided in to two types

Common repository we called as a Meta data server

Datastage Designer Window:

Tool BarTool Options like Jobproperties,Compile,Run

RepositoryRepository which contains Repository components

Read Method=Specific file Then execute in sequencial mode

Example for file pattern:

Example jobs for Lab Hand out:

2. Read Method=File Pattern

3. Read Method =Specific files

4. Read Method =Specific files

Sed ‘5q’It displays first 5 lines

Here Read method= specific file(s)

Input File data:

Output File data:

Properties for sequential_File_0:

Target Sequential_File_1 Properties:

Click on Columns tab on left hand side tab

Now Click On import and select Sequential file definitions..

Click on Import tab: