Beruflich Dokumente
Kultur Dokumente
Data ware house is nothing but collection of transactional data and historical data and can
be maintained in DWH for analysis purpose.
They are 3 types of tools should be maintained in any data warehousing project
1. ETL Tools
2. OLAP Tools (or) Reporting Tools
3. Modeling Tool
1
ETL TOOLS:
ETL is nothing but Extraction, Transformation, and Loading. a ETL Developer(those who
are expertise in dwh extracts data from heterogeneous databases(or)Flat files, Transform
data from source to target(dwh) while transforming needs to apply transformation rules
and finally load data into dwh.
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica Power center
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)
7. Pentaho Kettle
8.Talend
9. Inaplex Inaport
OLAP:
OLAP is nothing but Online Analytical Processing and these tools are called as reporting
Tools Also
A OLAP Developer analyses the data ware house and generate reports based on selection
criteria.
There are several OLAP Tools are available
1. Business Objects
2. Cognos
3. Report Net
4. SAS
5. Micro Strategy
6. Hyperion
7. MSAS (Microsoft Analysis Services)
MODELING TOOL:
Those who are working with ERWIN Tool called data modeler .A data modeler can
design data base of DWH with the help of fallowing tools
A ETL Developer can extract data from source databases (or) flat files(.txt,csv,.xls etc)
and populates into DWH .While populating data into DWH they are some staging areas
can be maintained between source and target .these staging areas are called staging area1
and staging area2.
2
STAGING AREA:
Staging Area is nothing but is temporary place which is used for cleansing unnecessary
data (or) unwanted data (or) inconsistency data.
ER Modeling:
ER Modeling is nothing but entity relationship modeling. in this model always call table
as entities and it may be second normal form (or) 3rd normal form (or) in between 2nd and
3rd normal form
Dimensional Modeling:
In this model tables called as dimensions (or) fact tables. It can be subdivided into three
schemas.
1. Star Schema
2. Snow Flake Schema
3. Multi Star Schema (or) Hybrid (or) Galaxy
Star Schema:
A fact table surrounded by dimensions is called start schema. it looks like start
In a start schema if there is only one fact table then it is called simple start schema.
In a start schema if there are more than one fact table then it is called complex start
schema
3
Sales Fact table:
Sale_id
Customer_id
Product_id
Account_id
Time_id
Promotion_id
Sales_per_day
Profit_per_day
Account Dimension:
Account_id
Account_type
Account_holder_name
Account_open_date
Account_nominee
Account_open_balence
Pramotion:
Promotion_id
Promotion_type
Promotion_date
Pramotion_designation
Pramotion_Area
4
Product:
Product_id
Product_name
Product_type
Product_desc
Product_version
Product_stratdate
Product_expdate
Product_maxprice
Product_wholeprice
Customer:
Cust_id
Cust_name
Cust_type
Cust_address
Cust_phone
Cust_nationality
Cust_gender
Cust_father_name
Cust_middle_name
Time:
Time_id
Time_zone
Time_format
Month_day
Week_day
Year_day
Week_Yeat
DIMENSION TABLE:
If a table contains primary keys and provides detail information about the table
(or) master information of the table then called dimension table.
FACT TABLE:
If a table contains more foreign keys and it’s having transactions, provides
summarized information such a table called fact table.
5
DIMENSION TYPES:
There are several dimension types are available
CONFORMED DIMENSION:
If a dimension table shared with more than one fact table (or) having foreign key more
than one fact table. Then that dimension table is called confirmed dimension.
DEGENERATED DIMENSION:
If a fact table act as dimension and it’s shared with another fact table (or) maintains
foreign key in another fact table .such a table called degenerated dimension.
JUNK DIMENSION:
A junk dimension contains text values, genders,(male/female),flag values(True/false) and
which is not use full to generate reports. Such dimensions is called junk dimension.
DIRTY DIMENSION:
If a record occurs more than one time in a table by the difference of non key attribute
such a table is called dirty dimension
ADDITIVE FACTS:
If there is a possibility to add some value to the existing fact in the fact table .that facts
we called as additive fact.
If there is possibility to add some value to the existing fact up to some extent in the fact
table is we called as semi additive fact.
6
DIFFERENCE BETWEEN STAR SCHEMA AND SNOW FLAKE SCHEMA:
7
In the event that databases are altered or new databases need to be integrated, a lot of
“hand-coded” work needs to be completely redone.
Different ETL Tools:
There are several ETL Tools are available in the market those are
1. Data stage
2. Informatica Power center
3. Pentaho Kettle
4. Talend
5. Inaplex Inaport
6. Abinitio
7. Oracle Warehouse Builder
8. Bodi (Business Objects Data Integration)
9. MSIS (Microsoft Integration Services)
1. Data stage:
Data stage is a comprehensive ETL tool Or we can say Data stage is an data Integration
and transformation tool which enables collection and consolidation of data from several
sources,its transformation and delivery into one or multiple target systems
History begins in 1997 the first version of data stage released by VMRAK company
it’s a US based company
Mr. Lee scheffner is the father of data stage
Those days data stage we called as Data integrator
In 1997 Data integrator acquired by company called TORRENT
Again in 1999 INFORMIX Company has acquired this Data integrator from
TORRENT Company
In 2000 ASCENTIAL Company acquired this Data Integrator and after that Ascentaial
Data stage server Edition
From 6.0 to 7.5.1 versions they supports only Unix flavor environment
Because server configured on only Unix plot form environment
In 2004, a version 7.5.x2 is released that support server configuration for windows flat
Form also.
In 2004, December the version 7.5.x2 were having ASCENTIAL suite components
Profile stage,
Quality stage,
Audit stage,
Meta stage,
DataStage Px,
DataStage Tx,
DataStage MUS,
These are all Individual tools
In 2005, February the IBM acquired all the ASCENTIAL suite components and the
IBM released IBM DS EE i.e., enterprise edition.
In 2006, the IBM has made some changes to the IBM DS EE and the changes are the
integrated the profiling stage and audit stage into one, quality stage, Meta stage, and
8
DataStage Px.IBM WEBSPHERE DS & QS 8.0
In 2009, IBM has released another version that “IBM INFOSPHERE DS & QS 8.1”
Pentaho Kettle:
Pentaho is a commercial open-source BI suite that has a product called Kettle for data
integration.
It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI
The company started around 2001
It has a strong community of 13,500 registered users
It uses a stand-alone java engine that process the tasks for moving data between many
different databases and files
Talend:
Inaplex:
9
Features Of Data stage:
There are 5 important features of DataStage, they are
- Any to Any,
- Plat form Independent,
- Node configuration,
- Partition parallelism, and
- Pipe line parallelism.
Any to Any:
Data stage can Extarct data from any source and can load data in to any target
Platform Independent:
A job can run in any processor is called platform independent
Data stage jobs can run on 3 types of processors
Three types of processor are there, they are
1. UNI Processor
2. Symmetric Multi Processor (SMP), and
3. Massively Multi Processor (MMP).
Node Configuration:
Node is a logical CPU ie.instance of physical CPU
The process of creating virtual CPU’s is called Node configuration
Example:
ETL job requires executing 1000 records
In Uni processor it takes 10 mins to execute 1000 records
But in same thing SMP processor takes 2.5 mins to execute 1000 records
Server jobs:
1. Datastage server jobs do not run on multiple nodes
2. Data stage server jobs don't support the parallelism (Round robin Hash modulus etc.
3. The transformer in server jobs compiles in Basic language
4. Data stage server jobs run on only one node
10
5. Data stage server jobs run on unix platform
6. Major difference in job architecture level Server jobs process in sequence one stage
after other
Configuration File:
What is configuration file? What is the use of this in data stage?
It is normal text file. it is having the information about the processing and storage
resources that are available for usage during parallel job execution.
Fast Name: it is server name. by using this name it was executed our ETL jobs.
Resource disk:- it is permanent memory area which stores all Repository components.
Resource Scratch disk:-it is temporary memory area where the staging operation will be
performed.
Configuration file:
Example:
{
node "node1"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
node "node2"
{
fastname "abc"
pools ""
resource disk "/opt/IBM/InformationServer/Server/Datasets" {pools ""}
resource scratchdisk "/opt/IBM/InformationServer/Server/Scratch" {pools
""}
}
}
11
Note:
In a configuration file No node names has same name
Default Node pool is “ “
At least one node must belong to the default node pool, which has a name of "" (zero-
length string).
Pipeline parallelism:
Pipe:
Pipe is a channel through which data moves from one stage to another stage
Partition Parallelism:
Partitioning:
Partioning is a technique of dividing the data into chunks
Data stage supports 8 types of partitions
Partioning plays a important role in data stage
Every stage in Data stage associated with default partitioning technique
Defualt partinining technique is auto
Note:
Selection of partioning techniques is based on
1 .Data(Volume ,Type
2 .Stage
3. No of key Columns
5.Key column data type
12
4.Same
Client Components:
The client component again classified into
Data stage Administartor
Data stage Manager
Data stage Director
Data stage Designer
13
Da ta stage Director can validate ajobs
Can run the jobs
Can monitor a job
Can schedule ajob
Can view the job logs
8.0.:
1. Five client components (Ds Designer,Ds Director,Information Analyzer,Ds
Administrator, Web console)
2. Architecture Components
Common User Interface
Common Repository
Common Engine
Common Connectivity
Common Shared services
3. N- tier architecture
4. Os Independent with respect to users but one time dependent
5. Capable of All Phases
6.Web based Administration through web console
7.Data base Based Repository
14
it is to create jobs, compile, run and multiple job compile.
It can handle four type of jobs
1.Server Jobs
2. Parallel jobs
3. Job Sequencing
4. Main frame Jobs
Server Components:
We have 3 server components
3.Package Installer:
Package Instaler has packs and Plug Ins
15
It can handle four type of jobs
1.Server Jobs
2. Parallel jobs
3. Job Sequencing
4. Main frame Jobs
5.Data Qulaity jobs
Web Console:
Through administrator components can perform the below tasks
1.Security services
2.Scheduling services
3.Logging services
4.Reporting services
5.Domian Management
5.Session Management
Information Analyzer:
It is also console for IBM infosphere Information Server console
It performs All activities of Phase1
1.Column Analysis
2.BaseLine Analysis
3.Primary Key Analysis
4.foriegn Key Analysis
5.Cross Domian Analysis
16
3. Data stage Designer
4. Data stage Director
5. Data stage Administrator
2. Common Respository
3. Comon Engine:
It is responsible for the following
Data Profiling Analysis
Data Data Quality Analysis
Data Transmission Analysis
4. Common Connectivity
It provides the common connections to the Common Repository
Stages Enhancements and Newly Introduced stages Comparison from
7.5x2 And 8.0.1:
Stage Category Available Stage version
Type Stage Name 7.5X2 Avilable Stage in Version 8.0.1
SCD(Slowly Change
Processing Stage Dimension) Not Available Available
Processing Stage FTP(File Transfer Protocal) Not Available Available
Processing Stage WTX(Webshere Transfer) Not Available Available
Processing Stage Surrogate Key Available Available (Enhance ment done
Available
(Normal Lookup,Sprase Available( Range Lookup,Case
Processing Stage Look up Lookup) Lookup)
Data Base Stage IWAY Not Available Available
Data Base Stage Classic Federation Not Available Available
Data Base Stage ODBC Connector Not Available Available
Data Base Stage Netteza Not Available Available
Available
(All Stages Technqs used wrto
Data Base Stage Sql builder Not Available Builder)
Note: Data base stages and Processing stage has Enhancements has done
Its has Title BarIBM Infosphere Datastage and Quality stage Designer
17
Menu bar File,Edit,View,Repository,Diagram,Import,Export,Tools,Window,Help
File Stages:
----------------
Sequential file stage:
===============
Sequential file stage is a file stage which is used to read the data sequentially or
Parallely.
If it is 1 file - It reads the data sequentially
If it is N files - It reads the data Parallely
Sequential file supports 1 Input link |1 Output Link | 1 reject link.
To read the data, we have read methods. Read methods are
a) Specific files
b) File Patterns
Specific File is for particular file
And File Pattern is used for Wild cards.
And in Error Mode. It has
Continue
Fail and
Output
If you select Continue-If any data type mismatch it will send the rest of the data to the
target
If you Select Fail- Job Abort or Any Data type mismatch
Output- It will send the mismatch data to Rejected data file.
Error data we get are
Data type Mismatch
Format Mismatch
Condition Mismatch
and we have the option like
Missing File Mode: In this Option
we have three sub options like
Depends
Error
Ok
(That means How to handle, if any file is missed)
Different Options usage in Sequential file:
-----------------------------------------------------
18
Read Method=File patternThen execute in file pattern
Note :If we choose Read method =Specific file then it asks input file path
If we select Read method=File Pattern then it asks ask pattern
Note:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
Note1:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
Note2: If FileNameColumn=InputRecordFilepath
Here FileNameColumn is an Option if we select this option then we have to create one
more extra column in extended column properties then in output file you will get the
InputRecordFilepathColumn extra at output
Note1:If Reject Mode=Output then you must provide the output reject link for rejected
records other wise it gives error
Note2: RowNumberColumn=InputRowNumberColumn
19
Here RowNumberColumn is an Option if we select this option then we have to create one
more extra column in extended column properties then in output file you will get the
InputRowNumberColumn extra at output
Sequencialfile Options:
Filter
FileNameColumn
RowNumcolumn
Read Firstrows
NullFieldvalue
1. Filter Options
Sed command:
--------------------
Sed: is a stream Editor for filtering and transforming text from standard input to
standard output
Grep commands:
-----------------------
Syntax:
grep “string”
Ex: grep “bhaskar”
1) grep –v “string” Ex: grep – v “bhaskar” it displays except ’bhaskar’
2) grep –i “String” Ex: grep - i “bhaskar” it ignores case sensitive
20
Example Job for Sequential File:
Requirement: Extracting EMP data from text file and loading into text file By using
Sequential file Stage
21
Job:
22
Importing table definition from sequential file:
Right click on sequential_file_0 Or double click on seqential_file_0
23
Click on Load Tab at Bottom
It will show the Window like this
24
Now in the file list u select the file emp1.txt
25
Tick the check box First line is column names and Click on Defines
Now Click on Ok and click on Close tab Now that file emp1.txt will show in the table
Definition list
26
Now Select Emp1.txt in table definition list and click on OK
Now Click on OK and again Click on OK ….This is the way of procedure for importing
table definition
27
2)Example Job for Sequential File:
Requirement: Extracting EMP data from text file and loading into text file By using
Sequential file Stage
Here Read method=file pattern
Input sequential_File_0
Properties:
Emp1.txt
Emp2.txt
28
Job:
29
Output Sequential_File_1 Properties:
30
3)Example Job for Sequential File:
Requirement: Extracting EMP data from text file and loading into text file By using
Sequential file Stage
31
Job:
32
Output sequential_File_01 Properties:
Output data:
33
Input file data:
34
Job:
35
Input Properties:
36
Columns:
Output Properties:
Outputdata:
37
6)Example Job for Sequential File:
Requirement: Extracting EMP data from text file and loading into text file By using
Sequential file Stage
Job:
InputProperties:
38
Columns:
Input Data:
39
Output Properties:
OutputData:
40
7)Example Job for Sequential File:
Requirement: Extracting EMP data from text file and loading into text file By using
Sequential file Stage
41
Job:
Input properties:
42
Input Columns:
Output properties:
43
Output Data:
Job:
44
Input sequential file properties:
Columns:
45
Output sequential file properties:
46
9)Example Job for Sequential File:
Requirement: Extracting EMP data from text file and loading into text file By using
Sequential file Stage
Grep Options:
3) grep “string” Ex: grep “bhaskar”
4) grep –v “string” Ex: grep – v “bhaskar”
5) grep –i “String” Ex: grep - i “bhaskar”
47
Source filed data:
Job:
48
Columns:
49
Output Data:
50
Source filedata:
Job:
Columns:
51
Output seqfile stage properties:
52
Usage of Parameters at Sequential file stage input file Path
Example :
-------------
File Stages:
-----------------
Data set:
------------
Dataset is the parallel processing file stage which is used for staging the data when we
design dependent jobs.
By Default dataset is parallel processing stage
Dataset will be stored in the binary format.
If we use dataset for the jobs, data will be stored in the Data Stage. That’s is inside the
repository.
Dataset will over come the limitations of the sequential file.
Limitations of sequential files are
1) Memory limitations ( It can store up to 2 GB Memory in the file format )
2) Sequential ( By default it is Sequential )
3) Conversion Problem ( Every time when we run the job, it has to convert from one
format to another format)
4) Stores the data outside the Data stage ( Where in Dataset it stores the data inside the
Data stage)
53
Alias names of Datasets are
a) Orchestrate Files
b) Operating System Files
1) Descriptor Files contains the Schema Details and address of the data.
It stores the data in C:/Data/file.ds
Descriptor file:
It holds the information about the address and about the structure
Data file: Represents the data in the native form
Control and Header file: these files are operate at the OS level for controlling the
Descriptor and Data file
54
EXAMPLE JOBS FOR DATA SET STAGE:
Example Job on DATA SET:
Source file data:
JOB:
55
Note: the target dataset file extension is .ds
56
it will show the list of files and datasets
57
U can view the record schema By click on table definition icon
you can see the data here by click on data viewer option:
58
Note: By using data set management we can we can open the dataset, it can show the
schema window’ it can show the data window, it can copy dataset and can delete dataset
It is used to extract data from flat file It never use to extract data from client flat
files
59
Development and Debugging Stages:
Row Generator stage:
JOB:
60
Click on Columns
now click Doble click on Row no 1 it will show the below screen
For empno filed U select Type=cycle,intial value=1000,increment=1 and then click next
Again it will show the below screen for Name field Set the Algorithm=Cycle,and give
value=RafelNadal,value=JamesBlake,value=Andderadick
Similary click next it will show the window for hire datefield set Type=Random
61
For Name field
62
for hire datafield:
63
Click on view data:
It will show the screen here In this job I defined 3 parameters i) is for No of rows 2) and
inputfilepath 3)inputfilename
64
2. Column Generator stage:
1.The Column Generator stage is a Development/Debug stage
2. It can have a single input link and a single output link.
3. The Column Generator stage adds columns to incoming data and generates mock data
for these columns for each data row processed.
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify details about the input link.
Output Page. This is where you specify details about the generated data being output
from the stage.
Example Job For Column generator:
Input seqEmpData:
JOB:
seqEmpData properties:
65
Column Generator Properties:
66
Click on output and click on columns:
Here u need to give column name salary in extended column properties:
67
Output columns:
68
Output:
Head Stage:
1.The Head Stage is a Development/Debug stage
2. It can have a single input link and a single output link
3. The Head Stage selects the first N rows from each partition of an input data set and
copies the selected rows to an output data set. You determine which rows are copied by
setting properties which allow you to specify:
The number of rows to copy
69
The partition from which the rows are copied
The location of the rows to copy
The number of rows to skip before the copying operation begins
4.This stage is helpful in testing and debugging applications with large data sets. For
example, the Partition property lets you see data from a single partition to determine if
the data is being partitioned as you want it to be. The Skip property lets you access a
certain portion of a data set.
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify the details about the single input set from
which you are selecting records.
Output Page. This is where you specify details about the processed data being output
from the stage.
70
Example Job for Head Stage:
Input SeqEmpData:
Output seqEmpdata:
JOB:
InputseqEmpData Properties:
71
Case-1:
72
Head Stage Output columns:
Target Output_Sequentialdata:
73
Example Job for Head Stage:
Input SeqEmpData:
OutputSequential data:
Job:
74
Input SeqEmpData:
75
Head stage output columns:
Target OutputSequentialData:
Target Output_Sequentail_data:
76
Tail Stage:
1.The Tail Stage is a Development/Debug stage
2. It can have a single input link and a single output link
3. The Tail Stage selects the last N records from each partition of an input data set and
copies the selected records to an output data set. You determine which records are copied
by setting properties which allow you to specify:
The number of records to copy
The partition from which the records are copied
4.This stage is helpful in testing and debugging applications with large data sets. For
example, the Partition property lets you see data from a single partition to determine if
the data is being partitioned as you want it to be. The Skip property lets you access a
certain portion of a data set
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify the details about the single input set from
which you are selecting records.
Output Page. This is where you specify details about the processed data being output
from the stage
Rows:
No of rows[Per partition]=10(Default is 10 if we need more than 10 or less
than 10 we have to change the number)
Number of rows to copy from input to output per partition.
Partitions:
All Partition=True/False
When set to True copies rows from all partitions. When set to False, copies from
specific partition numbers, which must be specified.
77
Example Job for Tail Stage:
Input SeqEmpData:
JOB:
78
Tailstage Properties:
Output Columns:
79
Target Output_Sequentialdata:
Sample Stage:
1.The Sample stage is a Development/Debug stage.
2. It can have a single input link and any number of output links when operationg in
percent mode,
3. And a single input and single output link when operating in period mode
4.The Sample stage samples an input data set. It operates in two modes. In Percent mode,
it extracts rows, selecting them by means of a random number generator, and writes a
given percentage of these to each output data set. You specify the number of output data
sets, the percentage written to each, and a seed value to start the random number
generator. You can reproduce a given distribution by repeating the same number of
outputs, the percentage, and the seed value
5.In Period mode, it extracts every Nth row from each partition, where N is the period,
which you supply. In this case all rows will be output to a single data set, so the stage
used in this mode can only have a single output link
6.For both modes you can specify the maximum number of rows that you want to sample
from each partition.
The stage editor has three pages:
80
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify details about the data set being Sampled.
Output Page. This is where you specify details about the Sampled data being output
from the stage.
Note: Sample stage we can Operate in Two Modes one is Period and Another one is
Percentage Mode
Input data:
JOB:
81
Sample stage properties:
Output Columns:
82
Target Seqfile properties:
Output data:
83
Note :Here out put we get only 3 records because we set option period[perpartion]=3 So
it takes every 3 rd record from input file data
Job:
84
Input sequential file stage properties:
85
Output columns for Output1:
86
Sample stage link order ing:
87
Output data for outdat1 seqfile:
88
Output2 seqfile data:
89
Peek Stage:
If we select True then Number of records [Per Partition]=10 it wont come if we set false
then only the number of records[ Per Partition]=10 will appear
2.Columns
Peek all input columns=True/False
When set to True prints all column values. When set to False, prints specific column
values, which must be specified.
3.Partitions
All Partitions=True/False
When set to True prints rows from all partitions. When set to False, prints from specific
partition numbers, which must be specified.
4.Options
Peek Records output mode=Joblog/output
Job log = print output to log file; Output = write to second output link of stage
Show column names=True/False
Set True to print the column name, followed by a colon, followed by the value; otherwise
prints only the value, followed by a space.
90
Example Job for Peek Stage:
Inputdata:
JOB:
91
Peek stage properties:
92
Output seqfile properties:
Option outputmode=Joblog:
Job:
93
Input seqfile properties:
94
peek stage properties:
Here we set the option Peek output mode=job log so we can see the data at Logs only
Procedure for see the data at logs:
Go to the tools and run directornow click on view log it will show the screen like
95
in the above screen from bottom to 8th row u click it will show the log details
96
Example Job for Peek Stage
Inputdata:
JOB:
97
Input seqfile data:
98
Peek stage properties:
99
peek output1 mappings:
100
peekoutput2 mappings:
peekoutput3 columns:
101
Peekoutput3 mappings:
peekout1 properties:
102
peekoutput1 data:
Peekoutput2 properties:
103
Peekoutput2 data:
Peekoutput2 properties:
104
Peekoutput3 data:
Peekoutput3 properties:
105
(
“Field1” datatype(size),
“Field2” datatype(size),
“Field3” datatype(size),
“Field4” datatype(size),
“Field5” datatype(size)
Field6” datatype(size),
“Field7” datatype(size),
“Field8” datatype(size),
“Field9” datatype(size),
“Field10” datatype(size)
)
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE "Schemaname.Tablename"
TO Group "Groupname";
Before subroutine:ExecSh
Input Value
/Data/Script/Deletetab.sh #Servername # #UserId# #Password#
#schemaname.TableName#
Step3:When we trigger the job then prompt will display the default conguration file and
change the file name as “2Nodeconfigfile” Now your job will run on 2 node
configuration file
106
Differences Between Filer ,Switch,External Filter
Filter:
1.Condition can put on Multiple Columns
2.It Have 1-Input,N-Output, 1-Reject Link
Switch:
1.Condition can put on Single Column
2.It Have 1-Input,128-Output, 1- default Reject Link
External Filter:
1.Here we can use All unix Filter commands
2.It Have 1-Input,1-Output, No-Reject Link
Connections
DSN=EXE
Password=OS system password
User=OS system user name
107
Columns
Load
Import ODBC Table Defination
DSN here select work Book
User ID and Password
Operating System
Add in ODBC
MS EXCEL Drivers
Name=EXE (DSN)
108
Click On output
Properties for Outputname=Emp
Columns:
109
Selection:
SQL:
110
Viwe Data:
111
Columns:
112
SQL:
113
View Data:
Emp_Dataset Properties:
View data:
114
Dept_Data_set:Properties:
View data:
ENCODE STAGE:
1.It is processing stage that encodes the records into single format with the support of
command line”.
2.It supports – “1-input and 1-output”.
Properties:
Stage
Options Command Line=Compress/gZip
Input
Output
Load the meta data of source files
DECODE STAGE:
1.It is processing stage that decodes the encoded data”.
2.It supports – “1-intput and 1-output”.
Properties:
Stage
Options: command line = (uncompress/gunzip)
Output
Load the ‘Meta’ data of the source file.
115
Filter Stage:
1.The Filter stage is a processing stage.
2.It can have a single input link and a any number of output links and, optionally, a single
reject link.
3.The Filter stage transfers, unmodified, the records of the input data set which satisfy the
specified requirements and filters out all other records.
4.You can specify different requirements to route rows down different output links. The
filtered out records can be routed to a reject link, if required.
Predicates
Where Clause
Options
Output rejects=False
Output rejects=True
Set to true to output rejects to reject link.
116
Example Job for Filter stage:
JOB:
Emp Properties:
117
Copy1 output :
Outputname=Emp_Copy
Output name=Emp_Copy_all
118
Out_Emp_Copyall_Dataset Properties:
outputdata:
Filter_3properties:
119
Filter_3 Output Mappings:
OutputName=OutputSalg1andsall3:
120
Output_Deptno10 PROPERTIES:
OUTPUT DATA:
Data_SET5 properties:
121
OUTPUT DATA:
Filter_10 PROPERTIES:
Output Mappings:Dslink15
122
Output columns:
output name=dslink13:
123
Dataset_14 PROPERTIES:
output
Datbase_12 Properties:
Output:
124
Switch stage:
1.The Switch stage is a processing stage.
2.It can have a single input link, up to 128 output links and a single rejects link.
The Switch stage takes a single data set as input and assigns each input row to an output
data set based on the value of a selector field.
3.The Switch stage performs an operation analogous to a C switch statement, which
causes the flow of control in a C program to branch to one of several cases based on the
value of a selector variable.
4.Rows that satisfy none of the cases are output on the rejects link.
1.Input
Selector=Column Name
1.Auto can be used when there is as many distinct selector values as output links.
2.Hash means that rows are hashed on the selector column modulo the number of output
links and assigned to an output link accordingly. In this case, the selector column must be
of a type that is convertible to Unsigned Integer and may not be nullable.
3.User-defined Mapping means that the onus is on the user to provide explicit mapping
for values to outputs
125
Number>], The Link Label Number is not needed if the value is intended for the same
output link as specified by the previous mapping that specified a number. You must
specify an individual mapping for each value of the selector column you want to direct to
one of the output links, thus this property will be repeated as many times as necessary to
specify the complete mapping.
3.Options
If Not Find =Fail,Drop,Output
Fail means that an invalid selector value causes the job to fail; Drop drops the offending
row; Output sends it to a reject link.
126
Oracle_Enterprise_0 properties:
127
Outputmappings:
Outputname=T1;
Output_Dataset_2 properties::
128
Output
Outputname=T2;
129
output:
Outputname=T3;
130
Output:
131
External Filter Proper ties options
1.Options
Arguments=?
Type: String
Any command-line arguments required.
Filter Command=?
Type: Pathname
Program or command to execute, which must be configured to accept input from stdin
and write its results to stdout.
Input data:
Job:
132
Sequential_File_7 Properties:
Columns:
133
External Filter properties:
Output columns:
Here we need to give the column names manually at output columns
134
Target data set_9: Properties:
OUTPUT:
135
JOB:
136
Copy_1 STAGE properties:
137
External_Filter_3 Output columns:
138
output:
139
Output Columns:
Dataset_8 Properties:
140
OUTPUT:
JOIN Queries:
Join is a query which combines the data from multiple tables
Types of joins:
1. Cartezion join
2. Equi join
3. Non equi join
4.Self join or inner join
5.Outer join
Left outer join
Right outer join
141
EMPNO ENAME JOB MGR DEPTNO
---------- ---------- ---------- ---------- ----------
111 bhaskar analyst 444 10
222 prabhakar clerk 333 20
333 pradeep manager 111 10
444 srujana engineer 222 40
Department Table Data
SQL> select * from dept;
Examples:
Cartezion join:
If we combine a data from multiple tables with out applying any condition then each
record in the first table will join with all records in the second table
SQL>select * from emp,dept
SQL>select empno,ename,job ,dname,loc from emp e,dept d
Equi join:
If we combine a data from multiple tables by applying equal no of conditions on multiple
tables then each record in the first table will join with one row in the second table.
This kind of join can be called as Equi join
Inner join:
This will display all the records that have matched.
Ex:
SQL> select empno,ename,job,dname,loc from emp inner join dept using(deptno);
142
Left outer join:
This will display the all matching records and the records which are in left hand side table
those that are not in right hand side table.
Ex:
SQL> select empno,ename,job,dname,loc from emp e left outer join dept d
on(e.deptno=d.deptno);
Or
SQL> select empno,ename,job,dname,loc from emp e,dept d where
e.deptno=d.deptno(+);
Right outer join:
This will display the all matching records and the records which are in right hand side
table those that are not in left hand side table.
Ex:
SQL> select empno,ename,job,dname,loc from emp e right outer join dept d
on(e.deptno=d.deptno);
Or
SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno(+) =
d.deptno;
Join Stage:
These topics describe Join stages, which are used to join data from two input tables and
produce one output table. You can use the Join stage to perform inner joins, outer joins, or
full joins.
1.An inner join returns only those rows that have matching column values in both
input tables. The unmatched rows are discarded.
143
2.An outer join returns all rows from the outer table even if there are no matches. You
define which of the two tables is the outer table.
3.A full join returns all rows that match the join condition, plus the unmatched rows
from both input tables.
Unmatched rows returned in outer joins or full joins have NULL values in the columns of
the other link
1.Join stages have two input links and one output link.
2. The two input links must come from source stages. The joined data can be output to
another processing stage or a passive stage
1.Join Keys
Key=?
Type: Input Column
Name of input column you want to join on. Columns with the same name must appear
in both input data sets and have compatible data types.
Case sensitive=True/False
Type: List
Whether this join column is case sensitive or not.
2.Options
Join type= F,I,L,R
F->Full outer join
I->Inner join
L->Left outer join
R->Right outer join
Type: List
Type of join operation to perform.
144
Dept table Data:
145
JOB2: Left Outer Join
Output:
LOOKUP JOB :
146
JOB3:Right Outer Join
Output:
147
Output:
148
LOOKUP STAGE:
1.The Lookup stage is a processing stage.
2.It is used to perform lookup operations on a data set read into memory from any other
Parallel job stage that can output data
3. The most common use for a lookup is to map short codes in the input data set onto
expanded information from a lookup table which is then joined to the incoming data and
output. For example, you could have an input data set carrying names and addresses of
your U.S. customers. The data as presented identifies state as a two letter U. S. state
postal code, but you want the data to carry the full name of the state. You could define a
lookup table that carries a list of codes matched to states, defining the code as the key
column. As the Lookup stage reads each line, it uses the key to look up the state in the
lookup table. It adds the state to a new column defined for the output link, and so the full
state name is added to each address. If any state codes have been incorrectly entered in
the data set, the code will not be found in the lookup table, and so that record will be
rejected.
4/Lookups can also be used for validation of a row. If there is no corresponding entry in a
lookup table to the key's values, the row is rejected.
5.The Lookup stage is one of three stages that join tables based on the values of key
columns. The other two are:
Join stage - Join stage
Merge stage - Merge Stage
6.The three stages differ mainly in the memory they use, the treatment of rows with
unmatched keys, and their requirements for data being input
7. The Lookup stage can have a reference link, a single input link, a single output link,
and a single rejects link
Input Data:
149
ReferenceData:
LOOK UPJOB:
Lookup Faiure: Drop
If Lookup failure =Drop then Inner join will perform
Output:
150
2.LOOK UP JOB
Lookup Failure=Continue
Output:
151
3.LOOKUP JOB:
Lookup Failure=Reject
If Lookup failure =Reject then the records which are not match with reference data
those records will send to the reject output Link
JOB:
Output:
Input Rejected :
LOOK UPJOB:
Lookup Faiure: Fail
If Lookup failure =Fail then If any of the input record not found in the reference file
Then the job will fails
152
MERGE STAGE:
1.Merge keys
2.Options
1.Merge Keys
Key=?
Sort order=Ascending
Sort in Either ascending or descending order
2.Options:
Unmatched Master mode=Drop/keep
Warn on reject updates=True
Warn on unmatched master=True
Masterdata:
UpdateData:
153
JOB1:
Unmatched Master mode=Drop
Warn on reject updates=True
Warn on unmatched master=True
Type: List
Keep means that unmatched rows (those without any updates) from the master link are
output; Drop means that unmatched rows are dropped
Output:
Master_Rejects:
154
JOB2:
Unmatched Master mode=Keep
Warn on reject updates=True
Warn on unmatched master=True
Output:
Master_Records:
Note : If the options "Warn on Reject Updates = True" and "Warn on Unmatched Masters
= True" then the log file shows the warnings on Reject Updates and Unmatched Data
from Masters
Note : If the options "Warn on Reject Updates = False" and "Warn on Unmatched
Masters = False" then the log file do not shows the warnings on Reject Updates and
Unmatched Data from Masters.
155
MODIFY STAGE JOBS:
Inputfile Data:
CUSTID,CNAME,ADDRESS,CITY,STATE,ZIP,CUSTDOB
1000,BHASKAR,MOOSPET,HYDERABAD,AP,500071,1983-03-10 16:02:00
2000,SUMIT,,BANGALORE,KA,560070,1985-03-01 16:02:00
3000,SRIKAR,ERRAGADDA,HYDERABAD,AP,,1985-05-01 16:02:00
4000,SRUJANA,,HYDERABAD,AP,,1986-07-01 16:02:00
Job:
156
Modify stage input columns
Output:
157
Modify Job2:
Null Handle:
Inputdata:
Job:
CUSTDOB=date_from_timestamp(CUSTDOB)
ZIP=Handle_Null('ZIP','999999')
158
Output:
3.Modify Job
Drop Columns
Job:
159
ZIP=Handle_Null('ZIP','999999')
Specification=Drop CUSTID
Modify stage Input columna
Output:
160
3.Modify Job
Keep Columns
Inputdata:
Job:
161
Modify stage properties:
KEEP CUSTID
Modify stage input columns
162
Modify stage output columns
Output:
Copy Stage:
1.The Copy stage is a processing stage.
2.It can have a single input link and any number of output links.
3. The Copy stage copies a single input data set to a number of output data sets.
4. Each record of the input data set is copied to every output data set. Records can be
copied without modification or you can drop or change the order of columns (to copy
with more modification - for example changing column data types - use the Modify stage
5. Where you are using a Copy stage with a single input and a single output, you should
ensure that you set the Force property in the stage editor TRUE. This prevents
InfoSphere™ DataStage® from deciding that the Copy operation is superfluous and
optimizing it out of the job
163
Input Page. This is where you specify details about the input link carrying the data to
be copied.
Output Page. This is where you specify details about the copied data being output
from the stage
Input:
Output:
Job:
164
Copy stage general tab
165
Copy stage Results set Output:
166
Aggregator Stage:
1.The Aggregator stage is a processing stage.
2.It classifies data rows from a single input link into groups and computes totals or other
aggregate functions for each group. The summed totals for each group are output from
the stage via an output link.
The stage editor has three pages:
Stage Page. This is always present and is used to specify general information about
the stage.
Input Page. This is where you specify details about the data being grouped or
aggregated.
Output Page. This is where you specify details about the groups being output from
the stage.
Aggregator stage general tab Options:
1.Grouping Keys:
2.Aggregations
3.Options
1.Gropuing Keys:
Group=Specifies an input column you are using as a grouping key.
Grouping Keys
Group
CaseSensitive=True/False
2. Aggrigations:
Aggregation type
Whether to perform calculation(s) on column(s), re-calculate on previously created
summary columns, or count rows.
Calculation
Count of Rows
Re-Calculation
Aggregation type=Calculation
Column for calculation=Column name
167
1.Corrected sum of squares output column
Name of column to hold the corrected sum of squares of data in the aggregate column.
->Decimal output=?
1.Maximum value output column
Name of column to hold the maximum value encountered in the aggregate column.
->Decimal output=?
2.Mean Value output column
->Decimal output=?
Name of column to hold the mean value of data in the aggregate column.
4. missing values
Specifies what constitutes a 'missing' value, for example -1 or NULL. Enter the value as a
floating point number.
7.Preserve type=True/False
True means that the datatype of the output column is derived from the input column when
calculating minimum value, maximum value.
168
Name of column to hold the standard error of data in the aggregate column.
16.weighting column
Increment the count for the group by the contents of the weight field for each record in
the group, instead of by 1. (Applies to: Percent Coefficient of Variation, Mean Value,
Sum, Sum of Weights, Uncorrected Sum of Squares.)
2.Options:
Allow null output=True/False
True means that NULL is a valid output value when calculating minimum value,
maximum value, mean value, standard deviation, standard error, sum, sum of weights,
and variance. False means 0 is output when all input values for calculation column
are NULL.
Method=Hash/Sort
Use hash mode for a relatively small number of groups; generally, fewer than about
1000 groups per megabyte of memory. Sort mode requires the input data set to have
been partition sorted with all of the grouping keys specified as hashing and sorting
keys.
169
Input data:
Requirement:
Output file11 data:
database properties;
170
Copy stage properties:
171
output tab:
Aggrigator2 properties:
172
output tab:
173
Column generator properties:
output tab:
174
Aggrigator 4 properties:
Output tab:
175
Target seqfile 12 properties:
176
Requirement:
Output file11 data:
JOB:
database properties;
177
Copy stage properties:
Aggrigator 2 properties:
178
Output tab:
179
Column generator properties:
180
Aggrigator4 properties:
output tab:
181
Target se file12 Properties:
Sorting:
Sorting can be done at different ways
1.If source is a Data base then we can use order by class by that we can sort the data
based on column names
2.If source is a Data base then we can use query like this
Select Distinct column(s)
From tabname
Order by Column(s)
182
Link level sort Example:
Input Data:
Here in the above screen shot if we can observe carefully 3 check box has to be selected
183
Output:
184
Click On output
Properties for Outputname=Emp
185
Columns:
Selection:
186
SQL:
Viwe Data:
187
Properties for Outputname=Emp
Columns:
188
SQL:
189
View Data:
Emp_Dataset Properties:
190
View data:
Dept_Data_set:Properties:
191
View data:
ENCODE STAGE:
1.It is processing stage that encodes the records into single format with the support of
command line”.
2.It supports – “1-input and 1-output”.
Properties:
Stage
Options Command Line=Compress/gZip
Input
Output
Load the meta data of source files
DECODE STAGE:
1.It is processing stage that decodes the encoded data”.
2.It supports – “1-intput and 1-output”.
Properties:
Stage
Options: command line = (uncompress/gunzip)
Output
Load the ‘Meta’ data of the source file.
192
Parameter Set :
Procedure to create Parameter Set:
1. Choose File > New to open the New dialog box.
2. Open the Other folder, select the Parameter Set icon, and click OK.
3. The Parameter Set dialog box appears.
4. Enter the required details on each of the pages as detailed in the following
sections.
Parameter Set General tab
Use the General page to specify a name for your parameter set and to provide
descriptions of it.
Parameter Set Parameters tab:
Use this page to specify the actual parameters that are being stored in the parameter
Set
Parameter Set Value tab:
Use this page to optionally specify sets of values to be used for the parameters in this
parameter set when a job using it is designed or run.
193
3.Parameter Set Value tab:
Use this page to optionally specify sets of values to be used for the parameters in this
parameter set when a job using it is designed or run.
Parameter Sets
General Parameters Values
InputData2:
Output:
194
Job:
Output:
195
Pivote Stage:
1.Pivot stage is an active stage,
2.Pivote stage is an processing stage
3.Maps sets of columns in an input table to a single column in an output table. This type
of mapping is called pivoting.
4. Pivot Stage converts columns in to rows.
Scenario:
Eg., Marks1 and Marks2,Marks3 are three columns.
Using Methodology : In the deviation field of the output column change the input
columns in to one column.
Note : Column "Marks" is derived from the input columns Marks1 and Marks2 and
Marks3
OutputData:
196
Job:
-----
197
OutputData:
4.You can use a Surrogate Key Generator stage to perform the following tasks:
Create or delete the key source before other jobs run
198
Update a state file with a range of key values
Generate surrogate key columns and pass them to the next stage in the job
View the contents of the state file
5.Generated keys are unsigned 64-bit integers. The key source can be a state file or a
database sequence. If you are using a database sequence, the sequence must be created by
the Surrogate Key stage. You cannot use a sequence previously created outside of
DataStage.
6.You can use the Surrogate Key Generator stage to update a state file, but not a database
sequence. Sequences must be modified with database tools.
InputData:
Output:
Job:
199
Surrogate key properties:
1.Key source:
Generate output column name= Surrogatekey1(This column we have to generate
at output)
Source Name=C:/data/bhaskar/empty.txt(we have to create the empty .txt file in
the given path)
Source Type=Flat File
2.Options:
Generate key from Last Highest value= Yes/No
Output:
TRANSFORMER STAGE:
Trans former stage plays major role in data stage .it is used to modify the data, apply
some functions while populating data from source to target
It takes one input link and gives one (or) more than one output links.
It has 3 components
1. Stage variable
2. Constraints
3. Derivations (or) Expressions
1. Transformer stage can works as copy stage and filter stage
2. Transformer stage requires C++ Compiler .it convert high level data into machine
language
200
Double click on transformer stage drag and drop of required target columnsClick Ok
The order of execution of transformer stage is
1. Stage variable
2. Constraints
3. Derivations
Example:
How to work transformer as filter stage (or) how to apply constraints in the
transformer stage:
Double click on trans former stage double click on constraint again double on
particular link click on this window it provides all information’s automatically and
view Constraints for reject link click other wise.
Example Derivation:
1. Ds Macro:
201
Ds Macro provides some built in Functions like
1. DsProjectName()
2. DsJobName()
3. DsHostName()
4. DsJobStartDate()
5. DsJobStartTime()
6. DsJobStartTimeStamp()
2. Ds routine:
It is nothing but set up functions
3. Job parameters:
Job parameters are nothing but some variables. these are used to reduce the redundancy
of work
4. Input columns:
It provides all input column names
5. Stage variables:
Stage variables are used to increase the performance and to reduce the redundancy of
work
How to define stage variable properties:
Click on stage variable right click on stage variable select stage variable
propertiesdefine stage variables
6. System variables:
It contain some built in functions like
1. @INROWNUM
2. @OUTROWNUM
3. @NUMPARTIONS
INROWNUM and OUTROWNUM provides how many records are loading into
transformer stage and how many records extracted from transformer stage ,Num Portion
tells how many nodes is handled
7. String:
It provides information with in double quotation hard coded value
8. Functions:
There are several built functions in data stage
1. Date&Time
2. Logical
3. Mathematical
4. Null Handling
5. Number
6. Raw
7. String
8. Type Conversion
202
9. Utility
Inputfile:
Output requirement
JOB:
203
Sequential file:
204
Results stage variable Derivation:
---------------------------------------
INPUTCOLUMNS:
OUTPUTLINK:
TARGET FILE:
205
2) EXAMPLE JOB FOR TRANSFORMER:
JOB2:
Input file:
Output requirement:
206
JOB:
207
Stage variable derivation:
Field(INPUT.HDATE,"/",3):"-": Field(INPUT.HDATE,"/",2):"-":
Field(INPUT.HDATE,"/",1)
Output2:
JOB:
INPUT:
208
Transformer1:
209
Left(INPUT.REC,1)
Transformer2:
Constrains logic:
210
OUTPUTINVC:
OUTPUTPRODID:
211
212
4. EXAMPLE JOB FOR TRANSFORMER STAGE:
Input file:seqfile1:
Input file:seqfile0:
Job:
213
Input file file1 properties:properties:
Join properties:
214
Stage variable Status derivation:
215
2. REMOVE DUPLICATES:
Inputdata:
Output:
216
JOB:
217
-->
- <EMPINFO>
- <EMPDETAILS>
<EMPID>1</EMPID>
<NAME>BHASKAR</NAME>
<GENDER>MALE</GENDER>
<COMPANY>IBM</COMPANY>
<CITY>HYDERABAD</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>2</EMPID>
<NAME>PRADEEP</NAME>
<GENDER>MALE</GENDER>
<COMPANY>WIPRO</COMPANY>
<CITY>BANGLORE</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>3</EMPID>
<NAME>SRUJANA</NAME>
<GENDER>FEMALE</GENDER>
<COMPANY>INFOSYS</COMPANY>
<CITY>HYDERABAD</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>4</EMPID>
<NAME>KRISHNAVENI</NAME>
<GENDER>FEMALE</GENDER>
<COMPANY>TCS</COMPANY>
<CITY>PUNE</CITY>
</EMPDETAILS>
- <EMPDETAILS>
<EMPID>5</EMPID>
<NAME>SRIKARAN</NAME>
<GENDER>MALE</GENDER>
<COMPANY>COGNIZANT</COMPANY>
<CITY>CHENNAI</CITY>
</EMPDETAILS>
</EMPINFO>
218
JOB:
219
1. Validation settings
3. Transformation settings
220
3. Options
Options->Input->Columns
221
XML Input Stage:
Xml input file data:
<EMPLOYEE>
<EMP>
<NAME>BHASKAR</NAME>
<DEPT>FINANCE</DEPT>
<SAL>10000</SAL>
</EMP>
<EMP>
<NAME>SRUJANA</NAME>
<DEPT>OTC</DEPT>
<SAL>20000</SAL>
</EMP>
<EMP>
<NAME>PRADEEP</NAME>
<DEPT>CUSTOMER</DEPT>
<SAL>30000</SAL>
</EMP>
</EMPLOYEE>
222
Job:
223
sequential file columns:
Here we have to read the entire xml file as a single reocrd
224
Xml Input Stage Input tab column proper ties:
225
Xml Input Stage Output tab general proper ties:
226
Xml Input Stage Output tab advanced proper ties:
227
Xml Input Stage Output tab Last Advanced proper ties:
228
Target sequential file proper ties:
FTP STAGE:
File Transfer from one data stage file server to other file server:
Job:
229
FTP stage general tab proper ties:
230
FTP stage properties tab properties:
231
File Transfer from Local Windows machine to UNIX server:
Connected to “servername”
>User <servername:<none>>:Userid
Password Required for UserId
Password:<Enter Password>
FTP>cd “Path”
Example:
FTP>cd temp\Data
250 command Successful
FTP>PWD
257 “temp\Data” is a current directory
FTP>put Sample.text
200 PORT command successful
150 Opening data connection for Sample.txt
232
Containers:
Containers are used to minimize the complexity of a job and for better under standing and
reusability purpose.
1. Local container
2. Shared container
Local containers: it is used to minimize the complexity of job for better understanding
purpose only.
And It never used for reusability purpose and it limit is with in a job.
Shared container:
Shared containers used for both purposes like to minimize the complexity of a job and
reusability. Its limit is with in project.
Local Container:
Shared container:
1. Itis used to for both purposes minimize the complexity of job and reusability
2.It is limit is with in a project
3. It occupies some memory
4. It can not be deconstructed directly first need to convert into local then
Constructed
233
Usage of shared containers in another job:
Create anew job-->drag and drop of shared container into new job-->double click on
shared container--> go to output--> assign old output link(shared container link)to new
output link--> go to columns--> click on load--->select reconcile from container link--
>click on validate-->same thing do it for remaing links-->click okss
JOB SEQUENCE:
It is used to run all jobs in a sequence (or) in a order by considering its dependencies. it
has many activities.
How to go to job sequence:
Select job sequencedrag drop of required jobs from jobs in repositorygive
connectionsave itcompile it now sequentially these 3 jobs will be run.
NOTIFICATION ACTIVITY:
It is used to send a mail to required persons automatically.
Double click on notification activity go to notification SMTP mail server: Company
name (www.xyz.com) ,sender email address: abreddy2@xyz.com, recipients email
234
address: recipients email address : abreddy2@xyz.com Email subject:Aggregatot job has
been aboarted,give some information on the bodyclick ok
TERMINATOR ACTIVITY:
It is used to send stop request to all running job.
Double click on wait for file activitygo to filefilename: select the file and set
timing(24 hours time only)
235
SEQUENCER:
It is used to connect one activity to another activity it takes more input links and gives
one output link
236
ROUTINE ACTIVITY:
It is used to execute a routine between two jobs
Double click on routine activitychoose routine nameif required parameter give
parameter.
237
SLOWLY CHANGING DIMENSIONS:
There are 3 types of SCD‘s Available in DWH.
Type1: It always maintains current data and updated data
Type2: It always maintains current data and full historical data
Type3: It always maintains current data and partial historical data
EXERICE-1:
no name sal
100 Bhaskar 1500
101 Mohan 2000
103 Sanjeev 2000
no name sal
100 Bhaskar 1000
101 Mohan 1500
102 Srikanth 2000
no name sal
100 Bhaskar 1500
101 Mohan 2000
102 Srikanth 2000
103 Sanjeev 2000
Type-I:
238
In SCD Type-I If a record exists in source table and not exists in target table then simply
insert a record into target table (103 record) if a record exists in both source and target
tables then simply update source record into target table(100,101)
Type-II:
While implementing SCD Type-II there are two extra columns are maintained in target
called Effective Start Date and Effective End Date .Effective start date is also part of
primary key.
If a record exists in source and not exists in target table then simply insert records into
target table. while inserting put Effective Start Date is equal to current date and effective
end date set null.
If a record exists in both source and target tables even though we are inserting a
source record into target table but before insert a record into target table the existing
record in target table update effective End Date=CurrentDate-1.
Now insert source record into target table effective start date=Current Date and Effective
End Date=Null
Type-III:
If a record exists in source and not exists in target table then simply insert records
Into target table. While inserting put Effective start Date is equal to Current Date and
Effective End Date set Null.
If a record exists in both source and target tables then check target table count group by
primary key if count=1 then update Effective End Date=Current Date-1 then simply
insert source record into target record.
If count greater than one then delete a record into target table group by primary key
where Effective End Date=Not Null. Now update target record Effective End
Date=Current Date-1 Then simply insert source record into target.
DATAWAREHOUSE:
Data ware house is nothing but collection of transactional data and historical data and can
be maintained in dwh for analysis purpose.
They are 3 types of tools should be maintained on any data warehousing project
239
1. ETL Tools
2. OLAP Tools (or) Reporting Tools
3. Modeling Tool
ETL TOOL:
ETL is nothing but Extraction, Transformation, and Loading. a ETL Developer(those who
are expertise in dwh extracts data from heterogeneous databases(or)Flat files, Transform
data from source to target(dwh) while transforming needs to apply transformation rules
and finally load data into dwh.
There are several ETL Tools available in the market those are
1. Data stage
2. Informatica
3. Abinitio
4. Oracle Warehouse Builder
5. Bodi (Business Objects Data Integration)
6. MSIS (Microsoft Integration Services)
OLAP:
OLAP is nothing but Online Analytical Processing and these tools are called as reporting
Tools Also
A OLAP Developer analyses the data ware house and generate reports based on selection
criteria.
There are several OLAP Tools are available
1. Business Objects
2. Cognos
3. Report Net
4. SAS
5. Micro Strategy
6. Hyperion
7. MSAS (Microsoft Analysis Services)
MODELING TOOL:
Those who are working with ERWIN Tool called data modeler .A data modeler can
design data base of DWH with the help of fallowing tools
240
A ETL Developer can extract data from source databases (or) flat files(.txt,csv,.xls etc)
and populates into DWH .While populating data into DWH they are some staging areas
can be maintained between source and target .these staging areas are called staging area1
and staging area2.
STAGING AREA:
Staging Area is nothing but is temporary place which is used for cleansing unnecessary
data (or) unwanted data (or) inconsistency data.
ER Modeling:
ER Modeling is nothing but entity relationship modeling. in this model always call table
as entities and it may be second normal form (or) 3rd normal form (or) in between 2nd and
3rd normal form
Dimensional Modeling:
In this model tables called as dimensions (or) fact tables. It can be subdivided into three
schemas.
4. Star Schema
5. Snow Flake Schema
6. Multi Star Schema (or) Hybrid (or) Galaxy
Star Schema:
A fact table surrounded by dimensions is called start schema. it looks like start
In a start schema if there is only one fact table then it is called simple start schema.
In a start schema if there are more than one fact table then it is called complex start
schema
241
Sales Fact table:
Sale_id
Customer_id
Product_id
Account_id
Time_id
Promotion_id
Sales_per_day
Profit_per_day
Account Dimension:
Account_id
Account_type
Account_holder_name
Account_open_date
Account_nominee
Account_open_balence
Pramotion:
Promotion_id
Promotion_type
Promotion_date
Pramotion_designation
Pramotion_Area
242
Product:
Product_id
Product_name
Product_type
Product_desc
Product_version
Product_stratdate
Product_expdate
Product_maxprice
Product_wholeprice
Customer:
Cust_id
Cust_name
Cust_type
Cust_address
Cust_phone
Cust_nationality
Cust_gender
Cust_father_name
Cust_middle_name
Time:
Time_id
Time_zone
Time_format
Month_day
Week_day
Year_day
Week_Yeat
DIMENSION TABLE:
If a table contains primary keys and provides detail information about the table
(or) master information of the table then called dimension table.
FACT TABLE:
If a table contains more foreign keys and it’s having transactions, provides
summarized information such a table called fact table.
DIMENSION TYPES:
There are several dimension types are available
CONFORMED DIMENSION:
243
If a dimension table shared with more than one fact table (or) having foreign key more
than one fact table. Then that dimension table is called confirmed dimension.
DEGENERATED DIMENSION:
If a fact table act as dimension and it’s shared with another fact table (or) maintains
foreign key in another fact table .such a table called degenerated dimension.
JUNK DIMENSION:
A junk dimension contains text values, genders,(male/female),flag values(True/false) and
which is not use full to generate reports. Such dimensions is called junk dimension.
DIRTY DIMENSION:
If a record occurs more than one time in a table by the difference of non key attribute
such a table is called dirty dimension
ADDITIVE FACTS:
If there is a possibility to add some value to the existing fact in the fact table .that facts
we called as additive fact.
If there is possibility to add some value to the existing fact up to some extent in the fact
table is we called as semi additive fact.
244
Star schema Snow flake schema
It maintains demoralized data in the It maintains normalized data in the
dimension table dimension table
Performance will be increased when Performance will be decreases when
joining fact table to dimension table when joining fact table to dimension table to
shrunken dimension table because it require
more inner joins when compared with snow
compared with snow flake flake
All dimension table should maintain ed Some dimension tables are not directly
relation ship directly with fact table maintained relationship with fact table
Data Profiling
Data Profiling:
1.Data profiling is the process of examining the data available in an existing data source
2.A data source usually a data base or a file
3. By doing data profiling we can collect the statistics and information about data
245
Overview about Data Profiling:
1. Data profiling helps you create data model of the 3’rd normal form, based solely
on data available in the source system
2. In order to create a data model of the 3’rd normal form we need the following
information
What is a Domain
A simple example of a Domain is the list of United States state abbreviations. The
Domain could be implemented as a CHAR(2) and would contain the following
valid value set: AL, AK, AR, CA, CO, CT, DE, DC, FL, GA, HI, ID, IL, IN, IA,
KS, KY, LA, ME, MD, MA, MI, MN, MS, MO, MT, NE, NV, NH, NJ, NM, NV,
NC, ND, OH, OK, OR, PA, RI, SC, SD, TN, TX, UT, VT, VA, WA, WV, WI, WY.
2
Many Columns can share the same Domain. Columns which share the same
Domain may be Synonym candidates.
A Domain is defined as the set of all valid values for a Column or set of Columns.
Domains contain target data type information, a user-defined list of valid values,
and a list of valid patterns. Each Schema has its own set of Domains.
Normalization:
Normalization is a process of decomposing a relation into smaller, well structured
relations without anomalies is called as narmalization
246
The rules are used in a relations is called normal forms
Column profiling produces a list of Inferred data types which fits the column data
below are some of the reports generated by SSIS Data profiling task
If you ran a Dependency profile for this table you would find the following
dependencies, among others
247
The List in the previous slide represents True dependencies Now lets take a look at
the below dependencies
The first one is not a true dependency because First Name does not positively
determine Last Name, in that “BHASKAR” could be “REDDY” or “RAO”.
Similarly FirstName + LastName doesn’t uniquely determine PAN.
If you add the first list to the Dependency Model, you would get two keys:
EmployeeID
PAN
However, only one of them can be a primary key, the other key is called an alternate
key
3.CrossTable Profiling:
Cross table profiling compares column in a schema determines which ones contain
similar values. This profile can determine whether a column or set of columns is
appropriate to serve as a foreign key between the selected tables.
248
. Suppose a Schema contains the following two Tables:
Both relations contain employee data, but they are defined separately to segregate public
and private information. The EmpID and EmployeeID Columns have the same business
meaning and can be meaningfully combined into a single Column. In contrast, look at
how the MgrID column is used in the Employee Table. Even though MgrID uses similar
values to EmployeeID, it represents a different role in the database. Therefore you would
not define MgrID and EmployeeID as Synonyms.
249
Adaptable, flexible, and scalable architecture :
Handles high data volumes with common parallel processing technology, and utilizes
common services such as connectivity to access a wide range of data sources and targets
1. Home
2. Overview
3. Investigate
4. Develope
5. Operate
250
2. Overview: Contains project properties tasks and the project dashboard.
Dash board
Project properties
3. Investigate: Contains information discovery and investigation tasks.
Column Analysis
Key and Cross Domain Analysis
Base Line Analysis
Publish Analysis Results
Table Management
4. Develop: Contains data transformation and information services enablement tasks.
Data Quality
251
Information Analyzer Data Steward
Provides read-only views of analysis results. With this role, users can also view the
results of all analysis jobs.
Information Analyzer DrillDown User
Provides the ability to drill down into source data if drill down security is enabled.
252
4. Create the IA project to create IA project have to be login with IA admin privileges
5. Import metadata from staging tables to IA environment
6. To Import Meta data have to be login with IA admin privileges
7. Add a data sources to created project
8. Adding the necessary users to the created project
9 .And also can Add the Groups to the created project
10. Assigning a project roles to the user or groups
11. The following are the four different project roles we have in IA
Information Analyzer Business Analyst
Information Analyzer Data operator
Information Analyzer Data steward
Information Analyzer Data operator
12.Running CA job for single or multiple columns
13. View Analysis results
14.Capture the analysis results where ever the data validation rules given in Data
profiling requirement template
15. Fill the Data profiling requirement sheet with all the columns where ever the Data
validation rules given
16.Generate the project required reports
17.Deliver the analysis results template and reports to the Focals
2. Enter User Name, Password, and Host Name of the services tier
3. After login to Information Analyzer main home screen click on file menu
4. Select open project and open project will display the list of created projects
5. Select the project for which analyze table want to run Column Analysis and click on
open
6. Now in the Information Analyzer workspace navigator menu select Investigate Tab
7. Select Column Analysis then it will display the below column analysis window
253
8. Now select the table and go to the Tasks and under the task will find the Run Column
Analysis option now click on Run column Analysis then it it will several minutes to
complete the column analysis once the column analysis completes then it will give the
status as Analyzed with analysis Run date.
9. Now find the below screen shot the CA status was now Analyzed
1. Now select the EMPID Column from the above list and now go to task and under task
select Open column analysis
254
And next click on view details then it will open the Column Analysis results window
2. Now we will find the column analysis results encountered in different tabs:
1. Overview:
2.Frequency Distribution:
3.Data Class:
255
4. Properties Analysis:
Properties has the fallowing Information
1.Data type
2.Data Length
3.Precision
4. Scale
5.Nallability
6.Cardinality Type
Note: If the format is Invalid then we have to change the status as violate and then have
to change the domain value status is Mark as invalid then automatically those Invalid
format associate values became change as Invalid in Domain and Completeness tab
Base Line Analysis:
You use baseline analysis to determine whether the structure or content of your data has
changed between two versions of the same data. After a baseline analysis job completes,
256
you can create a report that summarizes the results of that job and then compare the
results with the results from your baseline data.
Baseline analysis:
When you want to know if your data has changed, you can use baseline analysis to
compare the column analysis results for two versions of the same data source. The
content and structure of data changes over time when it is accessed by multiple users.
When the structure of data changes, the system processes that use the data are affected.
To compare your data, you choose the analysis results that you want to set as the baseline
version. You use the baseline version to compare all subsequent analysis results of the
same data source. For example, if you ran a column analysis job on data source A on
Tuesday, you could then set the column analysis results of source A as the baseline and
save the baseline in the repository. On Wednesday, when you run a column analysis job
on data source A again, you can then compare the current analysis results of data source A
with the baseline results of data source A.
To identify changes in your data, a baseline analysis job evaluates the content and
structure of the data for differences between the baseline results and the current results.
The content and structure of your data consists of elements such as data classes, data
properties, primary keys, and data values. If the content of your data has changed, there
will be differences between the elements of each version of the data. If you are
257
monitoring changes in the structure and content of your data on a regular basis, you might
want to specify a checkpoint at regular intervals to compare to the baseline. You set a
checkpoint to save the analysis results of the table for comparison. You can then choose
to compare the baseline to the checkpoint or to the most recent analysis results.
If you know that your data has changed and that the changes are acceptable, you can
create a new baseline at any time
To identify changes in table structure, column structure, or column content from the
baseline version to the current version, you can compare analysis results.
You must have InfoSphere™ Information Analyzer Business Analyst privileges and have
completed the following task.
Procedure
258
You must have InfoSphere™ Information Analyzer Business Analyst privileges and an
Information Analyzer Data Operator must have completed the following task.
Procedure
You can set a checkpoint to save a subsequent point of the selected analysis results for a
table to use in comparing to the baseline. The checkpoint is saved as a point of
comparison even if subsequent analysis jobs are run on the table.
If you are monitoring changes in the structure and content of your data on a regular basis,
you might want to specify a checkpoint at regular intervals to compare to the baseline.
You set a checkpoint to save the analysis results of the table for comparison. You can then
choose to compare the baseline to the checkpoint or to the most recent analysis results.
A checkpoint can also save results at a point in time for analysis publication.
Procedure
259
To determine whether the content and structure of your data has changed over time, you
can use baseline analysis to compare a saved analysis summary of your table to a current
analysis result of the same table.
You can use baseline analysis to identify an analysis result that you want to set as the
baseline for all comparisons. Over time, or as your data changes, you can import
metadata for the table into the metadata repository again, run a column analysis job on
that table, and then compare the analysis results from that job to the baseline analysis.
You can continue to review and compare changes to the initial baseline as often as needed
or change the baseline if necessary.
2. select Data quality and go to task under task find the New data rule definition
260
3. Click on new data rule definition then the below window will be pop up
4.Click on overview and provide the Data rule name in the name text field and short
description and Long description is optional
1.Condtion:
IF
THEN
ELSE
AND
OR
NOT
2.(
261
((
(((
Example:
Once we written the logic then we have to be validate weather the logic is syntactically
correct or not if the logic is correct then have to click on and save and exist.
3. SourceData
Here source data is a field name
4. Condtion
Not
5. Type of check
=
>
<
>=
<=
<>
Contains-- >String containment
Exists->Null value test
Matches_Format-- >If country=India then zip code format=’999999’
,Matches_Regex
occurs
occurs>
occurs>=
occurs<
occurs<=
In_Reference_column
In_reference_List
Unique
Is_numeric
Is_Date
6. Reference Data:
262
Rule Builder:
1. Business Problem:
Identifies when a column contains Data
Type of Check: exists
Identify weather a column GENDER contains a data value has MALE or FEMALE
Here in source we may have either upper or lower case then those values have to
Compare with reference data
263
3. Business Problem:
Type of Check: matches_regex
Identify weather a column EMPID contains a value has numeric value and length
of EMPID is 4 and the format of EMPID should be ‘9999’
Here in source we may have either upper or lower case then those values have to
Compare with reference data
3. Business Problem:
Type of Check: matches_format
If country code=”India” then have to be check weather the zip code format is ‘999999’ or
not
Quality Stage:
1.Why investigate:
264
Content
Feature of investigate:
Analyze free form and single domain columns
Provide frequency distribution of distinct values and patterns
Investigaet methods:
Characte distcrete
Character concate nate:
Word invistgate:
INVESTIGATE STAGE:
1.Character Discrete Investigate C mask:
Job:
265
Click on change mask and select for all the fields C mask
Above screen shot is only for one field it gives same like another fields which ever u
selected
266
Job:
Click on change mask and select for all the fields C mask
267
At output it gives like below
268
At out put it gives like below
Job:
269
At out put it gives like below
270
At out put it gives like below
271
At out put it gives like below
272
Output you will get like below
273
Pattern report output u will get like below:
Word investigate pattern report it give 5 columns at output
1.qsInvColumnName
2.qsInvPattern
3.qsInvSample
4.qsInvCount
5.QsInvPercent
274
1.qsInvCount
2.qsInvWord
3.qsInvClassCode
2.STANDARDIZE STAGE:
Example job:
Open standardize stage and select rule set text field and select standardize rules folder
with in that folder select OTHER folder and select COUNTRY folder and select
COUNTRY
Next in literal text field enter ZQUSZQ
275
Add column which column u want to select from available data columns
Select column name <literal> and AddressLine1, AddressLine1, city, state, Zip
columns in selected columns list
You will get the below screen after entering all
Next click on OK
you will get the below screen
next click on Ok
Output:
276
Quality stage :
Investigate: 3 methods:
1.chardiscreate->C,T,X masks
2.Charconcatenate:C,T.X masks
Lab:
Chardiscreate C mask (select one or many columns)
Characterconcatenate C MASK(select two or more columns concate nate)
WordInvstgate: FullName:
Token Rpt
Pattern Rpt
WordInvestigate:Address(pass address line 1,address line2)
Token Rpt
Pattern Rpt
WordInvestigate:Area(city ,state,Zip)
Token Rpt
Pattern Rpt
2.Standardize stage:
1.country identifier:
--- >select the rule set from others COUNTRY
--- > pass the literal ZQUSZQ and add the columns addressline1,addressline 2,city
,state,zip
--- > filter the records where ever we have flag ‘Y’ Those or US records
--- >split US, non US records into separate target
2. Apply the USPREP rule set to filter name components from address fields, and
area components from address fields
277
->pass ZQAREAZQ and add the column “City”
->pass ZQAREAZQ and add the column “State”
->pass ZQAREAZQ and add the column “Zip”
Standardize USNAME USADDR USAREA
1.Select USNAME rule set from standardize rules and add the clumn
NameDomain_USPREP
2. select new process and select the USADDR rule set and add the column
AddressDomain_USPREP
3. select new process and select the USAREA rule set and add the column
AreaDomain_USPREP
Rules Columns
USNAME.SET NameDomain_USPREP
USADDR.SET AddressDomain_USPREP
USAREA.SET AreaDomain_USPREP
Investigate unhandled name patterns
Take the above job as input and use 3 investigate stages
1 for Inv Unhandled Name
2. for InvUnhandeldAddr
3.for InvUnhandledArea
InvUnhandeldAddr:
select the method character concatenate for Address
select the columns
UnhandledPattern_USADDR, --- >set C mask
UnhandledData_USADDR--- >set X mask
InputPattern_USADDR--- >set X mask
AddressDomain_USPREP--- >set X mask
InvUnhandeldArea:
select the method character concatenate for Area
select the columns
UnhandledPattern_USAREA, --- >set C mask
UnhandledData_USAREA--- >set X mask
InputPattern_USAREA--- >set X mask
AreaDomain_USPREP--- >set X mask
278
Rule Set override
1.Input Pattern +FI
279
3.Unhandled Pattern FFI
The pattern FFI represents a last name that was recognized as a first name,
UnhandledPattern Unhandled Data Input Pattern Input Name Text
FFI HARRIS MARJORIE M +FI. HARRIS MARJORIE M.
Repeat the process for the remaining tokens for F,I also
Now Test the String HARRIS MARJORIE M
1) Development Projects.
2) Enhancement Projects
3) Migration Projects
4) Production support Projects.
-> The following are the different phases involved in a ETL project development life
cycle.
280
7) Pre - Production
8) Production (Go-Live)
-> The business requirement gathering start by business Analyst, onsite technical
lead and client business users.
-> BRS: - Business Analyst will gather the Business Requirement and document in
BRS
-> SRS: - senior technical people (or) ETL architect will prepare the SRS which
contains s/w and h/w requirements.
Based on HLD,a senior ETL developer prepare Low Level Design Document
The LLD contains more technical details of an ETL System.
An LLD contains data flow diagram (DFD), details of source and targets of each
mapping.
An LLD also contains information about full and incremental load.
After LLD then Development Phase will start
281
Development Phase (Coding):-
--------------------------------------------------
-> Based an LLD, the ETL team will create mapping (ETL Code)
-> After designing the mappings, the code ( Mappings ) will be reviewed by
developers.
Code Review:-
Peer Review:-
-> The code will reviewed by your team member (third party developer)
Testing:-
--------------------------------
Unit Testing:-
-> A unit test for the DWH is a white Box testing, It should check the ETL procedure
and Mappings.
-> The following are the test cases can be executed by an ETL developer.
1) Verify data loss
2) No.of records in the source and target
3) Dataload/Insert
4) Dataload/Update
5) Incremental load
6) Data accuracy
7) verify Naming standards.
8) Verify column Mapping
-> The Unit Test will be carried by ETL developer in development phase.
-> ETL developer has to do the data validations also in this phase.
282
-> Run all the mappings in the sequence order.
-> First run the source to stage mappings.
-> Then run the mappings related to dimensions and facts.
-> This test is carried out in the presence of client side technical users to verify the
data migration from source to destination.
Production Environment:-
---------------------------------
-> Migrate the code into the Go-Live environment from test environment ( QA
Environment ).
EXPLANATION:
We have to start with ….Our projects are mainly onsite and offshore model
projects.Inthis project we have one staging area in between source to target
databases. In some project they won’t use staging area’s. Staging area simplify the
process..
Architecture
Design: Output: Technical Design Doc’s, HLD, UTP ETL Lead, BA and Data Architect
80%onsite .( Schema design in Erwin and implement in database and preparing the
technical design document for ETL.
283
UTP Unit Test Plan.. write the test cases based on the requirement. Both positive and
negative test cases.
Based on the HLD. U have to create the mappings. After that code review and code
standard review will be done by another team member. Based on the review
comments u have to updated the mapping. Unit testing based on the UTP. U have to
fill the UTP andEnter the expected values and name it as UTR (Unit Test Results). 2
times code reviewand 2 times unit testing will be conducted in this phase. Migrating
to testing repositoryIntegration test plan has to prepare by the senior people.
80% offshore
Based on the integration test plan testing the application and gives the bugs list to
thedeveloper. Developers will fix the bugs in the development repository and again
migrated to testing repository. Again testing starts till the bugs free code.
20% Onsite
UAT User Accept Testing.Client will do the UAT.. this is last phase of the etl
project. If client satisfy with the product .. next deployment in production
environment.
Production: 50% offshore 50% onsite
Work will be distributed between offshore and onsite based on the run time of the
application. Mapping Bugs needs to fix by Development team. Development team will
support for warranty period of 90 days or based on agreement days..
In ETL projects Three Repositorys. For each repository access permissions and
location
will be different.
Development : E1
Testing : E2
Prduction : E3
284
1.Project Explanation:
I’m giving generic explanation of the project. Any project either banking or sales or
insurance can use this
explanation.
SCD2 Mapping:
We are implementing SCD2 mapping for customer dimension or account dimension
to keep history of the accounts or customers. We are using SCD2 Date
method.before telling this u should know it clearly abt this SCD2 method..careful abt
it..
285
Responsibilities.. pick from Project architecture Post and tell according ur comfortable
level.. we are responsible foronly development and testing and scheduling we are
usingthird party tools..( Control M, AutoSys, Job Tracker, Tivoli or etc..) we simply
give the dependencies between each mapping and run time. Based on that
Informationscheduling tool team will schedule the mappings. We won’t schedule in
Informatica .. that’s itFinished…
I have done my B.sc Computers from Osmania University in 2007. Ap.After that I
had an opportunity to work for a Wipro Technologies from Oct 2006 to Aug 2008
where I started off my career as an ETL developer. I have been with Wipro almost 2
years then I shifted to Ness Technologies In Aug 2008. Presently I am working with
Ness
Total I have 3.5 Years of experience in DWH using Data stage tool in development
and Enhancement projects. Primarily I worked on healthcare and manufacturing
domains.
In my Current project my roles & responsibilities are basically
I am working with onsite offshore model so we use to get the tasks from my
onsite team.
As a developer first I need to understand the physical data model i.e
dimensions and facts; their relationship & also functional specifications that tells
the business requirement designed by Business Analyst.
I involved into the preparation of source to target mapping sheet (tech Specs)
which tell us what is the source and target and which column we need to map
to target column and also what would be the business logic. This document
gives the clear picture for the development.
Creating Data stage jobs using different transformations to implement business
logic.
Preparation of Unit test cases also one of my responsibilities as per the business
requirement.
286
And also involved into Unit testing for the mappings developed by myself.
I use to source code review for the Data stage jobs developed by my team
members.
And also involved into the preparation of deployment plan which contains list
of Data stage jobs they need to migrate based on this deployment team can
migrate the code from one environment to another environment.
Once the code rollout to production we also work with the production support
team for 2 weeks where we parallel give the KT. So we also prepare the KT
document as well for the production team.
287
reporting level. These dashboards/reports can be used for the analysis purpose like say
how many RFQs created, how many RFQs approved, how many RFQs got responded
from the supply channels?
What is the budget?
What is budget approved?
Who is the approval manager pending with whom what is the feedback of the supply
channels from the past etc?
They don’t have the BI design, so they are using the manual process to achieve the
above by exporting the excel sheet; so we can do the drill up, drill down & get all the
detailed reports by charts.
Prepared By
A.Bhaskar Reddy
Email:abreddy2003@gmail.com
91-9948047694
288