Beruflich Dokumente
Kultur Dokumente
2.
Let us assume that the source system is a Relational Database . The source table is having duplicate
rows. Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of
the source table and load the target accordingly.
Source Qualifier Transformation DISTINCT clause
we use a Sorter Transformationand check the Distinct option. When we select the distinct option all
the columns will the selected as keys, in ascending order by default.
Student Name
Maths
Life Science
Physical Science
Sam
100
70
80
John
75
100
85
Tom
80
100
85
Student Name
Subject Name
Marks
Sam
Maths
100
Sam
Life Science
70
Sam
Physical Science
80
John
Maths
75
John
Life Science
100
John
Physical Science
85
Tom
Maths
80
Tom
Life Science
100
Tom
Physical Science
85
Student Name
Subject Name
Marks
Sam
Maths
100
Tom
Maths
80
Sam
Physical Science
80
John
Maths
75
Sam
Life Science
70
John
Life Science
100
John
Physical Science
85
Tom
Life Science
100
Tom
Physical Science
85
Student Name
Maths
Life Science
Physical Science
Sam
100
70
80
John
75
100
85
Tom
80
100
85
We will sort the source data based on STUDENT_NAME ascending followed by SUBJECT ascending.
Sorter Transformation
Now based on STUDENT_NAME in GROUP BY clause the following output subject columns are
populated as
MATHS: MAX(MARKS, SUBJECT='Maths')
LIFE_SC: MAX(MARKS, SUBJECT='Life Science')
PHY_SC: MAX(MARKS, SUBJECT='Physical Science')
Aggregator Transformation
adds a SELECT DISTINCT clause to the default SQL query, which in turn affects the number of rows
returned by the Database to the Integration Service and hence it is an Active transformation.
Q10. What happens to a mapping if we alter the datatypes between Source and its corresponding
Source Qualifier?
Ans.
The Source Qualifier transformation displays the transformation datatypes. The transformation
datatypes determine how the source database binds data when the Integration Service reads it.
Now if we alter the datatypes in the Source Qualifier transformation or the datatypes in the source
definition and Source Qualifier transformation do not match, the Designer marks the mapping as
invalid when we save it.
Q11. Suppose we have used the Select Distinct and the Number Of Sorted Ports property in the SQ
and then we add Custom SQL Query. Explain what will happen.
Ans.
Whenever we add Custom SQL or SQL override query it overrides the User-Defined Join, Source
Filter, Number of Sorted Ports, and Select Distinct settings in the Source Qualifier transformation.
Hence only the user defined SQL Query will be fired in the database and all the other options will be
ignored .
Q12. Describe the situations where we will use the Source Filter, Select Distinct and Number Of Sorted
Ports properties of Source Qualifier transformation.
Ans.
Source Filter option is used basically to reduce the number of rows the Integration Service queries so
as to improve performance.
Select Distinct option is used when we want the Integration Service to select unique values from a
source, filtering out unnecessary data earlier in the data flow, which might improve performance.
Number Of Sorted Ports option is used when we want the source data to be in a sorted fashion so as
to use the same in some following transformations like Aggregator or Joiner, those when configured for
sorted input will improve the performance.
Q13. What will happen if the SELECT list COLUMNS in the Custom override SQL Query and the
OUTPUT PORTS order in SQ transformation do not match?
Ans.
Mismatch or Changing the order of the list of selected columns to that of the connected transformation
output ports may result is session failure.
Q14. What happens if in the Source Filter property of SQ transformation we include keyword WHERE
say, WHERE CUSTOMERS.CUSTOMER_ID > 1000.
Ans.
We use source filter to reduce the number of source records. If we include the string WHERE in the
source filter, the Integration Service fails the session .
Q15. Describe the scenarios where we go for Joiner transformation instead of Source Qualifier
transformation.
Ans.
While joining Source Data of heterogeneous sources as well as to join flat files we will use the
Joiner transformation.
Use the Joiner transformation when we need to join the following types of sources:
Join data from different Relational Databases.
Join data from different Flat Files.
Join relational sources and flat files.
Q16. What is the maximum number we can use in Number Of Sorted Ports for Sybase source system.
Ans.
Sybase supports a maximum of 16 columns in an ORDER BY clause. So if the source is Sybase, do
not sort more than 16 columns.
Q17. Suppose we have two Source Qualifier transformations SQ1 and SQ2 connected to Target tables
TGT1 and TGT2 respectively. How do you ensure TGT2 is loaded after TGT1?
Ans.
If we have multiple Source Qualifier transformations connected to multiple targets, we can designate
the order in which the Integration Service loads data into the targets.
In the Mapping Designer, We need to configure the Target Load Plan based on the Source Qualifier
transformations in a mapping to specify the required loading order.
Image: Target Load Plan
SQ Source Filter
Filter Transformation
Source Qualifier
transformation filters rows
when read from a source.
Source Qualifier
transformation can only
filter rows from Relational
Sources.
performance.
To improve performance for an Unsorted Joiner transformation, use the source with fewer rows as
the master source. The fewer unique rows in the master, the fewer iterations of the join comparison
occur, which speeds the join process.
When the Integration Service processes an unsorted Joiner transformation, it reads all master rows
before it reads the detail rows. The Integration Service blocks the detail source while it caches rows
from the master source . Once the Integration Service reads and caches all master rows, it unblocks
the detail source and reads the detail rows.
To improve performance for a Sorted Joiner transformation, use the source with fewer duplicate key
values as the master source.
When the Integration Service processes a sorted Joiner transformation, it blocks data based on the
mapping configuration and it stores fewer rows in the cache, increasing performance. Blocking logic is
possible if master and detail input to the Joiner transformation originate from different sources .
Otherwise, it does not use blocking logic. Instead, it stores more rows in the cache.
Q24. What are the different types of Joins available in Joiner Transformation?
Ans.
In SQL, a join is a relational operator that combines data from multiple tables into a single result set.
The Joiner transformation is similar to an SQL join except that data can originate from different types of
sources.
The Joiner transformation supports the following types of joins :
Normal
Master Outer
Detail Outer
Full Outer
Note: A normal or master outer join performs faster than a full outer or detail outer join.
Q25. Define the various Join Types of Joiner Transformation.
Ans.
In a normal join , the Integration Service discards all rows of data from the master and detail source
that do not match, based on the join condition.
A master outer join keeps all rows of data from the detail source and the matching rows from the
master source. It discards the unmatched rows from the master source.
A detail outer join keeps all rows of data from the master source and the matching rows from the
detail source. It discards the unmatched rows from the detail source.
A full outer join keeps all rows of data from both the master and detail sources.
Q26. Describe the impact of number of join conditions and join order in a Joiner Transformation.
Ans.
We can define one or more conditions based on equality between the specified master and detail
sources.
Both ports in a condition must have the same datatype . If we need to use two ports in the join
condition with non-matching datatypes we must convert the datatypes so that they match. The
Designer validates datatypes in a join condition.
Additional ports in the join condition increases the time necessary to join two sources.
The order of the ports in the join condition can impact the performance of the Joiner transformation. If
we use multiple ports in the join condition, the Integration Service compares the ports in the order we
specified.
NOTE: Only equality operator is available in joiner join condition.
Q27. How does Joiner transformation treat NULL value matching.
Ans.
The Joiner transformation does not match null values .
For example, if both EMP_ID1 and EMP_ID2 contain a row with a null value, the Integration Service
does not consider them a match and does not join the two rows.
To join rows with null values, replace null input with default values in the Ports tab of the joiner, and
then join on the default values.
Note: If a result set includes fields that do not contain data in either of the sources, the Joiner
transformation populates the empty fields with null values. If we know that a field will return a NULL
and we do not want to insert NULLs in the target, set a default value on the Ports tab for the
corresponding port.
Q28. Suppose we configure Sorter transformations in the master and detail pipelines with the following
sorted ports in order: ITEM_NO, ITEM_NAME, PRICE.
When we configure the join condition, what are the guidelines we need to follow to maintain the sort
order?
Ans.
If we have sorted both the master and detail pipelines in order of the ports say ITEM_NO,
ITEM_NAME and PRICE we must ensure that:
Use ITEM_NO in the First Join Condition.
If we add a Second Join Condition, we must use ITEM_NAME.
If we want to use PRICE as a Join Condition apart from ITEM_NO, we must also use ITEM_NAME in
the Second Join Condition.
If we skip ITEM_NAME and join on ITEM_NO and PRICE, we will lose the input sort order and the
Integration Service fails the session .
Q29. What are the transformations that cannot be placed between the sort origin and the Joiner
transformation so that we do not lose the input sort order.
Ans.
The best option is to place the Joiner transformation directly after the sort origin to maintain sorted
data.
However do not place any of the following transformations between the sort origin and the Joiner
transformation:
Custom
Unsorted Aggregator
Normalizer
Rank
Union transformation
XML Parser transformation
Next we place a Sorted Aggregator Transformation . Here we will find out the AVERAGE
SALARY for each (GROUP BY) DEPTNO .
When we perform this aggregation, we lose the data for individual employees. To maintain employee
data, we must pass a branch of the pipeline to the Aggregator Transformation and pass a branch with
the same sorted source data to the Joiner transformation to maintain the original data. When we join
both branches of the pipeline, we join the aggregated data with the original data.
So next we need Sorted Joiner Transformation to join the sorted aggregated data with the original
data, based onDEPTNO .
Here we will be taking the aggregated pipeline as the Master and original dataflow as Detail Pipeline.
After that we need a Filter Transformation to filter out the employees having salary less than average
salary for their department.
Filter Condition: SAL>=AVG_SAL
Sequence
Generator
Properties
Description
Start Value
End Value
Cycle
Number of
Cached
Values
Reset
Q33. Suppose we have a source table populating two target tables. We connect the NEXTVAL port of
the Sequence Generator to the surrogate keys of both the target tables.
Will the Surrogate keys in both the target tables be same? If not how can we flow the same sequence
values in both of them.
Ans.
When we connect the NEXTVAL output port of the Sequence Generator directly to the surrogate key
columns of the target tables, the Sequence number will not be the same .
A block of sequence numbers is sent to one target tables surrogate key column. The second targets
receives a block of sequence numbers from the Sequence Generator transformation only after the first
target table receives the block of sequence numbers.
Suppose we have 5 rows coming from the source, so the targets will have the sequence values as
TGT1 (1,2,3,4,5) and TGT2 (6,7,8,9,10). [Taken into consideration Start Value 0, Current value 1 and
Increment by 1.
Now suppose the requirement is like that we need to have the same surrogate keys in both the
targets.
Then the easiest way to handle the situation is to put an Expression Transformation in between the
Sequence Generator and the Target tables. The SeqGen will pass unique values to the expression
transformation, and then the rows are routed from the expression transformation to the targets.
Sequence Generator
Q34. Suppose we have 100 records coming from the source. Now for a target column population we
used a Sequence generator.
Suppose the Current Value is 0 and End Value of Sequence generator is set to 80. What will happen?
Ans.
End Value is the maximum value the Sequence Generator will generate. After it reaches the End value
the session fails with the following error message:
TT_11009 Sequence Generator Transformation: Overflow error.
Failing of session can be handled if the Sequence Generator is configured to Cycle through the
sequence, i.e. whenever the Integration Service reaches the configured end value for the sequence, it
wraps around and starts the cycle again, beginning with the configured Start Value.
Q35. What are the changes we observe when we promote a non resuable Sequence Generator to a
resuable one?
And what happens if we set the Number of Cached Values to 0 for a reusable transformation?
Ans.
When we convert a non reusable sequence generator to resuable one we observe that the Number of
Cached Values is set to 1000 by default; And the Reset property is disabled.
When we try to set the Number of Cached Values property of a Reusable Sequence Generator to 0
in the Transformation Developer we encounter the following error message:
The number of cached values must be greater than zero for reusable sequence transformation.
Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will
start with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2
million, 4 million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million
master table data volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target
2. Informatica PowerCentre 8.5 as ETL tool
3. Database and Informatica setup on different physical servers using HP UNIX
4. Source database table has no constraint, no index, no database statistics and no partition
5. Source database table is not available in Oracle shared pool before the same is read
6. There is no session level partition in Informatica PowerCentre
7. There is no parallel hint provided in extraction SQL query
8. Informatica JOINER has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre
designer. The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to
sort data in database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN
data in informatica level. We have executed these mappings with different data points and logged the
result.
Further to the above test we will execute m_db_side_join mapping once again, this time with proper
database side indexes and statistics and log the results.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by
each system to sort data. The average time is plotted along vertical axis and data points are
plotted along horizontal axis.
0.1 M
1M
0.2 M
2M
0.4 M
4M
0.6 M
6M
Verdict
In our test environment, Oracle 10g performs JOIN operation 24% faster than
Informatica Joiner Transformation while without Index and 42% faster with
Database Index
Assumption
1. Average server load remains same during all the experiments
2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot be used for performance
benchmarking.
2. This data is only indicative and may vary in different testing conditions.
Test Preparation
We will perform the same test with different data points (data volumes) and log the results. We will start
with 1 million records and we will be doubling the volume for each next data points. Here are the
details of the setup we will use,
1. Oracle 10g database as relational source and target
2. Informatica PowerCentre 8.5 as ETL tool
3. Database and Informatica setup on different physical servers using HP UNIX
4. Source database table has no constraint, no index, no database statistics and no partition
5. Source database table is not available in Oracle shared pool before the same is read
6. There is no session level partition in Informatica PowerCentre
7. There is no parallel hint provided in extraction SQL query
8. The source table has 10 columns and first 8 columns will be used for sorting
9. Informatica sorter has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre
designer. The first mapping m_db_side_sort will use an ORDER BY clause in the source qualifier to
sort data in database level. Second mapping m_Infa_side_sort will use an Informatica sorter to sort
data in informatica level. We have executed these mappings with different data points and logged the
result.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by
each system to sort data. The time is plotted along vertical axis and data volume is plotted along
horizontal axis.
Verdict
The above experiment demonstrates that
Oracle database is faster in SORT operation
than Informatica by an average factor of 14%.
Assumption
1. Average server load remains same during all the experiments
2. Average network speed remains same during all the experiments
Note
This data can only be used for performance comparison but cannot be used for performance
benchmarking.
When we run a session, the integration service may create a reject file for each target instance in the
mapping to store the target reject record. With the help of the Session Log and Reject File we can
identify the cause of data rejection in the session. Eliminating the cause of rejection will lead to
rejection free loads in the subsequent session runs. If theInformatica Writer or the Target
Database rejects data due to any valid reason the integration service logs the rejected records into the
reject file. Every time we run the session the integration service appends the rejected records to the
reject file.
Row Indicator
Indicator Significance
Rejected By
Insert
Writer or target
Update
Writer or target
Delete
Writer or target
Reject
Writer
Rolled-back insert
Writer
Rolled-back update
Writer
Rolled-back delete
Writer
Committed insert
Writer
Committed update
Writer
Committed delete
Writer
Now comes the Column Data values followed by their Column Indicators, that determines the data
quality of the corresponding Column.
Column
Indicator
Type of data
Writer Treats As
Valid data or
Good Data.
Overflowed
Numeric Data.
Null Value.
Also to be noted that the second column contains column indicator flag value 'D' which signifies that
the Row Indicator is valid.
Now let us see how Data in a Bad File looks like:
0,D,7,D,John,D,5000.375,O,,N,BrickLand Road Singapore,T
Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate
calculations in a session. If the source changes incrementally and we can capture the changes, then
we can configure the session to process those changes. This allows the Integration Service to update
the target incrementally, rather than forcing it to delete previous loads data, process the entire source
data and recalculate the same data each time you run the session.
Incremental Aggregation
When the session runs with incremental aggregation enabled for the first time say 1st week of Jan, we
will use the entire source. This allows the Integration Service to read and store the necessary
aggregate data information. On 2nd week of Jan, when we run the session again, we will filter out the
CDC records from the source i.e the records loaded after the initial load. The Integration Service then
processes these new data and updates the target accordingly.
Use incremental aggregation when the changes do not significantly change the target. If
processing the incrementally changed source alters more than half the existing target, the session may
not benefit from using incremental aggregation. In this case, drop the table and recreate the target with
entire source data and recalculate the same aggregation formula .
INCREMENTAL AGGREGATION, may be helpful in cases when we need to load data in monthly
facts in a weekly basis.
Let us see a sample mapping to implement incremental aggregation:
Image: Incremental Aggregation Sample Mapping
Look at the Source Qualifier query to fetch the CDC part using a BATCH_LOAD_CONTROL
table that saves the last successful load date for the particular mapping.
Image: Incremental Aggregation Source Qualifier
Now the most important session properties configuation to implement incremental Aggregation
If we want to reinitialize the aggregate cache suppose during first week of every month we will
configure another session same as the previous session the only change being the Reinitialize
aggregate cache property checked in
CUSTOMER_KEY INVOICE_KEY
AMOUNT
LOAD_DATE
1111
5001
100
01/01/2010
2222
5002
250
01/01/2010
3333
5003
300
01/01/2010
1111
6007
200
07/01/2010
1111
6008
150
07/01/2010
2222
6009
250
07/01/2010
4444
1234
350
07/01/2010
5555
6157
500
07/01/2010
After the first Load on 1st week of Jan 2010, the data in the target is as follows:
CUSTOMER_KEY
INVOICE_KEY
MON_KEY
AMOUNT
1111
5001
201001
100
2222
5002
201001
250
3333
5003
201001
300
Now during the 2nd week load it will process only the incremental data in the source i.e those records
having load date greater than the last session run date. After the 2nd weeks load after incremental
aggregation of the incremental source data with the aggregate cache file data will update the target
table with the following dataset:
CUSTOMER_KEY INVOICE_KEY
MON_KEY
AMOUNT
Remarks/Operation
1111
6008
201001
450
2222
6009
201001
500
3333
5003
201001
300
4444
1234
201001
350
5555
6157
201001
500
The first time we run an incremental aggregation session, the Integration Service processes the entire
source. At the end of the session, the Integration Service stores aggregate data for that session run in
two files, the index file and the data file. The Integration Service creates the files in the cache directory
specified in the Aggregator transformation properties.Each subsequent time we run the session with
incremental aggregation, we use the incremental source changes in the session. For each input
record, the Integration Service checks historical information in the index file for a corresponding group.
If it finds a corresponding group, the Integration Service performs the aggregate operation
incrementally, using the aggregate data for that group, and saves the incremental change. If it does not
find a corresponding group, the Integration Service creates a new group and saves the record data.
When writing to the target, the Integration Service applies the changes to the existing target. It saves
modified aggregate data in the index and data files to be used as historical data the next time you run
the session.
Each subsequent time we run a session with incremental aggregation, the Integration Service creates
a backup of the incremental aggregation files. The cache directory for the Aggregator transformation
must contain enough disk space for two sets of the files.
The Integration Service creates new aggregate data, instead of using historical data, when we
configure the session toreinitialize the aggregate cache, Delete cache files etc.
When the Integration Service rebuilds incremental aggregation files, the data in the previous files is
lost.
Note: To protect the incremental aggregation files from file corruption or disk failure,
periodically back up the files.
Store
Quarter1
Quarter2
Quarter3
Quarter4
Store1
100
300
500
700
Store2
250
450
650
850
The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that
identifies the quarter number:
Target Table
Store
Sales
Quarter
Store 1
100
Store 1
300
Store 1
500
Store 1
700
Store 2
250
Store 2
450
Store 2
650
Store 2
850
Name
Month
Transportation
House Rent
Food
Sam
Jan
200
1500
500
John
Jan
300
1200
300
Tom
Jan
300
1350
350
Sam
Feb
300
1550
450
John
Feb
350
1200
290
Tom
Feb
350
1400
350
and we need to transform the source data and populate this as below in the target table:
Name
Month
Expense Type
Expense
Sam
Jan
Transport
200
Sam
Jan
House rent
1500
Sam
Jan
Food
500
John
Jan
Transport
300
John
Jan
House rent
1200
John
Jan
Food
300
Tom
Jan
Transport
300
Tom
Jan
House rent
1350
Tom
Jan
Food
350
.. like this.
Now below is the screen-shot of a complete mapping which shows how to achieve this result using
Informatica PowerCenter Designer.Image: Normalization Mapping Example 1
I will explain the mapping further below.
In the Ports tab of the Normalizer the ports will be created automatically as configured in the
Normalizer tab.
Interestingly we will observe two new columns namely,
GK_EXPENSEHEAD
GCID_EXPENSEHEAD
GK field generates sequence number starting from the value as defined in Sequence field while GCID
holds the value of the occurence field i.e. the column no of the input Expense head.
Now the GCID will give which expense corresponds to which field while converting columns to rows.
Below is the screen-shot of the expression to handle this GCID efficiently:
Image: Expression to handle GCID
at that point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You
will need a dynamic cache to handle this.
Article Index
Informatica Dynamic Lookup Cache
What is Static Cache
What is Dynamic Cache
How does dynamic cache work
Dynamic Lookup Mapping Example
Dynamic Lookup Sequence ID
Dynamic Lookup Ports
NULL handling in LookUp
Other Details
All Pages
Page 1 of 9
inShare0
0diggsdigg
.
A LookUp cache does not change once built. But what if the underlying lookup table changes the data
after the lookup cache is created? Is there a way so that the cache always remain up-to-date even if
the underlying table changes?
Let's think about this scenario. You are loading your target table through a mapping. Inside the
mapping you have a Lookup and in the Lookup, you are actually looking up the same target
table you are loading. You may ask me, "So? What's the big deal? We all do it quite often...".
And yes you are right. There is no "big deal" because Informatica (generally) caches the lookup table
in the very beginning of the mapping, so whatever record getting inserted to the target table through
the mapping, will have no effect on the Lookup cache. The lookup will still hold the previously cached
data, even if the underlying target table is changing.
But what if you want your Lookup cache to get updated as and when the target table is changing?
What if you want your lookup cache to always show the exact snapshot of the data in your target table
at that point in time? Clearly this requirement will not be fullfilled in case you use a static cache. You
will need a dynamic cache to handle this.
Updating a master customer table with both new and updated customer information
coming together as shown above
Loading data into a slowly changing dimension table and a fact table at the same
time. Remember, you typically lookup the dimension while loading to fact. So you load dimension table
before loading fact table. But using dynamic lookup, you can load both simultaneously.
Loading data from a file with many duplicate records and to eliminate duplicate
records in target by updating a duplicate row i.e. keeping the most recent row or the initial row
Loading the same data from multiple sources using a single mapping. Just consider
the previous Retail business example. If you have more than one shops and Linda has visited two of
your shops for the first time, customer record Linda will come twice during the same load.
Inserts the row into the cache: If the incoming row is not in the cache, the
Integration Service inserts the row in the cache based on input ports or generated Sequence-ID. The
Integration Service flags the row as insert.
Updates the row in the cache: If the row exists in the cache, the Integration Service
updates the row in the cache based on the input ports. The Integration Service flags the row as
update.
Makes no change to the cache: This happens when the row exists in the cache and
the lookup is configured or specified To Insert New Rows only or, the row is not in the cache and
lookup is configured to update existing rows only or, the row is in the cache, but based on the lookup
condition, nothing changes. The Integration Service flags the row as unchanged.
Notice that Integration Service actually flags the rows based on the above three conditions.
And that's a great thing, because, if you know the flag you can actually reroute the row to achieve
different logic. This flag port is called
NewLookupRow
Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to
use a Router or Filter transformation followed by an Update Strategy.
Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:
0 = Integration Service does not update or insert the row in the cache.
1 = Integration Service inserts the row into the cache.
2 = Integration Service updates the row in the cache.
When the Integration Service reads a row, it changes the lookup cache depending on the results of the
lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the
NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.
Add a comment
3.
JUN
1.Can 2 Fact Tables share same dimensions Tables? How many Dimension tables are
associated with one Fact Table ur project?
Ans: Yes
2.What is ROLAP, MOLAP, and DOLAP...?
Ans: ROLAP (Relational OLAP), MOLAP (Multidimensional OLAP), and DOLAP
(Desktop OLAP). In these three OLAP
architectures, the interface to the analytic layer is typically the same; what is
quite different is how the data is physically stored.
In MOLAP, the premise is that online analytical processing is best
implemented by storing the data multidimensionally; that is,
data must be stored multidimensionally in order to be viewed in a
multidimensional manner.
In ROLAP, architects believe to store the data in the relational model; for
instance, OLAP capabilities are best provided
against the relational database.
DOLAP, is a variation that exists to provide portability for the OLAP user. It
creates multidimensional datasets that can be
transferred from server to desktop, requiring only the DOLAP software to exist
on the target system. This provides significant
advantages to portable computer users, such as salespeople who are
frequently on the road and do not have direct access to
their office server.
3.What is an MDDB? and What is the difference between MDDBs and RDBMSs?
Ans: Multidimensional Database There are two primary technologies that are
used for storing the data used in OLAP applications.
These two technologies are multidimensional databases (MDDB) and relational
databases (RDBMS). The major difference
between MDDBs and RDBMSs is in how they store data. Relational
databases store their data in a series of tables and
columns. Multidimensional databases, on the other hand, store their data
in a large multidimensional arrays.
For example, in an MDDB world, you might refer to a sales figure as Sales
with Date, Product, and Location coordinates of
12-1-2001, Car, and south, respectively.
Advantages of MDDB:
Retrieval is very fast because
The data corresponding to any combination of dimension members can be retrieved
with a single I/O.
Data is clustered compactly in a multidimensional array.
Values are caluculated ahead of time.
The index is small and can therefore usually reside completely in memory.
Storage is very efficient because
The blocks contain only data.
A single index locates the block corresponding to a combination of sparse
dimension numbers.
4. What is MDB modeling and RDB Modeling?
Ans:
5. What is Mapplet and how do u create Mapplet?
Ans: A mapplet is a reusable object that represents a set of transformations. It
allows you to reuse transformation logic and can
contain as many transformations as you need.
Create a mapplet when you want to use a standardized set of transformation
logic in several mappings. For example, if you
have a several fact tables that require a series of dimension keys, you can
create a mapplet containing a series of Lookup
transformations to find each dimension key. You can then use the mapplet in
each fact table mapping, rather than recreate the
same lookup logic in each mapping.
To create a new mapplet:
1. In the Mapplet Designer, choose Mapplets-Create Mapplet.
2. Enter a descriptive mapplet name.
The recommended naming convention for mapplets is mpltMappletName.
3. Click OK.
The Mapping Designer creates a new mapplet in the Mapplet Designer.
4. Choose Repository-Save.
6. What for is the transformations are used?
Ans: Transformations are the manipulation of data from how it appears in the source
system(s) into another form in the data
warehouse or mart in a way that enhances or simplifies its meaning. In short,
u transform data into information.
This includes Datamerging, Cleansing, Aggregation: Datamerging: Process of standardizing data types and fields. Suppose one
source system calls integer type data as smallint
where as another calls similar data as decimal. The data from the two source
systems needs to rationalized when moved into
the oracle data format called number.
Cleansing: This involves identifying any changing inconsistencies or
inaccuracies.
Eliminating inconsistencies in the data from multiple sources.
Converting data from different systems into single consistent data set suitable for
analysis.
Meets a standard for establishing data elements, codes, domains, formats and
naming conventions.
Correct data errors and fills in for missing data values.
Aggregation: The process where by multiple detailed values are combined into a
single summary value typically summation numbers representing dollars spend or
units sold.
Generate summarized data for use in aggregate fact and dimension tables.
Hyperion
Software,
10. What are the modules/tools in Business Objects? Explain theier purpose briefly?
Ans:
BO
Designer,
Business
Query
for
Excel, BO
Reporter,
Infoview,Explorer,WEBI, BO Publisher, and Broadcast Agent, BO
ZABO).
InfoView: IT portal entry into WebIntelligence & Business Objects.
Base module required for all options to view and refresh reports.
Reporter: Upgrade to create/modify reports on LAN or Web.
Explorer: Upgrade to perform OLAP processing on LAN or Web.
Designer: Creates semantic layer between user and database.
Supervisor: Administer and control access for group of users.
WebIntelligence: Integrated query, reporting, and OLAP analysis over the
Web.
Broadcast Agent: Used to schedule, run, publish, push, and broadcast prebuilt reports and spreadsheets, including event
notification and response capabilities, event filtering, and
calendar based notification, over the LAN, email, pager,Fax, Personal Digital Assistant( PDA), Short
Messaging Service(SMS), etc.
Set Analyzer - Applies set-based analysis to perform functions such as
execlusion, intersections, unions, and overlaps
visually.
Developer Suite Build packaged, analytical, or customized apps.
11.What are the Ad hoc quries, Canned Quries/Reports? and How do u create them?
(Plz check this pageC\:BObjects\Quries\Data Warehouse - About Queries.htm)
Ans: The data warehouse will contain two types of query. There will be fixed
queries that are clearly defined and well understood, such as regular reports,
canned queries (standard reports) and common aggregations. There will also
be ad hoc queries that are unpredictable, both in quantity and frequency.
Ad Hoc Query: Ad hoc queries are the starting point for any analysis into a
database. Any business analyst wants to know what is inside the database. He then
proceeds by calculating totals, averages, maximum and minimum values for most
attributes within the database. These are unpredictable element of a data
warehouse. It is exactly that ability to run any query when desired and expect a
reasonable response that makes the data warhouse worthwhile, and makes the
design such a significant challenge.
The end-user access tools are capable of automatically generating the database
query that answers any Question posed by the user. The user will typically pose
questions in terms that they are familier with (for example, sales by
store last
week); this is converted into the database query by the access tool, which is aware
of the structure of information within the data warehouse.
Canned queries: Canned queries are predefined queries. In most instances, canned
queries contain prompts that allow you to customize the query for your specific
needs. For example, a prompt may ask you for a School, department, term, or
section ID. In this instance you would enter the name of the School, department or
term, and the query will retrieve the specified data from the Warehouse.You can
measure resource requirements of these queries, and the results can be used for
capacity palnning and for database design.
The main reason for using a canned query or report rather than creating your own is
that your chances of misinterpreting data or getting the wrong answer are reduced.
You are assured of getting the right data and the right answer.
12. How many Fact tables and how many dimension tables u did? Which table
precedes what?
Ans: http://www.ciobriefings.com/whitepapers/StarSchema.asp
13. What is the difference between STAR SCHEMA & SNOW FLAKE SCHEMA?
Ans: http://www.ciobriefings.com/whitepapers/StarSchema.asp
14. Why did u choose STAR SCHEMA only? What are the benefits of STAR SCHEMA?
Ans: Because its denormalized structure , i.e., Dimension Tables are denormalized.
Why to denormalize means the first (and often
only) answer is : speed. OLTP structure is designed for data inserts, updates,
and deletes, but not data retrieval. Therefore,
we can often squeeze some speed out of it by denormalizing some of the
tables and having queries go against fewer tables.
These queries are faster because they perform fewer joins to retrieve the same
recordset. Joins are also confusing to many
End users. By denormalizing, we can present the user with a view of the data
that is far easier for them to understand.
16. (i) What is FTP? (ii) How do u connect to remote? (iii) Is there another way to
use FTP without a special utility?
Ans: (i): The FTP (File Transfer Protocol) utility program is commonly used for
copying files to and from other computers. These
computers may be at the same site or at different sites thousands of miles
apart. FTP is general protocol that works on UNIX
systems as well as other non- UNIX systems.
(ii): Remote connect commands:
ftp machinename
ex: ftp 129.82.45.181 or ftp iesg
If the remote machine has been reached successfully, FTP responds by asking
for a loginname andpassword. When u enter
ur own loginname and password for the remote machine, it returns the prompt
like below
ftp>
and permits u access to ur own home directory on the remote machine. U
should be able to move around in ur own directory
and to copy files to and from ur local machine using the FTP interface
commands.
Note: U can set the mode of file transfer to ASCII ( default and transmits
seven bits per character).
Use the ASCII mode with any of the following:
- Raw Data (e.g. *.dat or *.txt, codebooks, or other plain text
documents)
- SPSS Portable files.
- HTML files.
If u set mode of file transfer to Binary (the binary mode transmits all
eight bits per byte and thus provides less chance of
a transmission error and must be used to transmit files other than ASCII
files).
For example use binary mode for the following types of files:
- SPSS System files
- SAS Dataset
- Graphic files (eg., *.gif, *.jpg, *.bmp, etc.)
- Microsoft Office documents (*.doc, *.xls, etc.)
1.
2.
3.
4.
(iii): Yes. If u r using Windows, u can access a text-based FTP utility from a
DOS prompt.
To do this, perform the following steps:
From the Start Programs MS-Dos Prompt
Enter ftp ftp.geocities.com. A prompt will appear
(or)
Enter ftp to
get
ftp
prompt ftp> open hostname ex. ftp>open
ftp.geocities.com (It connect to the specified host).
Enter ur yahoo! GeoCities member name.
enter your yahoo! GeoCities pwd.
You can now use standard FTP commands to manage the files in your Yahoo!
GeoCities directory.
17.What cmd is used to transfer multiple files at a time using FTP?
Ans: mget ==> To copy multiple files from the remote machine to the local
machine. You will be prompted for a y/n answer before
transferring each file mget * ( copies all files in the current remote
directory to ur current local directory,
using the same file names).
mput ==> To copy multiple files from the local machine to the remote
machine.
18. What is an Filter Transformation? or what options u have in Filter
Transformation?
Ans: The Filter transformation provides the means for filtering records in a
mapping. You pass all the rows from a source
transformation through the Filter transformation, then enter a filter condition
for the transformation. All ports in a Filter
transformation are input/output, and only records that meet the condition
pass through the Filter transformation.
Note: Discarded rows do not appear in the session log or reject files
To maximize session performance, include the Filter transformation as close to
the sources in the mapping as possible.
Rather than passing records you plan to discard through the mapping, you
then filter out unwanted data early in the
flow of data from sources to targets.
You cannot concatenate ports from more than one transformation into the
Filter transformation; the input ports for the filter
must come from a single transformation. Filter transformations exist within the
flow of the mapping and cannot be
unconnected. The Filter transformation does not allow setting output
default values.
Active transformations that might change the record count include the
following:
Advanced External Procedure
Aggregator
Filter
Joiner
Normalizer
Rank
Source Qualifier
Note: If you use PowerConnect to access ERP sources, the ERP Source
Qualifier is also an active transformation.
/*
You can connect only one of these active transformations to the same
transformation or target, since the Informatica
Server cannot determine how to concatenate data from different sets of
records with different numbers of rows.
*/
Passive transformations that never change the record count include
the following:
Lookup
Expression
External Procedure
Sequence Generator
Stored Procedure
Update Strategy
You can connect any number of these passive transformations, or connect
one active transformation with any number of
passive transformations, to the same transformation or target.
22. What is staging Area and Work Area?
Ans: Staging Area : - Holding Tables on DW Server.
- Loaded from Extract Process
- Input for Integration/Transformation
- May function as Work Areas
- Output to a work area or Fact Table
Work Area: - Temporary Tables
- Memory
23. What is Metadata? (plz refer DATA WHING IN THE REAL WORLD BOOK page #
125)
Ans: Defn: Data About Data
Metadata contains descriptive data for end users. In a data warehouse the
term metadata is used in a number of different
situations.
Metadata is used for:
Data transformation and load
Data management
Query management
Data transformation and load:
Metadata may be used during data transformation and load to describe the source
data and any changes that need to be made. The advantage of storing metadata
about the data being transformed is that as source data changes the changes can be
captured in the metadata, and transformation programs automatically regenerated.
For each source data field the following information is reqd:
Source Field:
Unique identifier (to avoid any confusion occurring betn 2 fields of the same anme
from different sources).
Name (Local field name).
Type (storage type of data, like character,integer,floating pointand so on).
Location
- system ( system it comes from ex.Accouting system).
- object ( object that contains it ex. Account Table).
The destination field needs to be described in a similar way to the source:
Destination:
Unique identifier
Name
Type (database data type, such as Char, Varchar, Number and so on).
Tablename (Name of the table th field will be part of).
The other information that needs to be stored is the transformation or
transformations that need to be applied to turn the source data into the destination
data:
Transformation:
Transformation (s)
- Name
- Language (name of the lanjuage that transformation is written in).
- module name
- syntax
The Name is the unique identifier that differentiates this from any other similar
transformations.
The Language attribute contains the name of the lnguage that the
transformation is written in.
The other attributes are module name and syntax. Generally these will be mutually
exclusive, with only one being defined. For simple transformations such as simple
SQL functions the syntax will be stored. For complex transformations the name
of the module that contains the code is stored instead.
Data management:
Metadata is reqd to describe the data as it resides in the data warehouse.This is
needed by the warhouse manager to allow it to track and control all data
movements. Every object in the database needs to be described.
group
by
criteria
sort
criteria
syntax
- execution plan
resources
25. What are the tasks that are done by Informatica Server?
Ans:The Informatica Server performs the following tasks:
Manages the scheduling and execution of sessions and batches
Executes sessions and batches
Verifies permissions and privileges
Interacts with the Server Manager and pmcmd.
The Informatica Server moves data from sources to targets based on metadata
stored in a repository. For instructions on how to move and transform data, the
Informatica Server reads a mapping (a type of metadata that includes
transformations and source and target definitions). Each mapping uses a session to
define additional information and to optionally override mapping-level options. You
can group multiple sessions to run as a single unit, known as a batch.
26. What are the two programs that communicate with the Informatica Server?
Ans: Informatica provides Server Manager and pmcmd programs to communicate
with the Informatica Server:
Server Manager. A client application used to create and manage sessions and
batches, and to monitor and stop the Informatica Server. You can use information
provided through the Server Manager to troubleshoot sessions and improve session
performance.
pmcmd. A command-line program that allows you to start and stop sessions and
batches, stop the Informatica Server, and verify if the Informatica Server is running.
27. When do u reinitialize Aggregate Cache?
Ans: Reinitializing the aggregate cache overwrites historical aggregate data with new
aggregate data. When you reinitialize the
aggregate cache, instead of using the captured changes in source tables, you
typically need to use the use the entire source
table.
For example, you can reinitialize the aggregate cache if the source for a
session changes incrementally every day and
completely changes once a month. When you receive the new monthly source,
you might configure the session to reinitialize
the aggregate cache, truncate the existing target, and use the new source
table during the session.
28. (ii) What are the minimim condition that u need to have so as to use Targte Load
Order Option in Designer?
Ans: U need to have Multiple Source Qualifier transformations.
To specify the order in which the Informatica Server sends data to targets,
create one Source Qualifier or Normalizer
transformation for each target within a mapping. To set the target load order,
you then determine the order in which each
Source Qualifier sends data to connected targets in the mapping.
When a mapping includes a Joiner transformation, the Informatica Server
sends all records to targets connected to that
Joiner at the same time, regardless of the target load order.
28(iii). How do u set the Target load order?
Ans: To set the target load order:
1. Create a mapping that contains multiple Source Qualifier transformations.
2. After you complete the mapping, choose Mappings-Target Load Plan.
A dialog box lists all Source Qualifier transformations in the mapping, as
well as the targets that receive data from each
Source Qualifier.
3. Select a Source Qualifier from the list.
order.
4. Click the Up and Down buttons to move the Source Qualifier within the load
5. Repeat steps 3 and 4 for any other Source Qualifiers you wish to reorder.
6. Click OK and Choose Repository-Save.
31(i). What the difference is between a database, a data warehouse and a data
mart?
Ans: -- A database is an organized collection of information.
-- A data warehouse is a very large database with special sets of tools to
extract and cleanse data from operational systems
and to analyze data.
-- A data mart is a focused subset of a data warehouse that deals with a
single area of data and is organized for quick
analysis.
32. What is Data Mart, Data WareHouse and Decision Support System explain
briefly?
Ans: Data Mart:
A data mart is a repository of data gathered from operational data and other
sources that is designed to serve a particular
community of knowledge workers. In scope, the data may derive from an enterprisewide database or data warehouse or be more specialized. The emphasis of a data
mart is on meeting the specific demands of a particular group of knowledge users in
terms of analysis, content, presentation, and ease-of-use. Users of a data mart can
expect to have data presented in terms that are familiar.
In practice, the terms data mart and data warehouse each tend to imply the
presence of the other in some form. However, most writers using the term seem to
agree that the design of a data mart tends to start from an analysis of user
needs and that a data warehouse tends to start from an analysis of what
data already exists and how it can be collected in such a way that the data
can later be used. A data warehouse is a central aggregation of data (which can be
distributed physically); a data mart is a data repository that may derive from a data
warehouse or not and that emphasizes ease of access and usability for a particular
designed purpose. In general, a data warehouse tends to be a strategic but
somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting
an immediate need.
Data Warehouse:
A data warehouse is a central repository for all or significant parts of the data that
an enterprise's various business systems collect. The term was coined by W. H.
Inmon. IBM sometimes uses the term "information warehouse."
Typically, a data warehouse is housed on an enterprise mainframe server. Data
from various online transaction processing (OLTP) applications and other sources is
selectively extracted and organized on the data warehouse database for use by
analytical applications and user queries. Data warehousing emphasizes the
capture of data from diverse sources for useful analysis and access, but does not
generally start from the point-of-view of the end user or knowledge worker who may
need access to specialized, sometimes local databases. The latter idea is known as
the data mart.
data mining, Web mining, and a decision support system (DSS) are three
kinds of applications that can make use of a data warehouse.
Decision Support System:
A decision support system (DSS) is a computer program application that analyzes
business data and presents it so that users can make business decisions more easily.
41(ii). What r the differences between Connected lookups and Unconnected lookups?
Ans:
Although both types of lookups perform the same basic task, there
are some important differences:
----------------------------------------------------------------------------------------------------------------------------Connected Lookup
Unconnected Lookup
Copying Mapping:
To copy the mapping, open a workbook.
In the Navigator, click and drag the mapping slightly to the right, not
dragging it to the workbook.
When asked if you want to make a copy, click Yes, then enter a new
name and click OK.
Choose Repository-Save.
Description
--------------------------Name
for
the
repository
Copying Sessions:
In the Server Manager, you can copy stand-alone sessions within a folder, or copy
sessions in and out of batches.
To copy a session, you must have one of the following:
Create Sessions and Batches privilege with read and write permission
Super User privilege
To copy a session:
1. In the Server Manager, select the session you wish to copy.
2. Click the Copy Session button or choose Operations-Copy Session.
The Server Manager makes a copy of the session. The Informatica Server names the
copy after the original session, appending a number, such as session_name1.
47. What are shortcuts, and what is advantage?
Ans: Shortcuts allow you to use metadata across folders without making copies,
ensuring uniform metadata. A shortcut inherits all
properties of the object to which it points. Once you create a shortcut, you can
configure the shortcut name and description.
When the object the shortcut references changes, the shortcut inherits those
changes. By using a shortcut instead of a copy,
you ensure each use of the shortcut exactly matches the original object. For
example, if you have a shortcut to a target
definition, and you add a column to the definition, the shortcut automatically
inherits the additional column.
Shortcuts allow you to reuse an object without creating multiple objects in the
repository. For example, you use a source
definition in ten mappings in ten different folders. Instead of creating 10 copies
of the same source definition, one in each
folder, you can create 10 shortcuts to the original source definition.
You can create shortcuts to objects in shared folders. If you try to create a
shortcut to a non-shared folder, the Designer
creates a copy of the object instead.
(Plzz refer Help Using Shell Commands n Post-Session Commands and Email)
Ans: The Informatica Server can perform one or more shell commands before or
after the session runs. Shell commands are
operating system commands. You can use pre- or post- session shell
commands, for example, to delete a reject file or
session log, or to archive target files before the session begins.
The status of the shell command, whether it completed successfully or
failed, appears in the session log file.
To call a pre- or post-session shell command you must:
1.
Use any valid UNIX command or shell script for UNIX servers, or any valid DOS or
batch file for Windows NT servers.
2.
Configure the session to execute the pre- or post-session shell commands.
You can configure a session to stop if the Informatica Server encounters an error
while executing pre-session shell commands.
For example, you might use a shell command to copy a file from one directory to
another. For aWindows NT server you would use the following shell command to
copy the SALES_ ADJ file from the target directory, L, to the source, H:
copy L:\sales\sales_adj H:\marketing\
For a UNIX server, you would use the following command line to perform a similar
operation:
cp sales/sales_adj marketing/
Tip: Each shell command runs in the same environment (UNIX or Windows NT) as
the Informatica Server. Environment settings in one shell command script do not
carry over to other scripts. To run all shell commands in the same environment, call
a single shell script that in turn invokes other scripts.
49. What are Folder Versions?
Ans: In the Repository Manager, you can create different versions within a folder to
help you archive work in development. You can copy versions to other folders as
well. When you save a version, you save all metadata at a particular point in
development. Later versions contain new or modified metadata, reflecting work that
you have completed since the last version.
Maintaining different versions lets you revert to earlier work when needed. By
archiving the contents of a folder into a version each time you reach a development
landmark, you can access those versions if later edits prove unsuccessful.
You create a folder version after completing a version of a difficult mapping, then
continue working on the mapping. If you are unhappy with the results of subsequent
work, you can revert to the previous version, then create a new version to continue
development. Thus you keep the landmark version intact, but available for
regression.
Note: You can only work within one version of a folder at a time.
50. How do automate/schedule sessions/batches n did u use any tool for automating
Sessions/batch?
and
MQ
Series
52. What r the procedure that u need to undergo before moving Mappings/sessions
from Testing/Development to Production?
Ans:
53. How many values it (informatica server) returns when it passes thru Connected
Lookup n Unconncted Lookup?
Ans: Connected Lookup can return multiple values where as Unconnected Lookup
will return only one values that is Return Value.
54. What is the difference between PowerMart and PowerCenter in 4.7.2?
Ans: If You Are Using PowerCenter
PowerCenter allows you to register and run multiple Informatica Servers against the
same repository. Because you can run
these servers at the same time, you can distribute the repository session load
across available servers to improve overall
performance.
With PowerCenter, you receive all product functionality, including distributed
metadata, the ability to organize repositories into
a data mart domain and share metadata across repositories.
A PowerCenter license lets you create a single repository that you can
configure as a global repository, the core component
of a data warehouse.
If You Are Using PowerMart
This version of PowerMart includes all features except distributed metadata and
multiple registered servers. Also, the various
options available with PowerCenter (such as PowerCenter Integration Server for BW,
PowerConnect for IBM DB2,
PowerConnect for SAP R/3, and PowerConnect for PeopleSoft) are not available with
PowerMart.
ression
gator
e Strategy
---------------------------------------Calculate a value
Expression
Perform an aggregate calculations
Aggregator
Modify text
Expression
Filter records
Filter, Source Qualifier
Order records queried by the Informatica Server Source Qualifier
Call a stored procedure
Stored Procedure
Call a procedure in a shared library or in the
External Procedure
COM layer of Windows NT
Generate primary keys
Sequence Generator
Limit records to a top or bottom range
Rank
Normalize records, including those read
Normalizer
from COBOL sources
Look up values
Lookup
Determine whether to insert, delete, update,
Update Strategy
or reject records
Join records from different databases
Joiner
or flat file systems
56. Expressions in Transformations, Explain briefly how do u use?
Ans: Expressions in Transformations
To transform data passing through a transformation, you can write an
expression. The most obvious examples of these are the
Expression and Aggregator transformations, which perform calculations
on either single values or an entire range of values
within a port. Transformations that use expressions include the following:
-------------------------------------------------------------Transformation
How It Uses Expressions
-------------------------------------------------------------Calculates the result of an expression for each row passing through the
transformation, using values from one or more ports.
Calculates the result of an aggregate expression, such as a sum or
average, based on all data passing through a port or on groups within that data.
Filter
Filters records based on a
condition you enter using an expression.
Filters the top or bottom range of records, based on a condition you
enter using an expression.
Assigns a numeric code to each record based on an expression, indicating
whether the Informatica Server should use the information in the record to insert,
delete, or update the target.
In each transformation, you use the Expression Editor to enter the expression. The
Expression Editor supports the transformation language for building expressions. The
transformation language uses SQL-like functions, operators, and other components
to build the expression. For example, as in SQL, the transformation language
includes the functions COUNT and SUM. However, the PowerMart/PowerCenter
transformation language includes additional functions not found in SQL.
When you enter the expression, you can use values available through ports.
For example, if the transformation has two input ports representing a price and
sales tax rate, you can calculate the final sales tax using these two values. The ports
used in the expression can appear in the same transformation, or you can use output
ports in other transformations.
57. In case of Flat files (which comes thru FTP as source) has not arrived then what
happens?Where do u set this option?
Ans: U get an fatel error which cause server to fail/stop the session.
U can set Event-Based Scheduling Option in Session Properties under General
tab-->Advanced options..
---------------------------------------------------Event-Based
Required/ Optional Description
----------------------------------------------------Indicator File to Wait For
Optional
Required to use eventbased scheduling. Enter the indicator file
(or directory and
file) whose arrival schedules the session. If you do
not enter a directory, the
Informatica Server assumes the file appears
in the server variable
directory $PMRootDir.
58. What is the Test Load Option and when you use in Server Manager?
Ans: When testing sessions in development, you may not need to process the
entire source. If this is true, use the Test Load
Option(Session Properties General Tab Target Options Choose Target Load
options as Normal (option button), with
Test Load cheked (Check box) and No.of rows to test ex.2000 (Text box with
Scrolls)). You can also click the Start button.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------59. SCD Type 2 and SGT difference?
60. Differences between 4.7 and 5.1?
61. Tuning Informatica Server for improving performance? Performance Issues?
Ans: See /* C:\pkar\Informatica\Performance Issues.doc */
62. What is Override Option? Which is better?
63. What will happen if u increase buffer size?
64. what will happen if u increase commit Intervals? and also decrease commit
Intervals?
65. What kind of Complex mapping u did? And what sort of problems u faced?
66. If u have 10 mappings designed and u need to implement some changes(may be
in existing mapping or new mapping need to
be designed) then how much time it takes from easier to complex?
67. Can u refresh Repository in 4.7 and 5.1? and also can u refresh pieces (partially)
of repository in 4.7 and 5.1?
68. What is BI?
Ans: http://www.visionnet.com/bi/index.shtml
69. Benefits of BI?
Ans: http://www.visionnet.com/bi/bi-benefits.shtml
70. BI Faq
Ans: http://www.visionnet.com/bi/bi-faq.shtml
71. What is difference between data scrubbing and data cleansing?
Ans: Scrubbing data is the process of cleaning up the junk in legacy data and
making it accurate and useful for the next generations
of automated systems. This is perhaps the most difficult of all conversion
activities. Very often, this is made more difficult when
the customer wants to make good data out of bad data. This is the dog work.
It is also the most important and can not be done
without the active participation of the user.
DATA CLEANING - a two step process including DETECTION and
then CORRECTION of errors in a data set
72. What is Metadata and Repository?
Ans:
Metadata. Data about data .
It contains descriptive data for end users.
Contains data that controls the ETL processing.
Contains data about the current state of the data warehouse.
ETL updates metadata, to provide the most current state.
Repository. The place where you store the metadata is called a repository. The
more sophisticated your repository, the more
complex and detailed metadata you can store in it. PowerMart and
PowerCenter use a relational database as the
repository.
77. How do u select duplicate rows using Informatica i.e., how do u use
Max(Rowid)/Min(Rowid) in Informatica?
**********************************Shankar
Prasad*************************************************
Posted 2nd June 2012 by Shankar Prasad
0
Add a comment
2.
JUN
*********************************************************************************************************************
***
Dimensional Data Model :
Dimensional data model is most often used in data warehousing systems. This is different from
the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can
imagine, the same data would then be stored differently in a dimensional model than in a 3rd
normal form model.
To understand dimensional data modeling, let's define some of the terms commonly used in this
type of modeling:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the Time
Dimension.
Hierarchy: The specification of levels that represents relationship between different attributes
within a dimension. For example, one possible hierarchy in the Time dimension is Year -->
Quarter --> Month --> Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, sales
amount would be such a measure. This measure is stored in the fact table with the appropriate
granularity. For example, it can be sales amount by store by day. In this case, the fact table
would contain three columns: A date column, a store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. For
example, the lookup table for the Quarter attribute would include a list of all of the quarters
available in the data warehouse. Each row (each quarter) may have several fields, one for the
unique ID that identifies the quarter, and one or more additional fields that specifies how that
particular quarter is represented on a report (for example, first quarter of 2001 may be
represented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more
lookup tables, but fact tables do not have direct relationships to one another. Dimensions and
hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup
tables.
In designing data models for data warehouses / data marts, the most commonly used schema
types are Star Schema andSnowflake Schema.
Star Schema: In the star schema design, a single object (the fact table) sits in the middle and
is radially connected to other surrounding objects (dimension lookup tables) like a star. A star
schema can be simple or complex. A simple star consists of one fact table; a complex star can
have more than one fact table.
Snowflake Schema: The snowflake schema is an extension of the star schema, where each
point of the star explodes into more points. The main advantage of the snowflake schema is
the improvement in query performance due to minimized disk storage requirements and joining
smaller lookup tables. The main disadvantage of the snowflake schema is the additional
maintenance efforts needed due to the increase number of lookup tables.
Whether one uses a star or a snowflake largely depends on personal preference and business
needs. Personally, I am partial to snowflakes, when there is a business case to analyze the
information at that particular level.
Slowly Changing Dimensions:
The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a
nutshell, this applies to cases where the attribute for a record varies over time. We give an
example below:
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois. So, the original entry in
the customer lookup table has the following record:
Customer Key
Name
State
1001
Christina
Illinois
At a later date, she moved to Los Angeles, California on January, 2003. How should ABC Inc.
now modify its customer table to reflect this change? This is the "Slowly Changing Dimension"
problem.
There are in general three ways to solve this type of problem, and they are categorized as
follows:
Type 1: The new record replaces the original record. No trace of the old record exists.
Type 2: A new record is added into the customer dimension table. Therefore, the customer is
treated essentially as two people.
Type 3: The original record is modified to reflect the change.
We next take a look at each of the scenarios and how the data model and the data looks like
for each of them. Finally, we compare and contrast among the three alternatives.
Type 1 Slowly Changing Dimension:
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original
information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, the new information replaces the new record,
and we have the following table:
Customer Key
Name
State
1001
Christina
California
Advantages:
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no
need to keep track of the old information.
Disadvantages:
- All history is lost. By applying this methodology, it is not possible to trace back in history. For
example, in this case, the company would not be able to know that Christina lived
in Illinois before.
Usage:
About 50% of the time.
When to use Type 1:
Type 1 slowly changing dimension should be used when it is not necessary for the data
warehouse to keep track of historical changes.
Type 2 Slowly Changing Dimension:
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new
information. Therefore, both the original and the new record will be present. The newe record
gets its own primary key.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
After Christina moved from Illinois to California, we add the new information as a new row into
the table:
Customer Key
Name
State
1001
Christina
Illinois
1005
Christina
California
Advantages:
- This allows us to accurately keep all historical information.
Disadvantages:
- This will cause the size of the table to grow fast. In cases where the number of rows for the
table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.
Usage:
About 50% of the time.
When to use Type 2:
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse
to track historical changes.
Type 3 Slowly Changing Dimension :
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular
attribute of interest, one indicating the original value, and one indicating the current value.
There will also be a column that indicates when the current value becomes active.
In our example, recall we originally have the following table:
Customer Key
Name
State
1001
Christina
Illinois
To accomodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we
have the following table (assuming the effective date of change is January 15, 2003):
Customer Key
Name
Original State
Current State
Effective Date
1001
Christina
Illinois
California
15-JAN-2003
Advantages:
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.
Disadvantages:
- Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information
will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite
number of time.
Surrogate key :
A surrogate key is frequently a sequential number but doesn't have to be. Having the key
independent of all other columns insulates the database relationships from changes in data
values or database design and guarantees uniqueness.
Some database designers use surrogate keys religiously regardless of the suitability of other
candidate keys. However, if a good key already exists, the addition of a surrogate key will
merely slow down access, particularly if it is indexed.
The concept of surrogate key is important in data warehouse ,surrogate means deputy or
substitute. surrogate key is a small integer(say 4 bytes)that can uniquely identify the record in
the dimension table.however it has no meaning data warehouse experts suggest that
production key used in the databases should not be used in the dimension tables as primary
keys instead in there place the surrogate key have to be used which are generated
automatically.
specified.
Normalization occurs at this level.
At this level, the data modeler attempts to describe the data in as much detail as possible,
without regard to how they will be physically implemented in the database.
In data warehousing, it is common for the conceptual data model and the logical data model to
be combined into a single step (deliverable).
The steps for designing the logical data model are as follows:
1.
2.
3.
4.
5.
6.
1.
2.
3.
4.
OLAP stands for On-Line Analytical Processing. The first attempt to provide a definition to OLAP
was by Dr. Codd, who proposed 12 rules for OLAP. Later, it was discovered that this particular
white paper was sponsored by one of the OLAP tool vendors, thus causing it to lose objectivity.
The OLAP Report has proposed the FASMI test, Fast Analysis
of SharedMultidimensional Information. For a more detailed description of both Dr. Codd's rules
and the FASMI test, please visit The OLAP Report.
For people on the business side, the key feature out of the above list is "Multidimensional." In
other words, the ability to analyze metrics in different dimensions such as time, geography,
gender, product, etc. For example, sales for the company is up. What region is most
responsible for this increase? Which store in this region is most responsible for the increase?
What particular product category or categories contributed the most to the increase? Answering
these types of questions in order means that you are performing an OLAP analysis.
Depending on the underlying technology used, OLAP can be braodly divided into two different
camps: MOLAP and ROLAP. A discussion of the different OLAP types can be found in the MOLAP,
ROLAP, and HOLAP section.
In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and
Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and
ROLAP.
MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a
multidimensional cube. The storage is not in the relational database, but in proprietary
formats.
Advantages:
Excellent performance: MOLAP cubes are built for fast data retrieval, and is
optimal for slicing and dicing operations.
Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large amount of data in the
cube itself. This is not to say that the data in the cube cannot be derived from a large amount
of data. Indeed, this is possible. But in this case, only summary-level information will be
included in the cube itself.
Can handle large amounts of data: The data size limitation of ROLAP
technology is the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
Performance can be slow: Because each ROLAP report is essentially a SQL query
(or multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
View comments
JUN
by Shankar Prasad
----------------------------------------------------------------------------------------------------------------------Q. What are Target Types on the Server?
A. Target Types are File, Relational and ERP.
Q. What are Target Types on the Server?
A. Target Types are File, Relational and ERP.
Q. How do you identify existing rows of data in the target table using
lookup transformation?
A. There are two ways to lookup the target table to verify a row exists or not :
1. Use connect dynamic cache lookup and then check the values of NewLookuprow
Output port to decide whether the incoming record already exists in the table / cache
or not.
2. Use Unconnected lookup and call it from an expression transformation and check
the Lookup condition port value (Null/ Not Null) to decide whether the incoming
record already exists in the table or not.
Q. What are Aggregate transformations?
A. Aggregator transform is much like the Group by clause in traditional
SQL.
This particular transform is a connected/active transform which can take
the incoming data from the mapping pipeline and group them based on
the group by ports specified and can caculated aggregate functions
like ( avg, sum, count, stddev....etc) for each of those groups.
From a performance perspective if your mapping has an AGGREGATOR
transform use filters and sorters very early in the pipeline if there is
any need for them.
Q. What are various types of Aggregation?
A. Various types of aggregation are SUM, AVG, COUNT, MAX, MIN, FIRST, LAST,
MEDIAN, PERCENTILE, STDDEV, and VARIANCE.
Q. What are Dimensions and various types of Dimension?
A. Dimensions are classified to 3 types.
1. SCD TYPE 1(Slowly Changing Dimension): this contains current data.
2. SCD TYPE 2(Slowly Changing Dimension): this contains current data + complete
historical data.
3. SCD TYPE 3(Slowly Changing Dimension): this contains current data.+partially
historical data
Q. What are 2 modes of data movement in Informatica Server?
A. The data movement mode depends on whether Informatica Server should process
single byte or multi-byte character data. This mode selection can affect the
enforcement of code page relationships and code page validation in the Informatica
Client and Server.
a) Unicode - IS allows 2 bytes for each character and uses additional byte for each
non-ascii character (such as Japanese characters)
b) ASCII - IS holds all data in a single byte
The IS data movement mode can be changed in the Informatica Server configuration
parameters. This comes into effect once you restart the Informatica Server.
Writer Thread - One Thread for Each Partition if target exist in the source pipeline
write to the target.
Transformation Thread - One or More Transformation Thread For Each Partition.
Q. Where should you place the flat file to import the flat file definition to
the designer?
A. Place it in local folder
Q. Which transformation should u need while using the Cobol sources as
source definitions?
A. Normalizer transformation which is used to normalize the data. Since Cobol sources
r often consists of denormalized data.
Q. How can you create or import flat file definition in to the warehouse
designer?
A. You can create flat file definition in warehouse designer. In the warehouse designer,
you can create a new target: select the type as flat file. Save it and u can enter
various columns for that created target by editing its properties. Once the target is
created, save it. You can import it from the mapping designer.
Q. What is a mapplet?
A. A mapplet should have a mapplet input transformation which receives input values,
and an output transformation which passes the final modified data to back to the
mapping. Set of transformations where the logic can be reusable when the mapplet
is
displayed within the mapping only input & output ports are displayed so that the
internal logic is hidden from end-user point of view.
Q. What is a transformation?
A. It is a repository object that generates, modifies or passes data.
Q. What are the designer tools for creating transformations?
A. Mapping designer
Transformation developer
Mapplet designer
Q. What are connected and unconnected transformations?
A. Connect Transformation : A transformation which participates in the mapping data
flow. Connected transformation can receivemultiple inputs and provides multiple
outputs
Unconnected: An unconnected transformation does not participate in the mapping
data flow. It can receive multiple inputs and provides single output
Unconnected Lookup
caches the lookup table and lookup values in the cache for each row
that comes into the transformation. When the lookup condition is
true, the Informatica server does not update the cache while it
processes the lookup transformation.
Dynamic cache: If you want to cache the target table and insert new
rows into cache and the target, you can create a look up
transformation to use dynamic cache. The Informatica server
dynamically inserts data to the target table.
Shared cache: U can share the lookup cache between multiple
transactions. You can share unnamed cache between
transformations in the same mapping.
Q: What do you know about Informatica and ETL?
A: Informatica is a very useful GUI based ETL tool.
Q: FULL and DELTA files. Historical and Ongoing load.
A: FULL file contains complete data as of today including history data, DELTA file contains
only the changes since last extract.
Q: Power Center/ Power Mart which products have you worked with?
A: Power Center will have Global and Local repository, whereas Power Mart will have only
Local repository.
Q: Explain what are the tools you have used in Power Center and/or Power
Mart?
A: Designer, Server Manager, and Repository Manager.
Q: What is a Mapping?
A: Mapping Represent the data flow between source and target
Q: What are the components must contain in Mapping?
A: Source definition, Transformation, Target Definition and Connectors
1.
2.
Q: What is Transformation?
A: Transformation is a repository object that generates, modifies, or passes data.
Transformation performs specific function. They are two types of transformations:
Active
Rows, which are affected during the transformation or can change the
no of rows that pass through it. Eg: Aggregator, Filter, Joiner,
Normalizer, Rank, Router, Source qualifier, Update Strategy, ERP
Source Qualifier, Advance External Procedure.
Passive
Does not change the number of rows that pass through it. Eg:
Expression, External Procedure, Input, Lookup, Stored Procedure,
Output, Sequence Generator, XML Source Qualifier.
A:
Source Qualifier (XML, ERP, MQ)
Joiner
Expression
Lookup
Filter
Router
Sequence Generator
Aggregator
Update Strategy
Stored Proc
External Proc
Advanced External Proc
Rank
Normalizer
Q: What are active/passive transformations?
A: Passive transformations do not change the nos. of rows passing through it whereas
active transformation changes the nos. rows passing thru it.
Active: Filter, Aggregator, Rank, Joiner, Source Qualifier
Passive: Expression, Lookup, Stored Proc, Seq. Generator
Q: What are connected/unconnected transformations?
A:
Connected transformations are part of the mapping pipeline. The input and output ports
are connected to other transformations.
Unconnected transformations are not part of the mapping pipeline. They are not linked in
the map with any input or output ports. Eg. In Unconnected Lookup you can pass multiple
values to unconnected transformation but only one column of data will be returned from
the transformation. Unconnected: Lookup, Stored Proc.
Q: In target load ordering, what do you order - Targets or Source Qualifiers?
A: Source Qualifiers. If there are multiple targets in the mapping, which are populated
from multiple sources, then we can use Target Load ordering.
Q: Have you used constraint-based load ordering? Where do you set this?
A: Constraint based loading can be used when you have multiple targets in the mapping
and the target tables have a PK-FK relationship in the database. It can be set in the
session properties. You have to set the Source Treat Rows as: INSERT and check the box
Constraint based load ordering in Advanced Tab.
Q: If you have a FULL file that you have to match and load into a corresponding
table, how will you go about it? Will you use Joiner transformation?
A: Use Joiner and join the file and Source Qualifier.
Q: If you have 2 files to join, which file will you use as the master file?
A: Use the file with lesser nos. of records as master file.
Q: If a sequence generator (with increment of 1) is connected to (say) 3 targets
and each target uses the NEXTVAL port, what value will each target get?
A: Each target will get the value in multiple of 3.
Q: Have you used the Abort, Decode functions?
A: Abort can be used to Abort / stop the session on an error condition.
If the primary key column contains NULL, and you need to stop the session from
continuing then you may use ABORT function in the default value for the port. It can be
used with IIF and DECODE function to Abort the session.
Q: Have you used SQL Override?
A: It is used to override the default SQL generated in the Source Qualifier / Lookup
transformation.
Q: If you make a local transformation reusable by mistake, can you undo the
reusable action?
A: No
Q: What is the difference between filter and router transformations?
A: Filter can filter the records based on ONE condition only whereas Router can be used to
filter records on multiple condition.
Q: Lookup transformations: Cached/un-cached
A: When the Lookup Transformation is cached the Informatica Server caches the data and
index. This is done at the beginning of the session before reading the first record from the
source. If the Lookup is uncached then the Informatica reads the data from the database
for every record coming from the Source Qualifier.
Q: Connected/unconnected if there is no match for the lookup, what is
returned?
A: Unconnected Lookup returns NULL if there is no matching record found in the Lookup
transformation.
Q: What is persistent cache?
A: When the Lookup is configured to be a persistent cache Informatica server does not
delete the cache files after completion of the session. In the next run Informatica server
uses the cache file from the previous session.
Q: What is dynamic lookup strategy?
A: The Informatica server compares the data in the lookup table and the cache, if there is
no matching record found in the cache file then it modifies the cache files by inserting the
record. You may use only (=) equality in the lookup condition.
If multiple matches are found in the lookup then Informatica fails the session. By default
the Informatica server creates a static cache.
Q: Mapplets: What are the 2 transformations used only in mapplets?
A: Mapplet Input / Source Qualifier, Mapplet Output
Q: Have you used Shortcuts?
A: Shortcuts may used to refer to another mapping. Informatica refers to the original
mapping. If any changes are made to the mapping / mapplet, it is immediately reflected
in the mapping where it is used.
Q: If you used a database when importing sources/targets that was dropped
later on, will your mappings still be valid?
A: No
Q: In expression transformation, how can you store a value from the previous
row?
A: By creating a variable in the transformation.
Q: How does Informatica do variable initialization? Number/String/Date
A: Number 0, String blank, Date 1/1/1753
Q: Have you used the Informatica debugger?
A: Debugger is used to test the mapping during development. You can give breakpoints in
the mappings and analyze the data.
Q: What do you know about the Informatica server architecture? Load Manager,
DTM, Reader, Writer, Transformer.
A:
Load Manager is the first process started when the session runs. It checks for validity of
mappings, locks sessions and other objects.
DTM process is started once the Load Manager has completed its job. It starts a thread for
each pipeline.
Reader scans data from the specified sources.
Writer manages the target/output data.
Transformer performs the task specified in the mapping.
Q: Have you used pmcmd command? What can you do using this command?
A: pmcmd is a command line program. Using this command
You can start sessions
Stop sessions
Recover session
Q: What are the two default repository user groups
A: Administrators and Public
o
o
o
o
o
o
o
Q: What are the Privileges of Default Repository and Extended Repository user?
A:
Default Repository Privileges
Use Designer
Browse Repository
Create Session and Batches
Extended Repository Privileges
Session Operator
Administer Repository
Administer Server
Super User
Q: How many different locks are available for repository objects
A: There are five kinds of locks available on repository objects:
Read lock. Created when you open a repository object in a folder for which you do not have
write permission. Also created when you open an object with an existing write lock.
Write lock. Created when you create or edit a repository object in a folder for which you
have write permission.
Execute lock. Created when you start a session or batch, or when the Informatica Server
starts a scheduled session or batch.
Fetch lock. Created when the repository reads information about repository objects from
the database.
Save lock. Created when you save information to the repository.
Q: What is Session Process?
A: The Load Manager process. Starts the session, creates the DTM process, and sends
post-session email when the session completes.
Q: What is DTM process?
A: The DTM process creates threads to initialize the session, read, write, transform data,
and handle pre and post-session operations.
o
o
o
o
o
o
o
Q: When the Informatica Server runs a session, what are the tasks handled?
A:
Load Manager (LM):
LM locks the session and reads session properties.
LM reads the parameter file.
LM expands the server and session variables and parameters.
LM verifies permissions and privileges.
LM validates source and target code pages.
LM creates the session log file.
LM creates the DTM (Data Transformation Manager) process.
1.
2.
3.
4.
5.
6.
cache built for the first lookup can be used for the others. It cannot be used across
mappings.
Shared:
If the lookup table is used in more than one transformation/mapping then the cache built
for the first lookup can be used for the others. It can be used across mappings.
Persistent :
If the cache generated for a Lookup needs to be preserved for subsequent use then
persistent cache is used. It will not delete the index and data files. It is useful only if the
lookup table remains constant.
Incremental aggregation?
In the Session property tag there is an option for performing incremental aggregation.
When the Informatica server performs incremental aggregation , it passes new source
data through the mapping and uses historical cache (index and data cache) data to
perform new aggregation calculations incrementally.
What are the three areas where the rows can be flagged for particular
treatment?
In mapping, In Session treat Source Rows and In Session Target Options.
What is the use of Forward/Reject rows in Mapping?
9.
2. Sources
Set a filter transformation after each SQ and see the records are not through.
If the time taken is same then there is a problem.
You can also identify the Source problem by
Read Test Session where we copy the mapping with sources, SQ and remove all
transformations
and connect to file target. If the performance is same then there is a Source bottleneck.
Using database query Copy the read query directly from the log. Execute the query
against the
source database with a query tool. If the time it takes to execute the query and the time
to fetch
the first row are significantly different, then the query can be modified using optimizer
hints.
Solutions:
Optimize Queries using hints.
Use indexes wherever possible.
3. Mapping
If both Source and target are OK then problem could be in mapping.
Add a filter transformation before target and if the time is the same then there is a
problem.
(OR) Look for the performance monitor in the Sessions property sheet and view the
counters.
Solutions:
If High error rows and rows in lookup cache indicate a mapping bottleneck.
Optimize Single Pass Reading:
Optimize Lookup transformation :
1. Caching the lookup table:
When caching is enabled the informatica server caches the lookup table and
queries the
cache during the session. When this option is not enabled the server queries the
lookup
table on a row-by row basis.
Static, Dynamic, Shared, Un-shared and Persistent cache
2. Optimizing the lookup condition
Whenever multiple conditions are placed, the condition with equality sign
should take
precedence.
3. Indexing the lookup table
The cached lookup table should be indexed on order by columns. The session log
contains
the ORDER BY statement
The un-cached lookup since the server issues a SELECT statement for each row
passing
into lookup transformation, it is better to index the lookup table on the columns
in the
condition
Optimize Filter transformation:
You can improve the efficiency by filtering early in the data flow. Instead of using a
filter
transformation halfway through the mapping to remove a sizable amount of data.
Use a source qualifier filter to remove those same rows at the source,
If not possible to move the filter into SQ, move the filter transformation as close to
the
source
qualifier as possible to remove unnecessary data early in the data flow.
Optimize Aggregate transformation:
1. Group by simpler columns. Preferably numeric columns.
2. Use Sorted input. The sorted input decreases the use of aggregate caches. The
server
assumes all input data are sorted and as it reads it performs aggregate
calculations.
3. Use incremental aggregation in session property sheet.
Optimize Seq. Generator transformation:
1. Try creating a reusable Seq. Generator transformation and use it in multiple
mappings
2. The number of cached value property determines the number of values the
informatica
server caches at one time.
Optimize Expression transformation:
1. Factoring out common logic
2. Minimize aggregate function calls.
3. Replace common sub-expressions with local variables.
4. Use operators instead of functions.
4. Sessions
If you do not have a source, target, or mapping bottleneck, you may have a session
bottleneck.
You can identify a session bottleneck by using the performance details. The informatica
server
creates performance details when you enable Collect Performance Data on the General
Tab of
the session properties.
Performance details display information about each Source Qualifier, target definitions,
and
individual transformation. All transformations have some basic counters that indicate
the
Number of input rows, output rows, and error rows.
Any value other than zero in the readfromdisk and writetodisk counters for
Aggregate, Joiner,
or Rank transformations indicate a session bottleneck.
Low bufferInput_efficiency and BufferOutput_efficiency counter also indicate a
session
bottleneck.
Small cache size, low buffer memory, and small commit intervals can cause session
bottlenecks.
5. System (Networks)
-----------------
Correlated subquery runs once for each row selected by the outer query. It contains a reference to a value from
the row selected by the outer query.
Nested subquery runs only once for the entire nesting (outer) query. It does not contain any reference to the
outer query row.
For example
Correlated Subquery:
select e1.empname e1.basicsal e1.deptno from emp e1 where e1.basicsal (select max(basicsal) from emp e2
where e2.deptno e1.deptno)
Nested Subquery:
select empname basicsal deptno from emp where (deptno basicsal) in (select deptno max(basicsal) from emp
group by deptno)
The Integration Service processes all input groups in parallel. The Integration Service concurrently reads sources
connected to the Union transformation and pushes blocks of data into the input groups of
the transformation. The Union transformation processes the blocks of data based on the order it receives the
blocks from the Integration Service.
You can connect heterogeneous sources to a Union transformation. The Union transformation merges sources
with matching ports and outputs the data from one output group with the same ports as the input groups.
In a star schema a dimension table will not have any parent table.
Whereas in a snow flake schema a dimension table will have one or more parent tables.
Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to drill
down the data from topmost hierachies to the lowermost hierarchies.
mode stores a character value in one byte.Unicode mode takes 2 bytes to store a character.
If a session joins multiple source tables in one Source Qualifier, optimizing the query may improve performance. Also, single
table select statements with an ORDER BY or GROUP BY clause may benefit from optimization such as adding indexes.
We can improve the session performance by configuring the network packet size,which allows
data to cross the network at one time.To do this go to server manger ,choose server configure database connections.
If u r target consists key constraints and indexes u slow the loading of data.To improve the session performance in this case
drop constraints and indexes before u run the session and rebuild them after completion of session.
Running a parallel sessions by using concurrent batches will also reduce the time of loading the
data.So concurent batches may also increase the session performance.
Partittionig the session improves the session performance by creating multiple connections to sources and targets and loads
data in paralel pipe lines.
In some cases if a session contains a aggregator transformation ,u can use incremental aggregation to improve session
performance.
Aviod transformation errors to improve the session performance.
If the sessioin containd lookup transformation u can improve the session performance by enabling the look up cache.
If Ur session contains filter transformation ,create that filter transformation nearer to the sources
or u can use filter condition in source qualifier.
Aggreagator,Rank and joiner transformation may oftenly decrease the session performance .Because they must group data
before processing it.To improve session performance in this case use sorted ports option.
You can also perform the following tasks to optimize the mapping:
Configure single-pass reading.
Optimize datatype conversions.
Eliminate transformation errors.
Optimize transformations.
Optimize expressions.
RE: Why did you use stored procedure in your ETL Appli...
Click Here to view complete document
hi
usage of stored procedure has the following advantages
1checks the status of the target database
2drops and recreates indexes
3determines if enough space exists in the database
4performs aspecilized calculation
=======================================
Stored procedure in Informatica will be useful to impose complex business rules.
=======================================static cache:
1.static cache remains same during the session run
2.static can be used to relational and falt file lookup
types
3.static cache can be used to both unconnected and
other workflows
Which will beter perform IIf or decode?
decode is better perform than iff condtion,decode can be
uesd insted of using multiple iff cases
DECODE FUNCTION YOU CAN FIND IN SQL BUT IIF FUNCTION IS NOT
IN SQL. DECODE FUNCTION WILL GIVE CLEAR READABILITY TO
UNDERSTAND THE LOGIC TO OTHER.
The source qualifier represents the records that the informatica server reads when it runs a session.
When we add a relational or a flat file source definition to a mapping,we need to connect it to a source
qualifier transformation.The source qualifier transformation represents the records that the informatica
server reads when it runs a session.
How many dimension tables did you had in your project and name some
dimensions (columns)?
Product Dimension : Product Key, Product id, Product Type, Product name, Batch Number.
Distributor Dimension: Distributor key, Distributor Id, Distributor Location,
Customer Dimension : Customer Key, Customer Id, CName, Age, status, Address, Contact
Account Dimension : Account Key, Acct id, acct type, Location, Balance,
Local Repository : Local repository is within a domain and its not a global repository.
Local repository can connect to a global repository using global shortcuts and can use objects in
its shared folders.
Versioned Repository : This can either be local or global repository but it allows version
control for the repository. A versioned repository can store multiple copies, or versions of an object. This
features allows to efficiently develop, test and deploy metadata in the production environment.
Q. What is a code page?
A. A code page contains encoding to specify characters in a set of one or more languages. The code page is
selected based on source of the data. For example if source contains Japanese text then the code page
should be selected to support Japanese text.
When a code page is chosen, the program or application for which the code page is set, refers to a specific
set of data that describes the characters the application recognizes. This influences the way that application
stores, receives, and sends character data.
Q. Which all databases PowerCenter Server on Windows can connect to?
A. PowerCenter Server on Windows can connect to following databases:
IBM DB2
Informix
Microsoft Access
Microsoft Excel
Microsoft SQL Server
Oracle
Sybase
Teradata
Q. Which all databases PowerCenter Server on UNIX can connect to?
A. PowerCenter Server on UNIX can connect to following databases:
IBM DB2
Informix
Oracle
Sybase
Teradata
Infomratica Mapping Designer
Aggregator
Application Source Qualifier
Custom
Expression
External Procedure
Filter
Input
Joiner
Lookup
Normalizer
Output
Rank
Router
Sequence Generator
Sorter
Source Qualifier
Stored Procedure
Transaction Control
Union
Update Strategy
XML Generator
XML Parser
XML Source Qualifier
Q. What is a source qualifier? What is meant by Query Override?
A. Source Qualifier represents the rows that the PowerCenter Server reads from a relational or flat file
source when it runs a session. When a relational or a flat file source definition is added to a mapping, it is
connected to a Source Qualifier transformation.
PowerCenter Server generates a query for each Source Qualifier Transformation whenever it runs the
session. The default query is SELET statement containing all the source columns. Source Qualifier has
capability to override this default query by changing the default settings of the transformation properties.
The list of selected ports or the order they appear in the default query should not be changed in overridden
query.
Q. What is aggregator transformation?
A. The Aggregator transformation allows performing aggregate calculations, such as averages and sums.
Unlike Expression Transformation, the Aggregator transformation can only be used to perform calculations
on groups. The Expression transformation permits calculations on a row-by-row basis only.
Aggregator Transformation contains group by ports that indicate how to group the data. While grouping the
data, the aggregator transformation outputs the last row of each group unless otherwise specified in the
transformation properties.
Various group by functions available in Informatica are : AVG, COUNT, FIRST, LAST, MAX, MEDIAN,
MIN, PERCENTILE, STDDEV, SUM, VARIANCE.
Q. What is Incremental Aggregation?
A. Whenever a session is created for a mapping Aggregate Transformation, the session option for
Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it passes
new source data through the mapping and uses historical cache data to perform new aggregation
calculations incrementally.
Q. How Union Transformation is used?
A. The union transformation is a multiple input group transformation that can be used to merge data from
various sources (or pipelines). This transformation works just like UNION ALL statement in SQL, that is
used to combine result set of two SELECT statements.
Q. Can two flat files be joined with Joiner Transformation?
A. Yes, joiner transformation can be used to join data from two flat file sources.
Q. What is a look up transformation?
A. This transformation is used to lookup data in a flat file or a relational table, view or synonym. It
compares lookup transformation ports (input ports) to the source column values based on the lookup
condition. Later returned values can be passed to other transformations.
Q. Can a lookup be done on Flat Files?
A. Yes.
Q. What is the difference between a connected look up and unconnected look up?
A. Connected lookup takes input values directly from other transformations in the pipleline.
Unconnected lookup doesnt take inputs directly from any other transformation, but it can be used in any
transformation (like expression) and can be invoked as a function using :LKP expression. So, an
unconnected lookup can be called multiple times in a mapping.
Q. What is a mapplet?
A. A mapplet is a reusable object that is created using mapplet designer. The mapplet contains set of
transformations and it allows us to reuse that transformation logic in multiple mappings.
Q. What does reusable transformation mean?
A. Reusable transformations can be used multiple times in a mapping. The reusable transformation is stored
as a metadata separate from any other mapping that uses the transformation. Whenever any changes to a
reusable transformation are made, all the mappings where the transformation is used will be invalidated.
Q. What is update strategy and what are the options for update strategy?
A. Informatica processes the source data row-by-row. By default every row is marked to be inserted in the
target table. If the row has to be updated/inserted based on some logic Update Strategy transformation is
used. The condition can be specified in Update Strategy to mark the processed row for update or insert.
Following options are available for update strategy :
DD_INSERT : If this is used the Update Strategy flags the row for insertion. Equivalent
numeric value of DD_INSERT is 0.
DD_UPDATE : If this is used the Update Strategy flags the row for update. Equivalent
numeric value of DD_UPDATE is 1.
DD_DELETE : If this is used the Update Strategy flags the row for deletion. Equivalent
numeric value of DD_DELETE is 2.
DD_REJECT : If this is used the Update Strategy flags the row for rejection. Equivalent
numeric value of DD_REJECT is 3.
This is going to be a very interesting topic for ETL & Data modelers who design processes/tables to load
fact or transactional data which keeps on changing between dates.
ratings, etc.
The table above shows an entity in the source system that contains time variant values but they dont change daily. The
values are valid over a period of time; then they change.
Maybe Ralph Kimball or Bill Inmon can come with better data model!
There is one to one relationship between the source row and the target row.
There is a CURRENT_FLAG attribute, that means every time the ETL process get a new value it has add a
new row with current flag and go to the previous row and retire it. Now this step is a very costly ETL step
it will slow down the ETL process.
From the report writer issue this model is a major challange to use. Because what if the report wants a
rate which is not current. Imagine the complex query.
Design B
In this design the sanpshot of the source table is taken every day.
The ETL is very easy. But can you imagine the size of fact table when the source which has more than 1
million rows in the source table. (1 million x 365 days = ? rows per year). And what if the change in
values are in hours or minutes?
But you have a very happy user who can write SQL reports very easily.
Design C
Can there be a comprimise. How about using from date (time) to date (time)! The report write can
simply provide a date (time) and the straight SQL can return a value/row that was valid at that moment.
However the ETL is indeed complex as the A model. Because while the current row will be from current
date to- infinity. The previous row has to be retired to from date to todays date -1.
This kind of ETL coding also creates lots of testing issues as you want to make sure that for nay given
date and time only one instance of the row exists (for the primary key).
Which design is better, I have used all depending on the situtation.
3. What should be the unit test plan?
There are various cases where the ETL can miss and when planning for test cases and your plan should be to precisely
test those. Here are some examples of test plans
a. There should be only one value for a given date/date time
b. During the initial load when the data is available for multiple days the process should go sequential and create
snapshots/ranges correctly.
c. At any given time there should be only one current row .
d. etc