You are on page 1of 216

Best Practices: Table of Contents

Best Practices BP-1


Configuration Management BP-1
Migration Procedures BP-1
Development Techniques BP-16
Development FAQs BP-16
Data Cleansing BP-24
Data Connectivity Using PowerConnect for BW Integration Server BP-29
Data Connectivity using PowerConnect for Mainframe BP-33
Data Connectivity using PowerConnect for MQSeries BP-36
Data Connectivity using PowerConnect for PeopleSoft BP-40
Data Connectivity using PowerConnect for SAP BP-46
Incremental Loads BP-52
Mapping Design BP-57
Metadata Reporting and Sharing BP-62
Naming Conventions BP-67
Session and Data Partitioning BP-72
Using Parameters, Variables and Parameter Files BP-75
Error Handling BP-87
A Mapping Approach to Trapping Data Errors BP-87
Design Error Handling Infrastructure BP-91
Documenting Mappings Using Repository Reports BP-94
Error Handling Strategies BP-96
Using Shortcut Keys in PowerCenter Designer BP-107
Object Management BP-109
Creating Inventories of Reusable Objects & Mappings BP-109
Operations BP-113

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-i


Updating Repository Statistics BP-113
Daily Operations BP-117
Load Validation BP-119
Third Party Scheduler BP-122
Event Based Scheduling BP-125
Repository Administration BP-126
High Availability BP-129
Performance Tuning BP-131
Recommended Performance Tuning Procedures BP-131
Performance Tuning Databases BP-133
Performance Tuning UNIX Systems BP-151
Performance Tuning Windows NT/2000 Systems BP-157
Tuning Mappings for Better Performance BP-161
Tuning Sessions for Better Performance BP-170
Determining Bottlenecks BP-177
Platform Configuration BP-182
Advanced Client Configuration Options BP-182
Advanced Server Configuration Options BP-184
Platform Sizing BP-189
Recovery BP-193
Running Sessions in Recovery Mode BP-193
Project Management BP-199
Developing the Business Case BP-199
Assessing the Business Case BP-201
Defining and Prioritizing Requirements BP-203
Developing a WBS BP-205
Developing and Maintaining the Project Plan BP-206
Managing the Project Lifecycle BP-208
Security BP-210
Configuring Security BP-210

PAGE BP-ii BEST PRACTICES INFORMATICA CONFIDENTIAL


Migration Procedures

Challenge

To develop a migration strategy that ensures clean migration between development,


test, QA, and production, thereby protecting the integrity of each of these
environments as the system evolves.

Description

In every application deployment, a migration strategy must be formulated to ensure


a clean migration between development, test, quality assurance, and production. The
migration strategy is largely influenced by the technologies that are deployed to
support the development and production environments. These technologies include
the databases, the operating systems, and the available hardware.

Informatica offers flexible migration techniques that can be adapted to fit the
existing technology and architecture of various sites, rather than proposing a single
fixed migration strategy. The means to migrate work from development to
production depends largely on the repository environment, which is either:

• Standalone PowerCenter, or
• Distributed PowerCenter

This Best Practice describes several migration strategies, outlining the advantages
and disadvantages of each. It also discusses an XML method provided in
PowerCenter 5.1 to support migration in either a Standalone or a Distributed
environment.

Standalone PowerMart/PowerCenter

In a standalone environment, all work is performed in a single Informatica repository


that serves as the shared metadata store. In this standalone environment,
segregating the workspaces ensures that the migration from development to
production is seamless.

Workspace segregation can be achieved by creating separate folders for each work
area. For instance, we might build a single data mart for the finance division within a

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-1


corporation. In this example, we would create a minimum of four folders to manage
our metadata. The folders might look something like the following:

In this scenario, mappings are developed in the FINANCE_DEV folder. As


development is completed on particular mappings, they will be copied one at a time
to the FINANCE_TEST folder. New sessions will be created or copied for each
mapping in the FINANCE_TEST folder.

When unit testing has been completed successfully, the mappings are copied into the
FINANCE_QA folder. This process continues until the mappings are integrated into
the production schedule. At that point, new sessions will be created in the
FINANCE_PROD folder, with the database connections adjusted to point to the
production environment.

Introducing shortcuts in a single standalone environment complicates the migration


process, but offers an efficient method for centrally managing sources and targets.

A common folder can be used for sharing reusable objects such as shared sources,
target definitions, and reusable transformations. If a common folder is used, there
should be one common folder for each environment (i.e., SHARED_DEV,
SHARED_TEST, SHARED_QA, SHARED_PROD).

Migration Example Process

Copying the mappings into the next stage enables the user to promote the desired
mapping to test, QA, or production at the lowest level of granularity. If the folder
where the mapping is to be copied does not contain the referenced source/target
tables or transformations, then these objects will automatically be copied along with
the mapping. The advantage of this promotion strategy is that individual mappings
can be promoted as soon as they are ready for production. However, because only
one mapping at a time can be copied, promoting a large number of mappings into
production would be very time consuming. Additional time is required to re-create or
copy all sessions from scratch, especially if pre- or post-session scripts are used.

On the initial move to production, if all mappings are completed, the entire
FINANCE_QA folder could be copied and renamed to FINANCE_PROD. With this
approach, it is not necessary to promote all mappings and sessions individually. After
the initial migration, however, mappings will be promoted on a “case-by-case” basis.

PAGE BP-2 BEST PRACTICES INFORMATICA CONFIDENTIAL


Follow these steps to copy a mapping from Development to Test:

1. If using shortcuts, first follow these substeps; if not using shortcuts, skip to step
2

• Create four common folders, one for each migration stage (COMMON_DEV,
COMMON_TEST, COMMON_QA, COMMON_PROD).
• Copy the shortcut objects into the COMMON_TEST folder.

2. Copy the mapping from Development into Test.

• In the PowerCenter Designer, open the appropriate test folder, and drag and
drop the mapping from the development folder into the test folder.

3. If using shortcuts, follow these substeps; if not using shortcuts, skip to step 4:

• Open the mapping that uses shortcuts.


• Using the newly copied mapping, open it in the Designer and bring in the
newly copied shortcut.
• Using the old shortcut as a model, link all of the input ports to the new
shortcut.
• Using the old shortcut as a model, link all of the output ports to the new
shortcut.

However, if any of the objects are active, first delete the old shortcut before linking
the output ports.

4. Create or copy a session in the Server Manager to run the mapping (make sure
the mapping exists in the current repository first).

• If copying the mapping, follow the copy session wizard.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-3


• If creating the mapping, enter all the appropriate information in the Session
Wizard.

5. Implement appropriate security, such as:

• In Development, the owner of the folders should be a user in the


development group.
• In Test and Quality Assurance, change the owner of the Test/QA folders to a
user in the Test/QA group.
• In Production, change the owner of the folders to a user in the Production
group.
• Revoke all rights to Public other than Read for the Production folders.

Performance Implications in the Single Environment

A disadvantage of the single environment approach is that even though the


Development, Test, QA, and Production “environments” are stored in separate
folders, they all reside on the same server. This can have negative performance
implications. If Development or Test loads are running simultaneously with

PAGE BP-4 BEST PRACTICES INFORMATICA CONFIDENTIAL


Production loads, the server machine may reach 100 percent utilization and
Production performance will suffer.

Often, Production loads run late at night, and most Development and Test loads run
during the day so this does not pose a problem. However, situations do arise where
performance benchmarking with large volumes or other unusual circumstances can
cause test loads to run overnight, contending with the pre-scheduled Production
runs.

Distributed PowerCenter

In a distributed environment, there are separate, independent environments (i.e.,


hardware and software) for Development, Test, QA, and Production. This is the
preferred method for handling Development to Production migrations. Because each
environment is segregated from the others, work performed in Development cannot
impact Test, QA, or Production.

With a fully distributed approach, separate repositories provide the same function as
the separate folders in the standalone environment described previously. Each
repository has a similar name for the folders in the standalone environment. For
instance, in our Finance example we would have four repositories, FINANCE_DEV,
FINANCE_TEST, FINANCE_QA, and FINANCE_PROD.

The mappings are created in the Development repository, moved into the Test
repository, and then eventually into the Production environment. There are three
main techniques to migrate from Development to Production, each involving some
advantages and disadvantages:

• Repository Copy
• Folder Copy
• Object Copy

Repository Copy

The main advantage to this approach is the ability to copy everything at once from
one environment to another, including source and target tables, transformations,
mappings, and sessions. Another advantage is the ability to automate this process
without having users perform this process. The final advantage is that everything
can be moved without breaking/corrupting any of the objects.

There are, however, three distinct disadvantages to the repository copy method. The
first is that everything is moved at once (also an advantage). The trouble with this is
that everything is moved, ready or not. For example, there may be 50 mappings in
QA but only 40 of them are production-ready. The 10 unready mappings are moved
into production along with the 40 production-ready maps, which leads to the second
disadvantage -- namely that maintenance is required to remove any unwanted or
excess objects. Another disadvantage is the need to adjust server variables,
sequences, parameters/variables, database connections, etc. Everything will need to
be set up correctly on the new server that will now host the repository.

There are three ways to accomplish the Repository Copy method:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-5


• Copying the Repository
• Repository Backup and Restore
• PMREP

Copying the Repository

The repository copy command is probably the easiest method of migration. To


perform this one needs to go the file menu of the Repository Manager and select
Copy Repository. From there the user is prompted to choose the location to which
the repository will be copied. The following screen shot shows the dialog box used to
input the new location information:

To successfully perform the copy, the user must delete the current repository in the
new location. For example, if a user was copying a repository from DEV to TEST,
then the TEST repository must first be deleted using the Delete option in the
Repository Manager to create room for the new repository. Then the Copy Repository
routine must be run.

Repository Backup and Restore

The Backup and Restore Repository is another simple method of copying an entire
repository. To perform this function, go to the File menu in the Repository Manager
and select Backup Repository. This will create a .REP file containing all repository
information. To restore the repository simply open the Repository Manager on the
destination server and select Restore Repository from the File menu. Select the
created .REP file to automatically restore the repository in the destination server. To
ensure success, be sure to first delete any matching destination repositories, since
the Restore Repository option does not delete the current repository.

PMREP

Using the PMREP commands is essentially the same as the Backup and Restore
Repository method except that it is run from the command line. The PMREP utilities
can be utilized both from the Informatica Server and from any client machines
connected to the server.

PAGE BP-6 BEST PRACTICES INFORMATICA CONFIDENTIAL


The following table documents the available PMREP commands:

The following is a sample of the command syntax used within a batch file to connect
to and backup a repository. Using the code example below as a model, scripts can be
written to be run on a daily basis to perform functions such as connect, backup,
restore, etc:

After following one of the above procedures to migrate into Production, follow these
steps to convert the repository to Production:

1. Disable sessions that schedule mappings that are not ready for Production or
simply delete the mappings and sessions.

• Disable the sessions in the Server manager by opening the session properties,
and then clearing the Enable checkbox under the General tab.
• Delete the sessions in the Server Manager and the mappings in the Designer.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-7


2. Modify the database connection strings to point to the Production sources and
targets.

• In the Server Manager, select Database Connections from the Server


Configuration menu.
• Edit each database connection by changing the connect string to point to the
production sources and targets.
• If using lookup transformations in the mappings and the connect string is
anything other than $SOURCE or $TARGET, then the connect string will need
to be modified appropriately.

3. Modify the pre- and post-session commands as necessary.

• In the Server Manager, open the session properties, and from the General tab
make the required changes to the pre- and post-session scripts.

4. Implement appropriate security, such as:

• In Development, ensure that the owner of the folders is a user in the


Development group.
• In Test and Quality Assurance, change the owner of the Test/QA folders to a
user in the Test/QA group.
• In Production, change the owner of the folders to a user in the Production
group.
• Revoke all rights to Public other than Read for the Production folders.

Folder Copy

Copying an entire folder allows you to quickly promote all of the objects in the
Development folder to Test, and so forth. All source and target tables, reusable
transformations, mappings, and sessions are promoted at once. Therefore,
everything in the folder must be ready to migrate forward. If certain mappings are
not ready, then after the folder is copied, developers (or the Repository
Administrator) must manually delete these mappings from the new folder.

The advantages of Folder Copy are:

• Easy to move the entire folder and all objects in it


• Detailed Wizard guides the user through the entire process
• There’s no need to update or alter any Database Connections, sequences or
server variables.

The disadvantages of Folder Copy are:

• User needs to be logged into multiple environments simultaneously.


• The repository is locked while Folder Copy is being performed.

If copying a folder, for example, from QA to Production, follow these steps:

1. If using shortcuts, follow these substeps; otherwise skip to step 2:

PAGE BP-8 BEST PRACTICES INFORMATICA CONFIDENTIAL


• In each of the dedicated repositories, create a common folder using exactly
the same name and case as in the “source” repository.
• Copy the shortcut objects into the common folder in Production and make
sure the shortcut has exactly the same name.
• Open and connect to either the Repository Manager or Designer.

2. Drag and drop the folder onto the production repository icon within the
Navigator tree structure. (To copy the entire folder, drag and drop the folder icon
just under the repository level.)

3. Follow the Copy Folder Wizard steps. If a folder with that name already exists,
it must be renamed.

4. Point the folder to the correct shared folder if one is being used:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-9


After performing the Folder Copy method, be sure to remember the following steps:

1. Modify the pre- and post-session commands as necessary:

• In the Server Manager, open the session properties, and from the General tab
make the required changes to the pre- and post-sessions scripts.

2. Implement appropriate security:

• In Development, ensure the owner of the folders is a user in the Development


group.
• In Test and Quality Assurance, change the owner of the Test/QA folders to a
user in the Test/QA group.
• In Production, change the owner of the folders to a user in the Production
group.
• Revoke all rights to Public other than Read for the Production folders.

Object Copy

Copying mappings into the next stage within a networked environment has many of
the same advantages and disadvantages as in the standalone environment, but the
process of handling shortcuts is simplified in the networked environment. For
additional information, see the previous description of Object Copy for the
standalone environment.

Additional advantages and disadvantages of Object Copy in a distributed


environment include:

Advantages:

• More granular control over objects

PAGE BP-10 BEST PRACTICES INFORMATICA CONFIDENTIAL


Disadvantages:

• Much more work to deploy an entire group of objects


• Shortcuts must exist prior to importing/copying mappings

1. If using shortcuts, follow these substeps, otherwise skip to step 2:

• In each of the dedicated repositories, create a common folder with the exact
same name and case.
• Copy the shortcuts into the common folder in Production making sure the
shortcut has the exact same name.

2. Copy the mapping from quality assurance (QA) into production.

• In the Designer, connect to both the QA and Production repositories and open
the appropriate folders in each.
• Drag and drop the mapping from QA into Production.

3. Create or copy a session in the Server Manager to run the mapping (make
sure the mapping exists in the current repository first).

• If copying the mapping follow the copy session wizard.


• If creating the mapping, enter all the appropriate information in the Session
Wizard.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-11


4. Implement appropriate security.

• In Development, ensure the owner of the folders is a user in the Development


group.
• In Test and Quality Assurance, change the owner of the Test/QA folders to a
user in the Test/QA group.
• In Production, change the owner of the folders to a user in the Production
group.
• Revoke all rights to Public other than Read for the Production folders.

Recommendations

Informatica recommends using the following process when running in a three-tiered


environment with Development, Test/QA, and Production servers:

For migrating from Development into Test, Informatica recommends using the
Object Copy method. This method gives you total granular control over the objects
that are being moved. It ensures that the latest development maps can be moved
over manually as they are completed. For recommendations on performing this copy
procedure correctly, see the steps outlined in the Object Copy section.

PAGE BP-12 BEST PRACTICES INFORMATICA CONFIDENTIAL


When migrating from Test to Production, Informatica recommends using the
Repository Copy method. Before performing this migration, all code in the Test
server should be frozen and tested. After the Test code is cleared for production, use
one of the repository copy methods. (Refer to the steps outlined in the Repository
Copy section for recommendations to ensure that this process is successful.). If
similar server and database naming conventions are utilized, there will be minimal or
no changes required to sessions that are created or copied to the production server.

XML Object Copy Process

Another method of copying objects in a distributed (or centralized) environment is to


copy objects by utilizing PM/PC’s XML functionality. This method is more useful in the
distributed environment because it allows for backup into an XML file to be moved
across the network.

The XML Object Copy Process works in a manner very similar to the Repository Copy
backup and restore method, as it allows you to copy sources, targets, reusable
transformations, mappings, and sessions. Once the XML file has been created, that
XML file can be changed with a text editor to allow more flexibility. For example, if
you had to copy one session many times, you would export that session to an XML
file. Then, you could edit that file to find everything within the <Session> tag, copy
that text, and paste that text within the XML file. You would then change the name
of the session you just pasted to be unique. When you imported that XML file back
into your folder, two sessions will be created. The following demonstrates the
import/export functionality:

1. Objects are exported into an XML file:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-13


2. Objects are imported into a repository from the corresponding XML file:

3. Sessions can be exported and imported into the Server Manager in the same
way (the corresponding mappings must exist for this to work).

PAGE BP-14 BEST PRACTICES INFORMATICA CONFIDENTIAL


INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-15
Development FAQs

Challenge

Using the PowerCenter product suite to most effectively to develop, name, and
document components of the analytic solution. While the most effective use of
PowerCenter depends on the specific situation, this Best Practice addresses some
questions that are commonly raised by project teams. It provides answers in a
number of areas, including Scheduling, Backup Strategies, Server Administration,
and Metadata. Refer to the product guides supplied with PowerCenter for additional
information.

Description

The following pages summarize some of the questions that typically arise during
development and suggest potential resolutions.

Q: How does source format affect performance? (i.e., is it more efficient to source
from a flat file rather than a database?)

In general, a flat file that is located on the server machine loads faster than a
database located on the server machine. Fixed-width files are faster than
delimited files because delimited files require extra parsing. However, if there
is an intent to perform intricate transformations before loading to target, it
may be advisable to first load the flat-file into a relational database, which
allows the PowerCenter mappings to access the data in an optimized fashion
by using filters and custom SQL SELECTs where appropriate.

Q: What are some considerations when designing the mapping? (i.e. what is the
impact of having multiple targets populated by a single map?)

With PowerCenter, it is possible to design a mapping with multiple targets.


You can then load the targets in a specific order using Target Load Ordering.
The recommendation is to limit the amount of complex logic in a mapping.
Not only is it easier to debug a mapping with a limited number of objects, but
they can also be run concurrently and make use of more system resources.
When using multiple output files (targets), consider writing to multiple disks
or file systems simultaneously. This minimizes disk seeks and applies to a

PAGE BP-16 BEST PRACTICES INFORMATICA CONFIDENTIAL


session writing to multiple targets, and to multiple sessions running
simultaneously.

Q: What are some considerations for determining how many objects and
transformations to include in a single mapping?

There are several items to consider when building a mapping. The business
requirement is always the first consideration, regardless of the number of
objects it takes to fulfill the requirement. The most expensive use of the DTM
is passing unnecessary data through the mapping. It is best to use filters as
early as possible in the mapping to remove rows of data that are not needed.
This is the SQL equivalent of the WHERE clause. Using the filter condition in
the Source Qualifier to filter out the rows at the database level is a good way
to increase the performance of the mapping.

Log File Organization

Q: Where is the best place to maintain Session Logs?

One often-recommended location is the default /SessLogs/ folder in the


Informatica directory, keeping all log files in the same directory.

Q: What documentation is available for the error codes that appear within the error
log files?

Log file errors and descriptions appear in Appendix C of the PowerCenter


User Guide. Error information also appears in the PowerCenter Help File
within the PowerCenter client applications. For other database-specific errors,
consult your Database User Guide.

Scheduling Techniques

Q: What are the benefits of using batches rather than sessions?

Using a batch to group logical sessions minimizes the number of objects that
must be managed to successfully load the warehouse. For example, a
hundred individual sessions can be logically grouped into twenty batches. The
Operations group can then work with twenty batches to load the warehouse,
which simplifies the operations tasks associated with loading the targets.

There are two types of batches: sequential and concurrent.

o A sequential batch simply runs sessions one at a time, in a linear


sequence. Sequential batches help ensure that dependencies are met
as needed. For example, a sequential batch ensures that session1 runs
before session2 when session2 is dependent on the load of session1,
and so on. It's also possible to set up conditions to run the next
session only if the previous session was successful, or to stop on
errors, etc.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-17


o A concurrent batch groups logical sessions together, like a sequential
batch, but runs all the sessions at one time. This can reduce the load
times into the warehouse, taking advantage of hardware platforms'
Symmetric Multi-Processing (SMP) architecture. A new batch is
sequential by default; to make it concurrent, explicitly select the
Concurrent check box.

Other batch options, such as nesting batches within batches, can further
reduce the complexity of loading the warehouse. However, this capability
allows for the creation of very complex and flexible batch streams without the
use of a third-party scheduler.

Q: Assuming a batch failure, does PowerCenter allow restart from the point of
failure?

Yes. When a session or sessions in a batch fail, you can perform recovery to
complete the batch. The steps to take vary depending on the type of batch:

If the batch is sequential, you can recover data from the session that failed
and run the remaining sessions in the batch. If a session within a concurrent
batch fails, but the rest of the sessions complete successfully, you can
recover data from the failed session targets to complete the batch. However,
if all sessions in a concurrent batch fail, you might want to truncate all targets
and run the batch again.

Q: What guidelines exist regarding the execution of multiple concurrent sessions /


batches within or across applications?

Session/Batch Execution needs to be planned around two main constraints:

• Available system resources


• Memory and processors

The number of sessions that can run at one time depends on the number of
processors available on the server. The load manager is always running as a
process. As a general rule, a session will be compute-bound, meaning its
throughput is limited by the availability of CPU cycles. Most sessions are
transformation intensive, so the DTM always runs. Also, some sessions
require more I/O, so they use less processor time. Generally, a session needs
about 120 percent of a processor for the DTM, reader, and writer in total.

For concurrent sessions:

• One session per processor is about right; you can run more, but all
sessions will slow slightly.
• Remember that other processes may also run on the PowerCenter
server machine; overloading a production machine will slow overall
performance.

Even after available processors are determined, it is necessary to look at


overall system resource usage. Determining memory usage is more difficult

PAGE BP-18 BEST PRACTICES INFORMATICA CONFIDENTIAL


than the processors calculation; it tends to vary according to system load and
number of Informatica sessions running. The first step is to estimate memory
usage, accounting for:

• Operating system kernel and miscellaneous processes


• Database engine
• Informatica Load Manager

Each session creates three processes: the Reader, Writer, and DTM.

• If multiple sessions run concurrently, each has three processes

• More memory is allocated for lookups, aggregates, ranks, and


heterogeneous joins in addition to the shared memory segment.

At this point, you should have a good idea of what is left for concurrent
sessions. It is important to arrange the production run to maximize use of this
memory. Remember to account for sessions with large memory
requirements; you may be able to run only one large session, or several small
sessions concurrently.

Load Order Dependencies are also an important consideration because they


often create additional constraints. For example, load the dimensions first,
then facts. Also, some sources may only be available at specific times, some
network links may become saturated if overloaded, and some target tables
may need to be available to end users earlier than others.

Q: Is it possible to perform two "levels" of event notification? One at the application


level, and another at the PowerCenter server level to notify the Server
Administrator?

The application level of event notification can be accomplished through post-


session e-mail. Post-session e-mail allows you to create two different
messages, one to be sent upon successful completion of the session, the
other to be sent if the session fails. Messages can be a simple notification of
session completion or failure, or a more complex notification containing
specifics about the session. You can use the following variables in the text of
your post-session e-mail:

E-mail Variable Description

%s Session name

%l Total records loaded

%r Total records rejected

%e Session status

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-19


%t Table details, including read throughput in bytes/second and write
throughput in rows/second

%b Session start time

%c Session completion time

%i Session elapsed time (session completion time-session start time)

%g Attaches the session log to the message

%a<filename> Attaches the named file. The file must be local to the Informatica
Server. The following are valid filenames: %a<c:\data\sales.txt>
or %a</users/john/data/sales.txt>

On Windows NT, you can attach a file of any type.


On UNIX, you can only attach text files. If you attach a non-text
file, the send might fail.

Note: The filename cannot include the Greater Than character


(>) or a line break.

The PowerCenter Server on UNIX uses rmail to send post-session e-mail. The
repository user who starts the PowerCenter server must have the rmail tool
installed in the path in order to send e-mail.

To verify the rmail tool is accessible:

1. Login to the UNIX system as the PowerCenter user who starts the
PowerCenter Server.

2. Type rmail <fully qualified email address> at the prompt and press Enter.

3. Type . to indicate the end of the message and press Enter.

4. You should receive a blank e-mail from the PowerCenter user's e-mail
account. If not, locate the directory where rmail resides and add that
directory to the path.

5. When you have verified that rmail is installed correctly, you are ready to
send post-session e-mail.

The output should look like the following:

Session complete.
Session name: sInstrTest
Total Rows Loaded = 1
Total Rows Rejected = 0
Completed

PAGE BP-20 BEST PRACTICES INFORMATICA CONFIDENTIAL


Rows Rows Read Throughput Write Throughput
Table Name
Loaded Rejected (bytes/sec) (rows/sec)
Status
1 0 30 1 t_Q3_sales

No errors encountered.
Start Time: Tue Sep 14 12:26:31 1999
Completion Time: Tue Sep 14 12:26:41 1999
Elapsed time: 0:00:10 (h:m:s)

This information, or a subset, can also be sent to any text pager that accepts
e-mail.

Backup Strategy Recommendation

Q: Can individual objects within a repository be restored from the back-up or from a
prior version?

At the present time, individual objects cannot be restored from a back-up


using the PowerCenter Server Manager (i.e., you can only restore the entire
repository). But, It is possible to restore the back-up repository into a
different database and then manually copy the individual objects back into
the main repository.

Refer to Migration Procedures for details on promoting new or changed


objects between development, test, QA, and production environments.

Server Administration

Q: What built-in functions, does PowerCenter provide to notify someone in the event
that the server goes down, or some other significant event occurs?

There are no built-in functions in the server to send notification if the


server goes down. However, it is possible to implement a shell script
that will sense whether the server is running or not. For example, the
command "pmcmd pingserver" will give a return code or status which
will tell you if the server is up and running. Using the results of this
command as a basis, a complex notification script could be built.

Q: What system resources should be monitored? What should be considered normal


or acceptable server performance levels?

The pmprocs utility, which is available for UNIX systems only, shows
the currently executing PowerCenter processes.

Pmprocs is a script that combines the ps and ipcs commands. It is


available through Informatica Technical Support. The utility provides
the following information:
- CPID - Creator PID (process ID)

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-21


- LPID - Last PID that accessed the resource
- Semaphores - used to sync the reader and writer
- 0 or 1 - shows slot in LM shared memory
(See Chapter 16 in the PowerCenter Administrator's Guide for
additional details.)

Q: What cleanup (if any) should be performed after a UNIX server crash? Or after an
Oracle instance crash?

If the UNIX server crashes, you should first check to see if the
Repository Database is able to come back up successfully. If this is the
case, then you should try to start the PowerCenter server. Use the
pmserver.err log to check if the server has started correctly. You can
also use ps -ef | grep pmserver to see if the server process (the Load
Manager) is running.

Metadata

Q: What recommendations or considerations exist as to naming standards or


repository administration for metadata that might be extracted from the
PowerCenter repository and used in others?

With PowerCenter, you can enter description information for all repository
objects, sources, targets, transformations, etc, but the amount of metadata
that you enter should be determined by the business requirements. You can
also drill down to the column level and give descriptions of the columns in a
table if necessary. All information about column size and scale, datatypes,
and primary keys are stored in the repository.

The decision on how much metadata to create is often driven by project


timelines. While it may be beneficial for a developer to enter detailed
descriptions of each column, expression, variable, etc, it is also very time
consuming to do so. Therefore, this decision should be made on the basis of
how much metadata will be required by the systems that use the metadata.

Q: What procedures exist for extracting metadata from the repository?

Informatica offers an extremely rich suite of metadata-driven tools for data


warehousing applications. All of these tools store, retrieve, and manage their
metadata in Informatica's central repository. The motivation behind the
original Metadata Exchange (MX) architecture was to provide an effective and
easy-to-use interface to the repository.

Today, Informatica and several key Business Intelligence (BI) vendors,


including Brio, Business Objects, Cognos, and MicroStrategy, are effectively
using the MX views to report and query the Informatica metadata.

Informatica does not recommend accessing the repository directly, even for
SELECT access. Rather, views have been created to provide access to the
metadata stored in the repository.

PAGE BP-22 BEST PRACTICES INFORMATICA CONFIDENTIAL


INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-23
Data Cleansing

Challenge

Accuracy is one of the biggest obstacles blocking the success of many data
warehousing projects. If users discover data inconsistencies, the user community
may lose faith in the entire warehouse’s data. However, it is not unusual to discover
that as many as half the records in a database contain some type of information that
is incomplete, inconsistent, or incorrect. The challenge is therefore to cleanse data
online, at the point of entry into the data warehouse or operational data store (ODS),
to ensure that the warehouse provides consistent and accurate data for business
decision making.

Description

Informatica has several partners in the data cleansing arena. The partners
and respective tools include the following:

DataMentors - Provides tools that are run before the data extraction and
load process to clean source data. Available tools are :

• DMDataFuse
TM
- a data cleansing and householding system with the
power to accurately standardize and match data.
• DMValiData
TM
- an effective, data analysis system that profiles and
identifies inconsistencies between data and metadata.
• DMUtils - a powerful non-compiled scripting language that operates on
flat ASCII or delimited files. It is primarily used as a query and
reporting tool. It also provides a way to reformat and summarize files.

FirstLogic – FirstLogic offers direct interfaces to PowerCenter during the


extract and load process as well as providing pre-data extraction data
cleansing tools like DataRight and Merge/Purge. The online interface (ACE
Library) integrates the TrueName Library and Merge/Purge Library of
FirstLogic, as Transformation Components, using the Informatica External
Procedures protocol. Thus, these components can be invoked for parsing,
standardization, cleansing, enhancement, and matching of the name and
address information during the PowerCenter ETL stage of building a data mart
or data warehouse.

PAGE BP-24 BEST PRACTICES INFORMATICA CONFIDENTIAL


Paladyne – The flagship product, Datagration is an open, flexible data quality
system that can repair any type of data (in addition to its name and address)
by incorporating custom business rules and logic. Datagration's Data
Discovery Message Gateway feature assesses data cleansing requirements
using automated data discovery tools that identify data patterns. Data
Discovery enables Datagration to search through a field of free form data and
re-arrange the tokens (i.e., words, data elements) into a logical order.
Datagration supports relational database systems and flat files as data
sources and any application that runs in batch mode.

Vality – Provides a product called Integrity, which identifies business


relationships (such as households) and duplications, reveals undocumented
business practices, and discovers metadata/field content discrepancies. It
offers data analysis and investigation, conditioning, and unique probabilistic
and fuzzy matching capabilities.

Vality is in the process of developing a "TX Integration" to PowerCenter.


Delivery of this bridge was originally scheduled for May 2001, but no further
information is available at this time.

Trillium – Trillium’s eQuality customer information components (a web


enabled tool) are integrated with Informatica’s Transformation Exchange
modules and reside on the same server as Informatica’s transformation
engine. As a result, Informatica users can invoke Trillium’s four data quality
components through an easy-to-use graphical desktop object. The four
components are :

• Converter: data analysis and investigation module for discovering


word patterns and phrases within free form text
• Parser: processing engine for data cleansing, elementizing and
standardizing customer data
• Geocoder: an Internationally-certified postal and census module for
address verification and standardization
• Matcher: a module designed for relationship matching and record
linking.

Integration Examples

This following sections describe how to integrate two of the tools with
PowerCenter.

FirstLogic – ACE

The following graphic illustrates a high level flow diagram of the data
cleansing process.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-25


Use the Informatica Advanced External Transformation process to interface
with the FirstLogic module by creating a “Matching Link” transformation. That
process uses the Informatica Transformation Developer to create a new
Advanced External Transformation, which incorporates the properties of the
FirstLogic Matching Link files. Once a Matching Link transformation has been
created in the Transformation Developer, users can incorporate that
transformation into any of their project mappings: it's reusable from the
repository.

When an Informatica session starts, the transformation is initialized. The


initialization sets up the address processing options, allocates memory, and
opens the files for processing. This operation is only performed once. As each
record is passed into the transformation it is parsed and standardized. Any
output components are created and passed to the next transformation. When
the session ends, the transformation is terminated. The memory is once again
available and the directory files are closed.

The available functions / processes are as follows.

ACE Processing

There are four ACE transformations available to choose from. They will parse,
standardize and append address components using Firstlogic’s ACE Library.
The transformation choice depends on the input record layout. A fourth
transformation can provide optional components. This transformation must be
attached to one of the three base transformations.

The four transforms are:

1. ACE_discrete - where the input address data is presented in discrete


fields
2. ACE_multiline - where the input address data is presented in multiple
lines (1-6).
3. ACE_mixed - where the input data is presented with discrete
city/state/zip and multiple address lines(1-6).
4. Optional transform – which is attached to one of the three base
transforms and outputs the additional components of ACE for
enhancement.

PAGE BP-26 BEST PRACTICES INFORMATICA CONFIDENTIAL


All records input into the ACE transformation are returned as output. ACE
returns Error/Status Code information during the processing of each address.
This allows the end user to invoke additional rules before the final load is
completed.

TrueName Process

TrueName mirrors the ACE transformation options with discrete, multi-line


and mixed transformations. A fourth and optional transformation available in
this process can be attached to one of the three transformations to provide
genderization and match standards enhancements. TrueName will generate
error and status codes. Similar to ACE, all records entered as input into the
TrueName transformation can be used as output.

Matching Process

The matching process works through one transformation within the


Informatica architecture. The input data is read into the Informatica data
flow similar to a batch file. All records are read, the break groups created
and, in the last step, matches are identified. Users set-up their own matching
transformation through the PowerCenter Designer by creating an advanced
external procedure transformation. Users are able to select which records are
output from the matching transformations by editing the initialization
properties of the transformation.

All matching routines are predefined and, if necessary, the configuration files
can be accessed for additional tuning. The five predefined matching scenarios
include: individual, family, household (the only difference between household
and family, is the household doesn't match on last name), firm individual, and
firm. Keep in mind that the matching does not do any data parsing, this must
be accomplished prior to using this transformation. As with ACE and
TrueName, error and status codes are reported.

Trillium

Integration to Trillium’s data cleansing software is achieved through the


Informatica Trillium Advanced External Procedures (AEP) interface.

The AEP modules incorporate the following Trillium functional components.

• Trillium Converter – The Trillium Converter facilitates data


conversion such as EBCDIC to ASCII, integer to character, character
length modification, literal constant and increasing values. It may also
be used to create unique record identifiers, omit unwanted
punctuation, or translate strings based on actual data or mask values.
A user-customizable parameter file drives the conversion process. The
Trillium Converter is a separate transformation that can be used
standalone or in conjunction with the Trillium Parser module.
• Trillium Parser – The Trillium Parser identifies and/or verifies the
components of free-floating or fixed field name and address data. The
primary function of the Parser is to partition the input address records

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-27


into manageable components in preparation for postal and census
geocoding. The parsing process is highly table- driven to allow for
customization of name and address identification to specific
requirements.
• Trillium Postal Geocoder – The Trillium Postal Geocoder matches an
address database to the ZIP+4 database of the U.S. Postal Service
(USPS).
• Trillium Census Geocoder – The Trillium Census Geocoder matches
the address database to U.S. Census Bureau information.

Each record that passes through the Trillium Parser external module is first
parsed and then, optionally, postal geocoded and census geocoded. The level
of geocoding performed is determined by a user-definable initialization
property.

• Trillium Window Matcher – The Trillium Window Matcher allows the


PowerCenter Server to invoke Trillium’s deduplication and house
holding functionality. The Window Matcher is a flexible tool designed to
compare records to determine the level of likeness between them. The
result of the comparisons is considered a passed, a suspect, or a failed
match depending upon the likeness of data elements in each record,
as well as a scoring of their exceptions.

Input to the Trillium Window Matcher transformation is typically the sorted


output of the Trillium Parser transformation. The options for sorting include:

• Using the Informatica Aggregator transformation as a sort engine.


• Separate the mappings whenever a sort is required. The sort can be
run as a pre/post session command between mappings. Pre/post
sessions are configured in the Server Manager.
• Build a custom AEP Transformation to include in the mapping.

PAGE BP-28 BEST PRACTICES INFORMATICA CONFIDENTIAL


Data Connectivity Using PowerConnect for BW Integration
Server

Challenge

Understanding PCISBW to load data into the SAP BW.

Description

PowerCenter supports SAP Business Information Warehouse (BW) as a warehouse


target only. PowerCenter Integration Server for BW enables you to include SAP
Business Information Warehouse targets in your data mart or data warehouse.
PowerCenter uses SAP’s Business Application Program Interface (BAPI), SAP’s
strategic technology for linking components into the Business Framework, to
exchange metadata with BW.

Key Differences of Using PowerCenter to Populate BW Instead of a RDBMS

• BW uses the pull model.BW must request data from an external source
system, which is PowerCenter before the source system can send data to BW.
PowerCenter uses PCISBW to register with BW first, using SAP’s Remote
Function Call (RFC) protocol.
• External source systems provide transfer structures to BW. Data is moved
and transformed within BW from one or more transfer structures to a
communication structure according to transfer rules. Both, transfer structures
and transfer rules, must be defined in BW prior to use. Normally this is done
from the BW side. An InfoCube is updated by one communication structure as
defined by the update rules.
• Staging BAPIs (an API published and supported by SAP) is the native
interface to communicate with BW. Three PowerCenter product suites use this
API. PowerCenter Designer uses the Staging BAPIs to import metadata for the
target transfer structures. PCISBW uses the Staging BAPIs to register with
BW and receive requests to run sessions. PowerCenter Server uses the
Staging BAPIs to perform metadata verification and load data into BW.
• Programs communicating with BW use the SAP standard saprfc.ini file to
communicate with BW. The saprfc.ini file is similar to the tnsnames file in
Oracle or the interface file in Sybase. The PowerCenter Designer reads
metadata from BW and the PowerCenter Server writes data to BW.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-29


• BW requires that all metadata extensions be defined in the BW Administrator
Workbench. The definition must be imported to Designer. An active structure
is the target for PowerCenter mappings loading BW.
• Due to its use of the pull model, BW must control all scheduling. BW invokes
the PowerCenter session when the InfoPackage is scheduled to run in BW.
• BW only supports insertion of data into BW. There is no concept of update or
deletes through the staging BAPIs.
• BW supports two different methods for loading data: IDOC and TRFC
(Transactional Remote Functional Call). The methods have to be chosen in
BW.
• When using IDOC, all of the processing required to move data from a transfer
structure to an InfoCube (transfer structure to transfer rules to
communication structure to update rules to InfoCubes) is done synchronously
with the InfoPackage.
• When using TRFC method, you have four options for the data target when you
execute the InfoPackage: 1) InfoCubes only, 2) ODS only 3) InfoCubes then
ODS and 4) InfoCubes and ODS in parallel.
• Loading into the ODS is the fastest since less processing is performed on the
data as it is being loaded into BW. (Lots of customers choose this option) You
can update the InfoCubes later.

Key Steps To Load Data Into BW

1. Install and Configure PowerCenter and PCISBW Components

The PCISBW server must be installed in the same directory as the PowerCenter
Server. On NT you can have only one PCISBW. Informatica recommends installing
PCISBW client tools in the same directory as the PowerCenter Client. For more
details on installation and configuration refer to the Installation Guide.

2. Build the BW Components

Step 1: Create an External Source System

Step 2: Create an InfoSource

Step 3: Assign an External Source System

Step 4: Activate the InfoSources

Hint: You do not normally need to create an external Source System or an


InfoSources. The BW administrator or project manager should tell you the name of
the external source system and the InfoSource targets.

3. Configure the saprfc.ini file

Required for PowerCenter and PCISBW to connect to BW. You need the same
saprfc.ini on both the PowerCenter Server and the PowerCenter Client).

4. Start the PCISBW server

PAGE BP-30 BEST PRACTICES INFORMATICA CONFIDENTIAL


Start PCISBW server only after you start PowerCenter server and before you create
InfoPackage in BW.

Pmbwserver [DEST_Entry_for_R_type]
[repo_user][repo_passwd][port_for_PowerCenter_Server]

Note: The & sign behind the start command doesn’t work when you start up the
PCISBW in a Telnet session

5. Build mappings

Import the InfoSource into PowerCenter Warehouse Designer and build a mapping
using the InfoSource as a target. Use the DEST_for_A_type as connect string.

6. Create a Database connection

Use DEST entry_for A_type of the saprfc.ini as the connect string in the PowerCenter
Server Manager

7. Load data

Create a session in PowerCenter and an InfoPackage in BW. You can only start a
Session from BW (Scheduler in the Administrator Workbench of BW). Before you can
start a session, you have to enter the session_name into BW. To do this, open the
Scheduler dialog box, go to the “Selection 3rd Party Tab and click on the “Selection
Refresh” button (symbol is a recycling sign) which then prompts you for the session
name. To start the session go to the last tab.

Parameter and Connection information file - Saprfc.ini

PowerCenter uses two types of entries to connect to BW through the saprfc.ini file:

• Type A. Used by PowerCenter Client and PowerCenter Server. Specifies the


BW application server. The client uses Type A for importing the transfer
structure (table definition) from BW into the Designer. The Server uses Type
A for verify the tables and writing into BW.
• Type R. Used by the PowerCenter Integration Server for BW. Register the
PCISBW as a RFC server at the SAP gateway so it acts as a listener. It then
can receive the request from BW to run a session on PowerCenter Server.

Do not use Notepad to edit this file. Notepad can corrupt the saprfc.ini file.

Set RFC_INI environment variable for all Windows NT, Windows 2000 and Windows
95/98 machines equal with saprfc.ini file. RFC_INI is used to locate the saprfc.ini.

Restrictions on Mappings with BW InfoSource Targets

• You can not use BW as a lookup table.


• You can use only one transfer structure for each mapping.
• You cannot execute stored procedure in a BW target.
• You cannot partition pipelines with a BW target.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-31


• You cannot copy fields that are prefaced with /BIC/ from the InfoSource
definition into other transformations.
• You cannot build update strategy in a mapping. BW supports only inserts. It
does not support updates or deletes. You can use Update Strategy
transformation in a mapping, but the PCISBW Server attempts to insert all
records, even those marked for update or delete.

Error Messages

PCISBW writes error messages to the screen. In some case PCISBW will generate a
file with extension *.trc in the PowerCenter Server directory. Look for error
messages there.

PAGE BP-32 BEST PRACTICES INFORMATICA CONFIDENTIAL


Data Connectivity using PowerConnect for Mainframe

Challenge

Accessing important, but difficult to deal with, legacy data sources residing on
mainframes and AS/400 systems, without having to write complex extract programs.

Description

When integrated with PowerCenter, PowerConnect for Mainframe and AS400


provides fast and seamless SQL access to non-relational sources, such as
VSAM, flat files, ADABAS, IMS and IDMS, as well as to relational sources,
such as DB2. It is an agent-based piece of software infrastructure that must
be installed on OS/390 or AS/400 as either a regular batch job or started
task. In addition, the PowerConnect client agent must be installed on the
same machine as the PowerCenter client or server.

The PowerConnect client agent and PowerCenter communicate via a thin


ODBC layer, so that as far as PowerCenter is concerned, the mainframe or
AS400 data is just a regular ODBC data source. The ODBC layer works for
both Windows and UNIX. The PowerConnect client agent and listener work in
tandem and, using TCP/IP, move the data at high-speed between the two
platforms in either direction. The data can also be compressed and encrypted
as it is being moved.

PowerConnect for Mainframe/AS400 has a Windows design tool, called


Navigator, which can directly import the following information, via
“datamaps”, without using FTP:

• COBOL and PL/1 copybooks


• Database definitions (DBDs) for IMS
• Subschemas for IDMS
• FDTs, DDMs, PREDICT data and ADA-CMP data for ADABAS
• Physical file definitions (DDS’s) for AS/400

After the above information has been imported and saved in the datamaps,
PowerCenter uses SQL to access the data – which it sees as relational tables -
at runtime.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-33


Some of the key capabilities of PowerConnect for Mainframe/AS400 include:

• Full EBCDIC-ASCII conversion


• Multiple concurrent data movements
• Support of all binary mainframe datatypes (e.g. packed decimal)
• Ability to handle complex data structures, such as COBOL OCCURS,
OCCURS DEPENDING ON, ADABAS MU and PE
• Support for REDEFINES
• Date/time field masking
• Multiple views from single data source
• Bad data checking
• Data filtering

Steps for Using the Navigator

If your objective is to import a COBOL copybook from OS/390, the process is


as follows:

1. Create the datamap (give it a name).

2. Specify the copybook name to be imported. This is the physical view.

3. Run the import process.

A relational table is created. This is the logical view.

4. Review and edit (if necessary) the default table created.

5. Perform a “row test” to source the data directly from OS/390.

The datamap is stored on the mainframe.

Installing PowerConnect for Mainframe/AS400

Note: Be sure to complete the Pre-Install Checklist (included at the end of


this document) prior to performing the install.

1. Perform the Windows install. This includes entering the Windows license
key, updating the configuration file (dbmover.cfg) to add a node entry for
communication between the client and the mainframe or AS/400, adding the
PowerConnect ODBC driver and setting up a client ODBC DSN.

2. Perform the mainframe or AS/400 install. This includes entering the


mainframe or AS/400 license key and updating the configuration file
(dbmover.cfg) to change various default settings.

3. Start the Listener on the mainframe or the AS/400 system.

4. Ping the mainframe or AS/400 from Windows to ensure connectivity.

PAGE BP-34 BEST PRACTICES INFORMATICA CONFIDENTIAL


5. Access sample data in Navigator as a test.

6. Perform the UNIX or NT install. This includes entering the UNIX or NT


license key, updating the configuration file (dbmover.cfg) to change various
default settings, adding the PowerConnect ODBC driver and setting up the
server ODBC DSN.

Guidelines for Integrating PowerConnect for Mainframe/AS400 with


PowerCenter

• In Server Manager, a database connection is required to allow the


server to communicate with PowerConnect. This should be of type
ODBC. The DSN name and connect string should be the same as
PowerConnect’s ODBC DSN, which was created when PowerConnect
was installed.
• Since the Informatica server communicates with PowerConnect via
ODBC, an ODBC license key is required.
• The “import from database” option in Designer is needed to pull in
sources from PowerConnect, along with the PowerConnect ODBC
DSN that was created when PowerConnect was installed.
• In Designer, before importing a source from PowerConnect for the first
time, edit the powermrt.ini file by adding this entry at the end of the
ODBCDLL section: DETAIL=EXTODBC.DLL
• When creating sessions in the Server Manager, modify the Tablename
prefix in the Source Options to include the PowerConnect
high-level qualifier (schema name).
• If entering a custom SQL override in the Source Qualifier to filter
PowerConnect data, the statement must be qualified with the
PowerConnect high-level qualifier (schema name).
• To handle large data sources, increase the default TIMEOUT setting in
the PowerConnect configuration files (dbmover.cfg) to
(15,1800,1800).
• To ensure smooth integration, apply the PowerCenter-PowerConnect
for Mainframe/AS400 ODBC EBF.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-35


Data Connectivity using PowerConnect for MQSeries

Challenge

Understanding how to use MQSeries Applications in PowerCenter mappings.

Description

MQSeries Applications communicate by sending each other messages rather than


calling each other directly. Applications can also request data using a ‘request
message’ on a message queue. Because no open connections are needed between
systems, they can run independently of one another. MQSeries enforces No Structure
on the content or format of the message; this is defined by the application.

Not Available to PowerCenter when using MQSeries

• No Lookup on MQSeries sources.


• No Debug ‘Sessions’. You must use actual server manager session to debug a
queue mapping.
• Certain considerations also necessary when using Aggregators, Joiners, and
Rank transformations because they will only be performed on one queue, as
opposed to a full data set.

MQSeries Architecture

MQSeries architecture has three parts: (1) Queue Manager, (2) Message Queue and
(3) MQSeries Message.

Queue Manager

• Informatica connects to Queue Manager to send and receive messages.


• Every message queue belongs to a Queue Manager.
• Queue Manager administers queues, creates queues, and controls queue
operation.

Message Queue is a destination to which messages can be sent.

MQSeries Message has two components:

PAGE BP-36 BEST PRACTICES INFORMATICA CONFIDENTIAL


• A header, which contains data about the queue.
• A data component, which contains the application data or the ‘message body.’

Extraction from a Queue

In order for PowerCenter to extract from a queue, the queue must be in a form of
COBOL, XML, Flat File or Binary. When extracting from a queue you need to use
either of two Source Qualifiers: MQ Source Qualifier (MQ SQ) or Associated Source
Qualifier (SQ).

MQ SQ – Must be used to read data from an MQ source. MQ SQ is predefined and


comes with 29 message headed fields. MSGID is the primary key. You cannot use a
MQ SQ to join two MQ sources.

MQ SQ can perform the following tasks:

• Select Associated Source Qualifier - this is necessary if the file is not binary.
• Set Tracing Level - verbose, normal, etc.
• Set Message Data Size – default 64,000; used for binary.
• Filter Data – set filter conditions to filter messages using message header
ports, control end of file, control incremental extraction, and control syncpoint
queue clean-up.
• Use mapping parameters and variables

Associated SQ – either an Associated SQ (XML, Flat File) or Normalizer (COBOL) is


required if the data is not in binary. If an Associated SQ is used, design the mapping
as if it were not using MQ Series, then add the MQ Source and Source Qualifier after
the mapping logic has been tested. Once the code is working correctly, test by
actually pulling data from the queue.

Loading to a Queue

There are two types of MQ Targets that can be used in a mapping: Static MQ Targets
and Dynamic MQ Targets. Only one type of MQ Target can be used in a single
mapping.

• Static MQ Targets – Does not load data to the message header fields.
(??CORRECT INTERPRETATION??) Use the target definition specific to the
format of the message data (i.e., flat file, XML, COBOL). Design the mapping
as if it were not using MQ Series, then make all adjustments in the session
when using MQ Series.
• Dynamic – Used for binary targets only and when loading data to a message
header. Note that certain message headers in a MQSeries message require a
predefined set of values assigned by IBM.

Creating and Configuring MQSeries Sessions

After you create mappings in the Designer, you can create and configure sessions in
the Server Manager. You can create a session with an MQSeries mapping using the
Session Wizard in the Server Manager.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-37


Configuring MQSeries Sources

MQSeries mappings cannot be partitioned if an associated source qualifier is used.

For MQ Series sources, the Source Type is set to the following:

• Heterogeneous when there is an associated source definition in the mapping.


This indicates that the source data is coming from an MQ source, and the
message data is in flat file, COBOL or XML format.
• Message Queue when there is no associated source definition in the mapping.

Note that there are two pages on the Source Options dialog: XML and MQSeries. You
can alternate between the two pages to set configurations for each.

Configuring MQSeries Targets

For Static MQSeries Targets, select File Target type from the list. When the target is
an XML file or XML message data for a target message queue, the target type is
automatically set to XML.

• If you load data to a dynamic MQ target, the target type is automatically


set to Message Queue.

• On the MQSeries page, select the MQ connection to use for the source
message queue, and click OK.

• Be sure to select the MQ checkbox in Target Options for the Associated file
type. Once this is done, click Edit Object Properties and enter:

• The Connection name of the target message Queue.

• Enter the Format of the Message Data in the Target Queue (ex.
MQSTR).

• And the number of rows per message(only applies to flat file MQ


Targets).

Appendix Information

PowerCenter uses the following datatypes in MQSeries mappings:

• IBM MQSeries datatypes. IBM MQSeries datatypes appear in the MQSeries


source and target definitions in a mapping.
• Native datatypes. Flat file, XML, or COBOL datatypes associated with an
MQSeries message data. Native datatypes appear in flat file, XML and COBOL
source definitions. Native datatypes also appear in flat file and XML target
definitions in the mapping.
• Transformation datatypes. Transformation datatypes are generic datatypes
that PowerCenter uses during the transformation process. They appear in all
the transformations in the mapping.

PAGE BP-38 BEST PRACTICES INFORMATICA CONFIDENTIAL


IBM MQSeries Datatypes

MQSeries Datatypes Transformation Datatypes


MQBYTE BINARY
MQCHAR STRING
MQLONG INTEGER

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-39


Data Connectivity using PowerConnect for PeopleSoft

Challenge

To maintain data integrity by sourcing/targeting transactional PeopleSoft systems.


Also, to maintain consistent, reusable metadata across various systems and to
understand the process for extracting data and metadata from PeopleSoft sources
without having to write and sustain complex SQR extract programs.

Description

PowerConnect for PeopleSoft supports extraction from PeopleSoft systems.


PeopleSoft saves metadata in tables that provide a description and logical
view of data stored in underlying physical database table. PowerConnect for
PeopleSoft uses SQL to communicate with the database server.

PowerConnect for PeopleSoft:

• Imports PeopleSoft source definition metadata via PowerCenter


Designer using ODBC to connect to PeopleSoft tables.
• Extracts data during a session by directly running against the physical
database tables using PowerCenter server.
• Extracts data from PeopleSoft systems without compromising existing
PeopleSoft security features.

Installing PowerConnect for PeopleSoft

Installation of PowerConnect for PeopleSoft is a multi-step process. To begin,


both the PowerCenter Client and Server have to be set up and configured.
Certain drivers that enable PowerCenter to extract source data from
PeopleSoft systems also need to be installed. The overall process involves:

Installing PowerConnect for PeopleSoft for the PowerCenter Server:

• Installation is simple like other Informatica products. Log onto the


Server machine on Windows NT/2000 or UNIX and run the setup
program to select and install the PowerConnect for PeopleSoft Server.
• On UNIX, make sure to set up the PATH environment variable to
include current directory.

PAGE BP-40 BEST PRACTICES INFORMATICA CONFIDENTIAL


Installing PowerConnect for PeopleSoft for the PowerCenter Client:

• Run the setup program and select PowerConnect for PeopleSoft client
from the setup list.
• Client installation wizard points to the PowerCenter Client directory for
the driver installation as a default, with the option to change the
location.

Importing Sources

PowerConnect for PeopleSoft aids data integrity by sourcing/targeting transactional


PeopleSoft systems and by maintaining reusable consistent metadata across various
systems. While importing the PeopleSoft objects, PowerConnect for PeopleSoft also
imports the metadata attached to those PeopleSoft structures.

PowerConnect for PeopleSoft extracts source data from two types of PeopleSoft
objects:

• Records
• Trees

PeopleSoft Records

A PeopleSoft record is a table-like structure that contains columns with defined


datatypes, precision, scale and keys. PowerConnect for PeopleSoft helps in importing
from the following PeopleSoft records.

• SQL table. Has one-to-one relationship with underlying physical tables.


• SQL view. Provides an alternative view of information in one or more
database tables. Key columns contain duplicate values.

PeopleSoft names the underlying database tables after the records,


PS_Record_Name. For example, data for the PeopleSoft records AE_REQUEST is
saved in the PS_AE_REQUEST database table.

When you import a PeopleSoft record, the Designer imports both the PeopleSoft
source name and the underlying database table name. The Designer uses the PS
source name as the name of the source definition. The PowerCenter Server uses the
underlying database table name to extract source data.

PeopleSoft Trees

A PeopleSoft tree is an object that defines the groupings and hierarchical


relationships between the values of a database field. A tree defines the
summarization rules for a database field. It specifies how the values of a database
file are grouped together for purposes of reporting or for security access. For
example, the values of the DEPTID field identify individual departments in your
organization. You can use the Tree Manager to define the organizational hierarchy
that specifies how each department relates to the other departments. For example,
departments 10700 and 10800 report to the same manager, department 20200 is

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-41


part of a different division, and so on. In other words, you build a treethat mirrors
the hierarchy.

Types of Trees

The Tree Manager enables you to create many kinds of trees for a variety of
purposes, but all trees fall into these major types:

• Detail trees, in which database field values appear as detail values.


• Summary trees, which provide an alternative way to group nodes from an
existing detail tree, without duplicating the entire tree structure.
• Node-oriented trees, in which database field values appear as tree nodes.
• Query access trees, which organize record definitions for PeopleSoft Query
security.

PowerConnect for PeopleSoft extracts data from the following PeopleSoft tree
structure types:

Detail Trees: In the most basic type of tree, the "lowest" level is the level farthest
to the right in the Tree Manager window, and holds detail values. The next level is
made up of tree nodes that group together the detail values, and each subsequent
level defines a higher level grouping of the tree nodes. This kind of tree is called a
detail tree. PowerConnect for PeopleSoft extracts data from loose-level and strict-
level detail trees with static detail ranges.

Winter Trees: Extracts data from loose-level and strict level node-oriented trees.
Winter trees contain no details ranges.

Summary Trees: In a summary tree, the detail values aren't values from a
database field, but tree nodes from an existing detail tree. The tree groups the nodes
from a specific level in the detail tree differently from the higher levels in the detail
tree itself. PowerConnect for PeopleSoft extracts data from loose-level and strict
level summary trees.

Node Oriented trees: In a node-oriented tree, the tree nodes represent the data
values from the database field. The Departmental Security tree in PeopleSoft HRMS
is a good example of a node-oriented tree.

Query access trees: are used to maintain security within the PeopleSoft
implementation. PeopleSoft records are grouped into logical groups, which are
represented as nodes on the tree. This way, a query written by a certain logged in
user within a group can only access the rows that are part of the records that are
assigned to the group the user has access to. There are no branches in query trees,
but children can/do exist.

Flattening trees

When you extract data from a PeopleSoft tree, the PowerCenter Server
denormalizes the tree structure. It uses either of the following methods to
denormalize trees.

PAGE BP-42 BEST PRACTICES INFORMATICA CONFIDENTIAL


• Horizontal flattening: The PowerCenter Server creates a single row
for each final branch node or detail range in the tree. You can only use
horizontal flattening with strict level trees.
• Vertical flattening: The PowerCenter Server creates a row for each
node or detail range represented in the tree. You can use vertical
flattening can be used with both strict-level and loose-level trees.

Tree Levels Flattening Tree Structure Metadata


Method Extraction
Method
Strict-level tree Horizontal Detail, Winter and Import Source
Summary Trees definition
Strict-level tree Vertical Detail, Winter and Create Source
Summary Trees definition
Loose-level tree Vertical only Detail, Winter and Create Source
Summary Trees definition

Extracting Data from PeopleSoft

PowerConnect for PeopleSoft extracts data from PeopleSoft systems without


compromising existing PeopleSoft security

To access PeopleSoft metadata and data, PowerCenter Client and Server require a
database username and password. You can either create separate users for
metadata and source extraction or alternatively use one for both. Extracting data
from PeopleSoft is a three-step process:

1. Import or create source definition


2. Create mapping
3. Create and run a session

1. Importing or Creating Source Definitions

Before extracting data from a source, you need to import its source definition. You
need a user with read access to PeopleSoft system to access the PeopleSoft physical
and metadata tables via an ODBC connection. To import a PeopleSoft source
definition, create an ODBC data source for each PeopleSoft system you want to
access.

When creating an ODBC data source, configure the data source to connect to the
underlying database for the PeopleSoft system. For example, if PeopleSoft system
resides on Oracle database, configure an ODBC data source to connect to the Oracle
database.

Use the Sources-Import command in PowerCenter Designer’s Source Analyzer tools


to import PeopleSoft records and strict-level trees. You can use the database system
names for ODBC names.

Note: If PeopleSoft already establishes database connection names, use the


PeopleSoft database connection names.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-43


After you import or create a PeopleSoft record or tree, the Navigator displays and
organizes sources by the PeopleSoft record or tree name by default.

PeopleTools based applications are table-based systems. A database for a


PeopleTools application contains three major sets of tables:

• System Catalog Tables store physical attributes of tables and views, which
your database management system uses to optimize performance.
• PeopleTools Tables contain information that you define using PeopleTools.
• Application Data Tables house the actual data your users will enter and
access through PeopleSoft application windows and panels.

Importing Records

You can import records from two tabs in the Import from PeopleSoft dialog box:

• Records tab.
• Panels tab.

Note: PowerConnect for PeopleSoft works with all versions of PeopleSoft systems. In
PeopleSoft 8, Panels are referred to as Pages. PowerConnect for PeopleSoft uses the
Panels tab to import PeopleSoft 8 Pages.

2. Create a Mapping

After you import or create the source definition, you connect to an ERP Source
Qualifier to represent the records the PowerCenter Server queries from a PeopleSoft
source. An ERP Source Qualifier is used for all ERP sources like SAP, PeopleSoft etc.
An ERP Source Qualifier like the Source Qualifier allows you to use user-defined joins
and filters.

When using the default join option between two PeopleSoft tables, the query created
will automatically append a PS_ prefix to the PeopleSoft tables. However, there are
certain tables that are stored on the database without that prefix, so an override and
a user-defined join will need to be made to correct this.

Take care when using user-defined primary-foreign key relationships with trees,
since changes made within Tree Manager may alter such relationships.
Denormalization of the tables that made up the tree will be changed, so simply
altering the primary-foreign key relationship within Source Analyzer can be
dangerous and it is advisable to re-import the whole tree.

3. Creating and Running a Session

You need a valid mapping, registered PowerCenter Server, and a Server Manager
database connection to create a session. When you configure the session, select
PeopleSoft as the source database type and then select a PeopleSoft database
connection as source database. If the database user is not the owner of the source
tables, enter the table owner name in the session as a source table prefix.

PAGE BP-44 BEST PRACTICES INFORMATICA CONFIDENTIAL


Note: If the mapping contains a Source or ERP Qualifier with a SQL Override, the
PowerCenter Server ignores the table name prefix setting for all connected sources.

PowerCenter uses SQL to extract data directly from the physical database tables,
performing code page translations when necessary. If you need to extract large
amount of source data, you can partition the sources to improve session
performance.

Note: You cannot partition an ERP Source Qualifier for PeopleSoft when it is
connected to or associated with a PeopleSoft tree.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-45


Data Connectivity using PowerConnect for SAP

Challenge

Understanding how to install PowerConnect for SAP R/3, extract data from SAP R/3,
build mappings, and run sessions to load SAP R/3 data into data warehouse.

Description

SAP R/3 is a software system that integrates multiple business applications,


such as Financial Accounting, Materials Management, Sales and Distribution,
and Human Resources. The R/3 system is programmed in Advance Business
Application Programming-Fourth Generation (ABAP/4, or ABAP), a language
proprietary to SAP.

PowerConnect for SAP R/3 provides the ability to integrate SAP R/3 data into
data warehouses, analytic applications, and other applications. All of this is
accomplished without writing ABAP code. PowerConnect extracts data from
transparent tables, pool tables, cluster tables, hierarchies(Uniform & Non
Uniform), SAP IDOCs and ABAP function modules.

The database server stores the physical tables in the R/3 system, while the
application server stores the logical tables. A transparent table definition on
the application server is represented by a single physical table on the
database server. Pool and cluster tables are logical definitions on the
application server that do not have a one-to-one relationship with a physical
table on the database server.

Communication Interfaces

TCP/IP is the native communication interface between PowerCenter and SAP


R/3. Other interfaces between the two include:

• Common Program Interface-Communications (CPI-C). CPI-C


communication protocol enables online data exchange and data
conversion between R/3 system and PowerCenter . To initialize CPI-C
communication with PowerCenter, SAP R/3 requires information such
as the host name of the application server and SAP gateway. This
information is stored on the PowerCenter Server in a configuration file

PAGE BP-46 BEST PRACTICES INFORMATICA CONFIDENTIAL


named sideinfo. The PowerCenter server uses parameters in the
sideinfo file to connect to R/3 system when running the stream mode
sessions.
• Remote Function Call (RFC). RFC is the remote communication
protocol used by SAP and is based on RPC (Remote Procedure Call). To
execute remote calls from PowerCenter, SAP R/3 requires information
such as the connection type, and the service name and gateway on
the application server. This information is stored on the PowerCenter
Client and PowerCenter Server in a configuration file named saprfc.ini.
PowerCenter makes remote function calls when importing source
definitions, installing ABAP program, and running file mode sessions.

Transport system. The transport system in SAP is a mechanism to transfer


objects developed on one system to another system. There are two situations
when transport system is needed:

• PowerConnect for SAP R/3 installation.


• Transport ABAP programs from development to production.

Note: if the ABAP programs are installed in the $TMP class then they cannot
be transported from development to production.

Extraction Process

R/3 source definitions can be imported from the logical tables using RFC
protocol. Extracting data from R/3 is a four-step process:

1. Import source definitions.

Designer connects to the R/3 application server using RFC. The Designer calls
a function in the R/3 system to import source definitions.

2. Create a mapping.

When creating a mapping using an R/3 source definition, you must use an
ERP Source Qualifier. In the ERP Source Qualifier, you can customize
properties of the ABAP program that the R/3 server uses to extract source
data. You can also use joins, filters, ABAP program variables, ABAP code
blocks, and SAP functions to customize the ABAP program.

3. Generate and install ABAP program.

Two ABAP programs can be installed for each mapping:

• File mode. Extract data to file. The PowerCenter Server accesses the
file through FTP or NFS mount.
• Stream Mode. Extract data to buffers. The PowerCenter server
accesses the buffers through CPI-C, the SAP protocol for program-to-
program communication.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-47


4. Create and Run Session. (File or Stream mode)

• Stream Mode. In stream mode, the installed ABAP program creates


buffers on the application server. The program extracts source data
and loads it into the buffers. When a buffer fills, the program streams
the data to the PowerCenter Server using CPI-C. With this method,
PowerCenter Server can process data when it is received.
• File Mode. When running a session in file mode, the session must be
configured to access the file through NFS mount or FTP. When the
session runs, the installed ABAP program creates a file on the
application server. The program extracts source data and loads it into
the file. When the file is complete, the PowerCenter Server accesses
the file through FTP or NFS mount and continues processing the
session.

Installation and Configuration Steps

For SAP R/3

The R/3 system needs development objects and user profiles established to
communicate with PowerCenter. Preparing R/3 for integration involves the
following tasks:

• Transport the development objects on the PowerCenter CD to R/3.


PowerCenter calls these objects each time it makes a request to the
R/3 system.
• Run transport program that generate unique Ids.

PAGE BP-48 BEST PRACTICES INFORMATICA CONFIDENTIAL


• Establish profiles in the R/3 system for PowerCenter users.
• Create a development class for the ABAP programs that PowerCenter
installs on the SAP R/3 system.

For PowerCenter

The PowerCenter Server and Client need drivers and connection files to
communicate with SAP R/3. Preparing PowerCenter for integration involves
the following tasks:

• Run installation programs on PowerCenter Server and Client machines.


• Configure the connection files:
• The sideinfo file on the PowerCenter Server allows PowerCenter to
initiate CPI-C with the R/3 system.
• The saprfc.ini file on the PowerCenter Client and Server allows
PowerCenter to connect to the R/3 system as an RFC client.

Required Parameters for sideinfo

• DEST – logical name of the R/3 system


• LU – host name of the SAP application server machine
• TP – set to sapdp<system number>
• GWHOST – host name of the SAP gateway machine.
• GWSERV – set to sapgw<system number>
• PROTOCOL – set to “I” for TCP/IP connection.

Required Parameters for saprfc.ini

• DEST – logical name of the R/3 system


• TYPE – set to “A” to indicate connection to specific R/3 system.
• ASHOST – host name of the SAP R/3 application server.
• SYSNR – system number of the SAP R/3 application server.

Configuring the Services File

On NT, it is located in \winnt\system32\drivers\etc

On UNIX, it is located in /etc

• sapdp<system number> <port# of dispatcher service>/TCP


• sapgw<system number> <port# of gateway service>/TCP

The system number and port numbers are provided by the BASIS
administrator.

Configure Connections to run Sessions

Configure database connections in the Server Manager to access the SAP R/3
system when running a session.

Configure FTP connection to access staging file through FTP.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-49


Steps to Configure PowerConnect on PowerCenter

1. Install PowerConnect for SAP R/3 on PowerCenter.


2. Configure the sideinfo file.
3. Configure the saprfc.ini
4. Set the RFC_INI environment variable.
5. Configure the database connection to run session.
6. Configure the FTP connection to access staging files through FTP.

Key Capabilities of PowerConnect for SAP R/3

Some key capabilities of PowerConnect for SAP R/3 include:

• Import SAP function in the Source Analyzer.


• Import IDOCS.
• Insert ABAP Code Block to add more functionality to the ABAP program
flow.
• Use of outer join when two or more sources are joined in the ERP
Source Qualifier.
• Use of static filters to reduce return rows. (MARA = MARA-MATNR =
‘189’)
• Customization of the ABAP program flow with joins, filters, SAP
functions and code blocks. For example: qualifying table = table1-
field1 = table2-field2 where the qualifying table is the “last” table in
the condition based on the join order.
• Creation of ABAP Program variables to represent SAP R/3 structures,
structure fields or values in the ABAP program
• Removal of ABAP program Information from SAP R/3 and the
repository when a folder is deleted.
• Be sure to note the following considerations regarding SAP R/3:
• You must have proper authorization on the R/3 system to perform
integrated tasks. The R/3 administration needs to create authorization,
profiles and userids for PowerCenter users.
• If your mapping has hierarchy definitions only, you cannot install the
ABAP program.
• The R/3 system administrator must use the transport control program
tp import, to transport these objects files on the R/3 system. The
transport process creates a development class called ZERP. The
installation CD includes devinit, dev3x, dev4x, production program
files. To avoid problems extracting metadata, installing programs, and
running sessions, do not install the dev3x transport on a 4.x system,
or dev4x transport on a 3.x system.
• Do not use Notepad to edit saprfc.ini file. Use a text editor, such as
WordPad.
• R/3 does not always maintain referential integrity between primary
key and foreign key relationship. If you use R/3 source to create target
definitions in the Warehouse Designer, you may encounter key
constraint errors when you load the data warehouse. To avoid these
errors, edit the keys in the target definition before you build the
physical targets.

PAGE BP-50 BEST PRACTICES INFORMATICA CONFIDENTIAL


• Do not use the Select Distinct option for LCHR when the length is
greater than 2000 and the underlying database is Oracle. This causes
the session to fail
• You cannot generate and install ABAP programs from mapping
shortcuts.
• If a mapping contains both hierarchies and tables, you must generate
the ABAP program using file mode.
• You cannot use an ABAP code block, an ABAP program variable and a
source filter if the ABAP program flow contains a hierarchy and no
other sources.
• You cannot use dynamic filters on IDOC source definitions in the ABAP
program flow.
• SAP R/3 stores all CHAR data with trailing blanks. When the
PowerCenter extracts CHAR data from SAP R/3, it treats it as
VARCHAR data and trims the trailing blanks. The PowerCenter server
also trims trailing blanks for CUKY and UNIT data. This allows you to
compare R/3 data with other source data without having use the
RTRIM function. If you are upgrading and your mappings use the
blanks to compare R/3 data with other data, you may not want the
PowerCenter Server to trim the trailing blanks. To avoid trimming the
trailing blanks, add the flag: AllowTrailingBlanksForSAPCHAR=Yes in
the pmserver.cfg
• If PowerCenter server is on NT/2000, you have to add that parameter
as a string value to the key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PowerMa
rt\Parameters\MiscInfo
• PowerCenter has the ability to generate the ABAP code for the
mapping. When this ABAP code is generated however, it does not
automatically create a transport for the ABAP code that it just
generated. The transport must need to be created manually within SAP
and then transported to the Production environment
• Given that the development and production SAP systems are identical,
you should be able to just switch your mapping to point to either the
development or production instance at the session level. So for
migration purposes, depending on which environment you’re in, all you
need to do is change the database connections at the session level.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-51


Incremental Loads

Challenge

Data warehousing incorporates large volumes of data, making the process of loading
into the warehouse without compromising its functionality increasingly difficult. The
goal is to create a load strategy that will minimize downtime for the warehouse and
allow quick and robust data management.

Description

As time windows shrink and data volumes increase, it is important to understand the
impact of a suitable incremental load strategy. The design should allow data to be
incrementally added to the data warehouse with minimal impact to the overall
system. The following pages describe several possible load strategies.

Considerations

• Incremental Aggregation –loading deltas into an aggregate table.


• Error-un/loading data– strategies for recovering, reloading, and unloading
data.
• History tracking–keeping track of what has been loaded and when.
• Slowly changing dimensions– Informatica Wizards for generic mappings (a
good start to an incremental load strategy).

Source Analysis

Data sources typically fall into the following possible scenarios:

• Delta Records - Records supplied by the source system include only new or
changed records. In this scenario, all records are generally inserted or
updated into the data warehouse.
• Record Indicator or Flags - Records that include columns that specify the
intention of the record to be populated into the warehouse. Records can be
selected based upon this flag to all for inserts, updates and delete.
• Date stamped data - Data is organized by timestamps. Data will be loaded
into the warehouse based upon the last processing date or the effective date
range.

PAGE BP-52 BEST PRACTICES INFORMATICA CONFIDENTIAL


• Key values are present - When only key values are present, data must be
checked against what has already been entered into the warehouse. All values
must be checked before entering the warehouse.
• No Key values present - Surrogate keys will be created and all data will be
inserted into the warehouse based upon validity of the records.

Identify Which Records Need to be Compared

Once the sources are identified, it is necessary to determine which records will be
entered into the warehouse and how. Here are some considerations:

• Compare with the target table. Determine if the record exists in the target
table. If the record does not exist, insert the record as a new row. If it does
exist, determine if the record needs to be updated, inserted as a new record,
or removed (deleted from target or filtered out and not added to the
warehouse). This occurs in cases of delta loads, timestamps, keys or
surrogate keys.
• Record indicators. Record indicators can be beneficial when lookups into the
target are not necessary. Take care to ensure that the record exists for
updates or deletes or the record can be successfully inserted. More design
effort may be needed to manage errors in these situations.

Determine the Method of Comparison

1. Joins of Sources to Targets. Records are directly joined to the target using Source
Qualifier join conditions or using joiner transformations after the source qualifiers
(for heterogeneous sources). When using joiner transformations, take care to ensure
the data volumes are manageable.

2. Lookup on target. Using the lookup transformation, lookup the keys or critical
columns in the target relational database. Keep in mind the caches and indexing
possibilities

3. Load table log. Generate a log table of records that have been already inserted
into the target system. You can use this table for comparison with lookups or joins,
depending on the need and volume. For example, store keys in the a separate table
and compare source records against this log table to determine load strategy.

Source Based Load Strategies

Complete Incremental Loads in a Single File/Table

The simplest method of incremental loads is from flat files or a database in which all
records will be loaded. This particular strategy requires bulk loads into the
warehouse, with no overhead on processing of the sources or sorting the source
records.

Loading Method

Data can be loaded directly from these locations into the data warehouse. There is
no additional overhead produced in moving these sources into the warehouse.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-53


Date Stamped Data

This method involves data that has been stamped using effective dates or
sequences. The incremental load can be determined by dates greater than the
previous load date or data that has an effective key greater than the last key
processed.

Loading Method

With the use of relational sources, the records can be selected based on this effective
date and only those records past a certain date will be loaded into the warehouse.
Views can also be created to perform the selection criteria so the processing will not
have to be incorporated into the mappings. Placing the load strategy into the ETL
component is much more flexible and controllable by the ETL developers and
metadata.

Non-relational data can be filtered as records are loaded based upon the effective
dates or sequenced keys. A router transformation or a filter can be placed after the
source qualifier to remove old records.

To compare the effective dates, you can use mapping variables to provide the
previous date processed. The alternative is to use control tables to store the date
and update the control table after each load.

For detailed instruction on how to select dates, refer to Best Practice: Variable and
Mapping Parameters.

Changed Data based on Keys or Record Information

Data that is uniquely identified by keys can be selected based upon selection criteria.
For example, records that contain key information such as primary keys, alternate
keys etc can be used to determine if they have already been entered into the data
warehouse. If they exist, you can also check to see if you need to update these
records or discard the source record.

Load Method

It may be possible to do a join with the target tables in which new data can be
selected and loaded into the target. It may also be feasible to lookup in the target to
see if the data exists or not.

Target Based Load Strategies

Load Directly into the Target

Loading directly into the target is possible when the data will be bulk loaded. The
mapping will be responsible for error control, recovery and update strategy.

Load into Flat Files and Bulk Load using an External Loader

PAGE BP-54 BEST PRACTICES INFORMATICA CONFIDENTIAL


The mapping will load data directly into flat files. An external loader can be invoked
at that point to bulk load the data into the target. This method reduces the load
times (with less downtime for the data warehouse) and also provide a means of
maintaining a history of data being loaded into the target. Typically this method is
only used for updates into the warehouse.

Load into a Mirror Database

The data will be loaded into a mirror database to avoid down time of the active data
warehouse. After data has been loaded, the databases are switched, making the
mirror the active database and the active as the mirror.

Using Mapping Variables and Parameter Files

A mapping variable can be used to perform incremental loading. This is a very


important issue that everyone should understand. The mapping variable is used in
the join condition in order to select only the new data that has been entered based
on the create_date or the modify_date, whichever date can be used to identify a
newly inserted record. The source system must have a reliable date to use.. Here are
the steps involved in this method:

Step 1: Create Mapping Variable

In the Informatica Designer, with the mapping designer open, go to the menu and
select Mappings, then select Parameters and Values.

Name the variable and, in this case, make your variable a date/time. For the
Aggregation option, select MAX.

In the same screen, state your initial value. This is the date at which the load should
start. The date must follow one of these formats:

• MM/DD/RR
• MM/DD/RR HH24:MI:SS
• MM/DD/YYYY
• MM/DD/YYYY HH24:MI:SS

Step 2: Use the Mapping Variable in the Source Qualifier

The select statement will look like the following:

Select * from tableA

Where

CREATE_DATE > to_date('$$INCREMENT_DATE', 'MM-DD-YYYY HH24:MI:SS')

Step 3: Use the Mapping Variable in an Expression

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-55


For the purpose of this example, use an expression to work with the variable
functions to set and use the mapping variable.

In the expression create a variable port and use the SETMAXVARIABLE variable
function and do the following:

SETMAXVARIABLE($$INCREMENT_DATE,CREATE_DATE)

CREATE_DATE is the date for which you would like to store the maximum value.

You can use the variable functions in the following transformations:

• Expression
• Filter
• Router
• Update Strategy

The variable constantly holds (per row) the max value between source and variable.
So if one row comes through with 9/1/2001, then the variable gets that value. If all
subsequent rows are LESS than that, then 9/1/2001 is preserved.

After the mapping completes, that is the PERSISTENT value stored in the repository
for the next run of your session. You can view the value of the mapping variable in
the session log file.

The value of the mapping variable and incremental loading is that it allows the
session to use only the new rows of data. No table is needed to store the
max(date)since the variable takes care of it.

PAGE BP-56 BEST PRACTICES INFORMATICA CONFIDENTIAL


Mapping Design

Challenge

Use the PowerCenter tool suite to create an efficient execution environment.

Description

Although PowerCenter environments vary widely, most sessions and/or


mappings can benefit from the implementation of common objects and
optimization procedures. Follow these procedures and rules of thumb when
creating mappings to help ensure optimization. Use mapplets to leverage the
work of critical developers and minimize mistakes when performing similar
functions.

General Suggestions for Optimizing

1. Reduce the number of transformations

• There is always overhead involved in moving data between


transformations.
• Consider more shared memory for large number of transformations.
Session shared memory between 12M and 40MB should suffice.

2. Calculate once, use many times.

• Avoid calculating or testing the same value over and over.


• Calculate it once in an expression, and set a True/False flag.
• Within an expression, use variables to calculate a value used several
times.

3. Only connect what is used.

• Delete unnecessary links between transformations to minimize the


amount of data moved, particularly in the Source Qualifier.
• This is also helpful for maintenance, if you exchange transformations
(e.g., a Source Qualifier).

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-57


4. Watch the data types.

• The engine automatically converts compatible types.


• Sometimes conversion is excessive, and happens on every
transformation.
• Minimize data type changes between transformations by planning data
flow prior to developing the mapping.

5. Facilitate reuse.

• Plan for reusable transformations upfront.


• Use variables.
• Use mapplets to encapsulate multiple reusable transformations.

6. Only manipulate data that needs to be moved and transformed.

• Delete unused ports particularly in Source Qualifier and Lookups.


Reducing the number of records used throughout the mapping
provides better performance
• Use active transformations that reduce the number of records as early
in the mapping as possible (i.e., placing filters, aggregators as close to
source as possible).
• Select appropriate driving/master table while using joins. The table
with the lesser number of rows should be the driving/master table.

7. When DTM bottlenecks are identified and session optimization has not
helped, use tracing levels to identify which transformation is causing the
bottleneck (use the Test Load option in session properties).

8. Utilize single-pass reads.

• Single-pass reading is the server’s ability to use one Source Qualifier


to populate multiple targets.
• For any additional Source Qualifier, the server reads this source. If you
have different Source Qualifiers for the same source (e.g., one for
delete and one for update/insert), the server reads the source for each
Source Qualifier.
• Remove or reduce field-level stored procedures.
• If you use field-level stored procedures, PowerMart has to make a call
to that stored procedure for every row so performance will be slow.

9. Lookup Transformation Optimizing Tips

• When your source is large, cache lookup table columns for those
lookup tables of 500,000 rows or less. This typically improves
performance by 10-20%.
• The rule of thumb is not to cache any table over 500,000 rows. This is
only true if the standard row byte count is 1,024 or less. If the row
byte count is more than 1,024, then the 500k rows will have to be
adjusted down as the number of bytes increase (i.e., a 2,048 byte row

PAGE BP-58 BEST PRACTICES INFORMATICA CONFIDENTIAL


can drop the cache row count to 250K – 300K, so the lookup table will
not be cached in this case).
• When using a Lookup Table Transformation, improve lookup
performance by placing all conditions that use the equality operator ‘=’
first in the list of conditions under the condition tab.
• Cache only lookup tables if the number of lookup calls is more than
10-20% of the lookup table rows. For fewer number of lookup calls, do
not cache if the number of lookup table rows is big. For small lookup
tables, less than 5,000 rows, cache for more than 5-10 lookup calls.
• Replace lookup with decode or IIF (for small sets of values).
• If caching lookups and performance is poor, consider replacing with an
unconnected, uncached lookup

10. Review complex expressions.

11. Examine mappings via Repository Reporting.

11. Minimize aggregate function calls.

12. Replace Aggregate Transformation object with an Expression


Transformation object and an Update Strategy Transformation for certain
types of Aggregations.

13. Operations and Expression Optimizing Tips

• Numeric operations are faster than string operations.


• Optimize char-varchar comparisons (i.e., trim spaces before
comparing).
• Operators are faster than functions (i.e., || vs. CONCAT).
• Optimize IIF expressions.
• Avoid date comparisons in lookup; replace with string.
• Test expression timing by replacing with constant.

14. Use Flat Files

• Using flat files located on the server machine loads faster than a
database located in the server machine.
• Fixed-width files are faster to load than delimited files because
delimited files require extra parsing.
• If processing intricate transformations, consider loading first to a
source flat file into a relational database, which allows the
PowerCenter mappings to access the data in an optimized fashion by
using filters and custom SQL Selects where appropriate.

15. If working with data that is not able to return sorted data (e.g., Web
Logs) consider using the Sorter Advanced External Procedure.

Suggestions for Using Mapplets

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-59


A mapplet is a reusable object that represents a set of transformations. It
allows you to reuse transformation logic and can contain as many
transformations as necessary.

1. Create a mapplet when you want to use a standardized set of


transformation logic in several mappings. For example, if you have several
fact tables that require a series of dimension keys, you can create a mapplet
containing a series of Lookup transformations to find each dimension key. You
can then use the mapplet in each fact table mapping, rather than recreate the
same lookup logic in each mapping.

2. To create a mapplet, add, connect, and configure transformations to


complete the desired transformation logic. After you save a mapplet, you can
use it in a mapping to represent the transformations within the mapplet.
When you use a mapplet in a mapping, you use an instance of the mapplet.
All uses of a mapplet are all tied to the ‘parent’ mapplet. Hence, all changes
made to the parent mapplet logic are inherited by every ‘child’ instance of the
mapplet. When the server runs a session using a mapplet, it expands the
mapplet. The server then runs the session as it would any other session,
passing data through each transformation in the mapplet as designed.

3. A mapplet can be active or passive depending on the transformations in


the mapplet. Active mapplets contain at least one active transformation.
Passive mapplets only contain passive transformations. Being aware of this
property when using mapplets can save time when debugging invalid
mappings.

4. There are several unsupported transformations that should not be used in


a mapplet, these include: COBOL source definitions, joiner, normalizer, non-
reusable sequence generator, pre- or post-session stored procedures, target
definitions, and PowerMart 3.5 style lookup functions

5. Do not reuse mapplets if you only need one or two transformations of the
mapplet while all other calculated ports and transformations are obsolete

6. Source data for a mapplet can originate from one of two places:

• Sources within the mapplet. Use one or more source definitions


connected to a Source Qualifier or ERP Source Qualifier
transformation. When you use the mapplet in a mapping, the mapplet
provides source data for the mapping and is the first object in the
mapping data flow.
• Sources outside the mapplet. Use a mapplet Input transformation
to define input ports. When you use the mapplet in a mapping, data
passes through the mapplet as part of the mapping data flow.

7. To pass data out of a mapplet, create mapplet output ports. Each port in
an Output transformation connected to another transformation in the mapplet
becomes a mapplet output port.

PAGE BP-60 BEST PRACTICES INFORMATICA CONFIDENTIAL


• Active mapplets with more than one Output transformations.
You need one target in the mapping for each Output transformation in
the mapplet. You cannot use only one data flow of the mapplet in a
mapping.
• Passive mapplets with more than one Output transformations.
Reduce to one Output Transformation otherwise you need one target
in the mapping for each Output transformation in the mapplet. This
means you cannot use only one data flow of the mapplet in a mapping.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-61


Metadata Reporting and Sharing

Challenge

Using Informatica’s suite of metadata tools effectively in the design of the end-user
analysis application.

Description

The levels of metadata available in the Informatica tool suite are quite extensive.
The amount of metadata that is entered is dependent on the business requirements.
Description information can be entered for all repository objects, sources, targets,
transformations, etc. You also can drill down to the column level and give
descriptions of the columns in a table if necessary. Also, all information about column
size and scale, data types, and primary keys are stored in the repository. The
decision on how much metadata to create is often driven by project timelines. While
it may be beneficial for a developer to enter detailed descriptions of each column,
expression, variable, etc, it will also require a substantial amount of time to do so.
Therefore, this decision should be made on the basis of how much metadata will be
required by the systems that use the metadata.

Informatica offers two recommended ways for accessing the repository metadata.

• Effective with the release of version 5.0, Informatica PowerCenter contains a


Metadata Reporter. The Metadata Reporter is a web-based application that
allows you to run reports against the repository metadata.

• Because Informatica does not support or recommend direct reporting access


to the repository, even for Select only queries, the second way of repository
metadata reporting is through the use of views written using Metadata
Exchange (MX). These views can be found in the Informatica Metadata
Exchange (MX) Cookbook.

Metadata Reporter

The need for the Informatica Metadata Reporter arose from the number of clients
requesting custom and complete metadata reports from their repositories. The
Metadata Reporter allows report access to every Informatica object stored in the
repository. The architecture of the Metadata Reporter is web-based, with an Internet

PAGE BP-62 BEST PRACTICES INFORMATICA CONFIDENTIAL


browser front end. You can install the Metadata Reporter on a server running either
UNIX or Windows that contains a supported web server. The Metadata Reporter
contains servlets that must be installed on a web server that runs the Java Virtual
Machine and supports the Java Servlet API. The currently supported web servers
are:

• iPlanet 4.1 or higher


• Apache 1.3 with Jserv 1.1
• Jrun 2.3.3

(Note: The Metadata Reporter will not run directly on Microsoft IIS because IIS does
not directly support servlets.)

The Metadata Reporter is accessible from any computer with a browser that has
access to the web server where the Metadata Reporter is installed, even without the
other Informatica Client tools being installed on that computer. The Metadata
Reporter connects to your Informatica repository using JDBC drivers. Make sure the
proper JDBC drivers are installed for your database platform.

(Note: You can also use the JDBC to ODBC bridge to connect to the repository. Ex.
Syntax - jdbc:odbc:<data_source_name>)

Although the Repository Manager provides a number of Crystal Reports, the


Metadata Reporter has several benefits:

• The Metadata Reporter is comprehensive. You can run reports on any


repository. The reports provide information about all types of metadata
objects.

• The Metadata Reporter is easily accessible. Because the Metadata Reporter is


web-based, you can generate reports from any machine that has access to
the web server where the Metadata Reporter is installed. You do not need
direct access to the repository database, your sources or targets or
PowerMart or PowerCenter

• The reports in the Metadata Reporter are customizable. The Metadata


Reporter allows you to set parameters for the metadata objects to include in
the report.

• The Metadata Reporter allows you to go easily from one report to another.
The name of any metadata object that displays on a report links to an
associated report. As you view a report, you can generate reports for objects
on which you need more information.

The Metadata Reporter provides 15 standard reports that can be customized with the
use of parameters and wildcards. The reports are as follows:

• Batch Report
• Executed Session Report
• Executed Session Report by Date
• Invalid Mappings Report

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-63


• Job Report
• Lookup Table Dependency Report
• Mapping Report
• Mapplet Report
• Object to Mapping/Mapplet Dependency Report
• Session Report
• Shortcut Report
• Source Schema Report
• Source to Target Dependency Report
• Target Schema Report
• Transformation Report

For a detailed description of how to run these reports, consult the Metadata Reporter
Guide included in your PowerCenter Documentation.

Metadata Exchange: The Second Generation (MX2)

The MX architecture was intended primarily for Business Intelligence (BI) vendors
who wanted to create a PowerCenter-based data warehouse and then display the
warehouse metadata through their own products. The result was a set of relational
views that encapsulated the underlying repository tables while exposing the
metadata in several categories that were more suitable for external parties. Today,
Informatica and several key vendors, including Brio, Business Objects, Cognos, and
MicroStrategy, are effectively using the MX views to report and query the Informatica
metadata.

Informatica currently supports the second generation of Metadata Exchange called


MX2. Although the overall motivation for creating the second generation of MX
remains consistent with the original intent, the requirements and objectives of MX2
supersede those of MX.

The primary requirements and features of MX2 are:

Incorporation of object technology in a COM-based API. Although SQL


provides a powerful mechanism for accessing and manipulating records of data in a
relational paradigm, it’s not suitable for procedural programming tasks that can be
achieved by C, C++, Java, or Visual Basic. Furthermore, the increasing popularity
and use of object-oriented software tools require interfaces that can fully take
advantage of the object technology. MX2 is implemented in C++ and offers an
advanced object-based API for accessing and manipulating the PowerCenter
Repository from various programming languages.

Self-contained Software Development Kit (SDK). One of the key advantages of


MX views is that they are part of the repository database and thus could be used
independent of any of the Informatica’s software products. The same requirement
also holds for MX2, thus leading to the development of a self-contained API Software
Development Kit that can be used independently of the client or server products.

Extensive metadata content, especially multidimensional models for OLAP. A


number of BI tools and upstream data warehouse modeling tools require complex
multidimensional metadata, such as hierarchies, levels, and various relationships.

PAGE BP-64 BEST PRACTICES INFORMATICA CONFIDENTIAL


This type of metadata was specifically designed and implemented in the repository to
accommodate the needs of our partners by means of the new MX2 interfaces.

Ability to write (push) metadata into the repository. Because of the limitations
associated with relational views, MX could not be used for writing or updating
metadata in the Informatica repository. As a result, such tasks could only be
accomplished by directly manipulating the repository’s relational tables. The MX2
interfaces provide metadata write capabilities along with the appropriate verification
and validation features to ensure the integrity of the metadata in the repository.

Complete encapsulation of the underlying repository organization by means


of an API. One of the main challenges with MX views and the interfaces that access
the repository tables is that they are directly exposed to any schema changes of the
underlying repository database. As a result, maintenance of the MX views and direct
interfaces becomes a major undertaking with every major upgrade of the repository.
MX2 alleviates this problem by offering a set of object-based APIs that are
abstracted away from the details of the underlying relational tables, thus providing
an easier mechanism for managing schema evolution.

Integration with third-party tools. MX2 offers the object-based interfaces needed
to develop more sophisticated procedural programs that can tightly integrate the
repository with the third-party data warehouse modeling and query/reporting tools.

Synchronization of metadata based on changes from up-stream and down-


stream tools. Given that metadata will reside in different databases and files in a
distributed software environment, synchronizing changes and updates ensures the
validity and integrity of the metadata. The object-based technology used in MX2
provides the infrastructure needed to implement automatic metadata synchronization
and change propagation across different tools that access the Informatica
Repository.

Interoperability with other COM-based programs and repository interfaces.


MX2 interfaces comply with Microsoft’s Component Object Model (COM)
interoperability protocol. Therefore, any existing or future program that is COM-
compliant can seamlessly interface with the Informatica Repository by means of
MX2.

Support for Microsoft’s UML-based Open Information Model (OIM). The


Microsoft Repository and its OIM schema, based on the standard Unified Modeling
Language (UML), could become a de facto general-purpose repository standard.
Informatica has worked in close cooperation with Microsoft to ensure that the logical
object model of MX2 remains consistent with the data warehousing components of
the Microsoft Repository. This also facilitates robust metadata exchange with the
Microsoft Repository and other software that support this repository.

Framework to support a component-based repository in a multi-tier


architecture. With the advent of the Internet and distributed computing, multi-tier
architectures are becoming more widely accepted for accessing and managing
metadata and data. The object-based technology of MX2 supports a multi-tier
architecture so that a future Informatica Repository Server could be accessed from a
variety of thin client programs running on different operating systems.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-65


MX2 Architecture

MX2 provides a set of COM-based programming interfaces on top of the C++ object
model used by the client tools to access and manipulate the underlying repository.
This architecture not only encapsulates the physical repository structure, but also
leverages the existing C++ object model to provide an open, extensible API based
on the standard COM protocol. MX2 can be automatically installed on Windows 95,
98, or Windows NT using the install program provided with its SDK. After the
successful installation of MX2, its interfaces are automatically registered and
available to any software through standard COM programming techniques. The MX2
COM APIs support the PowerCenter XML Import/Export feature and provide a COM
based programming interface in which to import and export repository objects.

PAGE BP-66 BEST PRACTICES INFORMATICA CONFIDENTIAL


Naming Conventions

Challenge

Choosing a good naming standard for the repository and adhering to it.

Description

Repository Naming Conventions

Although naming conventions are important for all repository and database objects,
the suggestions in this document focus on the former. Choosing a convention and
sticking with it is the key point - and sometimes the most difficult in determining
naming conventions. It is important to note that having a good naming convention
will help facilitate a smooth migration and improve readability for anyone reviewing
the processes.

FAQs

The following paragraphs present some of the questions that typically arise in
naming repositories and suggest answers:

Q: What are the implications of numerous repositories or numerous folders within a


repository, given that multiple development groups need to use the PowerCenter
server, and each group works independently?

• One consideration for naming conventions is how to segregate different


projects and data mart objects from one another. Whenever an object is
shared between projects, the object should be stored in a shared work area
so each of the individual projects can utilize a shortcut to the object.
Mappings are listed in alphabetical order.

Q: What naming convention is recommended for Repository Folders?

• Something specific (e.g., Company_Department_Project-Name_Prod) is


appropriate if multiple repositories are expected for various projects and/or
departments.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-67


Note that incorporating functions in the object name makes the name more
descriptive at a higher level. The drawback is that when an object needs to be
modified to incorporate some other business logic, the name no longer accurately
describes the object. Use descriptive names cautiously and at a high enough level. It
is not advisable to rename an object that is currently being used in a production
environment.

The following tables illustrate some naming conventions for transformation objects
(e.g., sources, targets, joiners, lookups, etc.) and repository objects (e.g.,
mappings, sessions, etc.).

Transformation Objects Naming Convention


Advanced External aep_ProcedureName
Procedure Transform:
Aggregator Transform: agg_TargetTableName(s) that leverages the expression
and/or a name that describes the processing being done.
Expression Transform: exp_TargetTableName(s) that leverages the expression
and/or a name that describes the processing being done.
External Procedure ext_ProcedureName
Transform:
Filter Transform: fil_TargetTableName(s) that leverages the expression
and/or a name that describes the processing being done.
Joiner Transform: jnr_SourceTable/FileName1_ SourceTable/FileName2
Lookup Transform: lkp_LookupTableName
Mapplet: mplt_Description
Mapping Variable: $$Function or Process that is being done
Mapping Parameter: $$Function or Process that is being done
Normalizer Transform: nrm_TargetTableName(s) that leverages the expression
and/or a name that describes the processing being done.
Rank Transform: rnk_TargetTableName(s) that leverages the expression
and/or a name that describes the processing being done.
Router: rtr_TARGETTABLE that leverages the expression and/or a
name that describes the processing being done

Group Name: Function_TargetTableName(s) (e.g.


INSERT_EMPLOYEE or UPDATE_EMPLOYEE)
Normalizer Transform: nrm_TargetTableName(s) that leverages the expression
and/or a name that describes the processing being done.
Sequence Generator: seq_Function
Source Qualifier sq_SourceTable1_SourceTable2
Transform:
Stored Procedure SpStoredProcedureName
Update Strategy UpdTargetTableName(s) that leverages the expression
and/or a name that describes the procession being done
Repository Objects Naming Convention
Mapping Name: m_TargetTable1_TargetTable2
Session Name: s_MappingName
Batch Names: bs_BatchName for a sequential batch and bc_BatchName
for a concurrent batch.

PAGE BP-68 BEST PRACTICES INFORMATICA CONFIDENTIAL


Folder Name Folder names should logically group sessions and
mappings. The grouping can be based on project, subject
area, promotion group, or some combination of these.

Target Table Names

There are often several instances of the same target, usually because of different
actions. When looking at a session run, there will be the several instances with own
successful rows, failed rows, etc. To make observing a session run easier, targets
should be named according to the action being executed on that target.

For example, if a mapping has four instances of CUSTOMER_DIM table according to


update strategy (Update, Insert, Reject, Delete), the tables should be named as
follows:

• CUSTOMER_DIM_UPD
• CUSTOMER_DIM_INS
• CUSTOMER_DIM_DEL
• CUSTOMER_DIM_REJ

Port Names

Ports names should remain the same as the source unless some other action is
performed on the port. In that case, the port should be prefixed with the
appropriate name.

When you bring a source port into a lookup or expression, the port should be
prefixed with “IN_”. This will help the user immediately identify the ports that are
being inputted without having to line up the ports with the input checkbox. It is a
good idea to prefix generated output ports. This helps trace the port value
throughout the mapping as it may travel through many other transformations. For
variables inside a transformation, you should use the prefix 'var_' plus a meaningful
name.

Batch Names

Batch names follow basically the same rules as the session names. A prefix, such as
'b_' should be used and there should be a suffix indicating if the batch is serial or
concurrent.

Batch Session Postfixes


init_load Initial Load indicates this session should only be used one time to load
initial data to the targets.
incr_load Incremental Load is a update of the target and normally run periodically
wkly indicates a weekly run of this session / batches
mtly indicates a monthly run of this session / batches

Shared Objects

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-69


Any object within a folder can be shared. These objects are sources, targets,
mappings, transformations, and mapplets. To share objects in a folder, the folder
must be designated as shared. Once the folder is shared, the users are allowed to
create shortcuts to objects in the folder.

If you have an object that you want to use in several mappings or across multiple
folders, like an Expression transformation that calculates sales tax, you can place the
object in a shared folder. You can then use the object in other folders by creating a
shortcut to the object in this case the naming convention is ‘SC_’ for instance
SC_mltCREATION_SESSION, SC_DUAL.

ODBC Data Source Names

Set up all Open Database Connectivity (ODBC) data source names (DSNs) the same
way on all client machines. PowerCenter uniquely identifies a source by its Database
Data Source (DBDS) and its name. The DBDS is the same name as the ODBC DSN
since the PowerCenter Client talks to all databases through ODBC.

If ODBC DSNs are different across multiple machines, there is a risk of analyzing the
same table using different names. For example, machine1 has ODBS DSN Name0
that points to database1. TableA gets analyzed in on machine 1. TableA is uniquely
identified as Name0.TableA in the repository. Machine2 has ODBS DSN Name1 that
points to database1. TableA gets analyzed in on machine 2. TableA is uniquely
identified as Name1.TableA in the repository. The result is that the repository may
refer to the same object by multiple names, creating confusion for developers,
testers, and potentially end users.

Also, refrain from using environment tokens in the ODBC DSN. For example, do not
call it dev_db01. As you migrate objects from dev, to test, to prod, you are likely to
wind up with source objects called dev_db01 in the production repository. ODBC
database names should clearly describe the database they reference to ensure that
users do not incorrectly point sessions to the wrong databases.

Database Connection Information

A good convention for database connection information is


UserName_ConnectString. Be careful not to include machine names or
environment tokens in the Database Connection Name. Database Connection names
must be very generic to be understandable and enable a smooth migration.

Using a convention like User1_DW allows you to know who the session is logging in
as and to what database. You should know which DW database, based on which
repository environment, you are working in. For example, if you are creating a
session in your QA repository using connection User1_DW, the session will write to
the QA DW database because you are in the QA repository.

Using this convention will allow for easier migration if you choose to use the Copy
Folder method. When you use Copy Folder, session information is also copied. If the
Database Connection information does not already exist in the folder you are copying
to, it is also copied. So, if you use connections with names like Dev_DW in your
development repository, they will eventually wind up in your QA, and even in your

PAGE BP-70 BEST PRACTICES INFORMATICA CONFIDENTIAL


Production repository as you migrate folders. Manual intervention would then be
necessary to change connection names, user names, passwords, and possibly even
connect strings. Instead, if you have a User1_DW connection in each of your three
environments, when you copy a folder from Dev to QA, your sessions will
automatically hook up to the connection that already exists in the QA repository.
Now, your sessions are ready to go into the QA repository with no manual
intervention required.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-71


Session and Data Partitioning

Challenge

Improving performance by identifying strategies for partitioning relational tables,


XML, COBOL and standard flat files, and by coordinating the interaction between
sessions, partitions, and CPUs. These strategies take advantage of the enhanced
partitioning capabilities in PowerCenter 5.1.

Description

On hardware systems that are under-utilized, it may be possible to improve


performance through parallel execution of the Informatica server engine.
However, parallel execution may impair performance on over-utilized systems
or systems with smaller I/O capacity.

Besides hardware, there are several other factors to consider when


determining if a session is an ideal candidate for partitioning. These
considerations include source and target database setup, target type, and
mapping design. (The Designer client tool is used to implement session
partitioning; see the Partitioning Rules and Validation section of the Designer
Help).

When these factors have been considered and a partitioned strategy has been
selected, the iterative process of adding partitions can begin. Continue adding
partitions to the session until the desired performance threshold is met or
degradation in performance is observed.

Follow these three steps when partitioning your session.

1. First, determine if you should partition your session. Parallel execution


benefits systems that have the following characteristics:

• Under utilized or intermittently used CPUs. To determine if this is


the case, check the CPU usage of your machine:

- UNIX–type VMSTAT 1 10 on the command line. The column


“ID” displays the percentage utilization of CPU idling during the
specified interval without any I/O wait. If there are CPU cycles

PAGE BP-72 BEST PRACTICES INFORMATICA CONFIDENTIAL


available (twenty percent or more idle time) then this session’s
performance may be improved by adding a partition.

- NT – check the task manager performance tab.

• Sufficient I/O. To determine the I/O statistics:

- UNIX– type IOSTAT on the command line. The column


“%IOWAIT” displays the percentage of CPU time spent idling while
waiting for I/O requests. The column “%idle” displays the total
percentage of the time that the CPU spends idling (i.e., the unused
capacity of the CPU.)

- NT – check the task manager performance tab.

• Sufficient memory. If too much memory is allocated to your session,


you will receive a memory allocation error. Check to see that you’re
using as much memory as you can. If the session is paging, increase
the memory. To determine if the session is paging, follow these steps:

- UNIX – type VMSTAT 1 10 on the command line. PI displays


number of pages swapped in from the page space during the specified
interval. PO displays the number of pages swapped out to the page
space during the specified interval. If these values indicate that paging
is occurring, it may be necessary to allocate more memory, if possible.

- NT – check the task manager performance tab.

2. The next step is to set up the partition. The following are selected hints
for session setup; see the Session and Server Guide for further directions on
setting up partitioned sessions.

• Add one partition at a time. To best monitor performance, add one


partition at a time, and note your session settings before you add each
partition.
• Set DTM Buffer Memory. For a session with n partitions, this value
should be at least n times the original value for the non-partitioned
session.
• Set cached values for Sequence Generator. For a session with n
partitions, there should be no need to use the “Number of Cached
Values” property of the sequence generator. If you must set this value
to a value greater than zero, make sure it is at least n times the
original value for the non-partitioned session.
• Partition the source data evenly. The source data should be
partitioned into equal sized chunks for each partition.
• Partition tables. A notable increase in performance can also be
realized when the actual source and target tables are partitioned.
Work with the DBA to discuss the partitioning of source and target
tables, and the setup of tablespaces.
• Consider Using External Loader. As with any session, using an
external loader may increase session performance. You can only use

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-73


Oracle external loaders for partitioning. Refer to the Session and
Server Guide for more information on using and setting up the Oracle
external loader for partitioning.

3. The third step is to monitor the session to see if the partition is


degrading or improving session performance. If the session performance is
improved and the session meets the requirements of step 1, add another
partition.

• Write throughput. Check the session statistics to see if you have


increased the write throughput.
• Paging. Check to see if the session is now causing the system to
page. When you partition a session and there are cached lookups, you
must make sure that DTM memory is increased to handle the lookup
caches. When you partition a source that uses a static lookup cache,
the Informatica Server creates one memory cache for each partition
and one disk cache for each transformation. Therefore, the memory
requirements will grow for each partition. If the memory is not
bumped up, the system may start paging to disk, causing degradation
in performance.

Assumptions

The following assumptions pertain to the source and target systems of a


session that is a candidate for partitioning. These conditions can help to
maximize the benefits that can be achieved through partitioning.

• Indexing has been implemented on the partition key when using a


relational source.
• Source files are located on the same physical machine as the PMServer
process when partitioning flat files, COBOL and XML, to reduce
network overhead and delay.
• All possible constraints are dropped or disabled on relational targets.
• All possible indexes are dropped or disabled on relational targets.
• Table Spaces and Database Partitions are properly managed on the
target system.
• Target files are written to same physical machine that hosts the
PMServer process, in order to reduce network overhead and delay.
• Oracle External Loaders are utilized whenever possible (Parallel Mode).

PAGE BP-74 BEST PRACTICES INFORMATICA CONFIDENTIAL


Using Parameters, Variables and Parameter Files

Challenge

Understanding how parameters, variables, and parameter files work and using them
for maximum efficiency.

Description

Prior to the release of PowerCenter 5.x, the only variables inherent to the product
were defined to specific transformations and to those Server variables that were
global in nature. Transformation variables were defined as variable ports in a
transformation and could only be used in that specific Transformation object (e.g.,
Expression, Aggregator and Rank Transformations). Similarly, global parameters
defined within Server Manager would affect the subdirectories for Source Files,
Target Files, Log Files, etc.

PowerCenter 5.x has made variables and parameters available across the entire
mapping rather than for a specific transformation object. In addition, it provides
built-in parameters for use within Server Manager. Using parameter files, these
values can change from session-run to session-run.

Mapping Variables

You declare mapping variables in PowerCenter Designer using the menu option
Mappings -> Parameters and Variables. After mapping variables are selected,
you use the pop-up window to create a variable by specifying its name, data type,
initial value, aggregation type, precision and scale. This is similar to creating a port
in most transformations.

Variables, by definition, are objects that can change value dynamically. Informatica
added four functions to affect change to mapping variables:

• SetVariable
• SetMaxVariable
• SetMinVariable
• SetCountVariable

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-75


A mapping variable can store the last value from a session run in the repository to be
used as the starting value for the next session run.

Name

The name of the variable should be descriptive and be preceded by ‘$$’ (so that it is
easily identifiable as a variable). A typical variable name is:
$$Procedure_Start_Date.

Aggregation Type

This entry creates specific functionality for the variable and determines how it stores
data. For example, with an aggregation type of Max, the value stored in the
repository would be the max value across ALL session runs until the value is deleted.

Initial Value

This value is used during the first session run when there is no corresponding and
overriding parameter file. This value is also used if the stored repository value is
deleted. If no initial value is identified, then a data type specific default value is
used.

Variable values are not stored in the repository when the session:

• Fails to complete.
• Is configured for a test load.
• Is a debug session.
• Runs in debug mode and is configured to discard session output.

Order of Evaluation

The start value is the value of the variable at the start of the session. The start value
can be a value defined in the parameter file for the variable, a value saved in the
repository from the previous run of the session, a user-defined initial value for the
variable, or the default value based on the variable data type.

The PowerCenter Server looks for the start value in the following order:

1. Value in session parameter file


2. Value saved in the repository
3. Initial value
4. Default value

Mapping Parameters and Variables

Since parameter values do not change over the course of the session run, the value
used is based on:

• Value in session parameter file


• Initial value
• Default value

PAGE BP-76 BEST PRACTICES INFORMATICA CONFIDENTIAL


Once defined, mapping parameters and variables can be used in the Expression
Editor section of the following transformations:

• Expression
• Filter
• Router
• Update Strategy

Mapping parameters and variables also can be used within the Source Qualifier in the
SQL query, user-defined join, and source filter sections.

Parameter Files

Parameter files can be used to override values of mapping variables or mapping


parameters, or to define Server-specific values for a session run. Parameter files
have a very simple and defined format; they are divided into session-specific
sections, with each section defined within brackets as FOLDER.SESSION_NAME. The
naming is case sensitive. Parameters or variables must be defined in the mapping to
be used. A line can be ‘REMed’ out by placing a semicolon at the beginning.
Parameter files do not globally assign values.

Some parameter file examples:

[USER1.s_m_subscriberstatus_load]

$$Post_Date_Var=10/04/2001

[USER1.s_test_var1]

$$PMSuccessEmailUser=XXX@informatica.com

;$$Help_User

A parameter file is declared for use by a session, either within the session properties,
at the outer-most batch a session resides in, or as a parameter value when utilizing
PMCMD command.

The following parameters and variables can be defined or overridden within the
parameter file:

Parameter & Variable Type Parameter & Variable Name Desired Definition
String Mapping Parameter $$State MA
Datetime Mapping Variable $$Time 10/1/2000 00:00:00
Source File (Session $InputFile1 Sales.txt
Parameter)
Database Connection $DBConnection_Target Sales (database
(Session Parameter) connection)
Session Log File (Session $PMSessionLogFile d:/session

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-77


Parameter) logs/firstrun.txt

Parameters and variables cannot be used in the following:

• Lookup SQL Override.


• Lookup Location (Connection String).
• Schema/Owner names within Target Objects/Session Properties.

Example: Variables and Parameters in an Incremental Strategy

Variables and parameters can enhance incremental strategies. The following example
uses a mapping variable, an expression transformation object, and a parameter file
for restarting.

Scenario

Company X wants to start with an initial load of all data but wants subsequent
process runs to select only new information. The environment data has an inherent
Post_Date that is defined within a column named Date_Entered that can be used.
Process will run once every twenty-four hours.

Sample Solution

Create a mapping with source and target objects. From the menu create a new
mapping variable named $$Post_Date with the following attributes:

• TYPE – Variable
• DATATYPE – Date/Time
• AGGREGATION TYPE – MAX
• INITIAL VALUE – 01/01/1900

PAGE BP-78 BEST PRACTICES INFORMATICA CONFIDENTIAL


Note that there is no need to encapsulate the INITIAL VALUE with quotation marks.
However, if this value is used within the Source Qualifier SQL, it is necessary to use
the native RDBMS function to convert (e.g., TO DATE(--,--)).

Within the Source Qualifier Transformation, use the following in the Source_Filter
Attribute: DATE_ENTERED > to_Date(' $$Post_Date','MM/DD/YYYY HH24:MI:SS')

Also note that the initial value 01/01/1900 will be expanded by the PowerCenter
Server to 01/01/1900 00:00:00, hence the need to convert the parameter to a date
time.

The next step is to $$Post_Date and Date_Entered to an Expression transformation.


This is where the function for setting the variable will reside. An output port named
Post_Date is created with data type of date/time. In the expression code section
place the following function:

SETMAXVARIABLE($$Post_Date,DATE_ENTERED)

The function evaluates each value for DATE_ENTERED and updates the variable with
the Max value to be passed forward. For example:

DATE_ENTERED Resultant POST_DATE


9/1/2000 9/1/2000
10/30/2001 10/30/2001
9/2/2000 10/30/2001

Consider the following with regard to the functionality:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-79


1. In order for the function to assign a value and ultimately store it in the
repository, the port must be connected to a downstream object. It need not
go to the target, but it must go to another Expression Transformation. The
reason is that that memory will not be instantiated unless it is used in a
downstream transformation object.
2. In order for the function to work correctly, the rows have to be marked for
insert. If the mapping is an update only mapping (i.e., Treat Rows As is set to
Update in the session properties) the function will not work. In this case,
make the session Data Driven and add an Update Strategy after the
transformation containing the SETMAXVARIABLE function, but before the
Target.
3. If the intent is to store the original Date_Entered per row and not the
evaluated date value, then add an ORDER BY clause to the Source Qualifier.
That way the dates are processed and set in order and data is preserved.

The first time this mapping is run the SQL will select from the source where
Date_Entered is > 01/01/1900 providing an initial load. As data flows through the
mapping, the variable gets updated to the Max Date_Entered it encounters. Upon
successful completion of the session, the variable is updated in the Repository for
use in the next session run. To view the current value for a particular variable
associated with the session, right-click on the session and choose View Persistent
Values.

The following graphic shows that after the initial run, the Max Date_Entered was
02/03/1998. The next time this session is run, based on the variable in the Source
Qualifier Filter, only sources where Date_Entered > 02/03/1998 will be processed.

PAGE BP-80 BEST PRACTICES INFORMATICA CONFIDENTIAL


Resetting or Overriding Persistent Values

To reset the persistent value to the initial value declared in the mapping, view the
persistent value from Server Manager (see graphic above) and press Delete Values.
This will delete the stored value from the Repository, causing the Order of Evaluation
to use the Initial Value declared from the mapping.

If a session run is needed for a specific date, use a parameter file. There are two
basic ways to accomplish this:

• Create a generic parameter file, place it on the server, and point all sessions
to that parameter file. A session may (or may not) have a variable, and the
parameter file need not have variables and parameters defined for every
session ‘using’ the parameter file. To override the variable, either change,
uncomment or delete the variable in the parameter file.
• Run PMCMD for that session but declare the specific parameter file within the
PMCMD command.

Parameter files can be declared in Session Properties under the Log & Error Handling
Tab.

In this example, after the initial session is run the parameter file contents may look
like:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-81


[Test.s_Incremental]

;$$Post_Date=

By using the semicolon, the variable override is ignored and the Initial Value or
Stored Value is used. If, in the subsequent run, the data processing date needs to be
set to a specific date (for example: 04/21/2001), then a simple Perl Script can
update the parameter file to:

[Test.s_Incremental]

$$Post_Date=04/21/2001

Upon running the sessions, the order of evaluation looks to the parameter file first,
sees a valid variable and value and uses that value for the session run. After
successful completion, run another script to reset the parameter file.

Example: Using Session and Mapping Parameters in Multiple Database Environments

Reusable mappings that can source a common table definition across multiple
databases, regardless of differing environmental definitions (e.g. instances, schemas,
user/logins), are required in a multiple database environment.

Scenario

Company X maintains five Oracle database instances. All instances have a common
table definition for sales orders, but each instance has a unique instance name,
schema and login.

DB Instance Schema Table User Password


ORC1 aardso orders Sam max
ORC99 environ orders Help me
HALC hitme order_done Hi Lois
UGLY snakepit orders Punch Judy
GORF gmer orders Brer Rabbit

Each sales order table has a different name, but the same definition:

ORDER_ID NUMBER (28) NOT NULL,


DATE_ENTERED DATE NOT NULL,
DATE_PROMISED DATE NOT NULL,
DATE_SHIPPED DATE NOT NULL,
EMPLOYEE_ID NUMBER (28) NOT NULL,
CUSTOMER_ID NUMBER (28) NOT NULL,
SALES_TAX_RATE NUMBER (5,4) NOT NULL,
STORE_ID NUMBER (28) NOT NULL

Sample Solution

PAGE BP-82 BEST PRACTICES INFORMATICA CONFIDENTIAL


Using Server Manager, create multiple connection strings. In this example, the
strings are named according to the DB Instance name. Using Designer create the
mapping that sources the commonly defined table. Then create a Mapping Parameter
named $$Source_Schema_Table with the following attributes:

Note that the parameter attributes vary based on the specific environment. Also, the
initial value is not required as this solution will use parameter files.

Open the source qualifier and use the mapping parameter in the SQL Override as
shown in the following graphic.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-83


Open the Expression Editor and select Generate SQL. The generated SQL statement
will show the columns. Override the table names in the SQL statement with the
mapping parameter.

Using Server Manager, create a session based on this mapping. Within the Source
Database connection, drop down place the following parameter:
$DBConnection_SourcePoint the target to the corresponding target and finish.

Now create the parameter file. In this example, there will be five separate parameter
files.

Parmfile1.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=aardso.orders

$DBConnection_Source= ORC1

Parmfile2.txt

[Test.s_Incremental_SOURCE_CHANGES]

PAGE BP-84 BEST PRACTICES INFORMATICA CONFIDENTIAL


$$Source_Schema_Table=environ.orders

$DBConnection_Source= ORC99

Parmfile3.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=hitme.order_done

$DBConnection_Source= HALC

Parmfile4.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table=snakepit.orders

$DBConnection_Source= UGLY

Parmfile5.txt

[Test.s_Incremental_SOURCE_CHANGES]

$$Source_Schema_Table= gmer.orders

$DBConnection_Source= GORF

Use PMCMD to run the five sessions in parallel. The syntax for PMCMD for starting
sessions is as follows:

pmcmd start {user_name | %user_env_var} {password | %password_env_var}


{[TCP/IP:][hostname:]portno | IPX/SPX:ipx/spx_address}
[folder_name:]{session_name | batch_name}[:pf=param_file] session_flag
wait_flag

In this environment there would be five separate commands:

pmcmd start tech_user pwd 127.0.0.1:4001 Test:


s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\Parmfile1.txt ‘ 1 1

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-85


pmcmd start tech_user pwd 127.0.0.1:4001 Test:
s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\ Parmfile2.txt ‘ 1 1

pmcmd start tech_user pwd 127.0.0.1:4001 Test:


s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\ Parmfile3.txt ‘ 1 1

pmcmd start tech_user pwd 127.0.0.1:4001 Test:


s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\ Parmfile4.txt ‘ 1 1

pmcmd start tech_user pwd 127.0.0.1:4001 Test:


s_Incremental_SOURCE_CHANGES:pf=’\$PMRootDir\ParmFiles\ Parmfile5.txt ‘ 1 1

Alternatively, you could run the sessions in sequence with one parameter file. In this
case, a pre- or post-session script would change the parameter file for the next
session.

PAGE BP-86 BEST PRACTICES INFORMATICA CONFIDENTIAL


A Mapping Approach to Trapping Data Errors

Challenge

Addressing data content errors within mappings to facilitate re-routing erroneous


rows to a target other than the original target table.

Description

Identifying errors and creating an error handling strategy is an essential part of a


data warehousing project. In the production environment, data must be checked and
validated prior to entry into the data warehouse. One strategy for handling errors is
to maintain database constraints. Another approach is to use mappings to trap data
errors.

The first step in using mappings to trap errors is understanding and identifying the
error handling requirement.

The following questions should be considered:

• What types of errors are likely to be encountered?


• Of these errors, which ones should be captured?
• What process can capture the possible errors?
• Should errors be captured before they have a chance to be written to the
target database?
• Should bad files be used?
• Will any of these errors need to be reloaded or corrected?
• How will the users know if errors are encountered?
• How will the errors be stored?
• Should descriptions be assigned for individual errors?
• Can a table be designed to store captured errors and the error descriptions?

Capturing data errors within a mapping and re-routing these errors to an error table
allows for easy analysis for the end users and improves performance. For example,
suppose it is necessary to identify foreign key constraint errors within a mapping.
This can be accomplished by creating a lookup into a dimension table prior to loading
the fact table. Referential integrity is assured by including this functionality in a
mapping. The database still enforces the foreign key constraints, but erroneous data
will not be written to the target table. Also, if constraint errors are captured within

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-87


the mapping, the PowerCenter server will not have to write the error to the session
log and the reject/bad file.

Data content errors also can be captured in a mapping. Mapping logic can identify
data content errors and attach descriptions to the errors. This approach can be
effective for many types of data content errors, including: date conversion, null
values intended for not null target fields, and incorrect data formats or data types.

Error Handling Example

In the following example, we want to capture null values before they enter into a
target field that does not allow nulls.

After we’ve identified the type of error, the next step is to separate the error from
the data flow. Use the Router Transformation to create a stream of data that will be
the error route. Any row containing an error (or errors) will be separated from the
valid data and uniquely identified with a composite key consisting of a MAPPING_ID
and a ROW_ID. The MAPPING_ID refers to the mapping name and the ROW_ID is
generated by a Sequence Generator. The composite key allows developers to trace
rows written to the error tables.

Error tables are important to an error handling strategy because they store the
information useful to error identification and troubleshooting. In this example, the
two error tables are ERR_DESC_TBL and TARGET_NAME_ERR.

The ERR_DESC_TBL table will hold information about the error, such as the mapping
name, the ROW_ID, and a description of the error. This table is designed to hold all
error descriptions for all mappings within the repository for reporting purposes.

The TARGET_NAME_ERR table will be an exact replica of the target table with two
additional columns: ROW_ID and MAPPING_ID. These two columns allow the
TARGET_NAME_ERR and the ERR_DESC_TBL to be linked. The TARGET_NAME_ERR
table provides the user with the entire row that was rejected, enabling the user to
trace the error rows back to the source. These two tables might look like the
following:

The error handling functionality must assigned to a unique description for each error
in the rejected row. In this example, any null value intended for a not null target

PAGE BP-88 BEST PRACTICES INFORMATICA CONFIDENTIAL


field will generate an error message such as ‘Column1 is NULL’ or ‘Column2 is NULL’.
This step can be done in an Expression Transformation.

After field descriptions are assigned, we need to break the error row into several
rows, with each containing the same content except for a different error description.
You can use the Normalizer Transformation A mapping approach to break one row of
data into many rows After a single row of data is separated based on the number of
possible errors in it, we need to filter the columns within the row that are actually
errors. For example, one row of data may have as many as three errors, but in this
case, the row actually has only one error so we need to write only one error with its
description to the ERR_DESC_TBL.

When the row is written to the ERR_DESC_TBL, we can link this row to the row in the
TARGET_NAME_ERR table using the ROW_ID and the MAPPING_ID. The following
chart shows how the two error tables can be linked. Focus on the bold selections in
both tables.

TARGET_NAME_ERR

Column1 Column2 Column3 ROW_ID MAPPING_ID


NULL NULL NULL 1 DIM_LOAD

ERR_DESC_TBL

FOLDER_NAME MAPPING_ID ROW_ID ERROR_DESC LOAD_DATE SOURCE Target


CUST DIM_LOAD 1 Column 1 is SYSDATE DIM FACT
NULL
CUST DIM_LOAD 1 Column 2 is SYSDATE DIM FACT
NULL
CUST DIM_LOAD 1 Column 3 is SYSDATE DIM FACT
NULL

The solution example would look like the following in a mapping:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-89


The mapping approach is effective because it takes advantage of reusable objects,
thereby using the same logic repeatedly within a mapplet. This makes error
detection easy to implement and manage in a variety of mappings.

By adding another layer of complexity within the mappings, errors can be flagged as
‘soft’ or ‘hard’. A ‘hard’ error can be defined as one that would fail when being
written to the database, such as a constraint error. A ‘soft’ error can be defined as a
data content error. A record flagged as a hard error is written to the error route,
while a record flagged as a soft error can be written to the target system and the
error tables. This gives business analysts an opportunity to evaluate and correct data
imperfections while still allowing the records to be processed for end-user reporting.

Ultimately, business organizations need to decide if the analysts should fix the data
in the reject table or in the source systems. The advantage of the mapping approach
is that all errors are identified as either data errors or constraint errors and can be
properly addressed. The mapping approach also reports errors based on projects or
categories by identifying the mappings that contain errors. The most important
aspect of the mapping approach however, is its flexibility. Once an error type is
identified, the error handling logic can be placed anywhere within a mapping. By
using the mapping approach to capture identified errors, data warehouse operators
can effectively communicate data quality issues to the business users.

PAGE BP-90 BEST PRACTICES INFORMATICA CONFIDENTIAL


Design Error Handling Infrastructure

Challenge

Understanding the need for an error handling strategy, identifying potential errors,
and determining an optimal plan for error handling.

Description

It important to realize the need for an error handling strategy, then devise an
infrastructure to resolve the errors. Although error handling varies from project to
project, the typical requirement of an error handling system is to address data
quality issues (i.e., dirty date). Implementing an error handling strategy requires a
significant amount of planning and understanding of the load process. You should
prepare a high level data flow design to illustrate the load process and the role that
error handling plays in it.

Error handling is an integral part of any load process and directly affects the process
when it starts and stops. An error handling strategy should be capable of accounting
for unrecoverable errors during the load process and provide crash recovery, stop,
and restart capabilities. Stop and restart processes can be managed through the pre-
and post- session shell scripts for each PowerCenter session.

Although source systems vary widely in functionality and data quality standards, at
some point a record with incorrect data will be introduced into the data warehouse
from a source system. The error handling strategy should reject these rows, provide
a place to put the rejected rows, and set a limit on how many errors can occur
before the load process stops. It also should report on the rows that are rejected by
the load process, and provide a mechanism for reload.

Regardless of whether an error requires manual inspection, correction of data or a


rerun of the process, the owner needs to know if any rows were loaded or changed
during the load, especially if a response is critical to the continuation of the process.
Therefore, it is critical to have a notification process in place. PowerCenter includes a
post-session e-mail functionality that can trigger the delivery of e-mail. Post-session
scripts can be written to increase the functionality of the notification process to send
detailed messages upon receipt of an error or file.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-91


The following table presents examples of one company’s error conditions and the
associated notification actions:

Error Condition Notification Action


Arrival of .DAT and .SENT Files. Timer to check if If the .DAT files or .SENT file do not arrive by 3:00
the files have arrived by 3:00 AM for daily loads AM, send an e-mail notification to Production Support
and 2:00 PM Saturday for weekly loads on-call resource.

1) E-mail
Tablespace check and Database constraints If the required Tablespace is not available, the
check for creating Target Tables system load for all the loads that are part of the
system are aborted, and notification is sent to the
DBA and Production Support.

1) E-mail

2) Page
Timer to check if the load has completed by 5:00 If the load has not completed within the 2-hour
AM. window, by 5:00 AM, then send an e-mail notification
to Product Support.

1) E-mail

2) Page
The rejected record number crosses the error Load the rejected records to a reject file and send an
threshold limit OR Informatica PowerCenter e-mail notification to Production Support.
session fails for any other reason.
1) E-mail

2) Page
Match the Hash Total and the Column Totals If the Hash total and the total number of records do
loaded in the target tables with the contents of not match, rollback the data load and send
the .SENT file. If they do not match, do a notification to Production Support.
rollback of the records loaded in the target.
1) E-mail

2) Page

Infrastructure Overview

A better way of identifying and trapping errors is to create tables within the mapping
to hold the rows that contain errors.

A Sample Scenario:

Each target table should have an identical error table, named


<TARGET_TABLE_NAME>_RELOAD with two additional columns, MAPPING_NAME
and SEQ_ID. An additional error table, ENTERPRISE_ERR_TBL captures descriptions
for all errors committed during loading.

The two tables look like the following:

PAGE BP-92 BEST PRACTICES INFORMATICA CONFIDENTIAL


The <TARGET_TABLE_NAME>_RELOAD table is target specific. The
ENTERPRISE_ERR_TBL is a target table in each mapping that requires error
capturing.

The entire process of defining the error handling strategy within a particular mapping
depends on the type of errors that you expect to capture.

The following examples illustrate what is necessary for successful error handling.

<TARGET_TABLE_NAME>_RELOAD
Fields: LKP1 LKP2 LKP3 ASOF_DT SEQ_ID MAPPING_NAME

Values: test OCC VAL 12/21/00 1 DIM_LOAD

ENTERPRISE_ERR_TBL
FOLDER_NAME MAPPING_NAME SEQ_ID ERROR_DESC LOAD_DATE SOURCE Target LKP_TBL

Values: Project_1 DIM_LOAD 1 LKP1 Invalid SYSDATE DIM DIM SAL


Project_1 DIM_LOAD 1 LKP2 Invalid SYSDATE DIM DIM CUST
Project_1 DIM_LOAD 1 LKP3 Invalid SYSDATE DIM DIM DEPT

The TARGET(<TARGET_TABLE_NAME>)_RELOAD captures rows of data that failed


the validation tests. By looking at the data rows stored in ENTERPRISE_ERR_TBL, we
can identify that mapping DIM_LOAD with the SEQ_ID of 1 had 3 errors. Since rows
in TARGET_RELOAD have a unique SEQ_ID, we can determine that the row of data in
the TARGET_RELOAD table with the SEQ_ID of 1 had three errors. Thus, we can
determine which values failed the lookup. By looking at the first row in the
ENTERPRISE_ERR_TBL, the error description states that ‘LKP1 was Invalid’. By using
the MAPPING_NAME and SEQ_ID, we can know that (‘test’) is the failed value in
LKP1.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-93


Documenting Mappings Using Repository Reports

Challenge

Documenting and reporting comments contained in each of the mapping objects.

Description

It is crucial to take advantage of the metadata contained in the repository in to


document your Informatica mappings, but the Informatica mappings must be
properly documented to take full advantage of this metadata. This means that
comments must be included at all levels of a mapping, from the mapping itself, down
to the objects and ports within the mapping. With PowerCenter, you can enter
description information for all repository objects, sources, targets, transformations,
etc, but the amount of metadata that you enter should be determined by the
business requirements. You can also drill down to the column level and give
descriptions of the columns in a table if necessary. All information about column size
and scale, datatypes, and primary keys are stored in the repository.

Once the mappings and sessions contain the proper metadata, it is important to
develop a plan for extracting this metadata. PowerCenter provides several ways to
access the metadata contained within the repository. One way of doing this is
through the generic Crystal Reports that are supplied with PowerCenter. These
reports are accessible through the Repository Manager. (Open the Repository
Manager, and click Reports.) You can choose from the following four reports:

Mapping report (map.rpt). Lists source column and transformation details for each
mapping in each folder or repository.

Source and target dependencies report (S2t_dep.rpt). Shows the source and
target dependencies as well as the transformations performed in each mapping.

Target table report (Trg_tbl.rpt). Provides target field transformation expressions,


descriptions, and comments for each target table.

Executed session report (sessions.rpt). Provides information about executed


sessions (such as the number of successful rows) in a particular folder.

PAGE BP-94 BEST PRACTICES INFORMATICA CONFIDENTIAL


Note: If your mappings contain shortcuts, these will not be displayed in the generic
Crystal Reports. You will have to use the MX2 Views to access the repository, or
create custom SQL view.

In PowerCenter 5.1, you can develop a metadata access strategy using the Metadata
Reporter. The Metadata Reporter allows for customized reporting of all repository
information without direct access to the repository itself. For more information on the
Metadata Reporter, consult Metadata Reporting and Sharing, or the Metadata
Reporter Guide included with the PowerCenter documentation.

A printout of the mapping object flow is also useful for clarifying how objects are
connected. To produce such a printout, arrange the mapping in Designer so the full
mapping appears on the screen, then use Alt+PrtSc to copy the active window to the
clipboard. Use Ctrl+V to paste the copy into a Word document.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-95


Error Handling Strategies

Challenge

Efficiently load data into the Enterprise Data Warehouse (EDW) and Data Mart (DM).
This Best Practice describes various loading scenarios, the use of data profiles, an
alternate method for identifying data errors, methods for handling data errors, and
alternatives for addressing the most common types of problems.

Description

When loading data into an EDW or DM, the loading process must validate that the
data conforms to known rules of the business. When the source system data does
not meet these rules, the process needs to handle the exceptions in an appropriate
manner. The business needs to be aware of the consequences of either permitting
invalid data to enter the EDW or rejecting it until it is fixed. Both approaches present
complex issues. The business must decide what is acceptable and prioritize two
conflicting goals:

• The need for accurate information


• The ability to analyze the most complete information with the understanding
that errors can exist.

Data Integration Process Validation

In general, there are three methods for handling data errors detected in the loading
process:

• Reject All. This is the simplest to implement since all errors are rejected
from entering the EDW when they are detected. This provides a very reliable
EDW that the users can count on as being correct, although it may not be
complete. Both dimensional and factual data are rejected when any errors are
encountered. Reports indicate what the errors are and how they affect the
completeness of the data.

Dimensional errors cause valid factual data to be rejected because a foreign


key relationship cannot be created. These errors need to be fixed in the
source systems and reloaded on a subsequent load of the EDW. Once the
corrected rows have been loaded, the factual data will be reprocessed and

PAGE BP-96 BEST PRACTICES INFORMATICA CONFIDENTIAL


loaded, assuming that all errors have been fixed. This delay may cause some
user dissatisfaction since the users need to take into account that the data
they are looking at may not be a complete picture of the operational systems
until the errors are fixed.

The development effort required to fix a Reject All scenario is minimal, since
the rejected data can be processed through existing mappings once it has
been fixed. Minimal additional code may need to be written since the data will
only enter the EDW if it is correct, and it would then be loaded into the data
mart using the normal process.

• Reject None. This approach gives users a complete picture of the data
without having to consider data that was not available due to it being rejected
during the load process. The problem is that the data may not be accurate.
Both the EDW and DM may contain incorrect information that can lead to
incorrect decisions.

With Reject None, data integrity is intact, but the data may not support
correct aggregations. Factual data can be allocated to dummy or incorrect
dimension rows, resulting in grand total numbers that are correct, but
incorrect detail numbers. After the data is fixed, reports may change, with
detail information being redistributed along different hierarchies.

The development effort to fix this scenario is significant. After the errors are
corrected, a new loading process needs to correct both the EDW and DM,
which can be a time-consuming effort based on the delay between an error
being detected and fixed. The development strategy may include removing
information from the EDW, restoring backup tapes for each night’s load, and
reprocessing the data. Once the EDW is fixed, these changes need to be
loaded into the DM.

• Reject Critical. This method provides a balance between missing


information and incorrect information. This approach involves examining each
row of data, and determining the particular data elements to be rejected. All
changes that are valid are processed into the EDW to allow for the most
complete picture. Rejected elements are reported as errors so that they can
be fixed in the source systems and loaded on a subsequent run of the ETL
process.

This approach requires categorizing the data in two ways: 1) as Key Elements
or Attributes, and 2) as Inserts or Updates.

Key elements are required fields that maintain the data integrity of the EDW
and allow for hierarchies to be summarized at different levels in the
organization. Attributes provide additional descriptive information per key
element.

Inserts are important for dimensions because subsequent factual data may
rely on the existence of the dimension data row in order to load properly.
Updates do not affect the data integrity as much because the factual data can
usually be loaded with the existing dimensional data unless the update is to a
Key Element.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-97


The development effort for this method is more extensive than Reject All
since it involves classifying fields as critical or non-critical, and developing
logic to update the EDW and flag the fields that are in error. The effort also
incorporates some tasks from the Reject None approach in that processes
must be developed to fix incorrect data in the EDW and DM.

Informatica generally recommends using the Reject Critical strategy to maintain the
accuracy of the EDW. By providing the most fine-grained analysis of errors, this
method allows the greatest amount of valid data to enter the EDW on each run of
the ETL process, while at the same time screening out the unverifiable data fields.
However, business management needs to understand that some information may be
held out of the EDW, and also that some of the information in the EDW may be at
least temporarily allocated to the wrong hierarchies.

Using Profiles

Profiles are tables used to track history of dimensional data in the EDW. As
the source systems change, Profile records are created with date stamps that
indicate when the change took place. This allows power users to analyze the EDW
using either current (As-Is) or past (As-Was) views of dimensional data.

Profiles should occur once per change in the source systems. Problems occur when
two fields change in the source system and one of those fields produces an error.
When the second field is fixed, it is difficult for the ETL process to produce a
reflection of data changes since there is now a question whether to update a
previous Profile or create a new one. The first value passes validation, which
produces a new Profile record, while the second value is rejected and is not included
in the new Profile. When this error is fixed, it would be desirable to update the
existing Profile rather than creating a new one, but the logic needed to perform this
UPDATE instead of an INSERT is complicated.

If a third field is changed before the second field is fixed, the correction process
cannot be automated. The following hypothetical example represents three field
values in a source system. The first row on 1/1/2000 shows the original values. On
1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to
BRed, which is invalid. On 1/10/2000 Field 3 changes from Open 9-5 to Open 24hrs,
but Field 2 is still invalid. On 1/15/2000, Field 2 is finally fixed to Red.

Date Field 1 Value Field 2 Value Field 3 Value


1/1/2000 Closed Sunday Black Open 9–5
1/5/2000 Open Sunday BRed Open 9–5
1/10/2000 Open Sunday BRed Open 24hrs
1/15/2000 Open Sunday Red Open 24hrs

Three methods exist for handling the creation and update of Profiles:

1. The first method produces a new Profile record each time a change is detected
in the source. If a field value was invalid, then the original field value is maintained.

Date Profile Date Field 1 Value Field 2 Field 3 Value


Value
1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5

PAGE BP-98 BEST PRACTICES INFORMATICA CONFIDENTIAL


Date Profile Date Field 1 Value Field 2 Field 3 Value
Value
1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5
1/10/2000 1/10/2000 Open Sunday Black Open 24hrs
1/15/2000 1/15/2000 Open Sunday Red Open 24hrs

By applying all corrections as new Profiles in this method, we simplify the process by
directly applying all changes to the source system directly to the EDW. Each change -
- regardless if it is a fix to a previous error -- is applied as a new change that creates
a new Profile. This incorrectly shows in the EDW that two changes occurred to the
source information when, in reality, a mistake was entered on the first change and
should be reflected in the first Profile. The second Profile should not have been
created.

2. The second method updates the first Profile created on 1/5/2000 until all
fields are corrected on 1/15/2000, which loses the Profile record for the change to
Field 3.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value


1/1/2000 1/1/2000 Closed Sunday Black Open 9 – 5
1/5/2000 1/5/2000 Open Sunday Black Open 9 – 5
1/10/2000 1/5/2000 Open Sunday Black Open 24hrs
(Update)
1/15/2000 1/5/2000 Open Sunday Red Open 24hrs
(Update)

If we try to apply changes to the existing Profile, as in this method, we run the risk
of losing Profile information. If the third field changes before the second field is fixed,
we show the third field changed at the same time as the first. When the second field
was fixed it would also be added to the existing Profile, which incorrectly reflects the
changes in the source system.

3. The third method creates only two new Profiles, but then causes an update to
the Profile records on 1/15/2000 to fix the Field 2 value in both.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value


1/1/2000 1/1/2000 Closed Sunday Black Open 9–5
1/5/2000 1/5/2000 Open Sunday Black Open 9–5
1/10/2000 1/10/2000 Open Sunday Black Open 24hrs
1/15/2000 1/5/2000 Open Sunday Red Open 9-5
(Update)
1/15/2000 1/10/2000 Open Sunday Red Open 24hrs
(Update)

If we try to implement a method that updates old Profiles when errors are fixed, as
in this option, we need to create complex algorithms that handle the process
correctly. It involves being able to determine when an error occurred and examining
all Profiles generated since then and updating them appropriately. And, even if we
create the algorithms to handle these methods, we still have an issue of determining
if a value is a correction or a new value. If an error is never fixed in the source
system, but a new value is entered, we would identify it as a previous error, causing
an automated process to update old Profile records, when in reality a new Profile
record should have been entered.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-99


Recommended Method

A method exists to track old errors so that we know when a value was rejected.
Then, when the process encounters a new, correct value it flags it as part of the load
strategy as a potential fix that should be applied to old Profile records. In this way,
the corrected data enters the EDW as a new Profile record, but the process of fixing
old Profile records, and potentially deleting the newly inserted record, is delayed until
the data is examined and an action is decided. Once an action is decided, another
process examines the existing Profile records and corrects them as necessary. This
method only delays the As-Was analysis of the data until the correction method is
determined because the current information is reflected in the new Profile.

Data Quality Edits

Quality indicators can be used to record definitive statements regarding the quality
of the data received and stored in the EDW. The indicators can be append to existing
data tables or stored in a separate table linked by the primary key. Quality indicators
can be used to:

• show the record and field level quality associated with a given record at the
time of extract
• identify data sources and errors encountered in specific records
• support the resolution of specific record error types via an update and
resubmission process.

Quality indicators may be used to record several types of errors – e.g., fatal errors
(missing primary key value), missing data in a required field, wrong data
type/format, or invalid data value. If a record contains even one error, data quality
(DQ) fields will be appended to the end of the record, one field for every field in the
record. A data quality indicator code is included in the DQ fields corresponding to
the original fields in the record where the errors were encountered. Records
containing a fatal error are stored in a Rejected Record Table and associated to the
original file name and record number. These records cannot be loaded to the EDW
because they lack a primary key field to be used as a unique record identifier in the
EDW.

The following types of errors cannot be processed:

• A source record does not contain a valid key. This record would be sent to a
reject queue. Metadata will be saved and used to generate a notice to the
sending system indicating that x number of invalid records were received and
could not be processed. However, in the absence of a primary key, no
tracking is possible to determine whether the invalid record has been replaced
or not.
• The source file or record is illegible. The file or record would be sent to a
reject queue. Metadata indicating that x number of invalid records were
received and could not be processed may or may not be available for a
general notice to be sent to the sending system. In this case, due to the
nature of the error, no tracking is possible to determine whether the invalid
record has been replaced or not. If the file or record is illegible, it is likely
that individual unique records within the file are not identifiable. While
information can be provided to the source system site indicating there are file

PAGE BP-100 BEST PRACTICES INFORMATICA CONFIDENTIAL


errors for x number of records, specific problems may not be identifiable on a
record-by-record basis.

In these error types, the records can be processed, but they contain errors:

• A required (non-key) field is missing.


• The value in a numeric or date field is non-numeric.
• The value in a field does not fall within the range of acceptable values
identified for the field. Typically, a reference table is used for this validation.

When an error is detected during ingest and cleansing, the identified error type is
recorded.

Quality Indicators (Quality Code Table)

The requirement to validate virtually every data element received from the source
data systems mandates the development, implementation, capture and maintenance
of quality indicators. These are used to indicate the quality of incoming data at an
elemental level. Aggregated and analyzed over time, these indicators provide the
information necessary to identify acute data quality problems, systemic issues,
business process problems and information technology breakdowns.

The quality indicators: “0”-No Error, “1”-Fatal Error, “2”-Missing Data from a
Required Field, “3”-Wrong Data Type/Format, “4”-Invalid Data Value and “5”-
Outdated Reference Table in Use, apply a concise indication of the quality of the data
within specific fields for every data type. These indicators provide the opportunity
for operations staff, data quality analysts and users to readily identify issues
potentially impacting the quality of the data. At the same time, these indicators
provide the level of detail necessary for acute quality problems to be remedied in a
timely manner.

Handling Data Errors

The need to periodically correct data in the EDW is inevitable. But how often should
these corrections be performed?

The correction process can be as simple as updating field information to reflect actual
values, or as complex as deleting data from the EDW, restoring previous loads from
tape, and then reloading the information correctly. Although we try to avoid
performing a complete database restore and reload from a previous point in time, we
cannot rule this out as a possible solution.

Reject Tables vs. Source System

As errors are encountered, they are written to a reject file so that business analysts
can examine reports of the data and the related error messages indicating the
causes of error. The business needs to decide whether analysts should be allowed to
fix data in the reject tables, or whether data fixes will be restricted to source
systems. If errors are fixed in the reject tables, the EDW will not be synchronized
with the source systems. This can present credibility problems when trying to track

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-101


the history of changes in the EDW and DM. If all fixes occur in the source systems,
then these fixes must be applied correctly to the EDW.

Attribute Errors and Default Values

Attributes provide additional descriptive information about a dimension concept.


Attributes include things like the color of a product or the address of a store.
Attribute errors are typically things like an invalid color or inappropriate characters in
the address. These types of errors do not generally affect the aggregated facts and
statistics in the EDW; the attributes are most useful as qualifiers and filtering criteria
for drilling into the data, (e.g. to find specific patterns for market research).
Attribute errors can be fixed by waiting for the source system to be corrected and
reapplied to the data in the EDW.

When attribute errors are encountered for a new dimensional value, default values
can be assigned to let the new record enter the EDW. Some rules that have been
proposed for handling defaults are as follows:

Value Types Description Default


Reference Values Attributes that are foreign keys to other Unknown
tables
Small Value Sets Y/N indicator fields No
Other Any other type of attribute Null or Business provided value

Reference tables are used to normalize the EDW model to prevent the duplication of
data. When a source value does not translate into a reference table value, we use
the ‘Unknown’ value. (All reference tables contain a value of ‘Unknown’ for this
purpose.)

The business should provide default values for each identified attribute. Fields that
are restricted to a limited domain of values (e.g. On/Off or Yes/No indicators), are
referred to as small value sets. When errors are encountered in translating these
values, we use the value that represents off or ‘No’ as the default. Other values, like
numbers, are handled on a case-by-case basis. In many cases, the data integration
process is set to populate ‘Null’ into these fields, which means “undefined” in the
EDW. After a source system value is corrected and passes validation, it is corrected
in the EDW.

Primary Key Errors

The business also needs to decide how to handle new dimensional values such as
locations. Problems occur when the new key is actually an update to an old key in
the source system. For example, a location number is assigned and the new location
is transferred to the EDW using the normal process; then the location number is
changed due to some source business rule such as: all Warehouses should be in the
5000 range. The process assumes that the change in the primary key is actually a
new warehouse and that the old warehouse was deleted. This type of error causes a
separation of fact data, with some data being attributed to the old primary key and
some to the new. An analyst would be unable to get a complete picture.

Fixing this type of error involves integrating the two records in the EDW, along with
the related facts. Integrating the two rows involves combining the Profile

PAGE BP-102 BEST PRACTICES INFORMATICA CONFIDENTIAL


information, taking care to coordinate the effective dates of the Profiles to sequence
properly. If two Profile records exist for the same day, then a manual decision is
required as to which is correct. If facts were loaded using both primary keys, then
the related fact rows must be added together and the originals deleted in order to
correct the data.

The situation is more complicated when the opposite condition occurs (i.e., two
primary keys mapped to the same EDW ID really represent two different IDs). In this
case, it is necessary to restore the source information for both dimensions and facts
from the point in time at which the error was introduced, deleting affected records
from the EDW and reloading from the restore to correct the errors.

DM Facts Calculated from EDW Dimensions

If information is captured as dimensional data from the source, but used as


measures residing on the fact records in the DM, we must decide how to handle the
facts. From a data accuracy view, we would like to reject the fact until the value is
corrected. If we load the facts with the incorrect data, the process to fix the DM can
be time consuming and difficult to implement.

If we let the facts enter the EDW and subsequently the DM, we need to create
processes that update the DM after the dimensional data is fixed. This involves
updating the measures in the DM to reflect the changed data. If we reject the facts
when these types of errors are encountered, the fix process becomes simpler. After
the errors are fixed, the affected rows can simply be loaded and applied to the DM.

Fact Errors

If there are no business rules that reject fact records except for relationship errors to
dimensional data, then when we encounter errors that would cause a fact to be
rejected, we save these rows to a reject table for reprocessing the following night.
This nightly reprocessing continues until the data successfully enters the EDW. Initial
and periodic analyses should be performed on the errors to determine why they are
not being loaded. After they are loaded, they are populated into the DM as usual.

Data Stewards

Data Stewards are generally responsible for maintaining reference tables and
translation tables, creating new entities in dimensional data, and designating one
primary data source when multiple sources exist. Reference data and translation
tables enable the EDW to maintain consistent descriptions across multiple source
systems, regardless of how the source system stores the data. New entities in
dimensional data include new locations, products, hierarchies, etc. Multiple source
data occurs when two source systems can contain different data for the same
dimensional entity.

Reference Tables

The EDW uses reference tables to maintain consistent descriptions. Each table
contains a short code value as a primary key and a long description for reporting
purposes. A translation table is associated with each reference table to map the

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-103


codes to the source system values. Using both of these tables, the ETL process can
load data from the source systems into the EDW and then load from the EDW into
the DM.

The translation tables contain one or more rows for each source value and map the
value to a matching row in the reference table. For example, the SOURCE column in
FILE X on System X can contain ‘O’, ‘S’ or ‘W’. The data steward would be
responsible for entering in the Translation table the following values:

Source Value Code Translation


O OFFICE
S STORE
W WAREHSE

These values are used by the data integration process to correctly load the EDW.
Other source systems that maintain a similar field may use a two-letter abbreviation
like ‘OF’, ‘ST’ and ‘WH’. The data steward would make the following entries into the
translation table to maintain consistency across systems:

Source Value Code Translation


OF OFFICE
ST STORE
WH WAREHSE

The data stewards are also responsible for maintaining the Reference table that
translates the Codes into descriptions. The ETL process uses the Reference table to
populate the following values into the DM:

Code Translation Code Description


OFFICE Office
STORE Retail Store
WAREHSE Distribution Warehouse

Error handling results when the data steward enters incorrect information for these
mappings and needs to correct them after data has been loaded. Correcting the
above example could be complex (e.g., if the data steward entered ST as translating
to OFFICE by mistake). The only way to determine which rows should be changed is
to restore and reload source data from the first time the mistake was entered.
Processes should be built to handle these types of situations, include correction of
the EDW and DM.

Dimensional Data

New entities in dimensional data present a more complex issue. New entities in the
EDW may include Locations and Products, at a minimum. Dimensional data uses the
same concept of translation as Reference tables. These translation tables map the
source system value to the EDW value. For location, this is straightforward, but over
time, products may have multiple source system values that map to the same
product in the EDW. (Other similar translation issues may also exist, but Products
serves as a good example for error handling.)

PAGE BP-104 BEST PRACTICES INFORMATICA CONFIDENTIAL


There are two possible methods for loading new dimensional entities. Either require
the data steward to enter the translation data before allowing the dimensional data
into the EDW, or create the translation data through the ETL process and force the
data steward to review it. The first option requires the data steward to create the
translation for new entities, while the second lets the ETL process create the
translation, but marks the record as ‘Pending Verification’ until the data steward
reviews it and changes the status to ‘Verified’ before any facts that reference it can
be loaded.

When the dimensional value is left as ‘Pending Verification’ however, facts may be
rejected or allocated to dummy values. This requires the data stewards to review the
status of new values on a daily basis. A potential solution to this issue is to generate
an e-mail each night if there are any translation table entries pending verification.
The data steward then opens a report that lists them.

A problem specific to Product is that when it is created as new, it is really just a


changed SKU number. This causes additional fact rows to be created, which produces
an inaccurate view of the product when reporting. When this is fixed, the fact rows
for the various SKU numbers need to be merged and the original rows deleted.
Profiles would also have to be merged, requiring manual intervention.

The situation is more complicated when the opposite condition occurs (i.e., two
products are mapped to the same product, but really represent two different
products). In this case, it is necessary to restore the source information for all loads
since the error was introduced. Affected records from the EDW should be deleted and
then reloaded from the restore to correctly split the data. Facts should be split to
allocate the information correctly and dimensions split to generate correct Profile
information.

Manual Updates

Over time, any system is likely to encounter errors that are not correctable using
source systems. A method needs to be established for manually entering fixed data
and applying it correctly to the EDW, and subsequently to the DM, including
beginning and ending effective dates. These dates are useful for both Profile and
Date Event fixes. Further, a log of these fixes should be maintained to enable
identifying the source of the fixes as manual rather than part of the normal load
process.

Multiple Sources

The data stewards are also involved when multiple sources exist for the same data.
This occurs when two sources contain subsets of the required information. For
example, one system may contain Warehouse and Store information while another
contains Store and Hub information. Because they share Store information, it is
difficult to decide which source contains the correct information.

When this happens, both sources have the ability to update the same row in the
EDW. If both sources are allowed to update the shared information, data accuracy
and Profile problems are likely to occur. If we update the shared information on only
one source system, the two systems then contain different information. If the
changed system is loaded into the EDW, it creates a new Profile indicating the

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-105


information changed. When the second system is loaded, it compares its old
unchanged value to the new Profile, assumes a change occurred and creates another
new Profile with the old, unchanged value. If the two systems remain different, the
process causes two Profiles to be loaded every day until the two source systems are
synchronized with the same information.

To avoid this type of situation, the business analysts and developers need to
designate, at a field level, the source that should be considered primary for the field.
Then, only if the field changes on the primary source would it be changed. While this
sounds simple, it requires complex logic when creating Profiles, because multiple
sources can provide information toward the one Profile record created for that day.

One solution to this problem is to develop a system of record for all sources. This
allows developers to pull the information from the system of record, knowing that
there are no conflicts for multiple sources. Another solution is to indicate, at the field
level, a primary source where information can be shared from multiple sources.
Developers can use the field level information to update only the fields that are
marked as primary. However, this requires additional effort by the data stewards to
mark the correct source fields as primary and by the data integration team to
customize the load process.

PAGE BP-106 BEST PRACTICES INFORMATICA CONFIDENTIAL


Using Shortcut Keys in PowerCenter Designer

Challenge

Using shortcut keys in PowerCenter Designer to edit repository objects.

Description

General Suggestions

• To Open a folder with workspace open as well, click on an Open folder


icon (rather than double-clicking on it). Alternatively, right click on the
folder name, then scroll down and click on “open”.
• When using the "drag & drop" approach to create Foreign Key/Primary
Key relationships between tables, be sure to start in the Foreign Key
table and drag the key/field to the Primary Key table. Set the Key
Type value to “NOT A KEY” prior to dragging.
• If possible, use an icon in the toolbar rather than a command from a
drop down menu.
• To use Create Customized Toolbars to tailor a toolbar for the functions
you commonly perform, press <Alt> <T> then <C>.
• To delete customized icons, go into customize toolbars under the tools
menu. From here you can either add new icons to your toolbar by
“dragging and dropping” them from the toolbar menu or you can “drag
and drop” an icon from the current toolbar if you no longer want to use
it.
• To use a Docking\UnDocking window such as Repository Navigator;
double click on it’s the window’s title bar.
• To quickly select multiple transformations, hold the mouse down and
drag to view a box. Be sure the box touches every object you want to
select.
• To expedite mapping development; use multiple fields/ports selection
to copy or link.
• To copy a mapping from a shared folder, press and hold <Ctrl> and
highlight the mapping with the left mouse button, then drag and drop
into another folder or mapping and click OK. The same action, without
holding <Ctrl> creates a Shortcut to an object.
• To start the Debugger, press <F9>.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-107


Edit Tables/Transformation

• To edit any cell in the grid, press <F2, then move the cursor to the
character you want to edit and click OK.
• To move the current field in a transformation Down, first highlight it,
then press <Alt><w> and click OK..
• To move the current field in a transformation Up, first highlight it, then
press <Alt><u> and click OK.
• To add a new field or port, first highlight an existing field or port, then
press <Alt><f> to insert the new field/port below it and click OK.
• To validate the Default value, first highlight the port you want to
validate, then press <Alt><v> and click OK).
• When adding a new port, just begin typing. You don't need to press
DEL first to remove the ‘NEWFIELD’ text, then click OK when you have
finished.
• When moving about the expression fields via arrow keys:
o Use the SPACE bar to check/uncheck the port type. The box
must be highlighted in order to check/uncheck the port type.
o Press <F2> then <F3> to quickly open the Expression Editor of
an OUT/VAR port. The expression must be highlighted.
• To cancel an edit in the grid, press <Esc> then click OK.
• For all combo/dropdown list boxes, just type the first letter on the list
to select the item you want.
• To copy a selected item in the grid, press <Ctrl><C>.
• To past a selected item from the Clipboard to the grid, press
<Ctrl><V>.
• To delete a selected field or port from the grid, press <Alt><C>.
• To copy a selected row from the grid, press <Alt><O>.
• To paste a selected row from the grid, press <Alt><P> .

Expression Editor

• To expedite the validation of a newly created expression, simply press


OK to initiate the parsing/validation of the expression, then press OK
once again in the “Expression parsed successfully” pop-up.
• To select PowerCenter functions and ports during expression creation,
use the Functions and Ports Tab.

PAGE BP-108 BEST PRACTICES INFORMATICA CONFIDENTIAL


Creating Inventories of Reusable Objects & Mappings

Challenge

Successfully creating inventories of reusable objects and mappings, including


identifying potential economies of scale in loading multiple sources to the same
target.

Description

Reusable Objects

The first step in creating an inventory of reusable objects is to review the business
requirements and look for any common routines/modules that may appear in more
than one data movement. These common routines are excellent candidates for
reusable objects. In PowerCenter, reusable objects can be single transformations
(lookups, filters, etc.) or even a string of transformations (mapplets).

Evaluate potential reusable objects by two criteria:

• Is there enough usage and complexity to warrant the development of a


common object?
• Are the data types of the information passing through the reusable object the
same from case to case or is it simply the same high-level steps with different
fields and data?

Common objects are sometimes created just for the sake of creating common
components when in reality, creating and testing the object does not save
development time or future maintenance. For example, if there is a simple
calculation like subtracting a current rate from a budget rate that will be used for two
different mappings, carefully consider whether the effort to create, test, and
document the common object is worthwhile. Often, it is simpler to add the
calculation to both mappings. However, if the calculation were to be performed in a
number of mappings, if it was very difficult, and if all occurrences would be updated
following any change or fix – then this would be an ideal case for a reusable object.

The second criterion for a reusable object concerns the data that will pass through
the reusable object. Many times developers see a situation where they may perform
a certain type of high-level process (e.g., filter, expression, update strategy) in two

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-109


or more mappings; at first look, this seems like a great candidate for a mapplet.
However, after performing half of the mapplet work, the developers may realize that
the actual data or ports passing through the high level logic are totally different from
case to case, thus making the use of a mapplet impractical. Consider whether there
is a practical way to generalize the common logic, so that it can be successfully
applied to multiple cases. Remember, when creating a reusable object, the actual
object will be replicated in one to many mappings. Thus, in each mapping using the
mapplet or reusable transformation object, the same size and number of ports must
pass into and out of the mapping/reusable object.

Document the list of the reusable objects that pass this criteria test, providing a
high-level description of what each object will accomplish. The detailed design will
occur in a future subtask, but at this point the intent is to identify the number and
functionality of reusable objects that will be built for the project. Keep in mind that it
will be impossible to identify 100 percent of the reusable objects at this point; the
goal here is to create an inventory of as many as possible, and hopefully the most
difficult ones. The remainder will be discovered while building the data integration
processes.

Mappings

A mapping is an individual movement of data from a source system to a target


system. In a simple world, a single source table would populate a single target table.
However, in practice, this is usually not the case. Sometimes multiple sources of
data need to be combined to create a target table, and sometimes a single source of
data creates many target tables. The latter is especially true for mainframe data
sources where COBOL OCCURS statements litter the landscape. In a typical
warehouse or data mart model, each OCCURS statement decomposes to a separate
table.

The goal here is to create an inventory of the mappings needed for the project. For
this exercise, the challenge is to think in individual components of data movement.
While the business may consider a fact table and its three related dimensions as a
single ‘object’ in the data mart or warehouse, five mappings may be needed to
populate the corresponding star schema with data (i.e., one for each of the
dimension tables and two for the fact table, each from a different source system).

Typically, when creating an inventory of mappings, the focus is on the target tables,
with an assumption that each target table has its own mapping, or sometimes
multiple mappings. While often true, if a single source of data populates multiple
tables, this approach yields multiple mappings. Efficiencies can sometimes be
realized by loading multiple tables from a single source. By simply focusing on the
target tables, however, these efficiencies can be overlooked.

A more comprehensive approach to creating the inventory of mappings is to create a


spreadsheet listing all of the target tables. Create a column with a number next to
each Target table. For each of the target tables, in another column, list the source
file or table that will be used to populate the table. In the case of multiple source
tables per target, create two rows for the target each with the same number, and list
the additional source(s) of data.

The Table would look similar to the following:

PAGE BP-110 BEST PRACTICES INFORMATICA CONFIDENTIAL


Number Target Table Source
1 Customers Cust_File
2 Products Items
3 Customer_Type Cust_File
4 Orders_Item Tickets
4 Orders_Item Ticket_Items

When completed, the spreadsheet can be sorted either by target table or source
table. Sorting by source table can help determine potential mappings that create
multiple targets.

When using a source to populate multiple tables at once for efficiency, be sure to
keep restartabilty/reloadability in mind. The mapping will always load two or more
target tables from the source, so there will be no easy way to rerun a single table. In
this example, potentially the Customers table and the Customer_Type tables can be
loaded in the same mapping.

When merging targets into one mapping in this manner, give both targets the same
number. Then, re-sort the spreadsheet by number. For the mappings with multiple
sources or targets, merge the data back into a single row to generate the inventory
of mappings, with each number representing a separate mapping.

The inventory would look similar to the following:

Number Target Table Source


1 Customers Cust_File
Customer_Type
2 Products Items
4 Orders_Item Tickets
Ticket_Items

At this point, it is often helpful to record some additional information about each
mapping to help with planning and maintenance.

First, give each mapping a name. Apply the naming standards generated in 2.2
DESIGN DEVELOPMENT ARCHITECTURE. These names can then be used to
distinguish mappings from each other and also can be put on the project plan as
individual tasks.

Next, determine for the project a threshold for a High, Medium, or Low number of
target rows. For example, in a warehouse where dimension tables are likely to
number in the thousands and fact tables in the hundred thousands, the following
thresholds might apply:

Low – 1 to 10,000 rows

Med – 10,000 to 100,000 rows

High – 100,000 rows +

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-111


Assign a likely row volume (High, Med or Low) to each of the mappings based on the
expected volume of data to pass through the mapping. These high level estimates
will help to determine how many mappings are of ‘High’ volume; these mappings will
be the first candidates for performance tuning.

Add any other columns of information that might be useful to capture about each
mapping, such as a high-level description of the mapping functionality, resource
(developer) assigned, initial estimate, actual completion time, or complexity rating.

PAGE BP-112 BEST PRACTICES INFORMATICA CONFIDENTIAL


Updating Repository Statistics

Challenge

The PowerCenter repository has more than eighty tables, and nearly all use one or
more indexes to speed up queries. Most databases keep and use column distribution
statistics to determine which index to use in order to optimally execute SQL queries.
Database servers do not update these statistics continuously, so they quickly become
outdated in frequently-used repositories, and SQL query optimizers may choose a
less-than-optimal query plan. In large repositories, choosing a sub-optimal query
plan can drastically affect performance. As a result, the repository becomes slower
and slower over time.

Description

The Database Administrator needs to continually update the database statistics to


ensure that they remain up-to-date. The frequency of updating depends on how
heavily the repository is used. Because the statistics need to be updated table by
table, it is useful for Database Administrators to create scripts to automate the task.

For the repository tables, it is helpful to understand that all PowerCenter repository
tables and index names begin with "OPB_" or "REP_". The following information is
useful for generating scripts to update distribution statistics.

Oracle

Run the following queries:

select 'analyze table ', table_name, ' compute statistics;' from


user_tables where table_name like 'OPB_%'
select 'analyze index ', INDEX_NAME, ' compute statistics;' from
user_indexes where INDEX_NAME like 'OPB_%'

This will produce output like:

'ANALYZETABLE' TABLE_NAME 'COMPUTESTATISTICS;'


analyze table OPB_ANALYZE_DEP compute statistics;
analyze table OPB_ATTR compute statistics;

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-113


analyze table OPB_BATCH_OBJECT compute statistics;
.
.
.
'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'
analyze index OPB_DBD_IDX compute statistics;
analyze index OPB_DIM_LEVEL compute statistics;
analyze index OPB_EXPR_IDX compute statistics;
.
.

Save the output to a file. Then, edit the file and remove all the headers. (i.e., the
lines that look like:

'ANALYZEINDEX' INDEX_NAME 'COMPUTESTATISTICS;'

Run this as a SQL script. This updates statistics for the repository tables.

MS SQL Server

Run the following query:

select 'update statistics ', name from sysobjects where name like
'OPB_%'

This will produce output like :

name
update statistics OPB_ANALYZE_DEP
update statistics OPB_ATTR
update statistics OPB_BATCH_OBJECT
.
.
Save the output to a file, then edit the file and remove the header information (i.e.,
the top two lines) and add a 'go' at the end of the file.

Run this as a SQL script. This updates statistics for the repository tables.

Sybase

Run the following query:

select 'update statistics ', name from sysobjects where name like
'OPB_%'

This will produce output like

name
update statistics OPB_ANALYZE_DEP
update statistics OPB_ATTR
update statistics OPB_BATCH_OBJECT

PAGE BP-114 BEST PRACTICES INFORMATICA CONFIDENTIAL


.
.
.

Save the output to a file, then remove the header information (i.e., the top two
lines), and add a 'go' at the end of the file.

Run this as a SQL script. This updates statistics for the repository tables.

Informix

Run the following query:

select 'update statistics low for table ', tabname, ' ;' from systables
where tabname like 'opb_%' or tabname like 'OPB_%';

This will produce output like :

(constant) tabname (constant)


update statistics low for table OPB_ANALYZE_DEP ;
update statistics low for table OPB_ATTR ;
update statistics low for table OPB_BATCH_OBJECT ;
.
.
.

Save the output to a file, then edit the file and remove the header information (i.e.,
the top line that looks like:

(constant) tabname (constant)

Run this as a SQL script. This updates statistics for the repository tables.

DB2

Run the following query :


select 'runstats on table ', (rtrim(tabschema)||'.')||tabname, ' and
indexes all;'
from sysstat.tables where tabname like 'OPB_%'
This will produce output like:
runstats on table PARTH.OPB_ANALYZE_DEP
and indexes all;
runstats on table PARTH.OPB_ATTR
and indexes all;
runstats on table PARTH.OPB_BATCH_OBJECT

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-115


and indexes all;
.
.
.

Save the output to a file.

Run this as a SQL script to update statistics for the repository tables.

PAGE BP-116 BEST PRACTICES INFORMATICA CONFIDENTIAL


Daily Operations

Challenge

Once the data warehouse has been moved to production, the most important task is
keeping the system running and available for the end users.

Description

In most organizations, the day-to-day operation of the data warehouse is the


responsibility of a Production Support Team. This team is typically involved
with the support of other systems and has expertise in database systems and
various operating systems. The Data Warehouse Development team, becomes
in effect, a customer to the Production Support team. To that end, the
Production Support team needs two documents, a Service Level Agreement
and an Operations Manual, to help in the support of the production data
warehouse.

Service Level Agreement

The Service Level agreement outlines how the overall data warehouse system
will be maintained. This is a high-level document that discusses the system to
be maintained, the components of the system, and identifies the groups
responsible for monitoring the various components of the system. At a
minimum, it should contain the following information:

• Times when the system should be available to users


• Scheduled maintenance window
• Who is expected to monitor the operating system
• Who is expected to monitor the database
• Who is expected to monitor the Informatica sessions
• How quickly the support team is expected to respond to notifications of
system failures
• Escalation procedures that include data warehouse team contacts in
the event that the support team cannot resolve the system failure

Operations Manual

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-117


The Operations Manual is crucial to the Production Support team because it
provides the information needed to perform the maintenance of the data
warehouse system. This manual should be self-contained, providing all of the
information necessary for a production support operator to maintain the
system and resolve most problems that may arise. This manual should
contain information on how to maintain all components of the data warehouse
system. At a minimum, the Operations Manual should contain:

• Information on how to stop and re-start the various components of the


system
• Ids and passwords (or how to obtain passwords) for the system
components
• Information on how to re-start failed PowerCenter sessions
• A listing of all jobs that are run, their frequency (daily, weekly,
monthly, etc.), and the average run times
• Who to call in the event of a component failure that cannot be resolved
by the Production Support team

PAGE BP-118 BEST PRACTICES INFORMATICA CONFIDENTIAL


Load Validation

Challenge

Knowing that all data for the current load cycle has loaded correctly is essential for
good data warehouse management. However, the need for load validation varies,
depending on the extent of error checking, data validation or data cleansing
functionality inherent in the your mappings.

Description

Methods for validating the load process range from simple to complex. The first step
is to determine what information you need for load validation (e.g., batch names,
session names, session start times, session completion times, successful rows and
failed rows). Then, you must determine the source of this information. All this
information is stored as metadata in the repository, but you must have a means of
extracting this information.

Finally, you must determine how you want this information presented to you. Do you
want it stored as a flat file? Do you want it e-mailed to you? Do you want it
available in a relational table, so that history easily be preserved? All of these
factors weigh in finding the correct solution for you.

The following paragraphs describe three possible solutions for load validation,
beginning with a fairly simple solution and moving toward the more complex:

1. Post-session e-mails on either success or failure

Post-session e-mail is configured in the session, under the General tab and
‘Session Commands’

A number of variables are available to simplify the text of the e-mail:

- %s Session name
- %e Session status
- %b Session start time
- %c Session completion time
- %i Session elapsed time
- %l Total records loaded

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-119


- %r Total records rejected
- %t Target table details
- %m Name of the mapping used in the session
- %n Name of the folder containing the session
- %d Name of the repository containing the session
- %g Attach the session log to the message

TIP: One practical application of this functionality is the situation in which a key
business user waits for completion of a session to run a report. You can configure e-
mail to this user, notifying him/her that the session was successful and the report
can run.

2. Query the repository

Almost any query can be put together to retrieve data about the load execution from
the repository. The MX view, REP_SESS_LOG, is a great place to start . This view is
likely to contain all the information you need. The following sample query shows how
to extract folder name, session name, session end time, successful rows and session
duration:

select subject_area, session_name, session_timestamp, successful_rows,


(session_timestamp - actual_start) * 24 * 60 * 60 from rep_sess_log a where
session_timestamp = (select max(session_timestamp) from rep_sess_log
where session_name =a.session_name) order by subject_area, session_name

The sample output would look like this:

Folder Session Name Session End Successful Failed Session


Name Time Rows Rows Duration
(sec’s)
Web Analytics S M W DYNMIC KEYS FILE 5/8/2001 7:49:18 AM 12900 0 126
LOAD
Web Analytics SMW LOAD WEB FACT 5/8/2001 7:53:01 AM 125000 0 478
Finance SMW NEW LOANS 5/8/2001 8:06:01 AM 35987 0 178
Finance SMW UPD LOANS 5/8/2001 8:10:32 AM 45 0 12
HR SMW NEW PERSONNEL 5/8/2001 8:15:27 AM 5 0 10

3. Use a mapping

A more complex approach, and the most customizable, is to create a PowerCenter


mapping to populate a table or flat file with desired information. You can do this by
sourcing the MX view REP_SESS_LOG and then performing lookups to other
repository tables or views for additional information.

The following graphic illustrates a sample mapping:

PAGE BP-120 BEST PRACTICES INFORMATICA CONFIDENTIAL


This mapping selects data from REP_SESS_LOG and performs lookups to retrieve the
absolute minimum and maximum run times for that particular session. This enables
you to compare the current execution time with to the minimum and maximum
durations.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-121


Third Party Scheduler

Challenge

Successfully integrate a third-party scheduler with PowerCenter. This Best Practice


describes various levels to integrate a third-party scheduler.

Description

When moving into production, many companies require the use of a third-party
scheduler that is the company standard.

A third-party scheduler can start and stop an Informatica session or batch using the
PMCMD commands. Because PowerCenter has a scheduler, there are several levels
at which to integrate a third-party scheduler with PowerCenter. The correct level of
integration depends on the complexity of the batch/schedule and level and type of
production support.

Third Party Scheduler Integration Levels

In general, there are three levels of integration between a third-party scheduler and
Informatica: Low Level, Medium Level, and High Level.

Low Level

Low level integration refers to a third-party scheduler kicking off only one
Informatica session or a batch. That initial PowerCenter process subsequently kicks
off the rest of the sessions and batches. The PowerCenter scheduler handles all
processes and dependencies after the third-party scheduler has kicked off the initial
batch or session. In this level of integration, nearly all control lies with the
PowerCenter scheduler.

This type of integration is very simple and should only be used as a loophole to fulfill
a corporate mandate on a standard scheduler. A low level of integration is very
simple to implement because the third-party scheduler kicks off only one process.
The third-party scheduler is not adding any functionality that cannot be handled by
the PowerCenter scheduler.

PAGE BP-122 BEST PRACTICES INFORMATICA CONFIDENTIAL


Low level integration requires production support personnel to have a thorough
knowledge of PowerCenter. Because many companies only have Production Support
personnel with knowledge in the company’s standard scheduler, one of the main
disadvantages of this level of integration is that if a batch fails at some point, the
Production Support personnel may not be able to determine the exact breakpoint.
Thus, the majority of the production support burden falls back on the Project
Development team.

Medium Level

Medium level integration is when a third-party scheduler kicks off many different
batches or sessions, but not all sessions. A third-party scheduler may kick off several
PowerCenter batches and sessions but within those batches, PowerCenter may have
several sessions defined with dependencies. Thus, PowerCenter is controlling the
dependencies within those batches. In this level of integration, the control is shared
between PowerCenter and a third-party scheduler.

This type of integration is more complex than low level integration because there is
much more interaction between the third-party scheduler and PowerCenter.
However, to reduce total amount of work required to integrate the third-party
scheduler and PowerCenter, many of the PowerCenter sessions may be left in
batches. This reduces the integration chores because the third-party scheduler is
only communicating with a limited number of PowerCenter batches.

Medium level integration requires Production Support personnel to have a fairly good
knowledge of PowerCenter. Because Production Support personnel in many
companies are knowledgeable only about the company’s standard scheduler.
Therefore, one significant disadvantage of this level of integration is that if the batch
fails at some point, the Production Support personnel may not be able to determine
the exact breakpoint. They are probably able to determine the general area, but not
necessarily the specific session. Thus, the production support burden is shared
between the Project Development team and the Production Support team.

High Level

High level integration is when a third-party scheduler has full control of scheduling
and kicks off all PowerCenter sessions. Because the PowerCenter sessions are not
part of any batches, the third-party scheduler controls all dependencies among the
sessions.

This type of integration is the most complex to implement because there are many
more interactions between the third-party scheduler and PowerCenter. The third-
party scheduler controls all dependencies between the sessions.

High level integration allows the Production Support personnel to have only limited
knowledge of PowerCenter. Because the Production Support personnel in many
companies are knowledgeable only about the company’s standard scheduler, one of
the main advantages of this level of integration is that if the batch fails at some
point, the Production Support personnel are usually able to determine the exact
breakpoint. Thus, the production support burden lies with the Production Support
team.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-123


PAGE BP-124 BEST PRACTICES INFORMATICA CONFIDENTIAL
Event Based Scheduling

Challenge

In an operational environment, the start of a session needs to be triggered by


another session or other event. The best method of event-based scheduling with the
PowerCenter Server, is the use of indicator files.

Description

The indicator file configuration is specified in the session configuration, under


advanced options. The file used as the indicator file must be able to be located by
the PowerCenter Server, much like a flat file source. When the session starts, the
PowerCenter Server will look for the existence of this file and will remove it when it
sees it.

If the session is waiting on its source file to be FTP’ed from another server, the FTP
process should be scripted so that it creates the indicator file upon successful
completion of the source file FTP. This file can be an empty, or dummy, file. The
mere existence of the dummy file is enough to indicate that the session should
start. The dummy file will be removed immediately after it is located. It is,
therefore, essential that you do not use your flat file source as the indicator file.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-125


Repository Administration

Challenge

The task of managing the repository, either in development or production, is


extremely important. A number of best practices are available to facilitate the tasks
involved with this responsibility.

Description

The following paragraphs describe several of the key tasks involved in managing the
repository:

Backing Up the Repository

Two back-up methods are advisable for repository backup: (1) either the
PowerCenter Repository Manager or ‘pmrep’ command line utility, and (2) the
traditional database backup method. The native PowerCenter backup is required,
and Informatica recommends using both methods, although both are not essential. If
database corruption occurs, the native PowerCenter backup provides a clean backup
that can be restored to a new database.

Analyzing Tables in the Repository

If operations in any of the client tools, including connectivity to the PowerCenter


repository, are slowing down , you may need to analyze the tables in the repository
to facilitate data retrieval, thereby increasing performance.

Purging Old Session Log Information

Similarly, if folder copies are taking an unusually long time, the OPB_SESSION_LOG
and/or OPB_SESS_TARG_LOG tables may be being transferred. Removing
unnecessary data from these tables will expedite the repository backup process as
well as the folder copy operation. To determine which logs to eliminate, execute the
following select statement to retrieve the sessions with the most entries in
OPB_SESSION_LOG:

PAGE BP-126 BEST PRACTICES INFORMATICA CONFIDENTIAL


select subj_name, sessname, count(*) from opb_session_log a, opb_subject b,
opb_load_session c

where a.session_id=c.session_id and b.subj_id=c.subj_id group by subj_name,


sessname order by count(*) desc

1. Copy the original session, then delete original session. When a session is
copied, the entries in the repository tables do not duplicate. When you delete the
session, the entries in the tables are deleted, eliminating all rows for an individual
session.

2. Log into Repository Manager and expand the sessions in a particular folder.
When you select one of the sessions, all of the session logs will appear on the right-
hand side of the screen. You can manually delete any of these by highlighting a
particular log, then selecting Delete from the Edit menu. Respond ‘Yes’ when the
system prompts you with the question “‘Delete these logs from the Repository?”

pmrep Utility

The pmrep utility was introduced in PowerCenter 5.0 to facilitate repository


administration and server level administration. This utility is a command-line
program for Windows 95/98 or Windows NT/2000 to update session-related
parameters in a PowerCenter repository. It is a standalone utility that installs in the
PowerCenter Client installation directory. It is not currently available for UNIX.

The pmrep utility has two modes: command line and interactive mode.

• Command line mode lets you execute pmrep commands from the windows
command line. This mode invokes and exits each time a command is issued.
Command line mode is useful for batch files or scripts.
• Interactive mode invokes pmrep and allows you to issue a series of
commands from a pmrep prompt without exiting after each command.

The following examples illustrate the use of pmrep:

Example 1: Script to backup PowerCenter Repository

echo Connecting to repository <Informatica Repository Name>...


d:\PROGRA~1\INFORM~1\pmrep\pmrep connect

-r <Informatica Repository Name> -n <Repository User Name> -x <Repository


Password> -t <Database Type> -u <Database User Name> -p < Database
Password> -c <Database Connection String>

echo Starting Repository Backup... d:\PROGRA~1\INFORM~1\pmrep\pmrep backup


-o Output File Name>

echo Clearing Connection…


d:\PROGRA~1\INFORM~1\pmrep cleanup

echo Repository Backup is Complete...

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-127


Example 2: Script to update database connection information

echo Connecting to repository Informatica Repository <Informatica Repository


Name>...

d:\PROGRA~1\INFORM~1\pmrep\pmrep connect -r <Informatica Repository Name>


-n <Repository User Name> -x <Repository Password> -t <Database Type> -u
<Database User Name> -p < Database Password>

-c <Database Connection String>

echo Begin Updating Connection Information for <Database Connection Name>…

d:\PROGRA~1\INFORM~1\pmrep\pmrep updatedbconfig –d <Database Connection


Name> –u <New Database Username> –p <New Database Password> –c <New
Database Connection String> -t <Database Type>

echo Clearing Connection…

d:\PROGRA~1\INFORM~1\pmrep cleanup

echo Completed Updating Connection Information for <Database Connection

Name>…

Export and Import Registry

The Repository Manager saves repository connection information in the registry. To


simplify the process of setting up client machines, you can export the connection
information, and then import it to a different client machine (as long as both
machines use the same operating system).

The section of the registry that you can import and export contains the following
repository connection information:

• Repository name
• Database username and password (must be in US-ASCII)
• Repository username and password (must be in US-ASCII)
• ODBC data source name (DSN)

The registry does not include the ODBC data source. If you import a registry
containing a DSN that does not exist on that client system, the connection fails. Be
sure to have the appropriate data source configured under the exact same name as
the registry you are going to import, for each imported DSN.

PAGE BP-128 BEST PRACTICES INFORMATICA CONFIDENTIAL


High Availability

Challenge

In a highly available environment, load schedules cannot be impacted by the failure


of physical hardware. The PowerCenter Server must be running at all times. If the
machine hosting the PowerCenter Server goes down, another machine must
recognize this and start another Server and assume responsibility for running the
sessions and batches. This is best accomplished in a clustered environment.

Description

While there are many types of hardware and many ways to configure a clustered
environment, this example is based on the following hardware and software
characteristics:

• 2 Sun 4500, running Solaris OS


• Sun High-Availability Clustering Software
• External EMC storage, with each server owning specific disks
• PowerCenter installed on a separate disk that is accessible by both servers in
the cluster, but only by one server at a time

One of the Sun 4500’s serves as the primary data integration server, while the other
server in the cluster is the secondary server. Under normal operations, the
PowerCenter Server ‘thinks’ it is physically hosted by the primary server and uses
the resources of the primary server, although it is physically located on its own
server.

When the primary server goes down, the Sun high-availability software automatically
starts the PowerCenter Server on the secondary server using the basic auto
start/stop scripts that are used in many UNIX environments to automatically start
the PowerCenter Server whenever a host is rebooted. In addition, the Sun high-
availability software changes the ownership of the disk where the PowerCenter
Server is installed from the primary server to the secondary server. To facilitate this,
a logical IP address can be created specifically for the PowerCenter Server. This
logical IP address is specified in the pmserver.cfg file instead of the physical IP
addresses of the servers. Thus, only one pmserver.cfg file is needed.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-129


PAGE BP-130 BEST PRACTICES INFORMATICA CONFIDENTIAL
Recommended Performance Tuning Procedures

Challenge

Efficient and effective performance tuning for PowerCenter products.

Description

Performance tuning procedures consist of the following steps in a pre-determined


order to pinpoint where tuning efforts should be focused.

1. Perform Benchmarking. Benchmark the sessions to set a baseline to measure


improvements against

2. Monitor the server. By running a session and monitoring the server, it should
immediately be apparent if the system is paging memory or if the CPU load is too
high for the number of available processors. If the system is paging, correcting the
system to prevent paging (e.g., increasing the physical memory available on the
machine) can greatly improve performance.

3. Use the performance details. Re-run the session and monitor the performance
details. This time look at the details and watch for the Buffer Input and Outputs for
the sources and targets.

4. Tune the source system and target system based on the performance details.
When the source and target are optimized, re-run the session to determine the
impact of the changes.

5. Only after the server, source, and target have been tuned to their peak
performance should the mapping be analyzed for tuning.

6. After the tuning achieves a desired level of performance, the DTM should be the
slowest portion of the session details. This indicates that the source data is arriving
quickly, the target is inserting the data quickly, and the actual application of the
business rules is the slowest portion. This is the optimum desired performance. Only
minor tuning of the session can be conducted at this point and usually has only a
minor effect.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-131


7. Finally, re-run the sessions that have been identified as the benchmark,
comparing the new performance with the old performance. In some cases,
optimizing one or two sessions to run quickly can have a disastrous effect on another
mapping and care should be taken to ensure that this does not occur.

PAGE BP-132 BEST PRACTICES INFORMATICA CONFIDENTIAL


Performance Tuning Databases

Challenge

Database tuning can result in tremendous improvement in loading performance. This


Best Practice covers tips on tuning several databases: Oracle, SQL Server and
Teradata.

Oracle

Performance Tuning Tools

Oracle offers many tools for tuning an Oracle instance. Most DBAs are already
familiar with these tools, so we’ve included only a short description of some of the
major ones here.

• V$ Views

V$ views are dynamic performance views that provide real-time


information on database activity, enabling the DBA to draw conclusions
about database performance. Because SYS is the owner of these
views, only SYS can query them. Keep in mind that querying these
views impacts database performance; with each query having an
immediate hit. With this in mind, carefully consider which users should
be granted the privilege to query these views. You can grant viewing
privileges with either the ‘SELECT’ privilege, which allows a user to
view for individual V$ views or the ‘SELECT ANY TABLE’ privilege,
which allows the user to view all V$ views. Using the SELECT ANY
TABLE option requires the ‘O7_DICTIONARY_ACCESSIBILITY’
parameter be set to ‘TRUE’, which allows the ‘ANY’ keyword to apply to
SYS owned objects.

• Explain Plan

Explain Plan, SQL Trace, and TKPROF are powerful tools for revealing
bottlenecks and developing a strategy to avoid them.

Explain Plan allows the DBA or developer to determine the execution path of a
block of SQL code. The SQL in a source qualifier or in a lookup that is running

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-133


for a long time should be generated and copied to SQL*PLUS or other SQL
tool and tested to avoid inefficient execution of these statements. Review the
PowerCenter session log for long initialization time (an indicator that the
source qualifier may need tuning) and the time it takes to build a lookup
cache to determine if the SQL for these transformations should be tested.

• SQL Trace

SQL Trace extends the functionality of Explain Plan by providing statistical


information about the SQL statements executed in a session that has tracing
enabled. This utility is run for a session with the ‘ALTER SESSION SET
SQL_TRACE = TRUE’ statement.

• TKPROF

The output of SQL Trace is provided in a dump file that is difficult to read.
TKPROF formats this dump file into a more understandable report.

• UTLBSTAT & UTLESTAT

Executing ‘UTLBSTAT’ creates tables to store dynamic performance statistics


and begins the statistics collection process. Run this utility after the database
has been up and running (for hours or days). Accumulating statistics may
take time, so you need to run this utility for a long while and through several
operations (i.e., both loading and querying).

‘UTLESTAT’ ends the statistics collection process and generates an output file
called ‘report.txt.’ This report should give the DBA a fairly complete idea
about the level of usage the database experiences and reveal areas that
should be addressed.

Disk I/O

Disk I/O at the database level provides the highest level of performance gain in most
systems. Database files should be separated and identified. Rollback files should be
separated onto their own disks because they have significant disk I/O. Co-locate
tables that are heavily used with tables that are rarely used to help minimize disk
contention. Separate indexes so that when queries run indexes and tables, they are
not fighting for the same resource. Also be sure to implement disk striping; this, or
RAID technology can help immensely in reducing disk contention. While this type of
planning is time consuming, the payoff is well worth the effort in terms of
performance gains.

Memory and Processing

Memory and processing configuration is done in the init.ora file. Because each
database is different and requires an experienced DBA to analyze and tune it
for optimal performance, a standard set of parameters to optimize
PowerCenter is not practical and will probably never exist.

PAGE BP-134 BEST PRACTICES INFORMATICA CONFIDENTIAL


TIP: Changes made in the init.ora file will take effect after a restart of the
instance. Use svrmgr to issue the commands “shutdown” and “startup”
(eventually “shutdown immediate”) to the instance.

The settings presented here are those used in a 4-CPU AIX server running
Oracle 7.3.4 set to make use of the parallel query option to facilitate parallel
processing of queries and indexes. We’ve also included the descriptions and
documentation from Oracle for each setting to help DBAs of other (non-
Oracle) systems to determine what the commands do in the Oracle
environment to facilitate setting their native database commands and settings
in a similar fashion.

• HASH_AREA_SIZE = 16777216
o Default value: 2 times the value of SORT_AREA_SIZE
o Range of values: any integer
o This parameter specifies the maximum amount of memory, in
bytes, to be used for the hash join. If this parameter is not set,
its value defaults to twice the value of the SORT_AREA_SIZE
parameter.
o The value of this parameter can be changed without shutting
down the Oracle instance by using the ALTER SESSION
command. (Note: ALTER SESSION refers to the Database
Administration command issued at the svrmgr command
prompt.)

• Optimizer_percent_parallel=33

This parameter defines the amount of parallelism that the


optimizer uses in its cost functions. The default of 0 means that
the optimizer chooses the best serial plan. A value of 100
means that the optimizer uses each object's degree of
parallelism in computing the cost of a full table scan operation.

The value of this parameter can be changed without shutting


down the Oracle instance by using the ALTER SESSION
command. Low values favor indexes, while high values favor
table scans.

Cost-based optimization is always used for queries that


reference an object with a nonzero degree of parallelism. For
such queries, a RULE hint or optimizer mode or goal is ignored.
Use of a FIRST_ROWS hint or optimizer mode overrides a
nonzero setting of OPTIMIZER_PERCENT_PARALLEL.

• parallel_max_servers=40
o Used to enable parallel query.
o Initially not set on Install.
o Maximum number of query servers or parallel recovery
processes for an instance.

• Parallel_min_servers=8
o Used to enable parallel query.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-135


o Initially not set on Install.
o Minimum number of query server processes for an instance.
This is also the number of query server processes Oracle
creates when the instance is started.

• SORT_AREA_SIZE=8388608
o Default value: Operating system-dependent
o Minimum value: the value equivalent to two database blocks
o This parameter specifies the maximum amount, in bytes, of
Program Global Area (PGA) memory to use for a sort. After the
sort is complete and all that remains to do is to fetch the rows
out, the memory is released down to the size specified by
SORT_AREA_RETAINED_SIZE. After the last row is fetched out,
all memory is freed. The memory is released back to the PGA,
not to the operating system.
o Increasing SORT_AREA_SIZE size improves the efficiency of
large sorts. Multiple allocations never exist; there is only one
memory area of SORT_AREA_SIZE for each user process at any
time.
o The default is usually adequate for most database operations.
However, if very large indexes are created, this parameter may
need to be adjusted. For example, if one process is doing all
database access, as in a full database import, then an increased
value for this parameter may speed the import, particularly the
CREATE INDEX statements.

IPC as an Alternative to TCP/IP on UNIX

On an HP/UX server with Oracle as a target (i.e., PMServer and Oracle target on
same box), using an IPC connection can significantly reduce the time it takes to build
a lookup cache. In one case, a fact mapping that was using a lookup to get five
columns (including a foreign key) and about 500,000 rows from a table was taking
19 minutes. Changing the connection type to IPC reduced this to 45 seconds. In
another mapping, the total time decreased from 24 minutes to 8 minutes for ~120-
130 bytes/row, 500,000 row write (array inserts), primary key with unique index in
place. Performance went from about 2Mb/min (280 rows/sec) to about 10Mb/min
(1360 rows/sec).

A normal tcp (network tcp/ip) connection in tnsnames.ora would look like this:

DW.armafix =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS =
(PROTOCOL =TCP)
(HOST = armafix)
(PORT = 1526)
)
)
(CONNECT_DATA=(SID=DW)
)
)

PAGE BP-136 BEST PRACTICES INFORMATICA CONFIDENTIAL


Make a new entry in the tnsnames like this, and use it for connection to the local
Oracle instance:

DWIPC.armafix =
(DESCRIPTION =
(ADDRESS =
(PROTOCOL=ipc)
(KEY=DW)
)
(CONNECT_DATA=(SID=DW))
)

Improving Data Load Performance

• Alternative to Dropping and Reloading Indexes

Dropping and reloading indexes during very large loads to a data warehouse
is often recommended but there is seldom any easy way to do this. For
example, writing a SQL statement to drop each index, then writing another
SQL statement to rebuild it can be a very tedious process.

Oracle 7 (and above) offers an alternative to dropping and rebuilding indexes


by allowing you to disable and re-enable existing indexes. Oracle stores the
name of each index in a table that can be queried. With this in mind, it is an
easy matter to write a SQL statement that queries this table. then generate
SQL statements as output to disable and enable these indexes.

Run the following to generate output to disable the foreign keys in the data
warehouse:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE


CONSTRAINT ' || CONSTRAINT_NAME || ' ;'

FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'R'

This produces output that looks like:

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT


SYS_C0011077 ;

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE CONSTRAINT


SYS_C0011075 ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT


SYS_C0011060 ;

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-137


ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT
SYS_C0011059 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT


SYS_C0011133 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT


SYS_C0011134 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT


SYS_C0011131 ;

Dropping or disabling primary keys will also speed loads. Run the results of
this SQL statement after disabling the foreign key constraints:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE


PRIMARY KEY ;'

FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'P'

This produces output that looks like:

ALTER TABLE MDDB_DEV.AGREEMENT_DIM DISABLE PRIMARY KEY ;

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE PRIMARY KEY ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE PRIMARY KEY ;

Finally, disable any unique constraints with the following:

SELECT 'ALTER TABLE ' || OWNER || '.' || TABLE_NAME || ' DISABLE


PRIMARY KEY ;'

FROM USER_CONSTRAINTS

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

AND CONSTRAINT_TYPE = 'U'

ALTER TABLE MDDB_DEV.CUSTOMER_DIM DISABLE CONSTRAINT


SYS_C0011070 ;

ALTER TABLE MDDB_DEV.CUSTOMER_SALES_FACT DISABLE CONSTRAINT


SYS_C0011071 ;

Save the results in a single file and name it something like ‘DISABLE.SQL’

PAGE BP-138 BEST PRACTICES INFORMATICA CONFIDENTIAL


To re-enable the indexes, rerun these queries after replacing ‘DISABLE’ with
‘ENABLE.’ Save the results in another file with a name such as ‘ENABLE.SQL’
and run it as a post-session command.

Re-enable constraints in the reverse order that you disabled them. Re-enable
the unique constraints first, and re-enable primary keys before foreign keys.

TIP: Dropping or disabling foreign keys will often boost loading, but this also
slows queries (such as lookups) and updates. If you do not use lookups or
updates on your target tables you should get a boost by using this SQL
statement to generate scripts. If you use lookups and updates (especially on
large tables), you can exclude the index that will be used for the lookup from
your script. You may want to experiment to determine which method is
faster.

SQL*Loader

• Loader Options

SQL*Loader is a bulk loader utility used for moving data from external files
into the Oracle database. To use the Oracle bulk loader, you need a control
file, which specifies how data should be loaded into the database. SQL*Loader
has several options that can improve data loading performance and are easy
to implement. These options are:

• DIRECT
• PARALLEL
• SKIP_INDEX_MAINTENANCE
• UNRECOVERABLE

A control file normally has the following format:

LOAD DATA

INFILE <dataFile>

APPEND INTO TABLE <tableName>

FIELDS TERMINATED BY '<separator>'

(<list of all attribute names to load>)

To use any of these options, merely add ‘OPTIONS (OPTION = TRUE) to


beginning of the control file, such as ‘OPTIONS (DIRECT = TRUE)’.

The CONVENTIONAL path is the default method for SQL*Loader. This


performs like a typical INSERT statement that updates indexes, fires triggers,
and evaluates constraints.

The DIRECT path obtains an exclusive lock on the table being loaded and
writes the data blocks directly to the database files, bypassing all SQL

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-139


processing. Note that no other users can write to the loading table due to this
exclusive lock and no SQL transformations can be made in the control file
during the load.

The PARALLEL option can be used with the DIRECT option when loading
multiple partitions of the same table. If the partitions are located on separate
disks, the performance time can be reduced to that of loading a single
partition.

If the CONVENTIONAL path must be used (i.e., transformations are


performed during the load, for example), then you can bypass index updates
by using the SKIP_INDEX_MAINTENANCE option. You will have to rebuild the
indexes after the load, but overall performance may improve significantly.

The UNRECOVERABLE option in the control file allows you to redo log writes
during a CONVENTIONAL load. Recoverability should not be an issue since the
data file still exists.

The DIRECT option automatically disables CHECK and foreign key


REFERENCES constraints, but not PRIMARY KEY, UNIQUE KEY and NOT NULL
constraints. Disabling these constraints with the SQL scripts described earlier
will benefit performance when loading data into a target warehouse.

Loading Partitioned Sessions

To improve performance when loading data to an Oracle database using a partitioned


session, by create the Oracle target table with the same number of partitions as the
session.

Optimizing Query Performance

• Oracle Bitmap Indexing

With version 7.3.x, Oracle added bitmap indexing to supplement the


traditional b-tree index. A b-tree index can greatly improve query
performance on data that has high cardinality or contains mostly unique
values, but is not much help for low cardinality/highly duplicated data and
may even increase query time. A typical example of a low cardinality field is
gender – it is either male or female (or possibly unknown). This kind of data
is an excellent candidate for a bitmap index, and can significantly improve
query performance.

Keep in mind, however, that b-tree indexing is still the Oracle default. If you
don’t specify an index type when creating an index, Oracle will default to b-
tree. Also note that for certain columns, bitmaps will be smaller and faster to
create than a b-tree index on the same column.

Bitmap indexes are suited to data warehousing because of their performance,


size, and ability to create and drop very quickly. Since most dimension tables
in a warehouse have nearly every column indexed, the space savings is
dramatic. But it is important to note that when a bitmap-indexed column is

PAGE BP-140 BEST PRACTICES INFORMATICA CONFIDENTIAL


updated, every row associated with that bitmap entry is locked, making bit-
map indexing a poor choice for OLTP database tables with constant insert and
update traffic. Also, bitmap indexes are rebuilt after each DML statement
(e.g., inserts and updates), which can make loads very slow. For this reason,
it is a good idea to drop or disable bitmap indexes prior to the load and re-
create or re-enable them after the load.

The relationship between Fact and Dimension keys is another example of low
cardinality. With a b-tree index on the Fact table, a query processes by
joining all the Dimension tables in a Cartesian product based on the WHERE
clause, then joins back to the Fact table. With a bitmapped index on the Fact
table, a ‘star query’ may be created that accesses the Fact table first followed
by the Dimension table joins, avoiding a Cartesian product of all possible
Dimension attributes. This ‘star query’ access method is only used if the
STAR_TRANSFORMATION_ENABLED parameter is equal to TRUE in the init.ora
file and if there are single column bitmapped indexes on the fact table
foreign keys. Creating bitmap indexes is similar to creating b-tree indexes. To
specify a bitmap index, add the word ‘bitmap’ between ‘create’ and ‘index’. All
other syntax is identical.

• Bitmap indexes:

drop index emp_active_bit;

drop index emp_gender_bit;

create bitmap index emp_active_bit on emp (active_flag);

create bitmap index emp_gender_bit on emp (gender);

• B-tree indexes:

drop index emp_active;

drop index emp_gender;

create index emp_active on emp (active_flag);

create index emp_gender on emp (gender);

Information for bitmap indexes in stored in the data dictionary in


dba_indexes, all_indexes, and user_indexes with the word ‘BITMAP’ in the
Uniqueness column rather than the word ‘UNIQUE.’ Bitmap indexes cannot
be unique.

To enable bitmap indexes, you must set the following items in the instance
initialization file:

• compatible = 7.3.2.0.0 # or higher


• event = "10111 trace name context forever"
• event = "10112 trace name context forever"

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-141


• event = "10114 trace name context forever"

Also note that the parallel query option must be installed in order to create
bitmap indexes. If you try to create bitmap indexes without the parallel query
option, a syntax error will appear in your SQL statement; the keyword
‘bitmap’ won't be recognized.

• TIP: To check if the parallel query option is installed, start and log into
SQL*Plus. If the parallel query option is installed, the word ‘parallel’ appears
in the banner text.

Index Statistics

• Table Method

Index statistics are used by Oracle to determine the best method to access
tables and should be updated periodically as normal DBA procedures. The
following will improve query results on Fact and Dimension tables (including
appending and updating records) by updating the table and index statistics
for the data warehouse:

The following SQL statement can be used to analyze the tables in the
database:

SELECT 'ANALYZE TABLE ' || TABLE_NAME || ' COMPUTE STATISTICS;'

FROM USER_TABLES

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

This generates the following results:

ANALYZE TABLE CUSTOMER_DIM COMPUTE STATISTICS;

ANALYZE TABLE MARKET_DIM COMPUTE STATISTICS;

ANALYZE TABLE VENDOR_DIM COMPUTE STATISTICS;

The following SQL statement can be used to analyze the indexes in the
database:

SELECT 'ANALYZE INDEX ' || INDEX_NAME || ' COMPUTE STATISTICS;'

FROM USER_INDEXES

WHERE (TABLE_NAME LIKE '%DIM' OR TABLE_NAME LIKE '%FACT')

This generates the following results:

ANALYZE INDEX SYS_C0011125 COMPUTE STATISTICS;

PAGE BP-142 BEST PRACTICES INFORMATICA CONFIDENTIAL


ANALYZE INDEX SYS_C0011119 COMPUTE STATISTICS;

ANALYZE INDEX SYS_C0011105 COMPUTE STATISTICS;

Save these results as a SQL script to be executed before or after a load.

• Schema Method

Another way to update index statistics is to compute indexes by schema


rather than by table. If data warehouse indexes are the only indexes located
in a single schema, then you can use the following command to update the
statistics:

EXECUTE SYS.DBMS_UTILITY.Analyze_Schema ('BDB', 'compute');

In this example, BDB is the schema for which the statistics should be
updated. Note that the DBA must grant the execution privilege for
dbms_utility to the database user executing this command.

TIP: These SQL statements can be very resource intensive, especially for
very large tables. For this reason, we recommend running them at off-peak
times when no other process is using the database. If you find the exact
computation of the statistics consumes too much time, it is often acceptable
to estimate the statistics rather than compute them. Use ‘estimate’ instead of
‘compute’ in the above examples.

Parallelism

Parallel execution can be implemented at the SQL statement, database object, or


instance level for many SQL operations. The degree of parallelism should be
identified based on the number of processors and disk drives on the server, with the
number of processors being the minimum degree.

• SQL Level Parallelism

Hints are used to define parallelism at the SQL statement level. The following
examples demonstrate how to utilize four processors:

SELECT /*+ PARALLEL(order_fact,4) */ …;

SELECT /*+ PARALLEL_INDEX(order_fact, order_fact_ixl,4) */ …;

TIP: When using a table alias in the SQL Statement, be sure to use this alias
in the hint. Otherwise, the hint will not be used, and you will not receive an
error message.

Example of improper use of alias:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-143


SELECT /*+PARALLEL (EMP, 4) */ EMPNO, ENAME
FROM EMP A

Here, the parallel hint will not be used because of the used alias “A” for table
EMP. The correct way is:

SELECT /*+PARALLEL (A, 4) */ EMPNO, ENAME


FROM EMP A

• Table Level Parallelism

Parallelism can also be defined at the table and index level. The following
example demonstrates how to set a table’s degree of parallelism to four for all
eligible SQL statements on this table:

ALTER TABLE order_fact PARALLEL 4;

Ensure that Oracle is not contending with other processes for these resources
or you may end up with degraded performance due to resource contention.

Additional Tips

• Executing Oracle SQL Scripts as Pre and Post Session Commands on


UNIX

You can execute queries as both pre- and post-session commands. For a
UNIX environment, the format of the command is:

sqlplus –s user_id/password@database @ script_name.sql

For example, to execute the ENABLE.SQL file created earlier (assuming the
data warehouse is on a database named ‘infadb’), you would execute the
following as a post-session command:

sqlplus -s pmuser/pmuser@infadb @
/informatica/powercenter/Scripts/ENABLE.SQL

In some environments, this may be a security issue since both username and
password are hard-coded and unencrypted. To avoid this, use the operating
system’s authentication to log onto the database instance.

In the following example, the Informatica id “pmuser” is used to log onto the
Oracle database. Create the Oracle user “pmuser” with the following SQL
statement:

CREATE USER PMUSER IDENTIFIED EXTERNALLY


DEFAULT TABLESPACE . . .
TEMPORARY TABLESPACE . . .

PAGE BP-144 BEST PRACTICES INFORMATICA CONFIDENTIAL


In the following pre-session command, “pmuser” (the id Informatica is logged
onto the operating system as) is automatically passed from the operating
system to the database and used to execute the script:

sqlplus -s /@infadb @/informatica/powercenter/Scripts/ENABLE.SQL

You may want to use the init.ora parameter “os_authent_prefix” to


distinguish between “normal” oracle-users and “external-identified” ones.

• DRIVING_SITE ‘Hint’

If the source and target are on separate instances, the Source Qualifier
transformation should be executed on the target instance.

For example, you want to join two source tables (A and B) together, which
may reduce the number of selected rows. However, Oracle fetches all of the
data from both tables, moves the data across the network to the target
instance, then processes everything on the target instance. If either data
source is large, this causes a great deal of network traffic. To force the Oracle
optimizer to process the join on the source instance, use the ‘Generate SQL’
option in the source qualifier and include the ‘driving_site’ hint in the SQL
statement as:

SELECT /*+ DRIVING_SITE */ …;

SQL Server

Description

Proper tuning of the source and target database is a very important


consideration to the scalability and usability of a business analytical
environment. Managing performance on an SQL Server encompasses the
following points.

• Manage system memory usage (RAM caching)


• Create and maintain good indexes
• Partition large data sets and indexes
• Monitor disk I/O subsystem performance
• Tune applications and queries
• Optimize active data

Manage RAM Caching

Managing random access memory (RAM) buffer cache is a major consideration in any
database server environment. Accessing data in RAM cache is much faster than
accessing the same Information from disk. If database I/O (input/output operations
to the physical disk subsystem) can be reduced to the minimal required set of data
and index pages, these pages will stay in RAM longer. Too much unneeded data and
index information flowing into buffer cache quickly pushes out valuable pages. The
primary goal of performance tuning is to reduce I/O so that buffer cache is best
utilized.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-145


Several settings in SQL Server can be adjusted to take advantage of SQL Server
RAM usage:

• Max async I/O is used to specify the number of simultaneous disk I/O
operations (???) that SQL Server can submit to the operating system. Note
that this setting is automated in SQL Server 2000
• SQL Server allows several selectable models for database recovery,
these include:

- Full Recovery

- Bulk-Logged Recovery

- Simple Recovery

Cost Threshold for Parallelism Option

Use this option to specify the threshold where SQL Server creates and executes
parallel plans. SQL Server creates and executes a parallel plan for a query only when
the estimated cost to execute a serial plan for the same query is higher than the
value set in cost threshold for parallelism. The cost refers to an estimated elapsed
time in seconds required to execute the serial plan on a specific hardware
configuration. Only set cost threshold for parallelism on symmetric multiprocessors
(SMP).

Max Degree of Parallelism Option

Use this option to limit the number of processors (a max of 32) to use in parallel plan
execution. The default value is 0, which uses the actual number of available CPUs.
Set this option to 1 to suppress parallel plan generation. Set the value to a number
greater than 1 to restrict the maximum number of processors used by a single query
execution .

Priority Boost Option

Use this option to specify whether SQL Server should run at a higher scheduling
priority than other processors on the same computer. If you set this option to one,
SQL Server runs at a priority base of 13. The default is 0, which is a priority base of
seven.

Set Working Set Size Option

Use this option to reserve physical memory space for SQL Server that is equal to the
server memory setting. The server memory setting is configured automatically by
SQL Server based on workload and available resources. It will vary dynamically
between min server memory and max server memory. Setting ‘set working set’ size
means the operating system will not attempt to swap out SQL Server pages even if
they can be used more readily by another process when SQL Server is idle.

Optimizing Disk I/O Performance

PAGE BP-146 BEST PRACTICES INFORMATICA CONFIDENTIAL


When configuring a SQL Server that will contain only a few gigabytes of data
and not sustain heavy read or write activity, you need not be particularly
concerned with the subject of disk I/O and balancing of SQL Server I/O
activity across hard drives for maximum performance. To build larger SQL
Server databases however, which will contain hundreds of gigabytes or even
terabytes of data and/or that can sustain heavy read/write activity (as in a
DSS application), it is necessary to drive configuration around maximizing
SQL Server disk I/O performance by load-balancing across multiple hard
drives.

Partitioning for Performance

For SQL Server databases that are stored on multiple disk drives,
performance can be improved by partitioning the data to increase the amount
of disk I/O parallelism.

Partitioning can be done using a variety of techniques. Methods for creating


and managing partitions include configuring your storage subsystem (i.e.,
disk, RAID partitioning) and applying various data configuration mechanisms
in SQL Server such as files, file groups, tables and views. Some possible
candidates for partitioning include:

• Transaction log
• Tempdb
• Database
• Tables
• Non-clustered indexes

Using bcp and BULK INSERT

Two mechanisms exist inside SQL Server to address the need for bulk movement of
data. The first mechanism is the bcp utility. The second is the BULK INSERT
statement.

• Bcp is a command prompt utility that copies data into or out of SQL Server.
• BULK INSERT is a Transact-SQL statement that can be executed from within
the database environment. Unlike bcp, BULK INSERT can only pull data into
SQL Server. An advantage of using BULK INSERT is that it can copy data into
instances of SQL Server using a Transact-SQL statement, rather than having
to shell out to the command prompt.

TIP: Both of these mechanisms enable you to exercise control over the batch size.
Unless you are working with small volumes of data, it is good to get in the habit of
specifying a batch size for recoverability reasons. If none is specified, SQL Server
commits all rows to be loaded as a single batch. For example, you attempt to load
1,000,000 rows of new data into a table. The server suddenly loses power just as it
finishes processing row number 999,999. When the server recovers, those 999,999
rows will need to be rolled back out of the database before you attempt to reload the
data. By specifying a batch size of 10,000 you could have saved significant recovery
time, because SQL Server would have only had to rollback 9999 rows instead of
999,999.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-147


General Guidelines for Initial Data Loads

While loading data:

• Remove indexes
• Use Bulk INSERT or bcp
• Parallel load using partitioned data files into partitioned tables
• Run one load stream for each available CPU
• Set Bulk-Logged or Simple Recovery model
• Use TABLOCK option

While loading data

• Create indexes
• Switch to the appropriate recovery model
• Perform backups

General Guidelines for Incremental Data Loads

• Load Data with indexes in place


• Performance and concurrency requirements should determine locking
granularity (sp_indexoption).

· Change from Full to Bulk-Logged Recovery mode unless there is an overriding


need to preserve a point–in time recovery, such as online users modifying the
database during bulk loads. Read operations should not affect bulk loads.

Teradata

Description

Teradata offers several bulk load utilities including FastLoad, MultiLoad, and TPump.
FastLoad is used for loading inserts into an empty table. One of TPump’s advantages
is that it does not lock the table that is being loaded. MultiLoad supports inserts,
updates, deletes, and “upserts” to any table. This best practice will focus on
MultiLoad since PowerCenter 5.x can auto-generate MultiLoad scripts and invoke the
MultiLoad utility per PowerCenter target.

Tuning MultiLoad

There are many aspects to tuning a Teradata database. With PowerCenter 5.x
several aspects of tuning can be controlled by setting MultiLoad parameters to
maximize write throughput. Other areas to analyze when performing a MultiLoad job
include estimating space requirements and monitoring MultiLoad performance.

Note: In PowerCenter 5.1, the Informatica server transfers data via a UNIX named
pipe to MultiLoad, whereas in PowerCenter 5.0, the data is first written to file.

MultiLoad Parameters

PAGE BP-148 BEST PRACTICES INFORMATICA CONFIDENTIAL


With PowerCenter 5.x, you can auto-generate MultiLoad scripts. This not only
enhances development, but also allows you to set performance options. Here are the
MultiLoad-specific parameters that are available in PowerCenter:

• TDPID. A client based operand that is part of the logon string.


• Date Format. Ensure that the date format used in your target flat file is
equivalent to the date format parameter in your MultiLoad script. Also
validate that your date format is compatible with the date format specified in
the Teradata database.
• Checkpoint. A checkpoint interval is similar to a commit interval for other
databases. When you set the checkpoint value to less than 60, it represents
the interval in minutes between checkpoint operations. If the checkpoint is
set to a value greater than 60, it represents the number of records to write
before performing a checkpoint operation. To maximize write speed to the
database, try to limit the number of checkpoint operations that are
performed.
• Tenacity. Interval in hours between MultiLoad attempts to log on to the
database when the maximum number of sessions are already running.
• Load Mode. Available load methods include Insert, Update, Delete, and
Upsert. Consider creating separate external loader connections for each
method, selecting the one that will be most efficient for each target table.
• Drop Error Tables. Allows you to specify whether to drop or retain the three
error tables for a MultiLoad session. Set this parameter to 1 to drop error
tables or 0 to retain error tables.
• Max Sessions. Available only in PowerCenter 5.1, this parameter specifies
the maximum number of sessions that are allowed to log on to the database.
This value should not exceed one per working amp (Access Module Process).
• Sleep. Available only in PowerCenter 5.1, this parameter specifies the
number of minutes that MultiLoad waits before retrying a logon operation.

Estimating Space Requirements for MultiLoad Jobs

Always estimate the final size of your MultiLoad target tables and make sure the
destination has enough space to complete your MultiLoad job. In addition to the
space that may be required by target tables, each MultiLoad job needs permanent
space for:

• Work tables
• Error tables
• Restart Log table

Note: Spool space cannot be used for MultiLoad work tables, error tables, or the
restart log table. Spool space is freed at each restart. By using permanent space for
the MultiLoad tables, data is preserved for restart operations after a system failure.
Work tables, in particular, require a lot of extra permanent space. Also remember to
account for the size of error tables since error tables are generated for each target
table.

Use the following formula to prepare the preliminary space estimate for one target
table, assuming no fallback protection, no journals, and no non-unique secondary
indexes:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-149


PERM = (using data size + 38) x (number of rows processed) x (number of apply
conditions satisfied) x (number of Teradata SQL statements within the applied DML)

Make adjustments to your preliminary space estimates according to the


requirements and expectations of your MultiLoad job.

Monitoring MultiLoad Performance

Here are some tips for analyzing MultiLoad performance:

1. Determine which phase of the MultiLoad job is causing poor performance.

• If the performance bottleneck is during the acquisition phase, as data is


acquired from the client system, then the issue may be with the client
system. If it is during the application phase, as data is applied to the target
tables, then the issue is not likely to be with the client system.
• The MultiLoad job output lists the job phases and other useful information.
Save these listings for evaluation.

2. Use the Teradata RDBMS Query Session utility to monitor the progress of the
MultiLoad job.

3. Check for locks on the MultiLoad target tables and error tables.

4. Check the DBC.Resusage table for problem areas, such as data bus or CPU
capacities at or near 100 percent for one or more processors.

5. Determine whether the target tables have non-unique secondary indexes


(NUSIs). NUSIs degrade MultiLoad performance because the utility builds a separate
NUSI change row to be applied to each NUSI sub-table after all of the rows have
been applied to the primary table.

6. Check the size of the error tables. Write operations to the fallback error tables are
performed at normal SQL speed, which is much slower than normal MultiLoad tasks.

7. Verify that the primary index is unique. Non-unique primary indexes can cause
severe MultiLoad performance problems.

PAGE BP-150 BEST PRACTICES INFORMATICA CONFIDENTIAL


Performance Tuning UNIX Systems

Challenge

The following tips have proven useful in performance tuning UNIX-based machines.
While some of these tips will be more helpful than others in a particular environment,
all are worthy of consideration.

Description

Running ps-axu

Run ps-axu to check for the following items:

• Are there any processes waiting for disk access or for paging? If so check the
I/O and memory subsystems.
• What processes are using most of the CPU? This may help you distribute the
workload better.
• What processes are using most of the memory? This may help you distribute
the workload better.
• Does ps show that your system is running many memory-intensive jobs? Look
for jobs with a large set (RSS) or a high storage integral.

Identifying and Resolving Memory Issues

Use vmstat or sar to check swapping actions. Check the system to ensure that
swapping does not occur at any time during the session processing. By using sar 5
10 or vmstat 1 10, you can get a snapshot of page swapping. If page swapping does
occur at any time, increase memory to prevent swapping. Swapping, on any
database system, causes a major performance decrease and increased I/O. On a
memory-starved and I/O-bound server, this can effectively shut down the
PowerCenter process and any databases running on the server.

Some swapping will normally occur regardless of the tuning settings. This occurs
because some processes use the swap space by their design. To check swap space
availability, use pstat and swap. If the swap space is too small for the intended
applications, it should be increased.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-151


Run vmstate 5 (sar –wpgr ) for SunOS, vmstat –S 5 to detect and confirm
memory problems and check for the following:

• Are pages-outs occurring consistently? If so, you are short of memory.


• Are there a high number of address translation faults? (System V only) This
suggests a memory shortage.
• Are swap-outs occurring consistently? If so, you are extremely short of
memory. Occasional swap-outs are normal; BSD systems swap-out inactive
jobs. Long bursts of swap-outs mean that active jobs are probably falling
victim and indicate extreme memory shortage. If you don’t have vmsta –S,
look at the w and de fields of vmstat. These should ALWAYS be zero.

If memory seems to be the bottleneck of the system, try following remedial steps:

• Reduce the size of the buffer cache, if your system has one, by decreasing
BUFPAGES. The buffer cache is not used in system V.4 and SunOS 4.X
systems. Making the buffer cache smaller will hurt disk I/O performance.
• If you have statically allocated STREAMS buffers, reduce the number of large
(2048- and 4096-byte) buffers. This may reduce network performance, but
netstat-m should give you an idea of how many buffers you really need.
• Reduce the size of your kernel’s tables. This may limit the system’s capacity
(number of files, number of processes, etc.).
• Try running jobs requiring a lot of memory at night. This may not help the
memory problems, but you may not care about them as much.
• Try running jobs requiring a lot of memory in a batch queue. If only one
memory-intensive job is running at a time, your system may perform
satisfactorily.
• Try to limit the time spent running sendmail, which is a memory hog.
• If you don’t see any significant improvement, add more memory.

Identifying and Resolving Disk I/O Issues

Use iostat to check i/o load and utilization, as well as CPU load. Iostat can be
used to monitor the I/O load on the disks on the UNIX server. Using iostat permits
monitoring the load on specific disks. Take notice of how fairly disk activity is
distributed among the system disks. If it is not, are the most active disks also the
fastest disks?

Run sadp to get a seek histogram of disk activity. Is activity concentrated in one
area of the disk (good), spread evenly across the disk (tolerable), or in two well-
defined peaks at opposite ends (bad)?

• Reorganize your file systems and disks to distribute I/O activity as evenly as
possible.
• Using symbolic links helps to keep the directory structure the same
throughout while still moving the data files that are causing I/O contention.
• Use your fastest disk drive and controller for your root filesystem; this will
almost certainly have the heaviest activity. Alternatively, if single-file
throughput is important, put performance-critical files into one filesystem and
use the fastest drive for that filesystem.
• Put performance-critical files on a filesystem with a large block size: 16KB or
32KB (BSD).

PAGE BP-152 BEST PRACTICES INFORMATICA CONFIDENTIAL


• Increase the size of the buffer cache by increasing BUFPAGES (BSD). This
may hurt your system’s memory performance.
• Rebuild your file systems periodically to eliminate fragmentation (backup,
build a new filesystem, and restore).
• If you are using NFS and using remote files, look at your network situation.
You don’t have local disk I/O problems.
• Check memory statistics again by running vmstat 5 (sar-rwpg). If your
system is paging or swapping consistently, you have memory problems, fix
memory problem first. Swapping makes performance worse.

If your system has disk capacity problem and is constantly running out of
disk space, try the following actions:

• Write a find script that detects old core dumps, editor backup and auto-save
files, and other trash and deletes it automatically. Run the script through
cron.
• If you are running BSD UNIX or V.4, use the disk quota system to prevent
individual users from gathering too much storage.
• Use a smaller block size on file systems that are mostly small files (e.g.,
source code files, object modules, and small data files).

Identifying and Resolving CPU Overload Issues

Use sar –u to check for CPU loading. This provides the %usr (user), %sys
(system), %wio (waiting on I/O), and %idle (% of idle time). A target goal should be
%usr + %sys= 80 and %wio = 10 leaving %idle at 10. If %wio is higher, the disk
and I/O contention should be investigated to eliminate I/O bottleneck on the UNIX
server. If the system shows a heavy load of %sys, and %usr has a high %idle, this is
indicative of memory and contention of swapping/paging problems. In this case, it is
necessary to make memory changes to reduce the load on the system server.

When you run iostat 5above, also observe for CPU idle time. Is the idle time
always 0, without letup? It is good for the CPU to be busy, but if it is always busy
100 percent of the time, work must be piling up somewhere. This points to CPU
overload.

• Eliminate unnecessary daemon processes. rwhod and routed are particularly


likely to be performance problems, but any savings will help.
• Get users to run jobs at night with at or any queuing system that’s available
always for help. You may not care if the CPU (or the memory or I/O system)
is overloaded at night, provided the work is done in the morning.
• Use nice to lower the priority of CPU-bound jobs will improve interactive
performance. Also, using nice to raise the priority of CPU-bound jobs will
expedite them but will hurt interactive performance. In general though, using
nice is really only a temporary solution. If your workload grows, it will soon
become insufficient. Consider upgrading your system, replacing it, or buying
another system to share the load.

Identifying and Resolving Network Issues

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-153


You can suspect problems with network capacity or with data integrity if users
experience slow performance when they are using rlogin or when they are accessing
files via NFS.

Look at netsat-i. If the number of collisions is large, suspect an overloaded


network. If the number of input or output errors is large, suspect hardware
problems. A large number of input errors indicate problems somewhere on the
network. A large number of output errors suggests problems with your system and
its interface to the network.

If collisions and network hardware are not a problem, figure out which system
appears to be slow. Use spray to send a large burst of packets to the slow system.
If the number of dropped packets is large, the remote system most likely cannot
respond to incoming data fast enough. Look to see if there are CPU, memory or disk
I/O problems on the remote system. If not, the system may just not be able to
tolerate heavy network workloads. Try to reorganize the network so that this system
isn’t a file server.

A large number of dropped packets may also indicate data corruption. Run
netstat-s on the remote system, then spray the remote system from the local
system and run netstat-s again. If the increase of UDP socket full drops (as indicated
by netstat) is equal to or greater than the number of drop packets that spray
reports, the remote system is slow network server If the increase of socket full drops
is less than the number of dropped packets, look for network errors.

Run nfsstat and look at the client RPC data. If the retransfield is more than 5
percent of calls, the network or an NFS server is overloaded. If timeout is high, at
least one NFS server is overloaded, the network may be faulty, or one or more
servers may have crashed. If badmixis roughly equal to timeout, at least one NFS
server is overloaded. If timeoutand retrans are high, but badxidis low, some part of
the network between the NFS client and server is overloaded and dropping packets.

Try to prevent users from running I/O- intensive programs across the
network. The greputility is a good example of an I/O intensive program. Instead,
have users log into the remote system to do their work.

Reorganize the computers and disks on your network so that as many users as
possible can do as much work as possible on a local system.

Use systems with good network performance as file servers.

If you are short of STREAMS data buffers and are running Sun OS 4.0 or System
V.3 (or earlier), reconfigure the kernel with more buffers.

General Tips and Summary of Other Useful Commands

• Use dirs instead of pwd.


• Avoid ps.
• If you use sh, avoid long search paths.
• Minimize the number of files per directory.
• Use vi or a native window editor rather than emacs.

PAGE BP-154 BEST PRACTICES INFORMATICA CONFIDENTIAL


• Use egrep rather than grep: it’s faster.
• Don’t run grep or other I/O- intensive applications across NFS.
• Use rlogin rather than NFS to access files on remote systems.

lsattr –E –l sys0 is used to determine some current settings on most UNIX


environments. Of particular attention is maxuproc. Maxuproc is the setting to
determine the maximum level of user background processes. On most UNIX
environments, this is defaulted to 40 but should be increased to 250 on most
systems.

Avoid raw devices. In general, proprietary file systems from the UNIX vendor are
most efficient and well suited for database work when tuned properly. Be sure to
check the database vendor documentation to determine the best file system for the
specific machine. Typical choices include: s5, The UNIX System V File System; ufs,
The “UNIX File System” derived from Berkeley (BSD); vxfs, The Veritas File System;
and lastly raw devices that, in reality are not a file system at all.

Use PMProcs Utility ( PowerCenter Utility), to view the current Informatica


processes. For example:

harmon 125: pmprocs

<------------ Current PowerMart processes --------------->

UID PID PPID C STIME TTY TIME CMD


powermar 2711 1421 16 18:13:11 ? 0:07 dtm pmserver.cfg 0 202 -
289406976
powermar 2713 2711 11 18:13:17 ? 0:05 dtm pmserver.cfg 0 202 -
289406976
powermar 1421 1 1 08:39:19 ? 1:30 pmserver
powermar 2712 2711 17 18:13:17 ? 0:08 dtm pmserver.cfg 0 202 -
289406976
powermar 2714 1421 11 18:13:20 ? 0:04 dtm pmserver.cfg 1 202 -
289406976
powermar 2721 2714 12 18:13:27 ? 0:04 dtm pmserver.cfg 1 202 -
289406976
powermar 2722 2714 8 18:13:27 ? 0:02 dtm pmserver.cfg 1 202 -
289406976

<------------ Current Shared Memory Resources --------------->

IPC status from <running system> as of Tue Feb 16 18:13:55 1999


T ID KEY MODE OWNER GROUP SEGSZ CPID LPID
Shared Memory:
m 0 0x094e64a5 --rw-rw---- oracle dba 20979712 1254 1273
m 1 0x0927e9b2 --rw-rw---- oradba dba 21749760 1331 2478
m 202 00000000 --rw------- powermar pm4 5000000 1421 2714
m 8003 00000000 --rw------- powermar pm4 25000000 2711 2711
m 4 00000000 --rw------- powermar pm4 25000000 2714 2714

<------------ Current Semaphore Resources --------------->

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-155


There are 19 Semaphores held by PowerMart processes

• Pmprocs is a script that combines the ps and ipcs commands


• Only available for UNIX
• CPID - Creator PID
• LPID - Last PID that accessed the resource
• Semaphores - used to sync the reader and writer
• 0 or 1 - shows slot in LM shared memory

Finally, when tuning UNIX environments, the general rule of thumb is to tune the
server for a major database system. Most database systems provide a special tuning
supplement for each specific version of UNIX. For example, there is a specific IBM
Redbook for Oracle 7.3 running on AIX 4.3. Because PowerCenter processes data in
a similar fashion as SMP databases, by tuning the server to support the database,
you also tune the system for PowerCenter.

References: System Performance Tuning (from O’Reilly Publishing) by Mike Loukid,


is the main reference book for this Best Practice. For detailed information on each of
the parameters discussed here and much more on performance tuning of the
applications running on UNIX-based systems refer this book.

PAGE BP-156 BEST PRACTICES INFORMATICA CONFIDENTIAL


Performance Tuning Windows NT/2000 Systems

Challenge

The Microsoft Windows NT/2000 environment is easier to tune than UNIX


environments, but offers limited performance options. NT is considered a “self-
tuning” operating system because it attempts to configure and tune memory to the
best of its ability. However, this does not mean that the NT system administrator is
entirely free from performance improvement responsibilities.

The following tips have proven useful in performance tuning NT-based machines.
While some are likely to be more helpful than others in any particular environment,
all are worthy of consideration.

Note: Tuning is essentially the same for both NT and 2000 based systems, with
differences for Windows 2000 noted in the last section.

Description

The two places to begin when tuning an NT server are:

• The Performance Monitor.


• The Performance tab (hit ctrl+alt+del, choose task manager, and click on the
Performance tab).

When using the Performance Monitor, look for these performance indicators to
check:

Processor: percent processor time. For SMP environments you need to add one
monitor for each CPU. If the system is “maxed out” (i.e. running at 100 percent for
all CPUs), it may be necessary to add processing power to the server. Unfortunately,
NT scalability is quite limited, especially in comparison with UNIX environments. Also
keep in mind NT’s inability to split processes across multiple CPUs. Thus, one CPU
may be at 100% utilization while the other CPUs are at 0% utilization. There is
currently no solution for optimizing this situation, although Microsoft is working on
the problem.

Memory: pages/second. In this comparison, a number of five pages per second or


less is acceptable. If the number is much higher, there is a need to tune the memory

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-157


to make better use of hardware rather than virtual memory. Remember that this is
only a guideline, and the recommended setting may be too high for some systems.

Physical disks: percent time. This is the best place to tune database performance
within NT environments. By analyzing the disk I/O, the load on the database can be
leveled across multiple disks. High I/O settings indicate possible contention for I/O;
files should be moved to less utilized disk devices to optimize overall performance.

Physical disks: queue length. This setting is used to determine the number of
users sitting idle waiting for access to the same disk device. If this number is greater
than two, moving files to less frequently used disk devices should level the load of
the disk device.

Server: bytes total/second. This is a very nebulous performance indicator. It


monitors the server network connection. It is nebulous because it bundles multiple
network connections together. Some connections may be fast while others are slow,
making it difficult to identify real problems, and very possibly resulting in a false
sense of security. Intimate knowledge of the network card, connections, and hub-
stacks is critical for optimal server performance when moving data across the
network. Careful analysis of the network card (or cards) and their settings, combined
with the use of a Network Analyzer, can eliminate bottlenecks and improve
throughput of network traffic at a magnitude of 10 to 1000 times depending on the
hardware.

Resolving Typical NT Problems

The following paragraphs describe some common performance problems in an NT


environment and suggest tuning solutions.

Load reasonableness. Assume that some software will not be well coded, and
some background processes, such as a mail server or web server running on the
same machine, can potentially starve the CPUs on the machine. Off-loading CPU
hogs may be the only recourse.

Device Drivers. The device drivers for some types of hardware are notorious for
wasting CPU clock cycles. Be sure to get the latest drivers from the hardware vendor
to minimize this problem.

Memory and services. Although adding memory to NT is always a good solution, it


is also expensive and usually must be planned to support the BANK system for EISA
and PCI architectures. Before adding memory, check the Services in Control Panel
because many background applications do not uninstall the old service when
installing a new update or version. Thus, both the unused old service and the new
service may be using valuable CPU memory resources.

I/O Optimization. This is, by far, the best tuning option for database applications
in the NT environment. If necessary, level the load across the disk devices by
moving files. In situations where there are multiple controllers, be sure to level the
load across the controllers too.

PAGE BP-158 BEST PRACTICES INFORMATICA CONFIDENTIAL


Using electrostatic devices and fast-wide SCSI can also help to increase
performance, and fragmentation can be eliminated by using a Windows NT/2000 disk
defragmentation product. Using this type of product is a good idea whether the disk
is formatted for FAT or NTFS.

Finally, on NT servers, be sure to implement disk stripping to split single data files
across multiple disk drives and take advantage of RAID (Redundant Arrays of
Inexpensive Disks) technology. Also increase the priority of the disk devices on the
NT server. NT, by default, sets the disk device priority low. Change the disk priority
setting in the Registry at service\lanman\server\parameters and add a key for
ThreadPriority of type DWORD with a value of 2.

Monitoring System Performance In Windows 2000

In Windows 2000 the Informatica server uses system resources to process


transformation, session execution, and reading and writing of data. The Informatica
server also uses system memory for other data such as aggregate, joiner, rank, and
cached lookup tables. With Windows 2000, you can use system monitor in the
Performance Console of the administrative tools, or system tools in the task
manager, to monitor the amount of system resources used by the Informatica server
and to identify system bottlenecks.

Windows 2000 provides the following tools (accessible under the Control
Panel/Administration Tools/Performance) for monitoring resource usage on your
computer:

• System Monitor
• Performance Logs and Alerts

These Windows 2000 monitoring tools enable you to analyze usage and detect
bottlenecks at the disk, memory, processor, and network level.

The System Monitor displays a graph which is flexible and configurable. You can
copy counter paths and settings from the System Monitor display to the Clipboard
and paste counter paths from Web pages or other sources into the System Monitor
display. The System Monitor is portable. This is useful in monitoring other systems
that require administration. Typing perfmon.exe at the command prompt causes the
system to start System Monitor, not Performance Monitor.

The Performance Logs and Alerts tool provides two types of performance-related
logs—counter logs and trace logs—and an alerting function. Counter logs record
sampled data about hardware resources and system services based on performance
objects and counters in the same manner as System Monitor. Therefore they can be
viewed in System Monitor. Data in counter logs can be saved as comma-separated or
tab-separated files that are easily viewed with Excel.

Trace logs collect event traces that measure performance statistics associated with
events such as disk and file I/O, page faults, or thread activity.

The alerting function allows you to define a counter value that will trigger actions
such as sending a network message, running a program, or starting a log. Alerts are

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-159


useful if you are not actively monitoring a particular counter threshold value, but
want to be notified when it exceeds or falls below a specified value so that you can
investigate and determine the cause of the change. You might want to set alerts
based on established performance baseline values for your system.

Note:You must have Full Control access to a subkey in the registry in order to create
or modify a log configuration. (The subkey is
HKEY_CURRENT_MACHINE\SYSTEM\CurrentControlSet\Services\SysmonLog\Log_Qu
eries.)

The predefined log settings under Counter Logs named System Overview, are
configured to create a binary log that, after manual start-up, updates every 15
seconds and logs continuously until it achieves a maximum size. If you start
logging with these settings, data is saved to the Perflogs folder on the root directory
and includes the counters: Memory\ Pages/sec, PhysicalDisk(_Total)\Avg. Disk
Queue Length, and Processor(_Total)\ % Processor Time.

If you want to create your own log setting press the right mouse on one of the log
types.

Some other useful counters include Physical Disk: Reads/sec and Writes/sec and
Memory: Available Bytes and Cache Bytes.

PAGE BP-160 BEST PRACTICES INFORMATICA CONFIDENTIAL


Tuning Mappings for Better Performance

Challenge

In general, a PowerCenter mapping is the biggest ‘bottleneck’ in the load process as


business rules determine the number and complexity of transformations in a
mapping. This Best Practice offers some guidelines for tuning mappings.

Description

Analyze mappings for tuning only after you have tuned the system, source and
target for peak performance.

Consider Single-Pass Reading

If several mappings use the same data source, consider a single-pass reading.
Consolidate separate mappings into one mapping with either a single Source
Qualifier Transformation or one set of Source Qualifier Transformations as the data
source for the separate data flows.

Similarly, if a function is used in several mappings, a single-pass reading will reduce


the number of times that function will be called in the session.

Optimize SQL Overrides

When SQL overrides are required in a Source Qualifier, Lookup Transformation, or in


the update override of a target object, be sure the SQL statement is tuned. The
extent to which and how SQL can be tuned depends on the underlying source or
target database system.

Scrutinize Datatype Conversions

PowerCenter Server automatically makes conversions between compatible datatypes.


When these conversions are performed unnecessarily performance slows. For
example, if a mapping moves data from an Integer port to a Decimal port, then back
to an Integer port, the conversion may be unnecessary.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-161


In some instances however, datatype conversions can help improve performance.
This is especially true when integer values are used in place of other datatypes for
performing comparisons using Lookup and Filter transformations.

Eliminate Transformation Errors

Large numbers of evaluation errors significantly slow performance of the


PowerCenter Server. During transformation errors, the PowerCenter Server engine
pauses to determine the cause of the error, removes the row causing the error from
the data flow, and logs the error in the session log.

Transformation errors can be caused by many things including: conversion errors,


conflicting mapping logic, any condition that is specifically set up as an error, and so
on. The session log can help point out the cause of these errors. If errors recur
consistently for certain transformations, re-evaluate the constraints for these
transformation. Any source of errors should be traced and eliminated.

Optimize Lookup Transformations

There are a number of ways to optimize lookup transformations that are setup in a
mapping.

When to Cache Lookups

When caching is enabled, the PowerCenter Server caches the lookup table and
queries the lookup cache during the session. When this option is not enabled, the
PowerCenter Server queries the lookup table on a row-by-row basis.

NOTE: All the tuning options mentioned in this Best Practice assume that memory
and cache sizing for lookups are sufficient to ensure that caches will not page to
disks. Practices regarding memory and cache sizing for Lookup transformations are
covered in Best Practice: Tuning Sessions for Better Performance.

In general, if the lookup table needs less than 300MB of memory, lookup caching
should be enabled.

A better rule of thumb than memory size is to determine the ‘size’ of the potential
lookup cache with regard to the number of rows expected to be processed. For
example, consider the following example.

In Mapping X, the source and lookup contain the following number of records:

5000
ITEMS (source):
records
200
MANUFACTURER:
records
100000
DIM_ITEMS:
records

Number of Disk Reads

PAGE BP-162 BEST PRACTICES INFORMATICA CONFIDENTIAL


Cached Lookup Un-cached Lookup
LKP_Manufacturer
Build Cache 200 0
Read Source Records 5000 5000
Execute Lookup 0 5000
Total # of Disk Reads 5200 100000
LKP_DIM_ITEMS
Build Cache 100000 0
Read Source Records 5000 5000
Execute Lookup 0 5000
Total # of Disk Reads 105000 10000
Consider the
case where MANUFACTURER is the lookup table. If the lookup table is cached, it will
take a total of 5200 disk reads to build the cache and execute the lookup. If the
lookup table is not cached, then it will take a total of 10,000 total disk reads to
execute the lookup. In this case, the number of records in the lookup table is small
in comparison with the number of times the lookup is executed. So this lookup
should be cached. This is the more likely scenario.

Consider the case where DIM_ITEMS is the lookup table. If the lookup table is
cached, it will result in 105,000 total disk reads to build and execute the lookup. If
the lookup table is not cached, then the disk reads would total 10,000. In this case
the number of records in the lookup table is not small in comparison with the
number of times the lookup will be executed. Thus the lookup should not be cached.

Use the following eight step method to determine if a lookup should be cached:

1. Code the lookup into the mapping.


2. Select a standard set of data from the source. For example, add a where
clause on a relational source to load a sample 10,000 rows.
3. Run the mapping with caching turned off and save the log.
4. Run the mapping with caching turned on and save the log to a different name
than the log created in step 3.
5. Look in the cached lookup log and determine how long it takes to cache the
lookup object. Note this time in seconds: LOOKUP TIME IN SECONDS = LS.
6. In the non-cached log, take the time from the last lookup cache to the end of
the load in seconds and divide it into the number or rows being processed:
NON-CACHED ROWS PER SECOND = NRS.
7. In the cached log, take the time from the last lookup cache to the end of the
load in seconds and divide it into number or rows being processed: CACHED
ROWS PER SECOND = CRS.
8. Use the following formula to find the breakeven row point:

(LS*NRS*CRS)/(CRS-NRS) = X

Where X is the breakeven point. If your expected source records is less than
X, it is better to not cache the lookup. If your expected source records is
more than X, it is better to cache the lookup.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-163


For example:

Assume the lookup takes 166 seconds to cache (LS=166).


Assume with a cached lookup the load is 232 rows per second (CRS=232).
Assume with a non-cached lookup the load is 147 rows per second (NRS =
147).

The formula would result in: (166*147*232)/(232-147) = 66,603.

Thus, if the source has less than 66,603 records, the lookup should not be
cached. If it has more than 66,603 records, then the lookup should be
cached.

Sharing Lookup Caches

There are a number of methods for sharing lookup caches.

• Within a specific session run for a mapping, if the same lookup is used
multiple times in a mapping, the PowerCenter Server will re-use the cache for
the multiple instances of the lookup. Using the same lookup multiple times in
the mapping will be more resource intensive with each successive instance. If
multiple cached lookups are from the same table but are expected to return
different columns of data, it may be better to setup the multiple lookups to
bring back the same columns even though not all return ports are used in all
lookups. Bringing back a common set of columns may reduce the number of
disk reads.
• Across sessions of the same mapping, the use of an unnamed persistent
cache allows multiple runs to use an existing cache file stored on the
PowerCenter Server. If the option of creating a persistent cache is set in the
lookup properties, the memory cache created for the lookup during the initial
run is saved to the PowerCenter Server. This can improve performance
because the Server builds the memory cache from cache files instead of the
database. This feature should only be used when the lookup table is not
expected to change between session runs.
• Across different mappings and sessions, the use of a named persistent
cache allows sharing of an existing cache file.

Reducing the Number of Cached Rows

There is an option to use a SQL override in the creation of a lookup cache. Options
can be added to the WHERE clause to reduce the set of records included in the
resulting cache.

NOTE: If you use a SQL override in a lookup, the lookup must be cached.

Optimizing the Lookup Condition

In the case where a lookup uses more than one lookup condition, set the conditions
with an equal sign first in order to optimize lookup performance.

PAGE BP-164 BEST PRACTICES INFORMATICA CONFIDENTIAL


Indexing the Lookup Table

The PowerCenter Server must query, sort and compare values in the lookup
condition columns. As a result, indexes on the database table should include every
column used in a lookup condition. This can improve performance for both cached
and un-cached lookups.

¨ In the case of a cached lookup, an ORDER BY condition is issued in the SQL


statement used to create the cache. Columns used in the ORDER BY condition should
be indexed. The session log will contain the ORDER BY statement.

¨ In the case of an un-cached lookup, since a SQL statement created for each row
passing into the lookup transformation, performance can be helped by indexing
columns in the lookup condition.

Optimize Filter and Router Transformations

Filtering data as early as possible in the data flow improves the efficiency of a
mapping. Instead of using a Filter Transformation to remove a sizeable number of
rows in the middle or end of a mapping, use a filter on the Source Qualifier or a Filter
Transformation immediately after the source qualifier to improve performance.

Avoid complex expressions when creating the filter condition. Filter


transformations are most effective when a simple integer or TRUE/FALSE expression
is used in the filter condition.

Filters or routers should also be used to drop rejected rows from an Update
Strategy transformation if rejected rows do not need to be saved.

Replace multiple filter transformations with a router transformation. This


reduces the number of transformations in the mapping and makes the mapping
easier to follow.

Optimize Aggregator Transformations

Aggregator Transformations often slow performance because they must group data
before processing it.

Use simple columns in the group by condition to make the Aggregator


Transformation more efficient. When possible, use numbers instead of strings or
dates in the GROUP BY columns. Also avoid complex expressions in the Aggregator
expressions, especially in GROUP BY ports.

Use the Sorted Input option in the aggregator. This option requires that data sent
to the aggregator be sorted in the order in which the ports are used in the
aggregator’s group by. The Sorted Input option decreases the use of aggregate
caches. When it is used, the PowerCenter Server assumes all data is sorted by group
and, as a group is passed through an aggregator, calculations can be performed and
information passed on to the next transformation. Without sorted input, the Server
must wait for all rows of data before processing aggregate calculations. Use of the

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-165


Sorted Inputs option is usually accompanied by a Source Qualifier which uses the
Number of Sorted Ports option.

Use an Expression and Update Strategy instead of an Aggregator


Transformation. This technique can only be used if the source data can be sorted.
Further, using this option assumes that a mapping is using an Aggregator with
Sorted Input option. In the Expression Transformation, the use of variable ports is
required to hold data from the previous row of data processed. The premise is to use
the previous row of data to determine whether the current row is a part of the
current group or is the beginning of a new group. Thus, if the row is a part of the
current group, then its data would be used to continue calculating the current group
function. An Update Strategy Transformation would follow the Expression
Transformation and set the first row of a new group to insert and the following rows
to update.

Optimize Joiner Transformations

Joiner transformations can slow performance because they need additional space in
memory at run time to hold intermediate results.

Define the rows from the smaller set of data in the joiner as the Master
rows. The Master rows are cached to memory and the detail records are then
compared to rows in the cache of the Master rows. In order to minimize memory
requirements, the smaller set of data should be cached and thus set as Master.

Use Normal joins whenever possible. Normal joins are faster than outer joins
and the resulting set of data is also smaller.

Use the database to do the join when sourcing data from the same database
schema. Database systems usually can perform the join more quickly than the
Informatica Server, so a SQL override or a join condition should be used when
joining multiple tables from the same database schema.

Optimize Sequence Generator Transformations

Sequence Generator transformations need to determine the next available sequence


number, thus increasing the Number of Cached Values property can increase
performance. This property determines the number of values the Informatica Server
caches at one time. If it is set to cache no values then the Informatica Server must
query the Informatica repository each time to determine what is the next number
which can be used. Configuring the Number of Cached Values to a value greater than
1000 should be considered. It should be noted any cached values not used in the
course of a session are ‘lost’ since the sequence generator value in the repository is
set, when it is called next time, to give the next set of cache values.

Avoid External Procedure Transformations

For the most part, making calls to external procedures slows down a session. If
possible, avoid the use of these Transformations, which include Stored Procedures,
External Procedures and Advanced External Procedures.

PAGE BP-166 BEST PRACTICES INFORMATICA CONFIDENTIAL


Field Level Transformation Optimization

As a final step in the tuning process, expressions used in transformations can be


tuned. When examining expressions, focus on complex expressions for possible
simplification.

To help isolate slow expressions, do the following:

1. Time the session with the original expression.

2. Copy the mapping and replace half the complex expressions with a constant.

3. Run and time the edited session.

4. Make another copy of the mapping and replace the other half of the complex
expressions with a constant.

5. Run and time the edited session.

Processing field level transformations takes time. If the transformation expressions


are complex, then processing will be slower. Its often possible to get a 10- 20%
performance improvement by optimizing complex field level transformations. Use the
target table mapping reports or the Metadata Reporter to examine the
transformations. Likely candidates for optimization are the fields with the most
complex expressions. Keep in mind that there may be more than one field causing
performance problems.

Factoring out Common Logic

This can reduce the number of times a mapping performs the same logic. If a
mapping performs the same logic multiple times in a mapping, moving the task
upstream in the mapping may allow the logic to be done just once. For example, a
mapping has five target tables. Each target requires a Social Security Number
lookup. Instead of performing the lookup right before each target, move the lookup
to a position before the data flow splits.

Minimize Function Calls

Anytime a function is called it takes resources to process. There are several common
examples where function calls can be reduced or eliminated.

Aggregate function calls can sometime be reduced. In the case of each aggregate
function call, the Informatica Server must search and group the data.

Thus the following expression:

SUM(Column A) + SUM(Column B)

Can be optimized to:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-167


SUM(Column A + Column B)

In general, operators are faster than functions, so operators should be used


whenever possible.

For example if you have an expression which involves a CONCAT function such as:

CONCAT(CONCAT(FIRST_NAME,’ ‘), LAST_NAME)

It can be optimized to:

FIRST_NAME || ‘ ‘ || LAST_NAME

Remember that IIF() is a function that returns a value, not just a logical test.
This allows many logical statements to be written in a more compact fashion.

For example:

IIF(FLG_A=’Y’ and FLG_B=’Y’ and FLG_C=’Y’, VAL_A+VAL_B+VAL_C,

IIF(FLG_A=’Y’ and FLG_B=’Y’ and FLG_C=’N’, VAL_A+VAL_B,

IIF(FLG_A=’Y’ and FLG_B=’N’ and FLG_C=’Y’, VAL_A+VAL_C,

IIF(FLG_A=’Y’ and FLG_B=’N’ and FLG_C=’N’, VAL_A,

IIF(FLG_A=’N’ and FLG_B=’Y’ and FLG_C=’Y’, VAL_B+VAL_C,

IIF(FLG_A=’N’ and FLG_B=’Y’ and FLG_C=’N’, VAL_B,

IIF(FLG_A=’N’ and FLG_B=’N’ and FLG_C=’Y’, VAL_C,

IIF(FLG_A=’N’ and FLG_B=’N’ and FLG_C=’N’, 0.0))))))))

Can be optimized to:

IIF(FLG_A=’Y’, VAL_A, 0.0) + IIF(FLG_B=’Y’, VAL_B, 0.0) + IIF(FLG_C=’Y’,


VAL_C, 0.0)

The original expression had 8 IIFs, 16 ANDs and 24 comparisons. The optimized
expression results in 3 IIFs, 3 comparisons and two additions.

Be creative in making expressions more efficient. The following is an example of


rework of an expression which eliminates three comparisons down to one:

For example:

IIF(X=1 OR X=5 OR X=9, 'yes', 'no')

PAGE BP-168 BEST PRACTICES INFORMATICA CONFIDENTIAL


Can be optimized to:

IIF(MOD(X, 4) = 1, 'yes', 'no')

Calculate Once, Use Many Times

Avoid calculating or testing the same value multiple times. If the same sub-
expression is used several times in a transformation, consider making the sub-
expression a local variable. The local variable can be used only within the
transformation but by calculating the variable only once can speed performance.

Choose Numeric versus String Operations

The Informatica Server processes numeric operations faster than string operations.
For example, if a lookup is done on a large amount of data on two columns,
EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID
improves performance.

Optimizing Char-Char and Char-Varchar Comparisons

When the Informatica Server performs comparisons between CHAR and VARCHAR
columns, it slows each time it finds trailing blank spaces in the row. The Treat CHAR
as CHAR On Read option can be set in the Informatica Server setup so that the
Informatica Server does not trim trailing spaces from the end of CHAR source fields.

Use DECODE instead of LOOKUP

When a LOOKUP function is used, the Informatica Server must lookup a table in the
database. When a DECODE function is used, the lookup values are incorporated into
the expression itself so the Informatica Server does not need to lookup a separate
table. Thus, when looking up a small set of unchanging values, using DECODE may
improve performance.

Reduce the Number of Transformations in a Mapping

Whenever possible the number of transformations should be reduced. As there is


always overhead involved in moving data between transformations. Along the same
lines, unnecessary links between transformations should be removed to minimize the
amount of data moved. This is especially important with data being pulled from the
Source Qualifier Transformation.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-169


Tuning Sessions for Better Performance

Challenge

Running sessions is where ‘the pedal hits the metal’. A common misconception is
that this is the area where most tuning should occur. While it is true that various
specific session options can be modified to improve performance, this should not be
the major or only area of focus when implementing performance tuning.

Description

When you have finished optimizing the sources, target database and mappings, you
should review the sessions for performance optimization.

Caches

The greatest area for improvement at the session level usually involves tweaking
memory cache settings. The Aggregator, Joiner, Rank and Lookup Transformations
use caches. Review the memory cache settings for sessions where the mappings
contain any of these transformations.

When performance details are collected for a session, information about


readfromdisk and writetodisk counters for Aggregator, Joiner, Rank and/or Lookup
transformations can point to a session bottleneck. Any value other than zero for
these counters may indicate a bottleneck.

Because index and data caches are created for each of these transformations, , both
the index cache and data cache sizes may affect performance, depending on the
factors discussed in the following paragraphs.

When the PowerCenter Server creates memory caches, it may also create cache
files. Both index and data cache files can be created for the following transformations
in a mapping:

• Aggregator transformation (without sorted ports)


• Joiner transformation
• Rank transformation
• Lookup transformation (with caching enabled)

PAGE BP-170 BEST PRACTICES INFORMATICA CONFIDENTIAL


The PowerCenter Server creates the index and data cache files by default in the
PowerCenter Server variable directory, $PMCacheDir. The naming convention used
by the PowerCenter Server for these files is PM [type of widget] [generated
number].dat or .idx. For example, an aggregate data cache file would be named
PMAGG31_19.dat. The cache directory may be changed however, if disk space is a
constraint. Informatica recommends that the cache directory be local to the
PowerCenter Server. You may encounter performance or reliability problems when
you cache large quantities of data on a mapped or mounted drive.

If the PowerCenter Server requires more memory than the configured cache size, it
stores the overflow values in these cache files. Since paging to disk can slow session
performance, try to configure the index and data cache sizes to store the appropriate
amount of data in memory. Refer to Chapter 9: Session Caches in the Informatica
Session and Server Guide for detailed information on determining cache sizes.

The PowerCenter Server writes to the index and data cache files during a session in
the following cases:

• The mapping contains one or more Aggregator transformations, and the


session is configured for incremental aggregation.
• The mapping contains a Lookup transformation that is configured to use a
persistent lookup cache, and the Informatica Server runs the session for the
first time.
• The mapping contains a Lookup transformation that is configured to initialize
the persistent lookup cache.
• The DTM runs out of cache memory and pages to the local cache files. The
DTM may create multiple files when processing large amounts of data. The
session fails if the local directory runs out of disk space.

When a session is run, the PowerCenter Server writes a message in the session log
indicating the cache file name and the transformation name. When a session
completes, the DTM generally deletes the overflow index and data cache files.
However, index and data files may exist in the cache directory if the session is
configured for either incremental aggregation or to use a persistent lookup cache.
Cache files may also remain if the session does not complete successfully.

If a cache file handles more than 2 gigabytes of data, the PowerCenter Server
creates multiple index and data files. When creating these files, the PowerCenter
Server appends a number to the end of the filename, such as PMAGG*.idx1 and
PMAGG*.idx2. The number of index and data files is limited only by the amount of
disk space available in the cache directory.

o Aggregator Caches

Keep the following items in mind when configuring the aggregate


memory cache sizes.

• Allocate at least enough space to hold at least one row in each


aggregate group.
• Remember that you only need to configure cache memory for
an Aggregator transformation that does NOT use sorted ports.
The PowerCenter Server uses memory to process an

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-171


Aggregator transformation with sorted ports, not cache
memory.
• Incremental aggregation can improve session performance.
When it is used, the PowerCenter Server saves index and data
cache information to disk at the end of the session. The next
time the session runs, the PowerCenter Server uses this
historical information to perform the incremental aggregation.
The PowerCenter Server names these files PMAGG*.dat and
PMAGG*.idx and saves them to the cache directory. Mappings
that have sessions which use incremental aggregation should
be set up so that only new detail records are read with each
subsequent run.

• Joiner Caches

The source with fewer records should be specified as the master source
because only the master source records are read into cache. When a session
is run with a Joiner transformation, the PowerCenter Server reads all the rows
from the master source and builds memory caches based on the master rows.
After the memory caches are built, the PowerCenter Server reads the rows
from the detail source and performs the joins.

Also, the PowerCenter Server automatically aligns all data for joiner caches on
an eight-byte boundary, which helps increase the performance of the join.

• Lookup Caches

Several options can be explored when dealing with lookup


transformation caches.

• Persistent caches should be used when lookup data is not


expected to change often. Lookup cache files are saved after a
session which has a lookup that uses a persistent cache is run
for the first time. These files are reused for subsequent runs,
bypassing the querying of the database for the lookup. If the
lookup table changes, you must be sure to set the Recache
from Database option to ensure that the lookup cache files will
be rebuilt.
• Lookup caching should be enabled for relatively small tables.
Refer to Best Practice: Tuning Mappings for Better Performance
to determine when lookups should be cached. When the Lookup
transformation is not configured for caching, the PowerCenter
Server queries the lookup table for each input row. The result
of the Lookup query and processing is the same, regardless of
whether the lookup table is cached or not. However, when the
transformation is configured to not cache, the PowerCenter
Server queries the lookup table instead of the lookup cache.
Using a lookup cache can sometimes increase session
performance.
• Just like for a joiner, the PowerCenter Server aligns all data for
lookup caches on an eight-byte boundary which helps increase
the performance of the lookup.

PAGE BP-172 BEST PRACTICES INFORMATICA CONFIDENTIAL


Allocating Buffer Memory

When the PowerCenter Server initializes a session, it allocates blocks of memory to


hold source and target data. Sessions that use a large number of source and targets
may require additional memory blocks.

You can tweak session properties to increase the number of available memory blocks
by adjusting:

• DTM Buffer Pool Size – the default setting is 12,000,000 bytes


• Default Buffer Block Size – the default size is 64,000 bytes

To configure these settings, first determine the number of memory blocks the
PowerCenter Server requires to initialize the session. Then you can calculate the
buffer pool size and/or the buffer block size based on the default settings, to create
the required number of session blocks.

If there are XML sources and targets in the mappings, use the number of groups in
the XML source or target in the total calculation for the total number of sources and
targets.

• Increasing the DTM Buffer Pool Size

The DTM Buffer Pool Size setting specifies the amount of memory the
PowerCenter Server uses as DTM buffer memory. The PowerCenter Server
uses DTM buffer memory to create the internal data structures and buffer
blocks used to bring data into and out of the Server. When the DTM buffer
memory is increased, the PowerCenter Server creates more buffer blocks,
which can improve performance during momentary slowdowns.

If a session’s performance details show low numbers for your source and
target BufferInput_efficiency and BufferOutput_efficiency counters, increasing
the DTM buffer pool size may improve performance.

Increasing DTM buffer memory allocation generally causes performance to


improve initially and then level off. When the DTM buffer memory allocation is
increased, the total memory available on the PowerCenter Server needs to be
evaluated. If a session is part of a concurrent batch, the combined DTM buffer
memory allocated for the sessions or batches must not exceed the total
memory for the PowerCenter Server system.

If you don’t see a significant performance increase after increasing DTM


buffer memory, then it was not a factor in session performance.

• Optimizing the Buffer Block Size

Within a session, you may modify the buffer block size by changing it in the
Advanced Parameters section. This specifies the size of a memory block that
is used to move data throughout the pipeline. Each source, each
transformation, and each target may have a different row size, which results
in different numbers of rows that can be fit into one memory block. Row size

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-173


is determined in the server, based on number of ports, their datatypes and
precisions. Ideally, block size should be configured so that it can hold roughly
100 rows, plus or minus a factor of ten. When calculating this, use the source
or target with the largest row size. The default is 64K. The buffer block size
does not become a factor in session performance until the number of rows
falls below 10 or goes above 1000. Informatica recommends that the size of
the shared memory (which determines the number of buffers available to the
session) should not be increased at all unless the mapping is “complex” (i.e.,
more than 20 transformations).

Running Concurrent Batches

Performance can sometimes be improved by creating a concurrent batch to run


several sessions in parallel on one PowerCenter Server. This technique should only
be employed on servers with multiple CPUs available. Each concurrent session will
use a maximum of 1.4 CPUs for the first session, and a maximum of 1 CPU for each
additional session. Also, it has been noted that simple mappings (i.e., mappings with
only a few transformations) do not make the engine “CPU bound” , and therefore use
a lot less processing power than a full CPU.

If there are independent sessions that use separate sources and mappings to
populate different targets, they can be placed in a concurrent batch and run at the
same time.

If there is a complex mapping with multiple sources, you can separate it into several
simpler mappings with separate sources. This enables you to place the sessions for
each of the mappings in a concurrent batch to be run in parallel.

Partitioning Sessions

If large amounts of data are being processed with PowerCenter 5.x, data can be
processed in parallel with a single session by partitioning the source via the source
qualifier. Partitioning allows you to break a single source into multiple sources and to
run each in parallel. The PowerCenter Server will spawn a Read and Write thread for
each partition, thus allowing for simultaneous reading, processing, and writing. Keep
in mind that each partition will compete for the same resources (i.e., memory, disk,
and CPU), so make sure that the hardware and memory are sufficient to support a
parallel session. Also, the DTM buffer pool size is split among all partitions, so it may
need to be increased for optimal performance.

Increasing the Target Commit Interval

One method of resolving target database bottlenecks is to increase the commit


interval. Each time the PowerCenter Server commits, performance slows. Therefore,
the smaller the commit interval, the more often the PowerCenter Server writes to the
target database, and the slower the overall performance. If you increase the commit
interval, the number of times the PowerCenter Server commits decreases and
performance may improve.

When increasing the commit interval at the session level, you must remember to
increase the size of the database rollback segments to accommodate this larger

PAGE BP-174 BEST PRACTICES INFORMATICA CONFIDENTIAL


number of rows. One of the major reasons that Informatica has set the default
commit interval to 10,000 is to accommodate the default rollback segment / extent
size of most databases. If you increase both the commit interval and the database
rollback segments, you should see an increase in performance. In some cases
though, just increasing the commit interval without making the appropriate database
changes may cause the session to fail part way through (you may get a database
error like “unable to extend rollback segments” in Oracle).

Disabling Session Recovery

You can improve performance by turning off session recovery. The PowerCenter
Server writes recovery information in the OPB_SRVR_RECOVERY table during each
commit. This can decrease performance. The PowerCenter Server setup can be set to
disable session recovery. But be sure to weigh the importance of improved session
performance against the ability to recover an incomplete session when considering
this option.

Disabling Decimal Arithmetic

If a session runs with decimal arithmetic enabled, disabling decimal arithmetic may
improve session performance.

The Decimal datatype is a numeric datatype with a maximum precision of 28. To use
a high-precision Decimal datatype in a session, it must be configured so that the
PowerCenter Server recognizes this datatype by selecting Enable Decimal Arithmetic
in the session property sheet. However, since reading and manipulating a high-
precision datatype (i.e., those with a precision of greater than 28) can slow the
PowerCenter Server, session performance may be improved by disabling decimal
arithmetic.

Reducing Error Tracing

If a session contains a large number of transformation errors, you may be able to


improve performance by reducing the amount of data the PowerCenter Server writes
to the session log.

To reduce the amount of time spent writing to the session log file, set the tracing
level to Terse. Terse tracing should only be set if the sessions run without problems
and session details are not required. At this tracing level, the PowerCenter Server
does not write error messages or row-level information for reject data. However, if
terse is not an acceptable level of detail, you may want to consider leaving the
tracing level at Normal and focus your efforts on reducing the number of
transformation errors.

Note that the tracing level must be set to Normal in order to use the reject loading
utility.

As an additional debug option (beyond the PowerCenter Debugger), you may set the
tracing level to Verbose to see the flow of data between transformations. However,
this will significantly affect the session performance. Do not use Verbose tracing

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-175


except when testing sessions. Always remember to switch tracing back to Normal
after the testing is complete.

The session tracing level overrides any transformation-specific tracing levels within
the mapping. Informatica does not recommend reducing error tracing as a long-term
response to high levels of transformation errors. Because there are only a handful of
reasons why transformation errors occur, it makes sense to fix and prevent any
recurring transformation errors.

PAGE BP-176 BEST PRACTICES INFORMATICA CONFIDENTIAL


Determining Bottlenecks

Challenge

Because there are many variables involved in identifying and rectifying performance
bottlenecks, an efficient method for determining where bottlenecks exist is crucial to
good data warehouse management.

Description

The first step in performance tuning is to identify performance bottlenecks. Carefully


consider the following five areas to determine where bottlenecks exist; use a process
of elimination, investigating each area in the order indicated:

1. Write
2. Read
3. Mapping
4. Session
5. System

Before you begin, you should establish an approach for identifying performance
bottlenecks. To begin, attempt to isolate the problem by running test sessions. You
should be able to compare the session’s original performance with that of the tuned
session’s performance.

The swap method is very useful for determining the most common bottlenecks. It
involves the following five steps:

1. Make a temporary copy of the mapping and/or session that is to be tuned,


then tune the copy before making changes to the original.
2. Implement only one change at a time and test for any performance
improvements to gauge which tuning methods work most effectively in the
environment.
3. Document the change made to the mapping/and or session and the
performance metrics achieved as a result of the change. The actual execution
time may be used as a performance metric.
4. Delete the temporary sessions upon completion of performance tuning.
5. Make appropriate tuning changes to mappings and/or sessions.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-177


Write Bottlenecks

Relational Targets

The most common performance bottleneck occurs when the PowerCenter Server
writes to a target database. This type of bottleneck can easily be identified with the
following procedure:

1. Make a copy of the original session


2. Configure the test session to write to a flat file

If the session performance is significantly increased when writing to a flat file, you
have a write bottleneck.

Flat File Targets

If the session targets a flat file, you probably do not have a write bottleneck. You can
optimize session performance by writing to a flat file target local to the PowerCenter
server. If the local flat file is very large, you can optimize the write process by
dividing it among several physical drives.

Read Bottlenecks

Relational Sources

If the session reads from a relational source, you should first use a read test session
with a flat file as the source in the test session. You may also use a database query
to indicate if a read bottleneck exists.

Using a Test Session with a Flat File Source

1. Create a mapping and session that writes the source table data to a flat file.
2. Create a test mapping that contains only the flat file source, the source
qualifier, and the target table.
3. Create a session for the test mapping.

If the test session’s performance increases significantly, you have a read bottleneck.

Using a Database Query

To identify a source bottlenecks by executing a read query directly against the


source database, follow these steps:

1. Copy the read query directly from the session log.


2. Run the query against the source database with a query tool such as SQL
Plus.
3. Measure the query execution time and the time it takes for the query to
return the first row.

PAGE BP-178 BEST PRACTICES INFORMATICA CONFIDENTIAL


If there is a long delay between the two time measurements, you have a source
bottleneck.

Flat File Sources

If your session reads from a flat file source, you probably do not have a read
bottleneck. Tuning the Line Sequential Buffer Length to a size large enough to hold
approximately four to eight rows of data at a time (for flat files) may help when
reading flat file sources. Ensure the flat file source is local to the PowerCenter
Server.

Mapping Bottlenecks

If you have eliminated the reading and writing of data as bottlenecks, you may have
a mapping bottleneck. Use the swap method to determine if the bottleneck is in the
mapping. After using the swap method, you can use the session’s performance
details to determine if mapping bottlenecks exist. High Rowsinlookupcache and
Errorrows counters indicate mapping bottlenecks. Follow these steps to identify
mapping bottlenecks:

Using a Test Mapping without transformations

1. Make a copy of the original mapping


2. In the copied mapping, retain only the sources, source qualifiers, and any
custom joins or queries
3. Remove all transformations.
4. Connect the source qualifiers to the target.

High Rowsinlookupcache counters:

Multiple lookups can slow the session. You may improve session performance by
locating the largest lookup tables and tuning those lookup expressions.

High Errorrows counters:

Transformation errors affect session performance. If a session has large numbers in


any of the Transformation_errorrows counters, you may improve performance by
eliminating the errors.

For further details on eliminating mapping bottlenecks, refer to the Best Practice:
Tuning Mappings for Better Performance

Session Bottlenecks

Session performance details can be used to flag other problem areas in the session
Advanced Options Parameters or in the mapping.

Low Buffer Input and Buffer Output Counters

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-179


If the BufferInput_efficiency and BufferOutput_efficiency counters are low for all
sources and targets, increasing the session DTM buffer pool size may improve
performance.

Aggregator, Rank, and Joiner Readfromdisk and Writetodisk Counters

If a session contains Aggregator, Rank, or Joiner transformations, examine each


Trasnformation_readfromdisk and Transformation_writetodisk counter. If these
counters display any number other than zero, you can improve session performance
by increasing the index and data cache sizes.

For further details on eliminating session bottlenecks, refer to the Best Practice:
Tuning Sessions for Better Performance.

System Bottlenecks

After tuning the source, target, mapping, and session, you may also consider tuning
the system hosting the PowerCenter Server.

Windows NT/2000

Use system tools such as the Performance tab in the Task Manager or the
Performance Monitor to view CPU usage and total memory usage.

UNIX

On UNIX, use system tools like vmstat and iostat to monitor such items as system
performance and disk swapping actions.

For further information regarding system tuning, refer to the Best Practices:
Performance Tuning UNIX-Based Systems and Performance Tuning NT/2000-Based
Systems.

The following table details the Performance Counters that can be used to flag session
and mapping bottlenecks. Note that these can only be found in the Session
Performance Details file.

Transformation Counters Description


Source Qualifier and BufferInput_Efficiency Percentage reflecting how seldom the reader
Normalizer waited for a free buffer when passing data to
Transformations the DTM.
BufferOutput_Efficiency Percentage reflecting how seldom the DTM
waited for a full buffer of data from the
reader.
Target BufferInput_Efficiency Percentage reflecting how seldom the DTM
waited for a free buffer when passing data to
the writer.
BufferOutput_Efficiency Percentage reflecting how seldom the
Informatica server waited for a full buffer of
data from the reader.
Aggregator and Rank Aggregator/Rank_readfromdisk Number of times the Informatica Server read
from the index or data file on the local disk,
instead of using cached data.

PAGE BP-180 BEST PRACTICES INFORMATICA CONFIDENTIAL


Transformations Aggregator/Rank_writetodisk Number of times the Informatica server
wrote to the index or data file on the local
disk, instead of using cached data.
Joiner Transformation Joiner_readfromdisk Number of times the Informatica Server read
from the index or data file on the local disk,
instead of using cached data.
(see Note below)
Joiner_writetodisk Number of times the Informatica server
wrote to the index or data file on the local
disk, instead of using cached data.
Lookup Transformation Lookup_rowsinlookupcache Number of rows stored in the lookup cache.
All Transformations Transformation_errorrows Number of rows in which the Infor matica
Server encountered an error

Note: The PowerCenter Server generates two sets of performance counters for a
Joiner transformation. The first set of counters refers to the master source. The
Joiner transformation does not generate output row counters associated with the
master source. The second set of counters refers to the detail source.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-181


Advanced Client Configuration Options

Challenge

Setting the Registry in order to ensure consistent client installations, resolve


potential missing or invalid license key issues and change the Server Manager
Session Log Editor to your preferred editor.

Description

Ensuring Consistent Data Source Names

To ensure the use of consistent data source names for the same data sources across
the domain, the Administrator can create a single "official" set of data sources, then
use the Repository Manager to export that connection information to a file. You can
then distribute this file and import the connection information for each client
machine.

Solution

• From Repository Manager, choose Export Registry from the Tools drop down
menu.
• For all subsequent client installs, simply choose Import Registry from the
Tools drop down menu.

Resolving the Missing or Invalid License Key Issue

The “missing or invalid license key” error occurs when attempting to install
PowerCenter Client tools on NT 4.0 or Windows 2000 with a userid other than
‘Administrator.’

This problem also occurs when the client software tools are installed under the
Administrator account, and subsequently a user with a non-administrator ID
attempts to run the tools. The user who attempts to log in using the normal ‘non-
administrator’ userid will be unable to start the PowerCenter Client tools. Instead,
the software will display the message indicating that the license key is missing or
invalid.

PAGE BP-182 BEST PRACTICES INFORMATICA CONFIDENTIAL


Solution

• While logged in as the installation user with administrator authority, use


regedt32 to edit the registry.
• Under HKEY_LOCAL_MACHINE open Software/Informatica/PowerMart Client
Tools/. From the menu bar, select Security/Permissions, and grant read
access to the users that should be permitted to use the PowerMart Client.
(Note that the registry entries for both PowerMart and PowerCenter server
and client tools are stored as PowerMart Server and PowerMart Client tools.)

Changing the Server Manager Session Log Editor

The session log editor is not automatically determined when the PowerCenter
Client tools are installed. A window appears the first time a session log is
viewed from the PowerCenter Server Manager, prompting the user to enter
the full path name of the editor to be used to view the logs. Users often set
this parameter incorrectly and must access the registry to change it.

Solution

• While logged in as the installation user with administrator authority,


use regedt32 to go into the registry.
• Move to registry path location: HKEY_CURRENT_USER
Software\Informatica\PowerMart Client Tools\[CLIENT
VERSION]\Server Manager\Session Files. From the menu bar, select
View Tree and Data.
• Select the Log File Editor entry by double clicking on it.
• Replace the entry with the appropriate editor entry, i.e. typically
WordPad.exe or Write.exe.
• Select Registry --> Exit from the menu bar to save the entry.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-183


Advanced Server Configuration Options

Challenge

Configuring the Throttle Reader and File Debugging options, adjusting semaphore
settings in the Unix environment, and configuring server variables.

Description

Configuring the Throttle Reader

If problems occur when running sessions, some adjustments at the Server level can
help to alleviate issues or isolate problems.

One technique that often helps resolve “hanging” sessions is to limit the number of
reader buffers that use Throttle Reader. This is particularly effective if your mapping
contains many target tables, or if the session employs constraint-based loading. This
parameter closely manages buffer blocks in memory by restricting the number of
blocks that can be utilized by the Reader.

Note for PowerCenter 5.x and above ONLY: If a session is hanging and it is
partitioned, it is best to remove the partitions before adjusting the throttle reader.
When a session is partitioned, the server makes separate connections to the source
and target for every partition. This will cause the server to manage many buffer
blocks. If the session still hangs, try adjusting the throttle reader.

Solution: To limit the number of reader buffers using Throttle Reader in NT/2000:

• Access file
hkey_local_machine\system\currentcontrolset\services\powermart\parameter
s\miscinfo.
• Create a new String value with value name of 'ThrottleReader' and value data
of '10'.

To do the same thing in UNIX:

• Add this line to .cfg file:


• ThrottleReader=10

PAGE BP-184 BEST PRACTICES INFORMATICA CONFIDENTIAL


Configuring File Debugging Options

If problems occur when running sessions or if the PowerCenter Server has a stability
issue, help technical support to resolve the issue by supplying them with Debug files.

To set the debug options on for NT/2000:

1. Select Start, Run, and type “regedit”


2. Go to hkey_local_machine, system, current_control_set, services, powermart,
miscInfo
3. Select edit, then add value
4. Place "DebugScrubber" as the value then hit OK. Insert "4" as the value
5. Repeat steps 4 and 5, but use "DebugWriter", "DebugReader", "DebugDTM"
with all three set to "1"

To do the same in UNIX:

Insert the following entries in the pmserver.cfg file:

• DebugScrubber=4
• DebugWriter=1
• DebugReader=1
• DebugDTM=1

Adjusting Semaphore Settings

The UNIX version of the PowerCenter Server uses operating system semaphores for
synchronization. You may need to increase these semaphore settings before
installing the server.

The number of semaphores required to run a session is 7. Most installations require


between 64 and 128 available semaphores, depending on the number of sessions the
server runs concurrently. This is in addition to any semaphores required by other
software, such as database servers.

The total number of available operating system semaphores is an operating system


configuration parameter, with a limit per user and system. The method used to
change the parameter depends on the operating system:

• HP/UX: Use sam (1M) to change the parameters.


• Solaris: Use admintool or edit /etc/system to change the parameters.
• AIX: Use smit to change the parameters.

Setting Shared Memory and Semaphore Parameters

Informatica recommends setting the following parameters as high as possible for the
operating system. However, if you set these parameters too high, the machine may
not boot. Refer to the operating system documentation for parameter limits:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-185


Parameter Recommended Value for Description
Solaris
SHMMAX 4294967295 Maximum size in bytes of a
shared memory segment.
SHMMIN 1 Minimum size in bytes of a
shared memory segment.
SHMMNI 100 Number of shared memory
identifiers.
SHMSEG 10 Maximum number of shared
memory segments that can be
attached by a process.
SEMMNS 200 Number of semaphores in the
system.
SEMMNI 70 Number of semaphore set
identifiers in the system.
SEMMNI determines the
number of semaphores that can
be created at any one time.
SEMMSL equal to or greater than the Maximum number of
value of the PROCESSES semaphores in one semaphore
initialization parameter set. Must be equal to the
maximum number of
processes.

For example, you might add the following lines to the Solaris /etc/system file to
configure the UNIX kernel:

set shmsys:shminfo_shmmax = 4294967295

set shmsys:shminfo_shmmin = 1

set shmsys:shminfo_shmmni = 100

set shmsys:shminfo_shmseg = 10

set semsys:shminfo_semmns = 200

set semsys:shminfo_semmni = 70

Always reboot the system after configuring the UNIX kernel.

Configuring Server Variables

One configuration best practice is to properly configure and leverage Server


variables. Benefits of using server variables:

• Ease of deployment from development environment to production


environment.
• Ease of switching sessions from one server machine to another without
manually editing all the sessions to change directory paths.

PAGE BP-186 BEST PRACTICES INFORMATICA CONFIDENTIAL


• All the variables are related to directory paths used by server.

Approach

In Server Manager, edit the server configuration to set or change the variables.

Each registered server has its own set of variables. The list is fixed, not user-
extensible.

Server Variable Value


$PMRootDir (no default – user must insert a
path)
$PMSessionLogDir $PMRootDir/SessLogs
$PMBadFileDir $PMRootDir/BadFiles
$PMCacheDir $PMRootDir/Cache
$PMTargetFileDir $PMRootDir/TargetFiles
$PMSourceFileDir $PMRootDir/SourceFiles
$PMExtProcDir $PMRootDir/ExtProc
$PMSuccessEmailUser (no default – user must insert a
path)
$PMFailureEmailUser (no default – user must insert a
path)
$PMSessionLogCount 0
$PMSessionErrorThreshold 0

Where are these variables referenced?

• Server manager session editor: anywhere in the fields for session log
directory, bad file directory, etc.
• Designer: Aggregator/Rank/Joiner attribute for ‘Cache Directory’;
External Procedure attribute for ‘Location’

Does every session and mapping have to use these variables (are they mandatory)?

• No. If you remove any variable reference from the session or the widget
attributes then the server does not use that variable.

What if a variable is not referenced in the session or mapping?

• The variable is just a convenience; the user can choose to use it or not. The
variable will be expanded only if it is explicitly referenced from another
location. If the session log directory is specified as $PMSessionLogDir, then
the logs are put in that location.

Note that this location may be different on every server. This is in fact a primary
purpose for utilizing variables. But if the session log directory field is changed to
designate a specific location, e.g. ‘/home/john/logs’, then the session logs will
instead be placed in the directory location as designated. (The variable
$PMSessionLogDir will be unused so it does not matter what the value of the variable
is set to).

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-187


PAGE BP-188 BEST PRACTICES INFORMATICA CONFIDENTIAL
Platform Sizing

Challenge

Determining the appropriate platform size to support PowerCenter, considering


specific environmental and processing requirements. Sizing may not be an easy task
because it may be necessary to configure a single server to support numerous
applications.

Description

This Best Practice provides general guidance for sizing computing environments. It
also discusses some potential questions and pitfalls that may arise when migrating to
Production.

Certain terms used within this Best Practice are specific to Informatica’s
PowerCenter. Please consult the appropriate PowerCenter manuals for explanation of
these terms where necessary.

Environmental configurations may very greatly with regard to hardware and software
sizing. In addition to requirements for PowerCenter, other applications may share the
server. Be sure to consider all mandatory server software components, including the
operating system and all of its components, the database engine, front-end engines,
etc. Regardless of whether or not the server is shared, it will be necessary to
research the requirements of these additional software components when estimating
the size of the overall environment.

Technical Information

Before delving into key sizing questions, let us review the PowerCenter engine and
its associated resource needs. Each session:

• Represents an active task that performs data loading. PowerCenter provides


session parameters that can be set to specify the amount of required shared
memory per session. This shared memory setting is important, as it will
dictate the amount of RAM required when running concurrent sessions, and
will also be used to provide a level of performance that meets your needs.
• Uses up to 140% of CPU resources. This is important to remember if sessions
will be executed concurrently.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-189


• Requires 20-30 MB of memory per session if there are no aggregations,
lookups, or heterogeneous data joins contained within the mapping.
• May require additional memory for the caching of aggregations, lookups, or
joins. The amount of memory can be calculated per session. Refer to the
Session and Server guide to determine the exact amount of memory
necessary per session.

Note: It may be helpful to refer to the Performance Tuning section in Phase 4 of the
Informatica Methodology when determining memory settings. The Performance
Tuning section provides additional information on factors that typically affect session
performance, and offers general guidance for estimating session resources.

The PowerCenter engine:

Requires 20-30 MB of memory for the main server engine for session coordination.

Requires additional memory when caching for aggregation, lookups, or joins,


because:

• Lookup tables, when cached in full, result in memory consumption


commensurate with the size of the tables involved.
• Aggregate caches store the individual groups; more memory is used if there
are more groups.
• In a join, cache the master table; memory consumed depends on the size of
the master.

Note: Sorting the input to aggregations will greatly reduce the need for memory.

Disk space is not a factor if the machine is dedicated exclusively to the server
engine. However, if the following conditions exist, disk space will need to be carefully
considered:

• Data is staged to flat files on the PowerCenter server.


• Data is stored in incremental aggregation files for adding data to aggregates.
The space consumed is about the size of the data aggregated.
• Temporary space is not used like a database on disk, unless the cache
requires it after filling system memory.
• Data does not need to be stripped to prevent head contention. This includes
all types of data such as flat files and database tables.

Key Questions

The goal of this analysis is to size the machine so that the ETL processes can
complete within the specified load window.

Consider the following questions when estimating the required number of sessions,
the volume of data moved per session, and the caching requirements for the
session’s lookup tables, aggregation, and heterogeneous joins. Use these estimates
along with recommendations in the preceding Technical Information section to
determine the required number of processors, memory, and disk space to achieve
the required performance to meet the load window.

PAGE BP-190 BEST PRACTICES INFORMATICA CONFIDENTIAL


Please note that the hardware sizing analysis is highly dependent on the
environment in which the server is deployed. It is very important to understand the
performance characteristics of the environment before making any sizing
conclusions.

It is vitally important to remember that in addition to PowerCenter, other


applications may be vying for server resources. PowerCenter commonly runs on a
server that also hosts a database engine plus query/analysis tools. In an
environment where PowerCenter runs in parallel with all of these tools, the
query/analysis tool often drives the hardware requirements. However, if the ETL
processing is performed after business hours, the query/analysis tool requirements
may not impose a sizing limitation.

With these additional processing requirements in mind, consider platform size in light
of the following questions:

• What sources are accessed by the mappings?


• How do you currently access those sources?
• Do the sources reside locally, or will they be accessed via a network
connection?
• What kind of network connection exists?
• Have you decided on the target environment (database/hardware/operating
system)? If so, what is it?
• Have you decided on the PowerCenter server environment
(hardware/operating system)?
• Is it possible for the PowerCenter server to be on the same machine as the
target?
• How will information be accessed for reporting purposes (e.g., cube, ad-hoc
query tool, etc.) and what tools will you use to implement this access?
• What other applications or services, if any, run on the PowerCenter server?
• Has the database table space been distributed across controllers, where
possible, to maximize throughput by reading and writing data in parallel?
• When considering the server engine size, answer the following questions:
• Are there currently extract, transform, and load processes in place? If so,
what are the processes, and how long do they take?
• What is the total volume of data that must be moved, in bytes?
• What is the largest table (bytes and rows)? Is there any key on this table that
could be used to partition load sessions, if necessary?
• How will the data be moved; via flat file processing or relational tables?
• What is the load strategy; is the data updated, incrementally loaded, or will
all tables be truncated and reloaded?
• Will the data processing require staging areas?
• What is the load plan? Are there dependencies between facts and
dimensions?
• How often will the data be refreshed?
• Will the refresh be scheduled at a certain time, or driven by external events?
• Is there a "modified" timestamp on the source table rows, enabling
incremental load strategies?
• What is the size of the batch window that is available for the load?
• Does the load process populate detail data, aggregations, or both?
• If data is being aggregated, what is the ratio of source/target rows for the
largest result set? How large is the result set (bytes and rows)?

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-191


The answers to these questions will provide insight into the factors that impact
PowerCenter's resource requirements. To simplify the analysis, focus on large,
"critical path" jobs that drive the resource requirement.

A Sample Performance Result

The following is a testimonial from a customer configuration. Please note that these
performance tests were run on a previous version of PowerCenter, which did not
include the performance and functional enhancements in release 5.1. These results
are offered as one example of throughput. However, results will definitely vary by
installation because each environment has a unique architecture and unique data
characteristics.

The performance tests were performed on a 4-processor Sun E4500 with 2GB of
memory. This processor handled just under 20.5 million rows, and more than 2.8GB
of data, in less than 54 minutes. In this test scenario, 22 sessions ran in parallel,
populating a large product sales table. Four sessions ran after the set of 22,
populating various summarization tables based on the product sales table. All of the
mappings were complex, joining several sources and utilizing several Expression,
Lookup and Aggregation transformations. The source and target database used in
the tests was Oracle. The source and target were both hosted locally on the ETL
Server.

Links

The following link may prove helpful when determining the platform
size:www.tpc.org. This website contains benchmarking reports that will help you
fine tune your environment and may assist in determining processing power
required.

PAGE BP-192 BEST PRACTICES INFORMATICA CONFIDENTIAL


Running Sessions in Recovery Mode

Challenge

Use PowerCenter standard functionality to recover data that is committed to a


session's targets, even if the session does not complete.

Description

When a network or other problem causes a session whose source contains a million
rows to fail after only half of the rows are committed to the target, one option is to
truncate the target and run the session again from the beginning. But that is not the
only option. Rather than processing the first half of the source again, you can tell the
server to keep data already committed to the target database and process the rest of
the source. This results in accurate and complete target data, as if the session
completed successfully with one run. This technique is called performing recovery.

When you run a session in recovery mode, the server notes the row id of the last row
committed to the target database. The server then reads all sources again, but only
processes from the subsequent row id. For example, if the server commits 1000 rows
before the session fails, when you run the session in recovery mode, the server
reads all source tables, and then passes data to the Data Transformation Manager
(DTM) starting from row 1001.

When necessary, the server can recover the same session more than once. That is, if
a session fails while running in recovery mode, you can re-run the session in
recovery mode until the session completes successfully. This is called nested
recovery.

The server can recover committed target data if the following three criteria are met:

• All session targets are relational. The server can only perform recovery on
relational tables. If the session has file targets, the server cannot perform
recovery. If a session writing to file targets fails, delete the files, and run the
session again.
• The session is configured for a normal (not bulk) target load. The
server uses database logging to perform recovery. Since bulk loading
bypasses database logging, the server cannot recover sessions configured to
bulk load targets. Although recovering a large session can be more efficient

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-193


than running the session again, bulk loading increases general session
performance. When configuring session properties for sessions processing
large amounts of data, weigh the importance of performing recovery when
choosing a target load type.

When you configure a session to load in bulk, the server logs a message in
the session log stating that recovery is not supported.

• The server configuration parameter Disable Recovery is not selected.


When the Disable Recovery option is checked, the server does not create the
OPB_SRVR_RECOVERY table in the target database to store recovery-related
information. If the table already exists, the server does not write information
to that table.

In addition, to ensure accurate results from the recovery, the following must be true:

• Source data does not change before performing recovery. This includes
inserting, updating, and deleting source data. Changes in source files or
tables can result in inaccurate data.
• The mapping used in the session does not use a Sequence Generator
or Normalizer. Both the Sequence Generator and the Normalizer
transformations generate source values: the Sequence Generator generates
sequences, and the Normalizer generates primary keys. Therefore, sessions
using these transformations are not guaranteed to return the same values
when performing recovery.

Session Logs

If a session is configured to archive session logs, the server creates a new session
log for the recovery session. If you perform nested recovery, the server creates a
new log for each session run. If the session is not configured to archive session logs,
the server overwrites the existing log when you recover the session.

Reject Files

When performing recovery, the server creates a single reject file. The server
appends rejected rows from the recovery session (or sessions) to the session reject
file. This allows you to correct and load all rejected rows from the completed session.

Example

Session “s_recovery” reads from a Sybase source and writes to a target table in
“production_target”, a Microsoft SQL Server database. This session is configured for
a normal load. The mapping consists of:

Source Qualifier: SQ_LINEITEM


Expression transformation: EXP_TRANS
Target: T_LINEITEM

The session is configured to save 5 session logs.

PAGE BP-194 BEST PRACTICES INFORMATICA CONFIDENTIAL


First Run

The first time the session runs, the server creates a session log named
s_recovery.log. (If the session is configured to save logs by timestamp, the server
appends the date and time to the log file name.) The server also creates a reject file
for the target table named t_lineitem.bad.

The following section of the session log shows the server preparing to load normally
to the production_target database. Since the server cannot find
OPB_SRVR_RECOVERY, it creates the table.

CMN_1053 Writer: Target is database [TOMDB@PRODUCTION_TARGET], user


[lchen], bulk mode [OFF]

...

CMN_1039 SQL Server Event

CMN_1039 [01/14/99 18:42:44 SQL Server Message 208 : Invalid object name 'OPB_SRVR_RECOVERY'.]

Thu Jan 14 18:42:44 1999

CMN_1040 SQL Server Event

CMN_1040 [01/14/99 18:42:44 DB-Library Error 10007 : General SQL Server error: Check messages from the
SQL Server.]

Thu Jan 14 18:42:44 1999

CMN_1022 Database driver error...

CMN_1022 [Function Name : Execute

SqlStmt : SELECT SESSION_ID FROM OPB_SRVR_RECOVERY]

WRT_8017 Created OPB_SRVR_RECOVERY table in target database.

As the following session log show, the server performs six target -based commits before the session fails.

TM_6095 Starting Transformation Engine...

Start loading table [T_LINEITEM] at: Thu Jan 14 18:42:50 1999

TARGET BASED COMMIT POINT Thu Jan 14 18:43:59 1999

=============================================

Table: T_LINEITEM

Rows Output: 10125

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-195


Rows Applied: 10125

Rows Rejected: 0

TARGET BASED COMMIT POINT Thu Jan 14 18:45:09 1999

=============================================

Table: T_LINEITEM

Rows Output: 20250

Rows Applied: 20250

Rows Rejected: 0

TARGET BASED COMMIT POINT Thu Jan 14 18:46:25 1999

=============================================

Table: T_LINEITEM

Rows Output: 30375

Rows Applied: 30375

Rows Rejected: 0

TARGET BASED COMMIT POINT Thu Jan 14 18:47:31 1999

=============================================

Table: T_LINEITEM

Rows Output: 40500

Rows Applied: 40500

Rows Rejected: 0

TARGET BASED COMMIT POINT Thu Jan 14 18:48:35 1999

=============================================

Table: T_LINEITEM

Rows Output: 50625

Rows Applied: 50625

Rows Rejected: 0

TARGET BASED COMMIT POINT Thu Jan 14 18:49:41 1999

=============================================

PAGE BP-196 BEST PRACTICES INFORMATICA CONFIDENTIAL


Table: T_LINEITEM

Rows Output: 60750

Rows Applied: 60750

Rows Rejected: 0

When a session fails, you can truncate the target and run the entire session again.
However, since the server committed more than 60,000 rows to the target, rather
than running the whole session again, you can configure the session to recover the
committed rows.

Running a Recovery Session

To run a recovery session, check the Perform Recovery option on the Log Files tab of
the session property sheet.

To archive the existing session log, either increase the number of session logs saved,
or choose Save Session Log By Timestamp option on the Log Files tab.

Start the session, or if necessary, edit the session schedule and reschedule the
session.

Second Run (Recovery Session)

When you run the session in recovery mode, the server creates a new session log.
Since the session is configured to save multiple logs, it renames the existing log
s_recovery.log.0, and writes all new session information in s_recovery.log. The
server reopens the existing reject file (t_lineitem.bad) and appends any rejected
rows to that file.

When performing recovery, the server reads the source, and then passes data to the
DTM beginning with the first uncommitted row. In the session log below, the server
notes the session is in recovery mode, and states the row at which it will begin
recovery (i.e., row 60751.)

TM_6098 Session [s_recovery] running in recovery mode.

TM_6026 Recovering from row [60751] for target instance


[T_LINEITEM].

When running the session with the Verbose Data tracing level, the server provides
more detailed information about the session. As seen below, the server sets row
60751 as the row from which to recover. It opens the existing reject file and begins
processing with the next row, 60752.

Note: Setting the tracing level to Verbose Data slows the server's performance and
is not recommended for most production sessions.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-197


CMN_1053 SetRecoveryInfo for transform(T_LINEITEM): Rows To Recover From = [60751]:

CMN_1053 Current Transform [SQ_lineitem]: Rows To Consume From = [60751]:

CMN_1053 Output Transform [EXPTRANS]: Rows To Produce From = [60751]:

CMN_1053 Current Transform [EXPTRANS]: Rows To Consume From = [60751]:

CMN_1053 Output Transform [T_LINEITEM]: Rows To Produce From = [60751]:

CMN_1053 Writer: Opened bad (reject) file [C:\winnt\system32\BadFiles\t_lineitem.bad]

Third Run (Nested Recovery)

If the recovery session fails before completing, you can run the session in recovery
mode again. The server runs the session as it did the earlier recovery sessions,
creating a new session log and appending bad data to the reject file.

You can run the session in recovery mode as many times as necessary to complete
the session's target tables.

When the server completes loading target tables, it performs any configured post-
session stored procedures or commands normally, as if the session completed in a
single run.

Returning to Normal Session

After successfully recovering a session, you must edit the session properties to clear
the Perform Recovery option. If necessary, return the session to its normal schedule
and reschedule the session.

Things to Consider

In PowerCenter 5.1, the DisableRecovery server initialization flag defaults to Yes.


This means the OPB_SRVR_RECOVERY table will not be created, and ‘Perform
Recovery’ will not be possible unless this flag is changed to No during server
configuration. You will need to have “create table” permissions in the target database
in order to create this table.

PAGE BP-198 BEST PRACTICES INFORMATICA CONFIDENTIAL


Developing the Business Case

Challenge

Identifying the departments and individuals that are likely to benefit directly from the project implementation.
Understanding these individuals, and their business information requirements, is key to defining and scoping the
project.

Description

The following four steps summarize business case development and lay a good foundation f or
proceeding into detailed business requirements for the project.

1. One of the first steps in establishing the business scope is identifying the project beneficiaries and
understanding their business roles and project participation. In many cases, the Project Sponsor can
help to identify the beneficiaries and the various departments they represent. This information can then
be summarized in an organization chart that is useful for ensuring that all project team members
understand the corporate/business organization.

• Activity - Interview project sponsor to identify beneficiaries, define their business roles and
project participation.
• Deliverable - Organization chart of corporate beneficiaries and
participants.

2. The next step in establishing the business scope is to understand the business problem or need that
the project addresses. This information should be clearly defined in a Problem/Needs Statement, using
business terms to describe the problem. For example, the problem may be expressed as "a lack of
information" rather than "a lack of technology" and should detail the business decisions or analysis that
is required to resolve the lack of information. The best way to gather this type of information is by
interviewing the Project Sponsor and/or the project beneficiaries.

• Activity - Interview (individually or in forum) Project Sponsor and/or


beneficiaries regarding problems and needs related to project.
• Deliverable - Problem/Need Statement

3. The next step in creating the project scope is defining the business goals and objectives for the
project and detailing them in a comprehensive Statement of Project Goals and Objectives. This
statement should be a high-level expression of the desired business solution (e.g., what strategic or
tactical benefits does the business expect to gain from the project,) and should avoid any technical
considerations at this point. Again, the Project Sponsor and beneficiaries are the best sources for this
type of information. It may be practical to combine information gathering for the needs assessment and
goals definition, using individual interviews or general meetings to elicit the information.

• Activity - Interview (individually or in forum) Project Sponsor and/or


beneficiaries regarding business goals and objectives for the project.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-199


• Deliverable - Statement of Project Goals and Objectives

4. The final step is creating a Project Scope and Assumptions statement that clearly defines the
boundaries of the project based on the Statement of Project Goals and Objective and the associated
project assumptions. This statement should focus on the type of information or analysis that will be
included in the project rather than what will not.

The assumptions statements are optional and may include qualifiers on the scope, such as ass umptions
of feasibility, specific roles and responsibilities, or availability of resources or data.

• Activity - Business Analyst develops Project Scope and Assumptions


statement for presentation to the Project Sponsor.
• Deliverable - Project Scope and Assumptions statement

PAGE BP-200 BEST PRACTICES INFORMATICA CONFIDENTIAL


Assessing the Business Case

Challenge

Developing a solid business case for the project that includes both the tangible and intangible potential benefits
of the project.

Description

The Business Case should include both qualitative and quantitative assessments of the project.

The Qualitative Assessment portion of the Business Case is based on the Statement of Problem/Need and the
Statement of Project Goals and Objectives (both generated in Subtask 1.1.1) and focuses on d iscussions with the
project beneficiaries of expected benefits in terms of problem alleviation, cost savings or controls, and increased
efficiencies and opportunities.

The Quantitative Assessment portion of the Business Case provides specific measurable details of the
proposed project, such as the estimated ROI, which may involve the following calculations:

• Cash flow analysis- Projects positive and negative cash flows for the
anticipated life of the project. Typically, ROI measurements use the cash flow
formula to depict results.

• Net present value - Evaluates cash flow according to the long-term value of
current investment. Net present value shows how much capital needs to be
invested currently, at an assumed interest rate, in order to create a stream of
payments over time. For instance, to generate an income stream of $500 per
month over six months at an interest rate of eight percent would require an
investment-a net present value-of $2,311.44.

• Return on investment - Calculates net present value of total incremental


cost savings and revenue divided by the net present value of total costs
multiplied by 100. This type of ROI calculation is frequently referred to as
return of equity or return on capital employed.

• Payback - Determines how much time will pass before an initial capital
investment is recovered.

The following are steps to calculate the quantitative business case or ROI:

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-201


Step 1. Develop Enterprise Deployment Map. This is a model of the project phases over a timeline,
estimating as specifically as possible customer participation (e.g., by department and location), subject area and
type of information/analysis, numbers of users, numbers of data marts and data sources, types of sources, and
size of data set.

Step 2. Analyze Potential Benefits. Discussions with representative managers and users or the Project
Sponsor should reveal the tangible and intangible benefits of the project. The most effective format for
presenting this analysis is often a "before" and "after" format that compares the current situation to the project
expectations.

Step 3. Calculate Net Present Value for all Benefits. Information gathered in this step should help the
customer representatives to understand how the expected benefits will be allocated throughout the organization
over time, using the enterprise deployment map as a guide.

Step 4. Define Overall Costs. Customers need specific cost information in order to assess the dollar impact of
the project. Cost estimates should address the following fundamental cost components:

• Hardware
• Networks
• RDBMS software
• Back-end tools
• Query/reporting tools
• Internal labor
• External labor
• Ongoing support
• Training

Step 5. Calculate Net Present Value for all Costs. Use either actual cost estimates or percentage-of-cost
values (based on cost allocation assumptions) to calculate costs for each cost component, projected over the
timeline of the enterprise deployment map. Actual cost estimates are more accurate than percentage-of-cost
allocations, but much more time-consuming. The percentage-of-cost allocation process may be valuable for initial
ROI snapshots until costs can be more clearly predicted.

Step 6. Assess Risk, Adjust Costs and Benefits Accordingly. Review potential risks to the project and make
corresponding adjustments to the costs and/or benefits. Some of the major risks to consider are:

• Scope creep, which can be mitigated by thorough planning and tight project
scope
• Integration complexity, which can be reduced by standardizing on vendors
with integrated product sets or open architectures
• Architectural strategy that is inappropriate
• Other miscellaneous risks from management or end users who may withhold
project support; from the entanglements of internal politics; and from
technologies that don't function as promised

Step 7. Determine Overall ROI. When all other portions of the business case are complete, calculate the
project's "bottom line". Determining the overall ROI is simply a matter of subtracting net present value of total
costs from net present value of (total incremental revenue plus cost savings).

For more detail on these steps, refer to the Informatica White Paper: 7 Steps to Calculating Data
Warehousing ROI.

PAGE BP-202 BEST PRACTICES INFORMATICA CONFIDENTIAL


Defining and Prioritizing Requirements

Challenge

Defining and prioritizing business and functional requirements is often accomplished through a combination of
interviews and facilitated meetings (i.e., workshops) between the Project Sponsor and beneficiaries and the
Project Manager and Business Analyst.

Description

The following three steps are key for successfully defining and prioritizing requirements:

Step 1: Discovery

During individual (or small group) interviews with high-level management, there is often focus and clarity of
vision that for some, may be hindered in large meetings or not available from lower-level management. On the
other hand, detailed review of existing reports and current analysis from the company's "information providers"
can fill in helpful details.

As part of the initial "discovery" process, Informatica generally recommends several interviews at the Project
Sponsor and/or upper management level and a few with those acquainted with current reporting and analysis
processes. A few peer group forums can also be valuable.

However, this part of the process must be focused and brief or it can become unwieldy as much time can be
expended trying to coordinate calendars between worthy forum participants. Set a time period and target list of
participants with the Project Sponsor, but avoid lengthening the process if some participants aren't available.

Questioning during these session should include the following:

• What are the target business functions, roles, and responsibilities?


• What are the key relevant business strategies, decisions, and processes (in
brief)?
• What information is important to drive, support, and measure success for
those strategies/processes? What key metrics? What dimensions for those
metrics?
• What current reporting and analysis is applicable? Who provides it? How is it
presented? How is it used?

Step 2: Validation and Prioritization

The Business Analyst, with the help of the Project Architect, documents the findings of the discovery process. The
resulting Business Requirements Specification includes a matrix linking the specific business requirements to their
functional requirements.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-203


At this time also, the Architect develops the Information Requirements Specification in order to clearly represent
the structure of the information requirements. This document, based on the business requirements findings, will
facilitate discussion of informational details and provide the starting point for the target model definition.

The detailed business requirements and information requirements should be reviewed with the project
beneficiaries and prioritized based on business need and the stated project objectives and scope.

Step 3: The Incremental Roadmap

Concurrent with the validation of the business requirements, the Architect begins the Functional Requirements
Specification providing details on the technical requirements for the project.

As general technical feasibility is compared to the prioritization from Step 2, the Project Manager, Business
Analyst, and Architect develop consensus on a project "phasing" approach. Items of secondary priority and those
with poor near-term feasibility are relegated to subsequent phases of the project. Thus, they develop a phased,
or incremental, "roadmap" for the project (Project Roadmap).

This is presented to the Project Sponsor for approval and becomes the first "Increment" or starting point for the
Project Plan.

PAGE BP-204 BEST PRACTICES INFORMATICA CONFIDENTIAL


Developing a WBS

Challenge

Developing a comprehensive work breakdown structure that clearly depicts all of the various tasks, subtasks
required to complete the project. Because project time and resource estimates are typically based on the Work
Breakdown Structure (WBS), it is critical to develop a thorough, accurate WBS.

Description

A WBS is a tool for identifying and organizing the tasks that need to be completed in a project. The WBS serves
as a starting point for both the project estimate and the project plan.

One challenge in developing a good WBS is obtaining the correct balance between enough detail, and too much
detail. The WBS shouldn't be a 'grocery list' of every minor detail in the project, but it does need to break the
tasks down to a manageable level of detail. One general guideline is to keep task detail to a duration of at least a
day.

It is also important to remember that the WBS is not necessarily a sequential document. Tasks in the hierarchy
are often completed in parallel. At this stage of project planning, the goal is to list every task that must be
completed; it is not necessary to determine the critical path for completing these tasks. For example, we may
have multiple subtasks under a task (e.g., 4.3.1 through 4.3.7 under task 4.3). So, although subtasks 4.3.1
through 4.3.4 may have sequential requirements that force us to complete them in order, subtasks 4.3.5 through
4.3.7 can - and should - be completed in parallel if they do not have sequential requirements. However, it is
important to remember that a task is not complete until all of its corresponding subtasks are completed -
whether sequentially or in parallel. For example, the BUILD phase is not complete until tasks 4.1 through 4.7 are
complete, but some work can (and should) begin for the DEPLOY phase long before the BUILD phase is
complete.

The Project Plan provides a starting point for further development of the project WBS. This sample is a Microsoft
Project file that has been "pre-loaded" with the Phases, Tasks, and Subtasks that make up the Informatica
Methodology. The Project Manager can use this WBS as a starting point, but should review it carefully to ensure
that it corresponds to the specific development effort, removing any steps that aren't relevant or adding steps as
necessary. Many projects will require the addition of detailed steps to accurately represent the development
effort.

If the Project Manager chooses not to use Microsoft Project, an Excel version of the Work Breakdown Structure is
available. The phases, tasks, and subtasks can be exported from Excel into many other project management
tools, simplifying the effort to develop the WBS.

After the WBS has been loaded into the selected project management tool and refined for the specific project
needs, the Project Manager can begin to estimate the level of effort involved in completing each of the steps.
When the estimate is complete, individual resources can be assigned and scheduled. The end result is the Project
Plan. Refer to Developing and Maintaining the Project Plan for further information about the project plan.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-205


Developing and Maintaining the Project Plan

Challenge

Developing the first-pass of a project plan that incorporates all of the necessary components but which is
sufficiently flexible to accept the inevitable changes.

Description

Use the following steps as a guide for developing the initial project plan:

• Define the project's major milestones based on the Project Scope.

• Break the milestones down into major tasks and activities. The Project Plan should be helpful as a
starting point or for recommending tasks for inclusion.

• Continue the detail breakdown, if possible, to a level at which tasks are of about one to three days'
duration. This level provides satisfactory detail to facilitate estimation and tracking. If the detail tasks
are too broad in scope, estimates are much less likely to be accurate.

• Confer with technical personnel to review the task definitions and effort estimates (or even to help
define them, if applicable).

• Establish the dependencies among tasks, where one task cannot be started until another is completed
(or must start or complete concurrently with another).

• Define the resources based on the role definitions and estimated number of resources needed for each
role.

• Assign resources to each task. If a resource will only be part-time on a task, indicate this in the plan.

At this point, especially when using Microsoft Project, it is advisable to create dependencies (i.e., predecessor
relationships) between tasks assigned to the same resource in order to indicate the sequence of that person's
activities.

The initial definition of tasks and effort and the resulting schedule should be an exercise in pragmatic feasibility
unfettered by concerns about ideal completion dates. In other words, be as realistic as possible in your initial
estimations, even if the resulting scheduling is likely to be a hard sell to c ompany management.

This initial schedule becomes a starting point. Expect to review and rework it, perhaps several times. Look for
opportunities for parallel activities, perhaps adding resources, if necessary, to improve the schedule.

PAGE BP-206 BEST PRACTICES INFORMATICA CONFIDENTIAL


When a satisfactory initial plan is complete, review it with the Project Sponsor and discuss the assumptions,
dependencies, assignments, milestone dates, and such. Expect to modify the plan as a result of this review.

Reviewing and Revising the Project Plan

Once the Project Sponsor and company managers agree to the initial plan, it becomes the basis for assigning
tasks to individuals on the project team and for setting expectations regarding delivery dates. The planning
activity then shifts to tracking tasks against the schedule and updating the plan based on status and changes to
assumptions.

One approach is to establish a baseline schedule (and budget, if applicable) and then track changes against it.
With Microsoft Project, this involves creating a "Baseline" that remain s static as changes are applied to the
schedule. If company and project management do not require tracking against a baseline, simply maintain the
plan through updates without a baseline.

Regular status reporting should include any changes to the schedule, beginning with team members' notification
that dates for task completions are likely to change or have already been exceeded. These status report updates
should trigger a regular plan update so that project management can track the effect on the overall schedule and
budget.

Be sure to evaluate any changes to scope (see 1.2.4 Manage Project and Scope Change Assessment ), or
changes in priority or approach, as they arise to determine if they impact the plan. It may be necessary to modify
the plan if changes in scope or priority require rearranging task assignments or delivery sequences, or if they add
new tasks or postpone existing ones.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-207


Managing the Project Lifecycle

Challenge

Providing a structure for on-going management throughout the project lifecycle.

Description

It is important to remember that the quality of a project can be directly correlated to the amount of
review that occurs during its lifecycle.

Project Status and Plan Reviews

In addition to the initial project plan review with the Project Sponsor, schedule regular status meetings
with the sponsor and project team to review status, issues, scope changes and schedule updates.

Gather status, issues and schedule update information from the team one day before the status
meeting in order to compile and distribute the Status Report .

Project Content Reviews

The Project Manager should coordinate, if not facilitate, reviews of requirements, plans and deliverables
with company management, including business requirements reviews with business personnel and
technical reviews with project technical personnel.

Set a process in place beforehand to ensure appropriate personnel are invited, any relevant documents
are distributed at least 24 hours in advance, and that reviews focus on questions and issues (rather
than a laborious "reading of the code").

Reviews may include:

• Project scope and business case review


• Business requirements review
• Source analysis and business rules reviews
• Data architecture review
• Technical infrastructure review (hardware and software capacity and configuration pla nning)
• Data integration logic review (source to target mappings, cleansing and transformation logic,
etc.)
• Source extraction process review
• Operations review (operations and maintenance of load sessions, etc.)
• Reviews of operations plan, QA plan, deployment and support plan

PAGE BP-208 BEST PRACTICES INFORMATICA CONFIDENTIAL


Change Management

Directly address and evaluate any changes to the planned project activities, priorities, or staffing as
they arise, or are proposed, in terms of their impact on the project plan.

• Use the Scope Change Assessment to record the background problem or requirement and the
recommended resolution that constitutes the potential scope change.
• Review each potential change with the technical team to assess its impact on the project,
evaluating the effect in terms of schedule, budget, staffing requirements, and so forth.
• Present the Scope Change Assessment to the Project Sponsor for acceptance (with formal
sign-off, if applicable). Discuss the assumptions involved in the impact estimate and any
potential risks to the project.

The Project Manager should institute this type of change management process in response to any issue
or request that appears to add or alter expected activities and has the potential to affect the plan. Even
if there is no evident effect on the schedule, it is important to document these changes because they
may affect project direction and it may become necessary, later in the project cycle, to justify these
changes to management.

Issues Management

Any questions, problems, or issues that arise and are not immediately resolved should be tracked to
ensure that someone is accountable for resolving them so that their effect can also be visible.

Use the Issues Tracking template, or something similar, to track issues, their owner, and dates of entry
and resolution as well as the details of the issue and of its solution.

Significant or "showstopper" issues should also be mentioned on the status report.

Project Acceptance and Close

Rather than simply walking away from a project when it seems complete, there should be an explicit
close procedure. For most projects this involves a meeting where the Project Sponsor and/or
department managers acknowledge completion or sign a statement of satisfactory completion.

• Even for relatively short projects, use the Project Close Report to finalize the project with a
final status report detailing:
o What was accomplished
o Any justification for tasks expected but not completed
o Recommendations
• Prepare for the close by considering what the project team has learned about the
environments, procedures, data integration design, data architecture, and other project plans.
• Formulate the recommendations based on issues or problems that need to be addressed.
Succinctly describe each problem or recommendation and if applicable, briefly describe a
recommended approach.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-209


Configuring Security

Challenge

Configuring a PowerCenter security scheme to prevent unauthorized access to


mappings, folders, sessions, batches, repositories, and data – in order to ensure
system integrity and data confidentiality.

Description

Configuring security is one of the most important components of building a Data


Warehouse. Security should be implemented with the goals of easy maintenance and
scalability. Determining an optimal security configuration for a PowerCenter
environment requires a thorough understanding of business requirements, data
content, and end users’ access requirements. Knowledge of PowerCenter’s security
facilities is also a prerequisite to security design. Before implementing security
measures, it is imperative to answer the following basic questions:

• Who needs access to the Repository? What do they need the ability to do?
• Is a central administrator required? What permissions are appropriate for
him/her?
• Is the central administrator responsible for designing and configuring the
repository security? If not, has a security administrator been identified?
• What levels of permissions are appropriate for the developers? Do they need
access to all the folders?
• Who needs to start sessions manually?
• Who is allowed to start and stop the Informatica Server?
• How will PowerCenter security be administered? Will it be the same as the
database security scheme?
• Do we need to restrict access to Global Objects?

The following pages offer some answers to the these questions and some
suggestions for assigning user groups and access privileges.

In most implementations, the administrator takes care of maintaining the


Repository. There should be a limit to the number of administrator accounts for
PowerCenter. While this is less important in a development/unit test environment, it
is critical for protecting the production environment.

PAGE BP-210 BEST PRACTICES INFORMATICA CONFIDENTIAL


PowerCenter’s security approach is similar to database security environments. All
security management is performed through the Repository Manager. The internal
security enables multi-user development through management of users, groups,
privileges, and folders. Every user ID must be assigned to one or more groups.
These are PowerCenter users, not database users; all password information is
encrypted and stored in the repository.

The Repository may be connected to sources/targets that contain sensitive


information. The Server Manager provides another level of security for this
purpose. It is used to assign read, write, and execute permissions for global objects.

Global Objects include Database Connections, FTP Connections and External Loader
Connections. Global Object permissions, in addition to privileges and permissions
assigned using the Repository Manager, affect the ability to perform tasks in the
Server Manager, the Repository Manager, and the command line program, pmcmd.

The Server Manager also offers an enhanced security option that allows you to
specify a default set of privileges that applies restricted access controls for Global
Objects. Only the owner of the Object or a Super User can manage permissions for a
Global Object.

Choosing the Enable Security option activates the following set of default
privileges:

User Default Global Object Permissions


Owner Read, Write and Execute
Owner Group Read and Execute
World No Permissions

Enabling Enhanced Security does not lock the restricted access settings for Global
Objects. This means that the permissions for Global Objects can be changed after
enabling Enhanced Security.

Although privileges can be assigned to users or groups, privileges are commonly


assigned to groups, with users then added to each group. This approach is simpler
than assigning privileges on a user-by-user basis, since there are generally few
groups and many users, and any user can belong to more than one group.

The following table summarizes some possible privileges that may be granted:

Privilege Description
Session Operator Can run any sessions or batches, regardless
of folder level permissions.
Use Designer Can edit metadata in the Designer.
Browse Repository Can browse repository contents through the
Repository Manager.
Create Sessions and Can create, modify, and delete sessions and
Batches batches in Server Manager.
Administer Repository Can create and modify folders.
Administer Server Can configure connections on the server and

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-211


Privilege Description
stop the server through the Server Manager
or the command-line interface.
Super User Can perform all tasks with the repository and
the server

The next table suggests a common set of initial groups and the privileges that may
be associated with them:

Group Description Privileges


Developer PowerCenter developers Session Operator, Use
who are creating the Designer, Browse
mappings. Repository, Create
Sessions and Batches
End User Business end users who Browse Repository
run reports off of the data
warehouse.
Operator Operations department Session Operator,
that runs and maintains Administer Server,
the environment in Browse Repository
production.
Administrator Data warehouse Super User
Administrators who
maintain the entire
warehouse environment.

Users with Administer Repository or Super User privileges may edit folder properties,
which must identify a folder owner and group, and also determine whether the folder
is shareable, meaning that shortcuts can be created pointing to objects within the
folder, thereby enabling object reuse. After a folder is flagged as shareable, this
property cannot be changed. For each folder, privileges are set for the owner, group,
and repository (i.e., any user).

The following table details the three folder level privileges: Read, Write, and
Execute:

Privilege Description
Read Can read, copy, and create shortcuts to
repository objects in the folder. Users without
read permissions cannot see the folder.
Write Can edit metadata in the folder.
Execute Can run sessions using mappings in the
folder.

Allowing shortcuts enables other folders in the same repository to share objects such
as source/target tables, transformations, and mappings. A recommended practice is
to create only one shareable folder per repository, and to place all reusable objects
within that sharable folder. When other folders create a shortcut from a shareable
folder, that folder inherits the properties of the object, so changes to common logic
or elements can be managed more efficiently.

PAGE BP-212 BEST PRACTICES INFORMATICA CONFIDENTIAL


Users who own a folder or have Administer Repository or Super User privileges can
edit folder properties to change the owner, the group assigned to the folder, the
three levels of privileges, and the Allow Shortcuts option.

Note that users with the Session Operator privilege can run sessions or batches,
regardless of folder level permissions. A folder owner should be allowed all three
folder level permissions. However members within the folder’s group may contain
only Read/Write, or possibly all three levels, depending on the desired level of
security. Repository privileges should be restricted to Read permissions only, if any
at all.

You might also wish to add a group specific to each application if there are many
application development tasks being performed within the same repository. For
example, if you have two projects, ABC and XYZ, it may be appropriate to create a
group for ABC developers and another for XYZ developers. This enables you to
assign folder level security to the group and keep the two projects from accidentally
working in folders that belong to the other project team. In this example, you may
assign group level security for all of the ABC folders to the ABC group. In this way,
only members of the ABC group can make changes to those folders.

Tight security is recommended in the production environment to ensure that the


developers and other users do not accidentally make changes to production. Only a
few people should have Administer Repository or Super User privileges, while
everyone else should have the appropriate privileges within the folders they use.

Informatica recommends creating individual User IDs for all developers and
administrators on the system rather than using a single shared ID. One of the most
important reasons for this is session level locking. When a session is in use by a
developer, it cannot be opened and modified by anyone but that user. Locks thus
prevent repository corruption by preventing simultaneous uncoordinated updates.
Also, if multiple individuals share a common login ID, it is difficult to identify which
developer is making (or has made) changes to an object.

INFORMATICA CONFIDENTIAL BEST PRACTICES PAGE BP-213


PAGE BP-214 BEST PRACTICES INFORMATICA CONFIDENTIAL