Session and Data Partititioning

Session and Data Partitioning
Session and Data Partitioning

Challenge
Improving performance by identifying strategies for partitioning relational tables, XML, COBOL and standard flat
files, and by coordinating the interaction between sessions, partitions, and CPUs. These strategies take advantage
of the enhanced partitioning capabilities in PowerCenter.
Description
On hardware systems that are under-utilized, you may be able to improve performance by processing partitioned
data sets in parallel in multiple threads of the same session instance running onthe PowerCenter Server engine.
However, parallel execution may impair performance on over-utilized systems or systems with smaller I/O capacity.
In addition to hardware, consider these other factors when determining if a session is an ideal candidate for
partitioning: source and target database setup, target type, mapping design, and certain assumptions that are
explained in the following paragraphs. Use the Workflow Manager client tool to implement session partitioning.
Assumptions
The following assumptions pertain to the source and target systems of a session that is a candidate for partitioning.
These factors can help to maximize the benefits that can be achieved through partitioning.
Indexing has been implemented on the partition key when using a relational source.
Source files are located on the same physical machine as the PowerCenter Server process when
partitioning flat files, COBOL, and XML, to reduce network overhead and delay.
All possible constraints are dropped or disabled on relational targets.
All possible indexes are dropped or disabled on relational targets.
Table spaces and database partitions are properly managed on the target system.
Target files are written to same physical machine that hosts the PowerCenter process in order to reduce
network overhead and delay.
Oracle External Loaders are utilized whenever possible
First, determine if you should partition your session. Parallel execution benefits systems that have the following
characteristics:
Check idle time and busy percentage for each thread. This gives the high-level information of the bottleneck
point/points. In order to do this, open the session log and look for messages starting with PETL_ under the RUN
INFO FOR TGT LOAD ORDER GROUP section. These PETL messages give the following details against the
reader, transformation, and writer threads:
Total Run Time
Total Idle Time
Busy Percentage
Under-utilized or intermittently-used CPUs. To determine if this is the case, check the CPU usage of your
machine. The column ID displays the percentage utilization of CPU idling during the specified interval without any
I/O wait. If there are CPU cycles available (i.e., twenty percent or more idle time), then this session's performance
may be improved by adding a partition.
Windows 2000/2003 - check the task manager performance tab.
UNIX - type VMSTAT 1 10 on the command line.
Sufficient I/O. To determine the I/O statistics:
UNIX - type IOSTAT on the command line. The column %IOWAIT displays the percentage of CPU time
spent idling while waiting for I/O requests. The column %idle displays the total percentage of the time that
the CPU spends idling (i.e., the unused capacity of the CPU.)
Sufficient memory. If too much memory is allocated to your session, you will receive a memory allocation error.
Check to see that you're using as much memory as you can. If the session is paging, increase the memory. To
determine if the session is paging:
UNIX - type VMSTAT 1 10 on the command line. PI displays number of pages swapped in from the page
space during the specified interval. PO displays the number of pages swapped out to the page space during
the specified interval. If these values indicate that paging is occurring, it may be necessary to allocate more
memory, if possible.
If you determine that partitioning is practical, you can begin setting up the partition.
Partition Types
PowerCenter provides increased control of the pipeline threads. Session performance can be improved by adding
partitions at various pipeline partition points. When you configure the partitioning information for a pipeline, you
must specify a partition type. The partition type determines how the PowerCenter Server redistributes data across
partition points. The Workflow Manager allows you to specify the following partition types:
Round-robin Partitioning
The PowerCenter Server distributes data evenly among all partitions. Use round-robin partitioning when you need
to distribute rows evenly and do not need to group data among partitions.
In a pipeline that reads data from file sources of different sizes, use round-robin partitioning. For example, consider
a session based on a mapping that reads data from three flat files of different sizes.
Source file 1: 100,000 rows
In this scenario, the recommended best practice is to set a partition point after the Source Qualifier and set the
partition type to round-robin. The PowerCenter Server distributes the data so that each partition processes
approximately one third of the data.
Hash Partitioning
The PowerCenter Server applies a hash function to a partition key to group data among partitions.
Use hash partitioning where you want to ensure that the PowerCenter Server processes groups of rows with the
same partition key in the same partition. For example, in a scenario where you need to sort items by item ID, but
do not know the number of items that have a particular ID number. If you select hash auto-keys, the PowerCenter
Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of
ports to form the partition key.
An example of this type of partitioning is when you are using Aggregators and need to ensure that groups of data
based on a primary key are processed in the same partition.
Key Range Partitioning

With this type of partitioning, you specify one or more ports to form a compound partition key for a source or target.
The PowerCenter Server then passes data to each partition depending on the ranges you specify for each port.
Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. Refer to
Workflow Administration Guide for further directions on setting up Key range partitions.
For example, with key range partitioning set at End range = 2020, the PowerCenter Server passes in data where
values are less than 2020. Similarly, for Start range = 2020, the PowerCenter Server passes in data where values
are equal to greater than 2020. Null values or values that may not fall in either partition are passed through the first
partition.
Pass-through Partitioning
In this type of partitioning, the PowerCenter Server passes all rows at one partition point to the next partition point
without redistributing them.
Use pass-through partitioning where you want to create an additional pipeline stage to improve performance, but
do not want to (or cannot) change the distribution of data across partitions. The Data Transformation Manager
spawns a master thread on each session run, which in turn creates three threads (reader, transformation, and
writer threads) by default. Each of these threads can, at the most, process one data set at a time and hence, three
data sets simultaneously. If there are complex transformations in the mapping, the transformation thread may take
a longer time than the other threads, which can slow data throughput.
It is advisable to define partition points at these transformations. This creates another pipeline stage and reduces
the overhead of a single transformation thread.
When you have considered all of these factors and selected a partitioning strategy, you can begin the iterative
process of adding partitions. Continue adding partitions to the session until you meet the desired performance
threshold or observe degradation in performance.
Tips for Efficient Session and Data Partitioning

Add one partition at a time. To best monitor performance, add one partition at a time, and note your
session settings before adding additional partitions. Refer to Workflow Administrator Guide, for more
information on Restrictions on the Number of Partitions.
Set DTM buffer memory. For a session with n partitions, set this value to at least n times the original value
for the non-partitioned session.
Set cached values for sequence generator. For a session with n partitions, there is generally no need to
use the Number of Cached Values property of the sequence generator. If you must set this value to a value
greater than zero, make sure it is at least n times the original value for the non-partitioned session.
Partition the source data evenly. The source data should be partitioned into equal sized chunks for each
partition.
Partition tables. A notable increase in performance can also be realized when the actual source and target
tables are partitioned. Work with the DBA to discuss the partitioning of source and target tables, and the
setup of tablespaces.
Consider using external loader. As with any session, using an external loader may increase session
performance. You can only use Oracle external loaders for partitioning. Refer to the Session and Server
Guide for more information on using and setting up the Oracle external loader for partitioning.
Write throughput. Check the session statistics to see if you have increased the write throughput.
Paging. Check to see if the session is now causing the system to page. When you partition a session and
there are cached lookups, you must make sure that DTM memory is increased to handle the lookup caches.
When you partition a source that uses a static lookup cache, the PowerCenter Server creates one memory
cache for each partition and one disk cache for each transformation. Thus, memory requirements grow for
each partition. If the memory is not bumped up, the system may start paging to disk, causing degradation in
performance.
When you finish partitioning, monitor the session to see if the partition is degrading or improving session
performance. If the session performance is improved and the session meets your requirements, add another
partition
Session on Grid and Partitioning Across Nodes

Session on Grid (provides the ability to run a session on multi-node integration services. This is most suitable for
large-size sessions. For small and medium size sessions, it is more practical to distribute whole sessions to
different nodes using Workflow on Grid. Session on Grid leverages existing partitions of a session b executing
threads in multiple DTMs. Log service can be used to get the cumulative log. See PowerCenter Enterprise Grid
Option for detailed configuration information.
Dynamic Partitioning
Dynamic partitioning is also called parameterized partitioning because a single parameter can determine the
number of partitions. With the Session on Grid option, more partitions can be added when more resources are
available. Also the number of partitions in a session can be tied to partitions in the database to facilitate
maintenance of PowerCenter partitioning to leverage database partitioning.

Session and Data Partititioning

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Session and Data Partititioning

Hochgeladen von

Copyright:

Verfügbare Formate

Session and Data Partitioning

Session and Data Partitioning

Key Range Partitioning

Tips for Efficient Session and Data Partitioning

Session on Grid and Partitioning Across Nodes

Das könnte Ihnen auch gefallen