PD I 2000 Lectures

Pentaho Data Integration
for
Database Developers
July 2011
2011, Pentaho. All Rights Reserved. www.pentaho.com.
Welcome Agenda
Audience and prerequisites
Learning objectives
Class process
Course outline
US and Worldwide: +1 (866) 660-7555 | Slide 2
Audience and Course Prerequisites

Intended audience
Course targeted for database administrators and database
developers.
Portions of the course assumes knowledge of SQL and relational
database concepts.
Course prerequisites
There are no Pentaho Training prerequisites for this course.
Learning Objectives
At the end of the course, you should understand:
The basic architecture and features of Pentaho Data Integration.
The concept and features of the advanced Pentaho Data Integration
Enterprise Edition.
How PDI supports you in the Agile BI approach.
Learning Objectives
At the end of the course, you should be able to
Load and write data from and to different data sources
Join data from different sources
Use PDI and ETL design patterns (like restartable solutions)
Influence the performance aspects of databases and transformations
Build portable and flexible jobs and transformations
Schedule jobs and transformations
Use logging, monitoring and error handling features of PDI
Load, transform, and create complex XML structures
Use scripting (JavaScript, Formula, Java) in transformations
Apply clustering and partitioning solutions for high volumes
Learning Objectives
What are your objectives?
What are you expectations?
Course Process
Daily schedule:
9:00 am 5:00 pm
1 hour lunch break -noon
15 minute morning break -10:30
15 minute afternoon break -3:30
The course is a combination of lecture, demo and labs.
Feel free to ask questions or to seek clarification!
Online survey will be provided for feedback and suggestions.

Overview
Pentaho Data Integration (PDI) Introduction

PDI is the product associated with the
KETTLE open source project:
KETTLE is open source software that
makes up the core of PDI Enterprise
Edition.
PDI Enterprise Edition is whole
product.
Professional technical support
Maintenance releases
EE-only features including
enterprise security integration,
scheduling, and more
Documentation
Member of the Pentaho BI Suite
Use as a BI Platform Component

PDI Jobs and Transformations can be
run in Pentaho BI Platform:
PDI 3.X Jobs/Transforms execute in
Pentaho BI Platform 1.7.X and
beyond.
PDI 4.X Jobs/Transforms execute in
Pentaho BI Platform 3.6.X and
beyond.
For example: Fill a Pentaho Report
with data from a Transformation.
Details in a separate module.
Enterprise Edition (EE) Data Integration Server

Standalone without the BI Platform
PDI Enterprise Edition Architecture:

Primary features and functions:
Execution: Executes ETL jobs and transformations using the Pentaho Data
Integration Engine.
Security: Allows you to manage users and roles (default security) or integrate
security to your existing security provider such as LDAP or Active Directory.
Content Management: Provides the ability to centrally store and manage your
ETL jobs and transformations. This includes full revision history on content and
features such as sharing and locking for collaborative development
environments.
Scheduling: Provides the services allowing you to schedule and monitor
scheduled activities on the Data Integration Server from within the Spoon design
environment.
The Enterprise Console

Provides a thin client for managing deployments of Pentaho Data
Integration Enterprise Edition including:
Management of Enterprise Edition licenses.
Monitoring and controlling activity on a remote Pentaho Data
Integration server.
Analyzing performance trends of registered Jobs and
Transformations.
The KETTLE Project

What is KETTLE?
Recursive acronym much like GNU (GNUs Not Unix).
Kettle Extraction Transformation Transportation & Loading
Environment.
Created out of frustration with other solutions
Custom built PL/SQL, C-SQL (embedded SQL), hacked VB solutions
Commercial products:
Oracle Warehouse Builder
Information Builders iWay
SQL Server DTS
Data Mirror
PDI (KETTLE) History

Early Years: 2001 2005
KETTLE project started by 2001 by Matt Casters
Years focused on easy-to-use, maintainability and deployability
KETTLE 2.3: December 2005
First LGPL-licensed open source version
Project acquired by Pentaho: April 2006
Provided paid staff of developers
Offered support and services to customers

Pentaho Data Integration 2.4: February 2007
Parallel processing support
Multi-tab interface for developers editing multiple transformations
Integration of transformation design and job execution user
interfaces
Pentaho Data Integration 2.5: May 2007
Enhanced MySQL database support
Transformation Explorer for organizing and accessing transformations

Pentaho Data Integration 3.0: Nov 2007
ETL Developer productivity
Rapid, community-fueled evolution
Performance and scalability
Clean separation of data and
metadata Reduced Java object creation
Clustering improvements:
Support for multiple step copies
Data re-partitioning
Dynamic cluster schemas
Faster flat file reading:

Use of non-blocking I/O (NIO) to read large blocks at a time
Parallel file reading
Support for lazy conversion
Simplified algorithms

Pentaho Data Integration 3.1: September 2008
Enterprise Console for remote monitoring and performance trend
analysis
Ease of use improvements including a consolidated log, execution
history, step performance graph results panel
Numerous new steps, job entries and expanded data source support
Pentaho Data Integration 3.2: May 2009
Dynamic Clustering - dynamically distribute execution to available
cluster/cloud nodes
Named Parameters
Usability improvements - annotate hops, more visual feedback
Over 20 new Transformation Steps/Job Entries with numerous
updates to existing steps

Pentaho Data Integration 4.0: June 2010
Data Modeling and Visualization Perspectives (Pentaho Agile BI)
EE Data Integration Server with scheduling, revision history and
more
Usability improvements
New steps/Job entries
Pentaho Data Integration 4.1: November 2010
Hadoop Integration (Enterprise Edition only)
One click disable/enable of steps downstream of a hop
Metadata Injection (experimental first steps)
Many new steps and job entries
PDI Version 4.2 (July 2011)

Graphical performance and progress feedback for transformations
Metadata Injection
Report bursting by the Pentaho Reporting Output step
Automatic Documentation step
Talend Job Execution job entry
Single Threader step for parallel performance tuning of large
transformations
Allow a job to be started at a job entry of your choice
The XML Input Stream (StAX) step to read huge XML files at optimal
performance
The Get ID from Slave Server step allows multi-host or clustered
transformations to get globally unique integer IDs from a slave server
4.2 continued on next page
PDI Version 4.2

Carte improvements
1.
2.
3.
4.
5.
reserve next value range from a slave sequence service

allow parallel (simultaneous) runs of clustered transformations
list (reserved and free) socket reservations service
new options in XML for configuring slave sequences
allow time-out of stale objects using environment variable
KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES
Memory tuning of logging back-end with

1. KETTLE_MAX_LOGGING_REGISTRY_SIZE
2. KETTLE_MAX_JOB_ENTRIES_LOGGED
3. KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage for
never ending ETL in general and jobs specifically.
4.2 continued on next page
PDI Version 4.2

Repository Import/Export
1. Export at the repository folder level
2. Export and Import with optional rule-based validations
3. Import command line utility allow for rule-based (optional)
import of lists of transformations, jobs and repository export
files: http://wiki.pentaho.com/display/EAI/Import+User+Documentation
ETL Metadata Injection
1. Retrieval of rows of data from a step to the metadata
injection step
2. Support for injection into the Excel Input step
3. Support for injection into the Row normaliser step
4. Support for injection into the Row Denormaliser step
And many more new steps and job entries
Why Pentaho Data Integration?

Ease of use:
100% metadata driven (define WHAT you want to do, not HOW to do
it)
No extra code generation means lower complexity
Simple setup, intuitive graphical designers and easy to maintain
Flexibility:
Never forces a certain path on the user
Pluggable architecture for extending functionality
Why Pentaho Data Integration?

Modern Standards-based Architecture
100% Java with broad, cross platform support
Over 100 out-of-the-box mapping objects (steps and job entries)
Enterprise class performance and scalability
Lower total cost of ownership (TCO)
No license fees
Short implementation cycles
Reduced maintenance costs
Pentaho Data Integration Adoption

Wide range of production deployments:
Small and medium-sized companies
Large enterprises
Rapid product evolution
Driven by Pentaho investment
Includes significant community
contributions
Contribution-friendly
architecture
Natural fit for additional data
sources, targets and
transformations
Common Uses
Data warehouse population:
Built-in support for slowly changing
dimensions, junk dimensions and other data
warehouse concepts.
Export of database(s) to text-file(s) or other
databases.
Import of data into databases, ranging from
text-files to Excel spreadsheets.
Data migration between database applications.
Common Uses
Exploration of data in existing database
(tables, views, synonyms, ).
Information enrichment by looking up data
in various information stores (databases,
text-files, Excel spreadsheets, ).
Data cleansing by applying complex
conditions in data transformations.
Application integration.
Example
1
or PeopleSoft, Siebel, JD Edwards, Axapta, Navision, SugarCRM, Compiere, and others
or DB2, Teradata, Microsoft, Sybase, MySQL, PostgreSQL, Ingres, etc.
Agile BI
Modeling and Visualization perspectives
Agile BI
Modeling and Visualization perspectives.
EE Data Integration Server

Enterprise Repository and Content Management:
New repository based on JCR (Content Repository API for Java)
Improved Repository Browser
Enterprise security:
Configurable Authentication including support for LDAP and MSAD
Task permissions to control what actions a user/role can perform
such as read/execute content, create content and administer
security
Full revision history on content allowing you to compare and restore
previous revisions of a Job or Transformation.
Ability to lock Transformations/Jobs for editing.
Recycling bin concept for working with deleted files.
EE Data Integration Server

Scheduling
Since PDI 4.0 & 4.2: Usability

Hover-over menus & simplify the
connection of steps
Graphical indicators on hops representing
the flow of information between steps.
New activity indicators on jobs and steps (since 4.2) help highlight current
activity and bottlenecks during execution.
Since PDI 4.0: Improved Logging

Internal object IDs (small API change)
Logging channels (GUIDs)
Step logging
Sniffing (debugging, data lineage)
Pentaho Data Integration Components

Spoon
Graphical environment for modeling
Transformations are metadata models
describing the flow of data
Jobs are workflow-like models for
coordinating resources, execution and
dependencies of ETL activities
Pan
Command line tool for executing
transformations modeled in Spoon
Spoon Interface Designing a Transformation
Kitchen
Command line tool for executing jobs
modeled in Spoon
Job Example
Pentaho Data Integration Components

Carte
Lightweight web\HTTP server for remotely
executing Jobs and Transformations.
Carte accepts XML containing the transformation to
execute and the execution configuration.
Enables remote monitoring
Used for executing transformations in a cluster
Remote servers running Carte are referred to as
Slave Servers.

Enterprise Edition alternative to Carte providing
Execution and remote monitoring (can act as master/slave similar to Carte)
Integrated scheduling
Enterprise Security options
Enhanced content management including revision history and locking
Repository: the Metadata Store

Kettle can store metadata in
XML files
RDBMS repository
Enterprise repository
Objects stored in repository
Connections
Transformations
Jobs
Schemas
User and profile definitions
Repository supports
collaborative development
How to start the user interface?

Start Spoon.bat (Windows) or Spoon.sh (Linux, MacOS) in the Kettle
folder.
The command launch-designer.bat /.sh is also possible in the archive
based installation.
For our trainings we do not use the repository, all training data will be
stored in the file system as KTR- (Transformations) or KJB-files (Jobs) in
XML format.
Another option would be to start via Java Web Start:
Latest version loaded via the Internet (usable for configuration
management).
JNLP-files (Java Network Launching Protocol) located in the
KETTLE/webstart folder.
Transformations
Transformations are a network of logical tasks (Steps):
Read a flat file
Filter it
Sort it
Load it into MySQL
Steps Job Entries
Hops - within Transformations

Are data pathways that connect steps together
Allow schema metadata to be passed from step to step
Determine the flow of data through the steps
Example: The pathway for all data and the true and false path from a Filter
rows step.
Hops Data Movement: Copy or Distribute?

Specify if data can either be copied or distributed between multiple hops
leaving a step (right click on a step and select Data Movement).
Hops Other Types in Transformations

Info Steps: When data is retrieved (pulled) from another step.
Error handling steps: When error handling is enabled.
Hops - within Jobs

Are defining the execution sequence for job entries
There are three types of hops within Jobs (right click on the hop):
Unconditional
Follow when result is true
Follow when result is false
Data Flow, Threading mechanism

All Steps run are started and run in parallel:
Initialization sequence is not predictable
PDI takes care of the correct data flow
Pulling and pushing data from step to step
PDI capable to process an unlimited number of rows:
Steps vary on execution speed and memory consumption
Set the threshold on number of rows will wait to be processed by
next step:
If this number of waiting rows is reached, the source step waits
for room.
When there is room to process, more rows are put into the data
stream.
Transformation Properties / Miscellaneous / Number of rows in
rowset
Threading mechanism
There are additional options that influence the threading mechanism in

Transformation Properties/Miscellaneous
1. Manage thread priorities: If there
is not much to do for a thread
(step), put the thread to sleep for
some milliseconds. This reduces
locking overhead of the buffers
and is enabled by default.
2. (Serial) Single Threaded: This allows to build thread pools in combination

with sub-transformations. It will not work if any step is getting or putting rows
from/to more than one step (e.g. the Stream lookup step).
3. KETTLE properties: KETTLE_BATCHING_ROWSET,
KETTLE_ROWSET_GET/PUT_TIMEOUT
Values, Metadata and Data

Values are like columns in data rows:
Composed of the metadata and the data.
PDI version 3.0 separated metadata and data
Metadata is only transported with the first data row.
All subsequent data rows reference to this metadata.
PDI maps database (JDBC) data types to PDI data
Implementation can (and often is) different from database to
database.
Values, Metadata and Data: Formatting etc.

Metadata is used for formatting when:
Data is presented e.g. in a preview.
Data is written to the outside world, e.g. to a text- or XML-file.
Metadata is NOT used for formatting when:
Data is just loaded from one table and written to another table.
Metadata is used to create SQL-statements to know the field types, length etc.
Metadata is used to check the data for the right data type.
Note: A change in metadata does not change the data, e.g.
A modification of length does not change/truncate the data.
A new formatting does not change the data.
But, when you modify the type in the Select Values step, the data is
converted to a new data type (e.g. from a String to a Number with the given
formatting).
Numeric Data Types

Number (double, floating point)
Double precision can handle only 15 significant digits
Is sometimes not exact
To avoid this, use BigNumber but will have performance drop
Integer
Is optimal for storing and processing data from a performance
viewpoint.
BigNumber
Offers a extremely high precision level but needs more memory and
CPU then the other numerical values.
Note: Formatting Number Data Types is done by default with the
pattern #.#;-#.# that means with only one digit precision.
Other Data Types

String
Used mainly for CHAR, VARCHAR and MEMO/CLOB fields
A note on NULL handling: PDI follows Oracle in its use of empty
strings and NULLs:
1. They are considered to be the same when they are compared by
steps.
2. Empty strings are converted to NULL. The latter makes a lot of
problems e.g. on data migration projects (more details can be
found in JIRA feature request PDI-2277).
PDI 4.0 introduced an environment variable (system wide) to change
this behaviour and to keep empty strings as empty strings:
KETTLE_EMPTY_STRING_DIFFERS_FROM_NULL
Set it in the kettle.properties file to "Y" to make it work.
Other Data Types

Date
Includes date and time
Note: Formatting Date Data Types is done by default with the
pattern yyyy/MM/dd HH:mm:ss.SSS
Boolean
True or false, representation in the database depends on database
boolean support.
Note: Please see the following option to turn this on when your
database supports this: Database connection/Advanced/Supports
boolean data type
Binary
Can hold any binary data like pictures, used mainly for BLOB data.
Serializable
An object to transfer from/to specific steps (internally only).
Data Types and Silent Conversions

Until version 2.5.x, data types mostly converted silently.
Can lead to some undiscovered type mismatches.
PDI 3.0
Data types are strictly checked
No silent conversions take place (with some exceptions for
compatibility)
Example for a type conflict:
Samples directory: Denormalizer - Simple example.ktr
Lazy Conversion
Lazy conversion is a delayed conversion of data types
Provides a performance boost.
Conversion takes place only where it is really needed
Ideally at output steps
Sometimes not at allreading from text-file and writing back to
text-file
If output format is the same as the input format, no conversion
Steps support lazy conversion
Specifically: CSV File Input, Fixed File Input and Table Input
Other steps support it transparently
Lazy Conversion
Binary form of data can cause issues:
For example: sorting on binary form ignores character set sorting
sequence.
New feature in the Select Values step
Covered in more detail in a separate module
Converts binary data to and from binary character data
Handling the User Interface

Main tree
Lists all open Transformations and Jobs and their contents.
Core objects and favorite Steps/Job entries
Core is toolbox with all the available Steps/Job entries (plug-ins are
in bold).
Favorites are static most used steps.
Notes
Can appear anywhere on graphical view.
Right clicking on the canvas and selecting Add Note.
Options and settings
Options are valid for the entire PDI environement.
Settings are valid for a particular transformation or job.

Draw new hops
Other methods:
Middle or scroll wheel button, click on the first step and drag onto the
second.
Use SHIFT+Click and drag from one step to another
Select 2 steps, right click on one of them and select New hop.
Drag Hops onto the canvas.
Inserting a step (or job entry) between others:
Move the step over the arrow until the arrow becomes drawn in bold
Release the mouse-button
Window sizes:
May have to resize some dialogs to all parameters

Click right on the first column in any dialog table (grid) for a list of all
the options.

Click right on a Step for a list of all the options in the context menu.
Run and Preview

Can execute the entire Transformation or just preview a particular step.
Preview also possible by selecting a step and pressing the F10 key.
Need at least two steps connected with a hop to run or preview
Closing a preview (vs. choosing the stop/get more rows options) will
leave the Transformation in a paused execution state. If you attempt to
restart the Transformation, it will tell you it can not be started twice.
Note: Preview may be destructive.
Subsequent steps are also initiated (could cause truncation of target
table).
Rows passed to subsequent steps; not stopped at the previewed step.
You may temporarily disable the hop after the step to preview.
Log View
Shows statistics associated with execution of a Transformation.
Used to understand performance and to check the results.
Logging can be very granular down to the row level if needed.
Safe Mode
Available in the Execute a Transformation/Job window
Used in cases that mix rows from various sources
Makes sure that these rows all have the same layout (metadata).
Forces each Transformation to check layout of each row.
Error thrown on row that differs in layout from first row
Step and offending row are reported
Has performance tradeoffs:
Checking on each row slows performance.
Source of an error found sooner, useful in trouble shooting.
Usage of Safe Mode at design time

Performing a union on data with different layouts generates warning
The name of field number 1 is not the same as in the first row
received: you're mixing rows with different layout. Field
[ThisIsAStringValue String(10)] does not have the same name as field
[ThisIsANumberValue Number].
Analyzing Errors
2010/05/18 16:38:00 - Generate Rows.0 - ERROR (version 4.0.0-) : Couldn't parse Integer
field [WrongType] with value [abc] -->
org.pentaho.di.core.exception.KettleValueException:
Log entries include

Step (Generate Rows) generating the error
Detailed information with the PDI version (useful when submitting
cases)
Any stack traces (useful for bug tracking due to program errors)
Error lines are in red since 4.0 (easier to find)
Use Show Error Lines to find errors easier
Debugging
Introduced in PDI 3.0
Provides condition break points
Replaying a Transformation
Is implemented for Text File Input and Excel Input
Allows files containing errors to be sent back to source and corrected.
Uses .line file to reprocess file:
ONLY lines that failed are processed during the replay.
Uses the date in the filename of the .line file to match the replay
date.
Special PDI Application Files

User-specific files found in .kettle directory of users home directory
kettle.properties: Default properties file for variables
shared.xml: Default shared objects file
db.cache: The database cache for meta data
repositories.xml: The local repositories file
.spoonrc: User interface settings, last opened transformation/job
.languageChoice: User language (delete to revert language)
Some temporary logs stored also in temp folder :
Usually cleaned by Java Virtual Machine (VM).
May need to be cleaned (deleted) due to defects in VM.
Directory of temporary logs determined by OS and VM
(for example: typically C:\Windows\Temp on Windows).
Special PDI Application Files KETTLE_HOME

The HOME directory may change depending on the user who is logged on. As a
result, the configuration files that control the behavior of PDI Jobs and
Transformations are different from user to user. Setting the KETTLE_HOME
variable can be performed system wide by the operating system or just before
the start of PDI using a shell script or batch. For example: use the SET
command.
Point the KETTLE_HOME to the directory that contains the .kettle directory. The
.kettle gets appended by PDI. For example: when you have stored the common
files in C:\Pentaho\Kettle\common\.kettle you need to set the KETTLE_HOME
variable to C:\Pentaho\Kettle\common.
When running PDI from the Pentaho BI Platform, please see the Knowledge Base
for setting the variable.

The Conceptual Model any questions?
Pentaho Data Integration - Conceptual model
Connections, Inputs and Outputs
Database Connections
Multiple database connections to different databases can be created.
With a PDI repository:
Defined connections readily available to transformations and jobs.
Connection information for the repository itself is stored in
repositories.xml.
Without a PDI repository:
Connection definition contained in a single Transformation or Job.
Can share connection definitions in subsequent Transformations and
Jobs.
Available database connections appear in the Main Tree.
Choose Share in the context menu of any connection to share it.

Shared connections appear in bold.
General database connection options
Access Via JDBC

Pentaho Data Integration ships with the most suitable JDBC drivers for
the listed databases.
Additional drivers can by added to the lib/libext directory.
Use Generic tab of connection dialog to use unlisted drivers.
Permits connections to non-listed databases.
Existing drivers can be replaced in the lib/libext directory.
Special database issues and experiences with different JDBC drivers can
be found in the Pentaho Data Integration Wiki:
http://wiki.pentaho.com/display/EAI/Special+database+issues+and+
experiences
Other Access Methods

ODBC connections are possible:
ODBC connections must be defined in Windows.
ODBC connections made via ODBC-JDBC-Bridge.
Some limitations of the SQL syntax
Generally slower than JDBC due to additional layer.
Use a JNDI connection to connect to a datasource defined in an
application server like JBoss or Webshere.
Plugin specific access methods are supplied by a specific database
driver (like SAP R/3 or PALO connections).
Advanced Database Connections

Pooling
Useful for performance tuning
Can limit number of connections for database client licensing reasons
Driver Specific Options (Options tab)
Pass additional parameters to the drivers
Allows driver specific tuning for performance
Database vendor documentation is available by clicking help button
Advanced Database Connections

Clustering
Enables clustering for the database connection and creation
connections to data partitions.
Requires partition ID, name and port of host, and user name and
password
Identifiers (Advanced tab)
Directs how SQL is generated
Post Connection SQL
Quoting
Quoting is used when reserved names or special characters are used.
For example: Field names, sum, V.A.T., overall sales.
PDI has an internal list of reserved names for most of the supported
database types.
PDIs automatic quoting can be overridden.
Feedback on quoting is always welcome to improve quoting
algorithms.
Database Explorer
In the toolbar:
In the connection context menu:
DB Cache: The Metadata Cache

Metadata for fields each connection and SQL statement is cached in
internal file db.cache.
Metadata cache must be refreshed.
Refreshed automatically when the table is changed in the PDI
context (from the SQL statement window).
Manual refreshing of cache maybe necessary.
Typically detected by missing fields or mismatches in Show
input/output fields of a step.
Impact Analysis
What impact does the Transformation have on the used databases?
SQL Editor
Creates the needed DDL for the output steps related to a database
table, often CREATE statements for tables or indices.
SQL button in the toolbar creates all needed DDL for tables.
No automatic mechanism to alter tables when the layout changed.
For example: A field type from a source table is changed
DDL can be easily and manually changed.
Steps Covered In This Section

Input Steps
Output Steps
Special Dimension Steps
Text File Input

Reads a large number of different text files, including CSV files generated by
spreadsheets.
Text File Input Options:
Filename specification options: Files may be added to selected files list.
Accept filenames from previous step: Filename can come from any source.
Content specification: Allows user to specify format of the files being read.
Error handling: Allows user to define how to react to errors.
Filters: Allows user to specify the lines to be skipped in the text file.
Fields: Allows user to define characteristics of the fields.
Formats: Includes formatting of number and date fields.
CSV File Input

Reads a CSV file format only.
Due to the internal processing, this step is much faster.
Options are a subset of Text File Input
NIO buffer size: Set the buffer size used by the Java I/O classes
(NIO, improved performance in the areas of buffer management).
Lazy conversion: This step supports lazy conversion.
Fixed File Input

Reads a fixed file format only.
Due to the internal processing, this step is much faster.
Options are a subset of Text File Input
NIO buffer size and lazy conversion options are identical to the CSV
file input options.
Run in parallel: Must be checked if Transformation is executed in
cluster with many workers processing on large file.
Table Input
Reads information from a database, using a connection and SQL.
Table Input Options
Step Name: The name has to be unique in a single Transformation.
Connection: The database connection used to read data from.
SQL: The statement used to read information from the database
connection; may be any query.
Insert data from step: The input step name where parameters for
the SQL come from, if appropriate.
Limit: Sets the number of lines that are read from the database.
Excel Input
Reads information from one or more Excel files.
The options provided by PDI GUI for accepting excel inputs include:
Step Name: Name of the step.
File Tab: To define the filenames, with variable support.
Sheet Tab: To define the sheet(s) to import.
Fields Tab: To specify the fields that need to be read from the
Excel files.
Error handling Tab: Allows user to define how to react when error
is encountered.
Content Tab: Includes the sub options of Header, No empty rows,
Stop on empty rows, Field name, Sheet name field, Row number
field and Limit.
Access Input
Reads information from one or more Access files.
No ODBC connection necessary.
Allows Access files to be read on non-Windows platforms.
Access Input Options:
Step Name: Name of the step.
File Tab: To define the filenames, with variable support.
Content Tab: Specify the table name and the inclusion of the file
name, table name, row number and limit.
Fields Tab: Specify the fields that need to be read from the Access
files.
XBase Input
Reads data from most types of DBF file derivates called the XBase
family. (dBase III/IV, Foxpro, Clipper, ...)
Options:
Step name: Unique name (in transformation) of the step.
Filename: Name of XBase file with variable support.
Limit size: Only read this number of rows; zero means unlimited.
Add rownr?: Adds a field to the output with the specified name that
contains the row number.
Apache Virtual File System (VFS) support

Apache VFS allows references to files from virtual any location.
Apache VFS support available in the Pentaho Platform and Pentaho
Analysis.
All file names are treated as URIs.
file:///somedir/somefile.txt
zip:http://somehost/downloads/somefile.zip
http://myusername@somehost/index.html
sftp://myusername:mypassword@somehost/pub/dl/somefile.tgz
webdav://somehost:8080/dist
Further information: http://commons.apache.org/vfs/filesystems.html
Generate Rows
Outputs a number of rows, default is empty but optionally containing a
number of static fields.
Options
Step name: Unique name (in Transformation) of the step.
Limit: The number of rows user wants to output.
Fields: Static fields user might want to include in the output row.
Get System Info

Retrieves information from the PDI environment.
Options
Step name: Unique name (in transformation) of the step.
Fields: The fields to output.
System Information Types

Date and time information
Run-time Transformation metadata
Command line arguments
LDAP Input
Reads data from a LDAP server.
Options
Host: Hostname or IP address of the LDAP server.
Port: The TCP port to use, typically 389.
User Authentication: Enable to pass authentication credentials to
server.
Username/Password: For authenticating with the LDAP server.
Search base: Location in the directory from which the LDAP search
begins.
Filter String: The filter string for filtering the results.
Fields: Define the return fields and type.
De-serialize/Serialize to/from File

Read and write data and PDI metadata together from and to a file.
Use Cases
Transfer data from one Transformation to another where the memory
is not sufficient to hold the amount of data.
Save data and metadata to be processed at another time.
Transfer data to another user with no need to reanalyze the
metadata (instead of text files).
Text File Output

Exports data to a variety of different text file formats, including CSV.
Options
Extension
Append
Separator
Enclosure
Header/Footer
Zipped
Include step number/date/time in file name
Encoding
Right pad fields
Split every or n row(s)
Table Output
Insert (only) information in a database table.
Options
Target table
Commit size
Truncate table
Ignore insert errors
Partition data over tables
Use batch update for inserts
Return auto-generated key
Name auto-generated key field
Is the name of the table defined in a field
Insert / Update
Automates simple merge processing:
Look up a row using one or more lookup keys.
If a row is not found, insert the new row.
If found and targeted fields are the identical, do nothing.
If found and targeted fields are not identical, update the row.
Options
Step name
Connection
Target table
Commit size
Keys
Update fields
Do not perform any updates (If used, operates like Table Output,
but without any Insert errors caused by duplicate keys).
Update
Same as the Insert / Update step except no insert is performed in the
database table.
ONLY updates are performed.
Delete
Same as the Update step, except rows are deleted.
Excel Output
Exports data to an Excel file
Options
Sheet name
Protect sheet with a password
Use a template (e.g. with a preformatted sheet)
Append or override the contents of the template
Access Output
Exports data to an Access file
ODBC not required
Can be used on non-Windows platforms
Options
Database filename (.mdb)

Create database (creates/overrides the file)
Target table
Create table
Commit size
XML Output
Writes rows from any source to one or more XML files
Options
File name
Extension
Include stepnr in file name
Include date in file name
Include time in file name
Split every N rows
Parent XML element
Row XML element
Fields
Zipped
Encoding
Dimension Update / Lookup

Implements slowly changing dimensions (Type I and Type II)
Can be used for updating a dimension table and for looking up values in
a dimension. (Lookup, if not found, then update/insert)
Each entry in the dimension table has the following fields
Technical key: The primary (surrogate) key of the dimension
Version field: Version of the dimension entry (a revision number)
Start of date range: Field containing validity starting date
End of date range: Field containing the validity ending date
Keys: Business keys used in source systems; used for lookup
functionality.
Fields: Actual information of a dimension and can be set individually
to update all versions or to add a new version when a new value
appears.
Combination Update / Lookup

Stores information in a junk-dimension table
Dimension table are often comprised of one or more combinations of simple
dimension attributes where each unique combination is the distinct key.
Steps activity
Lookup combination of business key field1.. fieldn from input stream in
dimension table
If this combination of business key fields exists, return its technical key
(surrogate id)
If this combination of business key doesn't exist yet, insert a row with the
new key fields and return its (new) technical key
Put all input fields on the output stream including the returned technical
key, but remove all business key fields if remove lookup fields is true
Will only maintain the key information
Non-key information in the dimension table needs to be updated separately.
Introduction to the Training Data
Introduction to the Training Data

Represents fictitious company Steel Wheels
Buys collectable model cars, trains, trucks, etc.
from manufacturers
Sells to distributors across the globe
Data adapted from the sample data provided by
Eclipses BIRT project.
pentaho_oltp database has many tables:

Offices, Employees, Customers, Products,
Orders, Orderdetails, Payments
pentaho_oltp
Pentaho Training Data: Tables

Offices
7 offices worldwide (San Francisco, Boston, NYC, Paris, Tokyo,
Sydney, London)
Headquartered in San Francisco, CA
Each office is assigned to a sales territory (APAC, NA, EMEA or
JAPAN)
Employees
23 employees: 6 Execs and 17 Sales Reps
Each assigned to one of the seven offices
Sales Reps also assigned to a number of customers (distributors)
New Sales Reps (that are still in training) dont have assigned
customers

Customers
Steel Wheels has 122 customers worldwide
Approximately 20 of those new customers without a sales rep any orders
Each has a credit limit which determines maximum outstanding balance
Products
110 unique models purchased from 13 vendors
Classified as 7 distinct product lines: Classic Cars, Vintage Cars, Motorcycles,
Trucks and Buses, Planes, Ships, Trains.
Additionally models are classified based on their scale (e.g. 1:18, 1:72 etc.)
Cost paid and MSRP (suggested retail price)
Payments
Customers make payments on average 2-3 weeks after they place an order.
In some cases one payment covers more than 1 order.

Orders
2560 orders, which span the period from 1/1/2000 to 12/31/2007
Each in a given state: In Process, Shipped, Cancelled, Disputed,
Resolved, or On Hold.
OrderDetails
Order line items reflect negotiated price and quantity per product
Training data has 23,640 OrderDetails
pentaho_oltp
Database Schema OLTP (Source Database)

Database schema is in $PENTAHO_TRAINING/data/models
Database Schema OLAP (Target Database)

Database schema is in $PENTAHO_TRAINING/data/models
Data Warehouse Steps
Target Database Design (your Data Warehouse)

Prior to defining mappings from sources, the target database must be
designed.
Staging tables and/or file format designs are often identical to source
format to simplify extract processing.
Target star schemas must be designed to meet analytical processing
(OLAP) needs as well as feasibility of load from identified source data.
Source to Target Mapping

Identifies how each target table/column will be populated from the
sources.
Includes details of the following:
Source table/column or how the value is otherwise derived
Data types/lengths and any format Transformation
Special cleansing or Transformation logic
Exception handling
This document will aid in creation of the actual programming
specifications for the ETL developers or in creation of instructions
(technical meta data) for the ETL tool.
Dimensional Design in your Data Warehouse

Best Practice Analytical Database Design (Kimball, et al.)
Geography
Employee
Product
Sales Facts
Customer
Time
Fact Table
Dimension Table
A fact table contains items that you want
Dimensions are the ways that you want to
to measure. For example:

Revenue
Amount sold
Average price
Metrics are the values you are trying to
report.
look at the data. For example:

By customer
By date
By product
Dimensions provide context in reports
(grouping, labels, filters, etc.).
Dimensional models are often called Star-Schemas.
Why Fact- and Dimension tables?

The idea behind this separation is that you have a large fact table with
much smaller dimension tables joined to it.
Fact tables are huge compared with the dimensions. They are usually
huge compared with anything you have had in an OLTP database (e.g.
when you store historical data or add data from external sources).
We aim to store as many facts as possible with less space to need and
with a speedy access.
Fact and dimension tables should be joined by integers.
Slowly Changing Dimensions

In a dimension table the value of an attribute may change over time.
You may need to reference the old and new value.
Examples:
You have a product dimension with the price as an attribute. The
price changes over time. You may want to calculate the profit for
certain periods by the product price for a certain time. Then you
need to keep track of these changes.
You have a customer dimension with the location (e.g. defined by
the zip code) as an attribute. Customers can relocate and get
another zip code. The geographic organized sales team made
budgets for their customers and want to compare this with the
actual. What happens when a customer relocates into another sales
region?
You can solve this easy by storing your dimensions with the concept of
Slowly Changing Dimensions.
Slowly Changing Dimensions

Type 1 Dimension:
New information overwrites the old information.
Old information is not saved, it is lost.
Can only be used in applications where maintaining a chronicle of data is not
essential; used for update only.
Type 2 Dimension:
New information is appended to the old information.
Old information is saved, it is effectively versioned.
Can be used in applications where maintaining a chronicle of data is
required so that changes in a data warehouse can be tracked.
Type 3 Dimension:
New information is saved alongside the old information.
Old information is partially saved.
Additional columns are created to show the time from which the new
information has taken effect.
Enables view of facts in both current state and what-if past states of
dimensional values
Dimension Update / Lookup

Implements slowly changing dimensions: Type 1 and Type 2
Can be used for updating a dimension table and for looking up values in a
dimension. Lookup, if not found, then update/insert.
In dimension implementation each entry in the dimension table has the
following fields:
Technical key: This is the primary (surrogate) key of the dimension.
Version field: Shows the version of the dimension entry (a revision number).
Start of date range: This is the fieldname containing validity starting date.
End of date range: This is the fieldname containing the validity ending date.
Keys: These are business keys used in source systems such as customer no,
product id, etc. These are used for lookup functionality.
Fields: These fields contain the actual information of a dimension and can be
set individually to update all versions or to add a new version when a new
value appears.
Example for Slowly Changing Dimensions

We create a new dimension for the product from the pentaho_oltp
database into the pentaho_olap database.
OLTP
OLAP

The dimension step added the fields:
productid: The technical key
version: The version of the dimension row as a reference
date_from, date_to: The valid date range for the actual dimension
row.

We renamed the following fields and added the dimension name.
This is recommended and useful for later analysis or reports to know,
fields belong to a certain dimension.
quantityinstock
buyprice
msrp

The data within our dim_product table looks like this:
Look at the first line: This is the default row returned when you lookup
a dimension and the key is not found. This row (for not found entries) is
created automatically by PDI with null values.

Now we have a price change for two products:

The price change came from a price file where the other information
like productname failed, so they are missing in the new versions:
We can look it up before the dimension load.

You have to uncheck the Update the dimension? box.
And enter the fields you want to retrieve (mind the change of the
columns, especially Type of return field instead of Type of dimension
update):

The handling of the Stream Datefield.
When the date field is empty, the actual date and time for lookup and
new inserts is used.
When you have a date field with a valid from date, you can use this
here.

Example for a different date used.

If you do a lookup of a dimension and want to include the effective
version and the date ranges of your found row, add them to fields:
Lookups
Lookups
The Lookup feature of PDI accesses a data source to find values per a
defined matching criteria, i.e. key.
The following steps have lookup functionality in PDI:
Commonly Used
Database Lookup
Stream Lookup
Merge Join
Others
Database Join
Call database procedure
Dimension Update/Lookup
Combination Update/Lookup
HTTP Lookup
Database Lookup
Lookup attributes from a single table based on a key-matching criteria
Options for performing database lookup include:
Lookup table: The name of the table where the lookup is done.
Enable cache: This option caches database lookups for the duration of
the Transformation.
Enabling this option can increase performance.
Danger: If other processes are changing values in the table do not
set this option.
Load all data from table: Preload the complete data in memory at the
initialization phase. This can replace a Stream Lookup step in
combination with a Table Input step and is faster.
Database Lookup (cont)
SELECT
ATTRIB1 as FullName
FROM
lookup_table
WHERE
ID = <<value of field in stream>>
Stream Lookup
Allows users to lookup data using information coming from other steps in the
transformation.
The data coming from the Source step is first read into memory (cache) and is
then used to look up data for each record in the main stream.
Options provided by Kettle GUI for performing stream lookup include:
Source step: The step from which to obtain the in-memory lookup data
Key(s) to lookup value(s): Allows user to specify names of fields that are used
to lookup values. Values are always searched using the equal comparison.
Fields to retrieve: User can specify the names of the fields to retrieve, as
well as the default value in case the value was not found or a new fieldname
in case the user wishes to change the output stream field name.
Stream Lookup (Example)

The following Transformation add information coming from a text-file to
data coming from a database table:
B is the source step. It is where the in-memory lookup stream resides.
Merge Join
Takes TWO sorted streams and performs a traditional JOIN on EQUALITY
INNER = Only output a row when the key is in both streams
LEFT OUTER = Output a row even if there is no matching key in 2nd
Step
RIGHT OUTER = Output a row even if there is no matching key in 1st
Step
FULL OUTER = Output a row regardless of matching
Options provided by PDI GUI for merge join include:
First Step: Step to refer to as the 1st
Second Step: Step to refer to as the 2nd
Keys for 1st: The key fields from the 1st Stream
Keys for 2nd: The key fields from the 1st Stream
Join Type: The key fields from the 1st Stream
Merge Join (cont)
FULL
OUTER
Database Join
Options provided by Kettle GUI for database join procedure include:
SQL: The SQL query to launch towards the database.
Number of rows to return: 0 means all, any other number limits the
number of rows.
Outer join?: When checked, will always return a single record for each
input stream record, even if the query did not return a result.
The parameters to use in the query.
Parameters noted as ? in the query
Order of fields in parameter list must match the order of the ? in the
query.
Call Database Procedure

Executes a database procedure (or function) and gets the result(s) back.
Options for call database procedure include:
Proc-name: Name of the procedure or function to call.
Enable auto-commit: This can be used to perform updates in the
database using a specified procedure. The user can either have the
changes done using auto-commit or by disabling this. If auto-commit is
disabled, a single commit is performed after the last row is received
by this step.
Result name:
When calling a database function this field needed.
When calling a database procedure this field must not be
entered.
Call Database Procedure (cont)

Result type: Type of result of function call. Not used in case of a
procedure.
Parameters: List of parameters that the procedure or function needs:
Field name: Name of the field.
Direction: Can be either IN (input only), OUT (output only), INOUT (value
is changed on the database).
Type: Used for output parameters so that Kettle knows what comes back.
Call Database Procedure (cont)

Other buttons available in this step are:
Find it... button: Searches the specified database connection for
available procedures and functions.
Get fields button: Fills in all the fields in the input streams to make
the process easier by deleting the lines that are not needed and reordering the ones that are needed.
Dimension Lookups
Uses the same dimension step used for updating
Same fields, same setup
Stream Datefield stream_date between EFFECT and EXPIRE
HTTP Lookup
Covered in Web Service module
Field Transformations Part 1
Field Transformations
Field Transformations are steps that operate at the field level within a
stream record.
The step types covered in this section include:
Select Values
Calculator
Add Constants
Null If
Select Values
This step type is used to:
Select/remove fields from the process stream.
Rename fields
Specify/change the length and/or precision of fields.
3 Tabs are provided:
Select and Alter: Specify the exact order and name in which the fields
have to be placed in the output rows.
Remove: Specify the fields that have to be removed from the output
rows.
Meta-data: Change the name, type, length and precision (the metadata) of one or more fields.
Select Values (cont)

Options provided for this step include:
Step name: Name of the step. This name has to be unique in a single
Transformation.
Attributes that can be changed for a given field:
Fieldname: The fieldname to select or change
Rename to: To be left blank if rename not required.
Length: Number has to be entered to specify the length (-1: no length
specified).
Precision: Number has to be entered to specify the precision (-1: no
precision.
Calculator
Provides a list of functions that can be executed on field values.
An important advantage Calculator has over custom JavaScript scripts is
that the execution speed of Calculator is many times that of a script.
Besides the arguments (Field A, Field B and Field C) the user also needs
to specify the return type of the function.
You can also opt to remove the field from the result (output) after all
values were calculated. This is useful for removing temporary values.
Calculator (cont)
The list of functions supported by
the calculator includes commonly
used mathematical and date
functions.
Add Constants
Adds constants to a stream.
The use is very simple:
Specify the name
Enter value in the form of a string
Specify the formats to convert the value into the chosen data type.
Null If
If the string representation of a field is equal to a specified value, then
the output value is set the null (empty).
Set Transformations
Set Transformations
Set Transformations are steps that operate on the entire set of data
within a stream.
The operations operate across all rows and not strictly within a row
The steps covered in the section include:
Filter Rows
Sort Rows
Join Rows
Merge Rows
Unique Rows
Aggregate Rows
Group By
Filter Rows
Filter rows based upon conditions and comparisons with full boolean logic
supported.
Output can be diverted into 2 streams: Records which pass (true) the condition
and records which fail (false).
Often used to:
Identify exceptions that must be written to a bad file
Branch transformation logic if single source has two interpretations
The options provided for this step include:
Send true data to step: Which step receives those rows which pass the
condition.
Send false data to step: Which step receives those rows which fail the
condition.
Sort Rows
Sort rows based upon specified fields, including sub sorts, in ascending
or descending order.
A list of fields and whether they should be sorted ascending or not.
Sort directory: This is the directory in which the temporary files are
stored when needed. The default is the standard temporary directory
for system.
Sort size: The more rows you can store in memory, the faster the
sort. Eliminating need for temp files reduced costly disk I/O.
The TMP-file prefix: Choose a recognizable prefix to identify the
files when they show up in the temp directory.
Join Rows
Produces combinations of all rows on the input streams.
The options provided by PDI on this feature include:
Step name: Name of the step; name has to be unique.
Main step to read from: Specifies the step to read the most data
from. This step is not cached or spooled to disk, the others are.
The condition: User can enter a complex condition to limit the
number of output rows. If empty, the result is a cartesian product.
Temp directory: Specify the name of the directory where the system
stores temporary files.
Temporary file prefix: This is the prefix of the temporary files that
will be generated.
Max. cache size: The number of rows to cache before the systems
reads data from temporary files.
Merge Join
The Merge Join step performs a classic merge join between data sets
with data coming from two different input steps. Join options include
INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.
Step name: Name of the step; name has to be unique.
First Step: Specify the first input step to the merge join.
Second Step: Specify the second input step to the merge join.
Join Type: INNER, LEFT OUTER, RIGHT OUTER, or FULL OUTER
Keys for 1st step: Specify the key fields on which the incoming data
is sorted.
Keys for 2nd step: Specify the key fields on which the incoming data
is sorted.
Sorted Merge
The Sorted Merge step merges rows coming from multiple input steps
providing these rows are sorted themselves on the given key fields.
Fields: Specify the fieldname and sort direction
(ascending/descending).
Merge Rows
Compares and merges two streams of data
Reference Stream
Compare Stream
Mostly used to identify deltas in source data when no timestamp is
available
Reference Stream = The previously loaded data
Compare Stream = The newly extracted data from the source
Usage note: Ensure streams are sorted by comparison key fields
The output row is marked as follows:
identical: The key was found in both streams and the values to
compare were identical.
changed: The key was found in both streams but one or more
values is different.
new: The key was not found in the reference stream.
deleted: The key was not found in the compare stream.
Unique Rows
Removes duplicates from the input stream.
Usage Note: Only consecutive records will be compared for duplicates, thus the
stream must be sorted by comparison fields.
Add counter to output?: Enable this to know how many rows duplicated for
each row in the output.
Counter fields: Name of the numeric field containing the number of
duplicate rows for each output record.
Fieldnames: A list of field names on which the uniqueness is compared. Data
in the other fields of the row is ignored.
Ignore Case Flag: Allows case insensitive matching on string fields.
Aggregate Rows
Generates unique rows and produces aggregate metrics.
The available aggregation types include SUM, AVERAGE, COUNT, MIN,
MAX, FIRST and LAST.
THIS STEP TYPE IS DEPRECATED AND SHOULD NOT BE USED
Use Group By step type instead.
Group By
Calculates aggregated values over a defined group of fields.
Operates much like the group by clause in SQL.
Aggregates: Specify the fields that need to be aggregated, the method (SUM,
MIN, MAX, etc.) and the name of the resulting new field.
Include all rows: If checked, the output will include both the new aggregate
records and the original detail records. You must also specify the name of
the output field that will be created and hold a flag which tells whether the
row is an aggregate or a detail record.
Very nice feature: Aggregate function Concatenate strings separated by can
be used to create a list of keys like 117, 131, 145,
The input needs to be sorted, another option is to use the Memory Group
By step that handles unsorted input.
Pivot Transformations
Pivot Transformations
Pivot Transformations are steps which flip the axis of the data (from rows
to columns and vice-versa).
Steps that are covered in this section:
Row Normalizer
Denormalizer
Row Flattener
Row Normalizer
Normalizes rows of data
For example:
Weekdate
Metric Type
Quantity
2001-01-07
Miles
1996
2001-01-07
Loaded Miles
1996
2001-01-07
Empty Miles
weekdate
Miles
Loaded_miles
Empty_miles
2001-01-07
1996
1996
2001-01-28
587
539
48
2001-01-28
Miles
587
..
..
..
..
2001-01-28
Loaded Miles
539
2001-01-28
Empty Miles
48
This result transforms column names into row descriptor values

It is possible to normalize more then on field at a time, whereby groups
of columns generate unique rows.
Row Normalizer (cont)

transformation.
Type field: The name of the type field. (Metric Type in our example)
Fields: A list of fields to normalize.

Fieldname: Name of the fields to normalize (Miles, Loaded_Miles,
Empty_Miles in our example).
Type: Give a string to classify the field (Miles, Loaded Miles, Empty
Miles in our example).
New field: Can give one or more fields where the new value should
transferred to (Quantity in our example)
Row Denormalizer
Denormalizes data by looking up key-value pairs.
For example:
Weekdate
Metric Type
Quantity
2001-01-07
Miles
1996
2001-01-07
Loaded Miles
1996
2001-01-07
Empty Miles
2001-01-28
Miles
587
2001-01-28
Loaded Miles
539
2001-01-28
Empty Miles
48
weekdate
Miles
Loaded_miles
Empty_miles
2001-01-07
1996
1996
2001-01-28
587
539
48
..
..
..
..
Row Denormalizer (cont)

transformation.
Key field: The field that defines the key (Metric Type).
Group fields: Specify the fields that make up the grouping
(Weekdate).
Target fields: Specify the fields to de-normalize by specifying the
String value for the key field (Quantity Miles, Loaded_Miles,
Empty_Miles).
Options are provided to convert data types. Most designs use Strings to
store values so this is helpful is the value is really a number or date.
In case there are key-value pair collisions (key is not unique for the
group specified) specify the aggregation method to use to compute the
new value
Row Flattener
Flattens sequentially provided rows
Usage Notes
Rows must be sorted in proper order.
Use denormalizer if Key-Value pair intelligence is required for
flattening.
For example:
Field1
A
D
Field1
A
A
D
D
Field2
B
E
Field2
B
B
E
E
Field3
C
F
Field3
C
C
F
F
Target1
One
Three
Flatten
One
Two
Three
Four
Target2
Two
Four
Row Flattener (cont)

Step name- Name of the step. This name has to be unique in a single
transformation.
The field to flatten- The field that needs to be flattened into

different target fields (e.g. Flatten)
Target fields- The name of the target fields to flatten to (e.g.

Target1, Target2)
Closure Generator
This step was created to allow you to generate a Reflexive Transitive
Closure Table for Mondrian.
Technically, this step reads all input rows in memory and calculates all
possible parent-child relationships. It attaches the distance (in levels)
from parent to child.
Step name- The name that uniquely identifies the step.

Parent ID field- The field name that contains the parent ID of the
parent-child relationship.
Child ID field- The field name that contains the child ID of the parentchild relationship.
Distance field name- The name of the distance field that will be added
to the output
Root is zero- Check this box if the root of the parent-child tree is not
empty (null) but zero (0)

Field Transformations are steps that operate at the field level within a
stream record
The step types covered in this section include:
Add Sequence
Regex Evaluation
Split Fields
Value Mapper
Add Sequence
Adds a sequence number to the stream.
A sequence is an ever-changing integer value with a defined start value and
increment.
Name of value- Name of new field that is added to stream.
Use DB to get sequence- Option to be enabled when sequence is to be driven by a
database sequence.
Connection name- Choose the name of the connection on which the database sequence
resides.
Sequence name- The name of the database sequence.
Use counter to calculate sequence- Enable to have sequence generated by Kettle. Be
careful, Kettle generated sequences are created anew for each run of the
Add Sequence (cont)
Add Sequence (db sequence)

PDI Sequences aren't persistent
Use Databases Sequences for generating Surrogates
Regex Evaluation
Evaluates a Regular Expression
Field to Evaluate name of the EXISTING field that contains the
string you want to perform the evaluation against
Result Fieldname name of the NEW field to put the result. Values:
Y/N
Regular Expression The regular expression to evaluate
Other Options: Case Sensitivity, Encodings, Whitespace, etc
Regex Evaluation (cont)
Split Fields
Split fields based upon delimiter
Field to split- The name of the field you want to split.
Delimiter- Delimiter that determines the end of values in the field.
Fields- List of fields to split into.
Original Field:
Multiple Fields:
12/31/2007
12
31
2007
Split Fields (cont)
Value Mapper
Maps input value to a new output value based on a mapping table
This is usually done in a data driven manner with a database table,
however this step allows you to define the mapping table in your code
Useful is the mapping table is small and rarely or never changes
For example, if user wants to replace Gender Types:
Fieldname to use: gender_code
Target fieldname: gender_desc
Default Upon: If you don't match, then put this (else statement)
Source/Target Mapping: F->Female, M->Male
Value Mapper (cont)
Loading the Time Dimension

and the Fact Table
The Time Dimension

Every Data Warehouse needs a Time Dimension. Since everything has to
exist in time, it is probably the most general dimension in data
warehouse.
In general it has not many rows (if your granularity is on a daily basis,
you would need about 3,650 rows for ten years).
It covers hierarchies e.g. weeks, month, quarter, year.
It can also hold special holidays or other calendar specific information
(workdays, fiscal years), so the row can be very wide.
The Time Dimension

Example:
The Time Dimension

Example Transformation:
Loading the Fact Table

In the majority of cases, the dimension tables are loaded and updated
before the fact table is loaded.
Sometimes you create a dimension out of (new) fact table entries.
When we assume, all dimensions are in place, we have to do the
following tasks:
Assemble the measures for the facts, e.g. the orders and order
details (decide to do this with or without delta loading).
Lookup the keys with the dimension lookup steps
Optional: do some extra cleanup
Load the fact table

Examples for the tasks:
Assemble the measures for the facts, e.g. the orders and order
details (decide to do this with or without delta loading).
Start with the details like order lines with the product specific issues
Lookup order headers to get some customer specifics and the order
date

Lookup the keys with the dimension lookup steps
This replaces e.g. the product code by the technical key productid

Optional: do some extra cleanup
Load the fact table
We have an exception handling and calculate the total price.
Introduction to Jobs
Jobs
Jobs aggregate up individual
pieces of functionality to
implement an entire process
Individual Pieces: FTP Files, Load

Staging, Load Warehouse.
Job: Nightly Warehouse Update
Job Entries
Think workflow for ETL

The basic composition of a Job is
Job Hops
Job Settings
Job Entries
Job Hops
Job Hops
A job hop is a graphical representation of one or more data streams between two
steps
A hop always links two job entries and can be set (depending on the type of
originating job entry) to execute the next job entry
unconditionally,
after successful execution,
or failed execution
The execution order is indicated with an arrow on graphical view pane
Unconditional
True (Success)
False (Error)
Job Hops (cont.)

Besides the execution order, a job hop also specifies the condition on
which the next job entry will be executed.
Unconditional specifies that the next job entry will be executed
regardless of the result of the originating job entry.
Follow when result is true specifies that the next job entry will
only be executed when result of originating job entry was true.
Follow when result is false specifies that the next job entry will
only be executed when the result of the originating job entry was
false.
Job Settings
Job settings are the options that control the behavior of a job and the
method of logging a jobs actions.
NOTE: Logging is covered in

Detail in the Logging Module
Job Entry
A job entry is a primary building block of a job
Execute transformations, retrieve files, generate email, etc.
A single job entry can be placed multiple times on the canvas.
Job Entry Types

The following types are covered in this section
Start
Dummy
OK
Error
Transformation
Sub-Job
Shell
eMail
SQL
FTP
Table Exists
File Exists
Evaluation
SFTP
HTTP
Start
Defines the starting point for job execution
Only unconditional job hops are available from a Start job entry.
The start icon also contains basic scheduling functionality.
Dummy
Use the Dummy job entry to do nothing in a job.
This can be useful to make job drawings clearer or for looping.
Dummy performs no evaluation.
Transformation
Execute a transformation.
The options provided for this job entry are:
Name of the job entry- This name has to be unique in a single job. A
job entry can be placed several times on the canvas, however it will
be the same job entry.
Name of transformation- The name of the transformation to start.
Repository directory- The directory in repository where
transformation is located.
Filename- Specify the XML filename of the transformation to start.
Specify log file- Check this if you want to specify a separate logging
file for the execution of this transformation.
Name of log file- The directory and base name of the log file (for
example C:\logs).
Extension of the log file- The filename extension (for example: log or
txt)
Transformation (cont.)
Include time in filename - Adds the system time to the filename.
Logging level - Specifies the logging level for the execution of the
transformation.
Copy previous results to arguments - The results from a previous transformation
can be sent to this one using the Copy rows to result step.
Arguments - Specify the strings to use as arguments for the transformation.
Execute once for every input row - Support for looping has been added by
allowing a transformation to be executed once for every input row.

Clear the list of result rows before execution- Checking this makes sure that the
list of result rows is cleared before the transformation is started.

Clear the list of result files before execution- Checking this makes sure that the
list of result files is cleared before the transformation is started.
Transformation (example)
Using Repository
Using Files
Job (aka Sub Job)

Executes a job.
Name of the job entry- This name has to be unique in a single job. A
job entry can be placed several times, however it will be the same job
entry.
Name of transformation- The name of the job to start.
Repository directory- The directory in the repository where the job is

located.
Filename- If you're not working with a repository, specify the filename

of the job to start.
Specify log file- Check this if you want to specify a separate logging
file for the execution of this job.
Name of log file- The directory and base name of the log file (for
example C:\logs)
Job (aka SubJob) (cont.)

Extension of the log file- The filename extension (for example: log or txt)
Include date in filename- Adds the system date to the filename.
Include time in filename- Adds the system time to the filename.
Logging level- Specifies the logging level for the execution of the job.
can be sent to this job using the Copy rows to result step in a transformation.
Arguments - Specify the strings to use as arguments for the job.
Execute once for every input row - This implements looping. If the previous job
entry returns a set of result rows, you can have this job executed once for every
row found. One row is passed to this job at every execution. For example you can
execute a job for each file found in a directory using this option.
Job (aka Subjob) example

Using Repository
Using Files
Job (aka SubJob) Example
Shell
Executes a shell script on the host where the job is running.
Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.
Filename - The filename of the shell script to start.
Specify log file - Check this if you want to specify a separate logging
file for the execution of this shell script.
Name of log file - The directory and base name of the log file (for
example C:\logs).
Extension of the log file - The filename extension (for example: log or
txt).
Shell (cont.)
Include date in filename - Adds the system date to the filename.
Include time in filename - Adds the system time to the filename.
Logging level - Specifies the logging level for the execution of the shell.
can be sent to the shell script using Copy rows to result step.
Arguments - Specify the strings to use as arguments for the shell script.
Execute once for every input row - This implements looping. If the previous job
entry returns a set of result rows, you can have this shell script executed once for
every row found. One row is passed to this script at every execution in
combination with the copy previous result to arguments. The values of the
corresponding result row can then be found on command line argument $1, $2, ...
(%1, %2, %3, ... on Windows).
Mail
Send an e-Mail.
Destination address - The destination for the e-Mail.
Use authentication - Check this if your SMTP server requires you to

authenticate yourself.
Authentication user - The user name to authenticate with.
Authentication password- The password to authenticate with.
SMTP server - The mail server to which the mail has to be sent.
Reply address - The reply address for this e-Mail.
Subject - The subject of the e-Mail.
Mail (cont.)
Include date in message - Check this if you want to include date in the e-Mail.
Contact person - The name of the contact person to be placed in the e-Mail.
Contact phone - The contact telephone number to be placed in the e-Mail.
Comment - Additional comment to be placed in the e-Mail.
Attach files to message - Check this if you want to attach files to this message.
Select the result files types to attach - When a transformation (or job) processes
files (text, excel, dbf, etc) an entry is being added to the list of files in the result
of that transformation or job. Specify the types of result files you want to add.
Zip files into a single archive - Check this if you want to zip all selected files into
a single archive.
Zip filename - Specify the name of the zip file that will be placed into the e-mail.
SQL
Execute an SQL script
You can execute more than one SQL statement, provided that they are
separated by semi-colons
The options for this job entry are:
Connection - The database connection to use.
SQL script - The SQL script to execute.
FTP
Retrieve one or more files from an FTP server.
Name of the job entry - The name of the job entry.
FTP server name - The name of the server or the IP address.
User name - The user name to log into the FTP server.
Password - The password to log into the FTP server.
Remote directory - The remote directory on FTP server from which files are taken.
Target directory - The directory on the machine on which Kettle runs in which you want
to place the transferred files
Wildcard - Specify a regular expression here if you want to select multiple files.
Use binary mode? - Check this if the files need to be transferred in binary mode.
Timeout - The FTP server timeout in seconds.
Remove files after retrieval? - Remove the files on the FTP server, but only after all
selected files have been successfully transferred.
Table Exists
Verifies if a certain table exists on a database.
Database connection - The database connection to use.
Table name - The name of the database table to check.
File Exists
Verifies if a certain file exists on the server on which PDI runs.
the canvas, however it will be the same job entry
Filename - The name and path of the file to check for
Variable - Select the variable to use as filename
Browse - Look for the file on the file system
JavaScript Evaluation
Calculates a boolean expression.
This result can be used to determine which next step will be executed.
The following variables are available for the expression:
errors: number of errors in the previous job entry (long).
lines_input: number of rows read from database or file (long).
lines_output: number of rows written to database or file (long).
lines_updated: number of rows updated in a database table (long).
lines_read: number of rows read from a previous transformation step (long).
lines_written: number of rows written to a next transformation step (long).
files_retrieved: number of files retrieved from an FTP server (long).
exit_status: the exit status of a shell script (integer).
nr (integer): the job entry number. Increments at every next job entry.
is_windows: true if Kettle runs on MS Windows (boolean).
SFTP
Retrieves one or more files from an FTP server using the Secure FTP
protocol.
SFTP-server name (IP) - The name of the SFTP server or the IP
address.
SFTP port - The TCP port to use. This is usually 22.
User name - The user name to log into the SFTP server.
Password - The password to log into the SFTP server.
Remote directory - Remote directory on SFTP server from which get
files.
Target directory - The directory on the machine on which Kettle runs
in which you want to place the transferred files.
Wildcard - Specify a regular expression if you want to select multiple
files.
Remove files after retrieval? - Remove the files on the SFTP server,
but only after all selected files have been successfully transferred.
HTTP
Gets a file from a web server using the HTTP protocol.
The options provided by Chef on this feature are:
URL (HTTP) - The URL to use (example:

http://www.kettle.be/index.html).
Run for every result row - Check this if you want to run this job entry
for every row that was generated by a previous transformation. Use
the Copy rows to result.
Fieldname to get URL from - The fieldname in the result rows to get
the URL from.
Target filename - The target filename what to call the downloaded

file.
Append to target file - Append to the target file if it already exists.
HTTP (cont.)
Add date and time to target filename - Check this if you want to add
date and time yyyMMdd_HHmmss to the target filename.
Target filename extension - Specify the target filename extension in

case you're adding a date and time to the filename.
Username - The username to authenticate with. For Windows

Domains, put the Domain in from of the user like this
DOMAIN\Username
Password - The password to authenticate with.
Proxy server - The HTTP proxy server name or IP address.
Proxy port - The HTTP proxy port to use (usually 8080).
Ignore proxy for hosts - Specify a regular expression matching the

hosts you want to ignore/ separated.
Advanced Job Concepts
Exchanging data between transformations

Excursion Mapping: Within a Transformation you have a sub
transformation, called Mapping:
Example for a main Transformation:
Example for a sub Transformation:
Use case: If you need a part of a transformation to be reused by other

transformations.
Due to the mapping specifications you have all fields at design time.

Within a Job you have two or more Transformations exchanging data.
Rows are exchanged between transformation within memory, example:

If you want to put/get data, use the steps Copy rows to result and
Get rows from result within a transformation.
Within a job, uncheck the Clear list of result rows before execution
for the transformation that should receive data from the previous
transformation.
Note: Result rows can be accumulated by rows from subsequent Jobs or

Transformations (since Version 2.5).

You can also set arguments like a fixed input parameter:
In the transformation you get the argument with the Get System Info
step:

Exchanging data via Arguments: Option: Copy previous results to args
When Execute for every input row is not checked, only the first row
will be taken (when arguments are used).
Clear the list of result rows must be checked, otherwise this could
lead to an infinite loop (because this transformation is generating rows,
too).

Exchanging data via the Result set: Option: Execute for every input row
The transformation or job get exactly one result row from the preceding
result rows set (Clear the list of result rows must be checked).
The transformation getting the result row looks like this:

Exchanging data via the Result set: Option: Clear the list of result rows
Rule of thumb:
When you use the options: Copy previous results to args or Execute
for every input row you have to check Clear the list of result rows
Otherwise: When Clear the list of result rows is UNchecked, all
rows will be copied to the calling transformation or job.

Exchanging data via Files: Instead of exchanging data only in memory,
you can materialize the data. Use cases:
The data you want to transfer blows up your memory.
Transfer data to another location
Process data independent of time and location
File bugs or feature requests together with data to reproduce ;-)
With PDI you can Serialize and Deserialize to a file. As an advantage to
other file formats like Textfiles or Excel, the PDI meta data is stored
together with the data and you do not need to figure this out.
Command Line Parameters

A job can be called from a .bat /.sh with parameters like this:
call kitchen.bat /file:directory\parajob.kjb test1 test2
These parameters can be transferred to subsequent Transformations.
Attention: Copy previous results to args MUST NOT be checked,
otherwise the parameters are not transferred to the transformation.
Concept for processing files and tables

Process all tables:
Start with a Job that defines all tables to process, e.g.:
Start a Sub-Job to process all tables (result rows):

The Sub-Job (processing exact one row) sets a variable that is accessible
from subsequent Transformations or Jobs.
Excursion: Variables and Data Flow

Why do you have to set variables in a different transformation?
To recap:
All Steps run in an own thread, that means they are started and run
in parallel
The initialization sequence is not predictable and PDI is taking care
of the correct data flow (pulling and pushing data from step to
step)
When you define and use a variable in the same transformation, it is
not clear if the setting of the variable takes place before the use.
Thus you have to split the transformation for the setting and
referencing part of variables.

The concept for processing files is similar:
Note: Wildcards are regular expressions. See the example how to

process all files starting with test and ending with .txt

If you only use steps that allow rows as an input parameter you do not
need variables (like Text File Input: Accepting filenames from previous
step or Table Input).
Use of a variable vs. information from a previous step:

Most of the output steps like Text file output or Table Output do not
support this, so you need variables. Here is an example for the Excel
output:
Concept for sending results & alerts by

mail
For every file that is written to the file system e.g. by the Text File
Output step or Excel Output step, its filename is stored in a List of
result files.
This list can be processed by a transformation with the step Get files
from result.
You can programmatically create this list by the step Set files in
result:

mail
The list of result files can be used to automatically send all produced
files via mail:

mail
Beneath the mail addresses (To, CC, BCC, Reply To, From) and the mail
server settings you can enter the following options:
When you uncheck Only send comment in mail body you get the following details in the
mail
mail message body together with the message comment:
Enclosed you find the latest ....
Job:
----JobName
Directory
JobEntry
: Mail send testfile

: /
: Mail 1
Message date: 2007/10/29 15:35:02.565

Previous results:
----------------Job entry Nr
Errors
Lines read
Lines written
Lines input
Lines output
Lines updated
Script exist status
Result
:
:
:
:
:
:
:
:
:
1
0
0
0
0
0
0
0
true
Path to this job entry:

-----------------------Mail send testfile
Mail send testfile : : start : Start of job execution (2007/10/29 15:35:01.858)
Mail send testfile : : START : start : Start of job execution (2007/10/29 15:35:01.858)
Mail send testfile : : START : [nr=0, errors=0, exit_status=0, result=true] : Job execution finished (2007/10/29 15:35:01.859)
Mail send testfile : : Generate Testfile : Followed unconditional link : Start of job execution (2007/10/29 15:35:01.860)
Mail send testfile : : Generate Testfile : [nr=1, errors=0, exit_status=0, result=true] : Job execution finished (2007/10/29
15:35:02.437)
Mail send testfile : : Mail 1 : Followed link after success : Start of job execution (2007/10/29 15:35:02.438)

mail
For sending all produced files, you select General:
Select other file types to add logfiles for different detail levels.

mail
If you want to add a log you need to define this in a previous job entry
for a transformation, e.g.:

mail
Example of sending an alert by mail:
When you create more files then you want to send, there is the option
to Clear list of result files within running a transformation or sub-job.
Running Job Entries in Parallel

By default, job entries run sequential only
Even when they are designed this way:
Both transformations are not running in parallel. The sequence is

depending on the order of the job entry creation.

When you select on the context menu of any job entry Launch next
entries in parallel, all following job entries are called in parallel:

When you want to synchronize the job entries again, you could imagine
a construct like the following:
Attention: In this case, the job entry is called two times and does not
start when both entries are finished. (see the following slide)

When you want to synchronize the job entries again, you need a
construct with a wrapper job like this:
The job entry will call the job that includes the parallel tasks.
Now, the log entry will be executed, when both entries are finished.
Conditions
With conditions you can change the pathway of your job.
Most of the job entries support to give a result back as true or false.
E.g. if a transformation fails, the result is false.
More complex conditions can be handled by the JavaScript job entry.
This is different to the JavaScript for transformations and only
evaluates.
Conditions
Within the JavaScript job entry you can use the following variables:
lines_input
lines_output
lines_updated
lines_rejected
lines_read
lines_written
files_retrieved
errors
exit_status
nr
is_windows
Note: All variables beginning with lines_ need some preparation
Conditions
Example for checking the processed lines:
The JavaScript looks like this:
Conditions
You have to enable and define what step of the transformation should
be taken for the variable lines_written. Do this within the
transformation:
Conditions
Example for checking for a specific value within the result lines:
The JavaScripts look like this:
Further advanced job concepts

Further advanced job concepts (like looping within jobs) can be found in
the chapter ETL patterns.
Common Scripting Uses
Classic vs. Modified JavaScript

Before 3.0 there was a JavaScript and Modified JavaScript step.
3.0
Since 3.0 there is only the Modified JavaScript step with a compatibility
mode:
Classic vs. Modified JavaScript

Why compatibility?
The 3.0 version divided data and meta data and thus using the
handling of PDI values has changed.
All listed functions in the manual for the old step are not valid in the
new step.
Instead, you use the standard Java object methods for the classes:
String : java.lang.String
Number: java.lang.Double
Integer: java.lang.Long
Date: java.lang.Date
BigNumber: java.math.BigDecimal
Boolean: java.lang.Boolean
Binary: java.lang.byte[]
The compatibility mode is slower and should be avoided
Test-Script
The Test-Script button creates test data for the input fields, thus
depending on the needed format this can fail. Example:
Different defaults for testing depending on the value type:

String: test value test value ...., Numericals: 0, Boolean: true, Date:
actual date
You can change the default values and test the script with OK
Built-in Functions
There are a lot of build-in functions with samples:
Right click on a function and select Sample
Start and End Scripts

You can define multiple scripts and select if this should be a script that
is executed additionally at the start or end of the transformation.
Constants: Influence the Processing

With the constants you can skip subsequent rows at a certain point or
set the transformation in an error state.
You must set trans_Status to the desired state
trans_Status = CONTINUE_TRANSFORMATION;
trans_Status = SKIP_TRANSFORMATION; //skip actual row
trans_Status = ABORT_TRANSFORMATION; // ends normal
trans_Status = ERROR_TRANSFORMATION; // ends with an error
Attention: Make sure to set trans_Status = CONTINUE_TRANSFORMATION
outside of an if section, because the JavaScript step determines if the
status should be changed by analyzing the script and does not find
trans_Status within if sections reliable.
Constants: Influence the Processing

Example: Only 5 of 10 rows are processed.
For the usage of getProcessCount() and WriteToLog() look at the

samples from the Special functions.
Get the samples by a right

click on a function.
Internal API objects

You can use the following internal API objects:
_TransformationName_: a String with the actual transformation
name
_step_: the actual step instance of
org.pentaho.di.trans.steps.scriptvalues_mod.ScriptValuesMod
rowMeta: the actual instance of org.pentaho.di.core.row.RowMeta
row: the actual instance of the actual data Object[]
With these API objects you can access the internal instances. The rich
functionality can not be covered in full here. To use them in the right
way you have to look at the classes in the source code.
Some Examples are also available in your local directory kettle/samples
(e.g. JavaScript - dialog.ktr) and on the Wiki.

Example for the _step_ functionality:
Analyse the URL and hostname for an existing connection:
With _step_ you can access almost all information from the context
of your transformation and environment.

getInputRowMeta(): Get you Meta data of the input, same as rowMeta
getOutputRowMeta(): Get you Meta data of the output, e.g. when you
want to add fields by the JavaScript.
row[]: Array of the actual row
Normally all input fields are available in the JavaScript and you can
access. If you need to process your row dynamically, this array is
useful.
If you want to change the size of the output row to add new fields,
use the function: newRow =
createRowCopy(getOutputRowMeta().size());
putRow(): When you want to produce extra rows.
A detailed example for this is on the Wiki:
http://wiki.pentaho.org/display/EAI/Migrating+JavaScript+from+2.5.x+to+3.0.0

Example for processing the entire row:
This transformation demonstrates how to process all fields of a row
by JavaScript. Here all 'E' characters will be replaced by 'Z'.
A use case is converting all strings coming from a host system with
wrong conversions of special characters (e.g. German umlauts like
).
Use Java Classes

You can access all Java classes by preceding Packages to the needed
class, example:
If you want to add 3rd party libraries, you can add the jar to the
classpath and use them.
Note: Keep in mind by using PDI classes, they can change in newer
versions.
Replacing Values
Use case: You want to replace parts of a string.
Actually, it is not possible to change the value of an existing one in
version 3.0
In the old-style engine, it was possible to change strings to numbers,
etc. The result was a big mess where data types got mixed a lot.
Number / Integer mixups especially since JS resorts to using doubles
for almost anything.
As such, the developers want to discourage doing exactly that as
much as possible.
For doing a replacement you have to use the compatibility mode.
The developers actually discuss a solution like this:
Add a column in the "Fields" section: "Replace value? (Y/N)" - In that
situation it can indeed convert values and verify data types.
More Common Use Cases

Convert dates: use the date functions like date2str(), str2date()
Convert special string formats: use the string functions like indexOf(),
substr()
Working days: use the functions isWorkingDay(), getNextWorkingDay()
Check formats: e.g. isDate(), isNum()
Special: e.g. getDigitsOnly(), resolveIP(), LuhnCheck() [checks credit
card numbers]
JavaScript Language Reference & Libraries

PDI uses Rhino for the JavaScript interpreter
The Rhino project with documentation:
http://www.mozilla.org/rhino/
Since PDI version 3.0 the JavaScript engine is not sealed any more:
Sealing prevents using of common JavaScript libraries.
This was actually very limiting for experienced users because there
are some very good JavaScript libraries that contain many useful
functions.
Since PDI version 3.0 you can use them by including them in the
classpath.
http://jslib.mozdev.org/
Scripting and Performance

In general the JavaScript step should be avoided, when the performance
is critical.
Alternatives for scripting:
Formula step (faster than JavaScript step)
User Defined Java Expression (faster than Formula step)
User Defined Java Class used as an alternative for plug-ins
Formula step
The Formula step is based on the Open-Formula-Syntax
You can reference values with square brackets: [value]
Get help for every function with clicking on the function
Applying business rules (if / then / else) with more complex logic is
possible
User Defined Java Expression

This step is compiling into pure Java and you can use the standard Java
object methods for the values
Applying business rules (if / then / else) with more complex logic is
possible with the conditional expression operator like
(a > b) ? a : b
The condition (a > b) is tested. If it is true, the first value a is
returned. If it is false, the second value b is returned.
Another example testing for a string:
User Defined Java Class

With this step, it is possible to access all internal step logic like you can
do with an own custom made plug-in. The benefit of doing it with this
step is the deployment process is simplified.
Further information can be found in the Wiki
http://wiki.pentaho.com/display/EAI/Writing+your+own+Pentaho+Data+Integration+Plug-In
Look at the Code Snippets, e.g. Main to get samples:
Dynamic Transformations
Use case: One transformation fits all
You want to use only one master transformation and want to control
the
Input-type (e.g. CSV, Fixed File, Excel)
Preprocessing
Field Mapping
Validation
Enrichment
You can accomplish this easily by the use of sub transformations
(mappings) and variables
Use case: One transformation fits all
Here is a sample transformation that calls different subtransformations that are controlled by variables by a job and out of
this, it is completely flexible.
Use case: Dynamic field mapping
You get a lot of different input files and need to output this into a
harmonized file structure.
In this case, you can use the ETL Metadata Injection step controlling
this transformation:
ETL Metadata Injection step: sample controlling transformation:
With the possibility of controlling your transformation by meta data you
will be very flexible and can accomplish e.g. processing invoice data
from many customers or suppliers with very little transformations or
jobs.
More steps will support this powerful feature in the next releases.
Using XML
XML Steps Overview

Get Data from XML
Powerful step with XPath and handling large files capabilities
XML Input Stream (StAX)

This step is capable of processing very large and complex XML files
very fast using the StAX parser
XML Output
Basic XML Output for simple and flat structures
XML Steps Overview

Add XML
Adds XML to a data stream for more complex XML Outputs
XML Join
The XML Join Step allows to add xml tags from one stream into a
leading XML structure from a second stream (similar to Add XML but
more powerful).
XML Job Entries Overview

XSL Transformation
XSL transformation job entry transforms (by applying XSL document)
XML documents into other documents (XML or other format, such as
HTML or plain text).
XSD Validator
Validate a XML file against a XML Schema Definition (XSD).
DTD Validator
This step provides the ability to validate a XML document against a
Document Type Definition (DTD).
Get Data from XML

Sample:
[]
Get Data from XML

Define the XPath
For more information on XPath, see this tutorial:

http://www.w3schools.com/XPath/
You can reference parent elements in the fields like this:
Get Data from XML

The result looks like this
You have the order number and order line together

It is also possible to split the header rows and order line to different
steps and only use one reference from the header within the order
line records (this example is covered in the labs).
Get Data from XML

Get Data from XML: Use of tokens
This is useful when you want to reference parts from another
hierarchy structure like in this example with <User>:
Get Data from XML

Get Data from XML: Use of tokens
The definition looks like this:
And the result is:
Get Data from XML

Get Data from XML: further advanced options
Process XML data that is coming from a field: just define the field
and check XML Source is defined in a field
Process large XML files: define Prune path to handle large files
When the prune path is given, the file is processed in a streaming

mode in chunks of data separated by the prune path. At first this
can be almost the same as the "Loop XPath" property with some
exceptions details can be found in the documentation.

XML Input Stream (StAX) vs. Get Data from XML:
The Get Data from XML step is easier to use but uses DOM parsers
that need in memory processing and even the purging of parts of
the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different

approach:
Since Kettle has so many own steps to process data in different
ways, the processing logic has been moved more into the
transformation and the step itself provides the raw XML data
stream together with additional and helpful processing information.

Choose this step, whenever you have limitations with other steps or
when you are in need of parsing XML with the following conditions:
Very fast and independend of the memory regardless of the file

size (GBs and more are possible due to the streaming approach)
Very flexible reading different parts of the XML file in different
ways (and avoid parsing the file many times)

XML Sample with different element blocks:

A preview may look like this (depending on the selected field):

You see you really get almost the original streaming information with
Elements and Attributes from the XML file together with helpful other
fields like the element level.
Since the processing logic of some XML files can sometimes be very
tricky, a good knowledge of the existing Kettle steps is recommended
to use this step. Please see the different samples of this step for
illustrations of the usage.

Sample for processing the file:
XML Input Stream (StAX) Test 2 - Element Blocks.xml

The output looks like this for the Analyzer List block:
The output looks like this for the Products block:

There are a lof of options in the step to help to solve your needs:
XML Output
Basic XML Output for simple and flat structures
The usage is easy: Beneath the filename, you have to define your
root and row XML element name.
XML Output
A result could look like this:
Add XML
Adds XML to a data stream for more complex XML Outputs
This step allows you to encode the content of a number of fields
in a row in XML. This XML is added to the row in the form of a
String field.
Add XML
Enter the field names and optionally the Element name or if this
should be included as an attribute.
Add XML
A use case is to build more complex (nested) XML structures.
Here is a basic example:
XML Join
XML Join allows to add xml tags from one stream into a leading XML
structure from a second stream.
Together with the Add XML step, XML Join is used for building more
complex XML files. It replaces the stream join of fields and
simplifies the creation.
continues here in the real sample
Portable Transformations and Jobs
Use Cases for Portability

Going from Development Test Production
Change your database server
Change your database name
Change usernames and passwords
Reuse transformations or jobs with different files or databases
Use relative paths and be independent of the directory structure
and many more
All without changing the transformations or jobs
Set Environment Variables: Variables scope

When you set variables they can be valid in:
The Java Virtual Machine (JVM)
The parent job
The grand-parent job
The root job
Keep in mind that valid in the JVM can lead to race conditions.
The preferred scope is the root job. The parent or grand-parent scopes
are only needed, if the same variables are used and referenced in
different levels of your jobs.
Set Environment Variables: Properties File

You can set environment variables also from the properties file
kettle.properties.
By default this file is located in your $HOME/.kettle directory (e.g.
C:\Users\jb\.kettle):
You can also point the KETTLE_HOME environment variable to the

directory that contains the .kettle directory. Please see the chapter
PDI Overview for more details.
Set Environment Variables: System wide

You can set environment variables also from the operating system level,
but they are only valid in the JVM when you bypass them (see next
slide).
Set Environment Variables: JVM wide

You can set environment variables that are valid in the entire Java
Virtual Machine. They can reference to operating system variables or set
to fixed values. Use the D option in your .bat or .sh files.
Example for .bat: -DDATABASE="%DATABASE%
Example for .sh: -DDATABASE=$DATABASE
You can also set them fixed here, e.g.: -DDATABASE=salesdb
PDI own Variables

PDI has a lot of own variables that can be referenced. Especial for the
use case of flexible directory structures the following are useful:
Internal.Transformation.Filename.Directory
Internal.Job.Filename.Directory
pentaho.solutionpath (when you run in a Pentaho BI environment)
Referencing Variables
Whenever you see this icon, you can use variables:
Press Ctrl-Space to see a list of available variables.
If you want to test your transformation at design time, make sure you
set the variable for test purposes. (Edit / Set environment variables)
Spoon automatically detects variables, that are referenced but are not
set and lists them here.
Variables are very useful in:
Flexible file processing:
Flexible table processing:
Table Input:
Table Output:
And flexible database connections:
Named Parameters
Named parameters are a system that allows you to parameterize your
transformations and jobs. On top of the variables system that was
already in place prior to the introduction in version 3.2, named
parameters offer the setting of a description and a default value. That
allows you in turn to list the required parameters for a job or
transformation.
They can be set in the job or transformation properties.
Named Parameters
When starting a job or transformation you can overwrite the default:
Named parameters can also be set within a job entry:
Or from the command line, e.g. for kitchen for example:

"-param:MASTER_HOST=192.168.1.3" "-param:MASTER_PORT=8181"
Shared Objects
A variety of objects can be placed in a shared objects file on the local
machine.
The default location for the shared objects file is
$HOME/.kettle/shared.xml.
Objects that can be shared using this method include:
Database connections
Steps
Slave servers
Partition schemas
Cluster schemas
To share one of these objects, simply right-click on the object in the
tree control on the left and choose share.
Shared Objects
This is useful especially for Connections like we do it in this course.
Thus we do not have to enter the information again for new
transformations.
If we want to change one of the connection properties like the user
name, we could do it once for all transformations.
When you use Slave servers, Partition schemas, Cluster schemas this is
the same.
Shared Objects
If you want to change the location of the shared objects file you can do
this in the properties or your transformation.
This is recommended, when it should be independent of the users home
directory.
JNDI for database connections

When using JNDI, you create a named
connection. This is used predominately
inside of an application server container
(Tomcat, JBoss, etc.). In a development
environment, you essentially mimic JNDI.
Each developer will use the same name,
but different connection information.
When you move to a production server,
you will configure a JNDI connection with
the same name on the server.
JNDI
To configure the connection for the use at design time, edit the file:
data-integration\simple-jndi\jdbc.properties
This file can be found in the DI server in:
data-integration-server\pentaho-solutions\system\simple-jndi
Here is a sample connection for a shared connection named
SampleData:
SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/url=jdbc:hsqldb:hsql://localhost/sampledata
SampleData/user=pentaho_user
SampleData/password=password
Logging
What is Logging?
Summarized Information about the Job or Transformation execution
Number of records Inserted
Total Elapsed Time spent in a Transformation
Detailed information about Job or Transformation execution

Exceptions
Errors
Debugging Information
Reasons to Enable Logging

Reliability
See if a Job finished with errors
Review what errors were encountered
Headless Operation
Most ETL in production isn't run from the GUI
Need a place to watch initiated job results
Performance Monitoring
Useful information for both current performance problems and
capacity planning
Two Types of Logging

Log Entries
Traditional logging in the sense of the word
File based approach
Verbose
Database Logging
Summarized results
RDBMS based approach
Concise and structured
Log Entries: Introduction

ALWAYS contains Timestamp
Usually contains the Step name that logged the entry
The rest is varies by what the log entry is
Log Entries: File Locations

Defaults to Spoon_xxx.log in your temporary files folder, e.g.:
C:\Users\Username\AppData\Local\Temp
Location can be set on command line

kitchen.sh -logfile=/tmp/mylogfile.log
WARNING
Log files can get BIG, several hundreds of MB s depending on logging
levels.
Place them in a separate directory, and periodically archive or purge
Log Entries: Logging Levels

Sets the verbosity and information which is logged
Logging levels are additive
Basic Level = Minimal + Basic Log Entries
You get all the entries from the previous levels PLUS the level you've
selected
Levels
Error
Nothing
Minimal
Basic
Detailed
Debug
Row Level
Log Entries: Nothing

You'll see log entries like this:
Log Entries: Error & Minimal

Error
Builds on Nothing
Will place log messages if any errors have occurred. If no errors occur
there will be no log output.
Minimal
Builds on Error
Places raw minimum of log entries. Typically a Transformation Started
and a Transformation Ended.
Log Entries: Basic

Basic
Builds on Minimal
Logs information about individual steps
Log Entries: Detailed

Detailed
Builds on Basic
Each Step provides MORE information about the execution. Whereas
previous log level only provided summary level for each step this level
encourages each step to print out additional information.
Database steps provide information about the database connection and
statements they're executing
pentaho_oltp - Setting preparedStatement to [SELECT orderdate,

requireddate, shippeddate, status, customernumber FROM orders
WHERE ordernumber = ? ]
Log Entries: Debug & Row Level

Debug
Builds on Detailed
VERBOSE. Log pretty much everything. Useful for developers or for
tracking down obscure issues.
Row Level
Builds on Debug
MOST VERBOSE. Dumps actual values of rows passing through
operators. Useful for tracking down when a data value is causing an
OraException with no helpful error message.
Log Entries: Common Uses

Unknown Database Exception
Using Basic logging for normal operation you get a Database
Exception complaining about invalid characters in a column.
Exception doesn't tell you which VALUEs are at issue, only the column.
Turn on Row Level logging to find the row that is throwing the
exception.
Determining Prepared Statement Syntax
Data Warehouse DBAs want to a report of all the SQL your ETL
application will execute.
This SQL will be put through an analysis and tuned so that lookup
indexes are properly implemented.
Turn on Detailed logging and collect the set of SQL being used by
steps.
Database Logging: Introduction

Logs information into database tables in a structured format for reporting
and monitoring.
NOTE: Connection to PDI Logging and DW can be the SAME or Different.
Job1
PDI Logging
Transform1
Transform2
Data Warehouse
Staging
Data Marts
Database Logging: Location and Properties

Database Location that will receive log records. Comprised of:
Connection
NOTE: Best to use a DIFFERENT connection from your actual ETL.
Schema / Table Name
Name of the schema and table to receive the records.
NOTE: Jobs and Transforms log in different formats and different
tables.
Database Logging: Jobs and Transformations

Both Jobs and Transformations log
Jobs log to one table (pdi_log_job)
Transformations log to (pdi_log_trans)
NOTE: Use both in practice
Transform Log Format: Structure

The table for logging contains the following columns (see Transformation
Setting / Logging for the full description):
Transform Log Format: Example

The following table shows a transformation executed THREE times
successfully
Transform Log Format: Step Selection

A transformation can have huge numbers of steps. Which one does PDI
choose to report as it's summary for the entire transformation?
Transform Log Format: Step Selection (cont)

Need to configure PDI to say which STEP row count should be reported at
the transformation summary level.
Select ONE STEP for the following columns
LINES_READ
LINES_INPUT
LINES_WRITTEN
LINES_OUTPUT
LINES_UPDATED
LINES_REJECTED
Job Log Format: Structure

The table for logging contains the following columns
This is the same structure as the transformation log table but with the
fields
ID_JOB - Primary Key of the job log entry
JOBNAME - Name of the job
Job Log Format: Example

The following shows the successful execution of two jobs (END) and one
job that was in process at the time the SQL was executed against the
table pdi_log_job.
NOTE: The LINES_* come from the last Transform executed!
Database Logging: Creating Tables

Create a Connection for the logging
(PDI_LOG_CONNECTION)
Open the Settings for either the
Transformation or Job
Select the connection and type in a
table name
Suggestion: pdi_log_job for Jobs
Suggestion: pdi_log_trans for
Transformations
Hit the SQL button at the bottom
to get the DDL for the table.
Execute.
Database Logging: Indexes

When using job or transformation logging, this information is used in
many ways, for example:
The History in the Spoon Job log table is shown
It is used by the Pentaho Enterprise Console
It is used, when you analyze your log tables
Out of this, having the right indexes on the tables helps in
performance.
Database Logging: (optional) LOG_FIELD

Big Text column that stores
the contents of the Log
Entries at the end of the run.
WARNING: Increases by at
least an order of magnitude
the space needed for the
logging tables.
Database Transformation Logging: Gotchas

Transformation Name and Job Name are important! If you're using .ktr
and .kjb you have to make sure and set these.
Start and End dates are not intuitive.

Start is not when the transformation / job began. It is set to the last
end date of the prior transformation / job. The first date is set to an
"infinite" date in the past, like 1900 or one hour before: '1899-12-31
23:00:00'
The startdate/enddate columns contains the date-range since the last
time that the transformation ran without error.
Note: If you want to get a hold of this date range, use a "Get
System Info" step and select options "Start of date range" and "End
of date range".
The logdate is the date that the last record was inserted into the
transformation and as such the time the transformation stopped
running. (the ending date)
The replay date is the date you can use to "replay" this transformation
and is effectively the time that the transformation was executed and
started. (the start date)
Database Logging: Common Uses

Headless Operation
DBA scripts, Pentaho reports to watch the tables to see the state and
progress of remotely executed Jobs/Transforms.
Running ETL processes
Find the processes that are currently running or stuck.
Determining speed and throughput
LINES IN / Total Transform time = AGGREGATE RECORD THROUGHPUT
Chart that at different LINES IN (X axis) and ELAPSED TIME (Y Axis) and
you can see how your ETL scales. BE SCARED OF THE HOCKEY STICK.
Database Logging Features since 4.0

Internal Object IDs, Logging channels (GUIDs)
Small API change
Solid logging architecture
Log separation (no mixture of transformation and job logs anymore)
Hierarchy: Where is each logging entry coming from?
Step logging
Performance logging
Step / Job Entry Logging

Very detailed information about every step in a transformation (or every
job entry in a job)
Step / Job Logging

... Continued ...
Logging Channels
Very detailed information about every channel, e.g. the object type and
name:
Logging Channels
Deep information about the executed transformation filename or
repository details (Object and Revision) and the logging channel hierarchy
(Parent / Root):
This is also usable for lineage analysis when a report is build out of these
information
Performance Logging
You need to enable performance Logging and Monitoring:
Performance Logging
You can see the results in the
Executiong Results / Performance Graph:
Performance Logging
You can see the results also in the log table for further analysis:
A snapshot was taken every second
Performance Logging
The snapshot contains the processed rows:
Performance Logging
The snapshot contains also the buffer situation.
This is very useful in analyzing bottle necks, e.g. when the number of
input buffer rows if often higher than the output buffer row (the
ratio), then this step takes more time for processing and is most likely
the bottle neck.
Real time monitoring: Drill Down and Sniffing

When you need real time information about your process:
Drill down into running sub jobs, transformatins or sub-transformations
(mappings)
Turn on sniffing
Combine this with debugging
Drill down is possible when the box is blue:

When a transformation is running, you can turn on sniffing in the context
menu of any step:
The result shows you the real

rows actually processed. To slow
it down for better visualization,
you can add a Delay Row step.

To define a start point for your sniff testing, you can use a break point:
The normal preview looks like this (not the sniffing, but similar):
Note: Press the Close button
and not the Stop button, see
next slide....

Activate the sniffing in the pausend Transformation and then resume it:
You can analyze the detailed row before and after the step or at any
other position within one transformation:
Database Logging and Kettle.properties

A lot of database logging definitions can be set by system variables
There is no need anymore to define this in every job and transformation
The list of variables and description can be seen in the Menu: Edit / Edit
the Kettle.properties file:
Scheduling and Monitoring

Logging is tightly linked to Monitoring and Scheduling. You need to know
if your scheduled jobs run successful or not, how much time they need
etc.
This is discussed in more detail in the chapters Scheduling and
Monitoring and Operations Patterns
Error Handling within Transformations
Step Error Handling

Within a transformation you can check if a step fails and direct the
problematic rows to another stream.
The entire transformation will not fail and continue to process your
data.
Note: Not all steps support this error handling feature at this time.
Step Error Handling

To configure error handling, select Define Error handling... from the
context menu of the step.
Step Error Handling

The data stream for the problematic rows looks like this:
Step Error Handling

You can also set thresholds to set the entire transformation in an error
state. Depending on the Nr. Of rows or percentage, you will get errors
like this:
Too many rows where rejected by the error handling, 1 is the

maximum and 2 rows where rejected. This transformation is being
asked to stop.
The maximum percentage of rejected rows of 66 has been reached.

2 rows where rejected out of 3. This transformation is being asked
to stop.
Step Error Handling

You can also combine this with the Abort step in the case you want
another Error Message when the transformation stops.
With Always log rows, all rows in this data stream will be logged.
Step Error Handling

Note: When you filter your log after the word error, make sure to
have it in the message (e.g. the button Show error lines will not
detect them).
You see CR / LF within the data. If you want to eliminate them, you can
use a JavaScript.
Step Error Handling

JavaScript example for eliminating CR / LF (using compatibility mode).
Now the result looks like this:
Step Error Handling

Performance aspects I: This type of error handling can lead to a
performance loss when a lot of errors arise. This also depends on the
involved steps.
When you can check your data before the problematic step, this
would be an alternative. The steps to use could be:
Filter Rows Step

RegEx Evaluation Step
JavaScript Step
Step Error Handling

Performance aspects II: For the Input / Output step this could be an
alternative.
The technique used in Insert/Update is first to do a lookup and then
perform an insert or an update when needed. For high-performance
situations you can use error handling to speed up the operation. Using
batch inserts and a primary key you can work up to 3 times faster (with
a low updates to inserts ratio).
ETL Patterns
Introduction to Patterns
pattern / noun
1)a form or model proposed for imitation
2)something designed or used as a model for making things <a
dressmaker's pattern>
3)an artistic, musical, literary, or mechanical design or form
Informally
How do I accomplish <<common data scenario x,y>>using PDI?
Common Data X
ETL Pattern in PDI
Common Format Y
Identify Candidates
Notice data SIMILARITIES, not
SPECIFICS
Data Characteristics
Similar data types

Similar source systems
Similar content
PEG
PATTERN
HOLE
Similar To...
Processing Characteristics
Similar algorithms
Similar square pegs and round
holes
PEG
PATTERN
HOLE
Housekeeping Characteristics
Similar loading/tracking
techniques
Pattern : Batching
Tag DML with something to identify it as part of one load process
Batch
ID: 27
Load Date: 10-Oct-2007
FACT/SCD II records have a column that identify which batch they
were inserted during
Logical Rollback
DELETE from FACT where batch_id < = 15

Roll back to any point in time (yesterday, 10 days ago, etc)
Partial Load Rollback
Commit every 1000 records is good for performance, but what happens
when you only get part way through a load
DELETE from FACT where batch_id = (current batch_id) to cleanup
Auditing
What batch did we insert that fact record?
Pattern: Batching Overview

Get ID from somewhere
Get BATCH_ID
from somewhere
Set Variable
Use Variable in all INSERT /
UPDATE operations
Your data ends up having an extra
attribute and looking like this:
Set variable
${BATCH_ID}
Use variable
${BATCH_ID}
on INSERT,
UPDATE steps
Pattern: Batching Get Batch ID

Get BATCH_ID wherever you want
Database table for your own

batching
PDI Batch ID
Database Sequence
max(batch_id) on FACT table
...
Use Set Variables step
Pattern: Batching Use Batch ID

To get the BATCH ID at any point
in the ETL process configure a
Get Variables step
This retrieves the variable
PARENT_BATCH_ID and adds it to
the stream, like any other field
Include it as a column with your
INSERT / UPDATE operations in the
database
Pattern: Batching Combine with Logging

Example: Use the Batch ID from the log table and get the
information when the transformation run with what Batch ID.
Note: A combination for keeping data in a consistent state with the

Batch-ID is shown in the transactional section later on.
Pattern : Change Data Capture

Find data that's been CRUD-ed since your last ETL run
CRUD Create, Replace, Updated, or Deleted
Compare todays data to yesterdays data
Which records are new?

Which records have been deleted?
Which records have been changed?
Which records are identical?
Detect changes
Only process changes for more efficient processing

Slowly Changing Dimension logic
Special ETL for a new CUSTOMER, deleted customer, etc.
Route changes to their appropriate processing pipeline

Compare data you got the last time to the data right now
Route to the appropriate processing

Merge Rows
TWO inputs
Yesterdays data (STAGE)
Current data (OLTP)

Keys
Data needs to arrive sorted
Know what to compare

Values
What to compare
Flagfield
Name of the field to put the new

identical changed deleted

Write your own Slowly Changing Dimension logic
Deleted
Mark customer deleted

Move to historical table for reporting
Created
Create a summary table for customer?

Insert record into an interface table with another system?
Call a webservice to get ZIP information only on NEW customers
Changed
Write your own Slowly Changing Dimension logic

Insert into a HISTORY table in addition to updating a reporting table

Steps for simplification
Switch / Case: route directly to the steps for deleted, changed, new,
identical
Synchronize after merge: process the delete, insert, update directly

on the database depending on the flagfield content
Pattern : State Based Calculations

Calculate things as something moves through a set of states
Process focused metrics
Business Process for a Customer Order
Web Site Placed

Warehouse Received
Warehouse Shipped
Customer Received
Great Questions about the movement of these states
What's the average amount of time to go from Warehouse Received to

Warehouse shipped by the day of the week?
How long does it take to complete an order (start to finish) by
warehouse (East Coast, West Coast, etc)
What's the percentage of time spent on things we can control (Web
Site, Warehouse) vs our service provider (shippers)
Elapsed Time is key
Pattern : State Based Calculations

Get Data in Order
Calculate Previous Date (ie, like SQL LAG(DATE) )
Calculate difference in dates (This Time versus Previous Time)
Pattern : SBC Get Data in Order

Get your data sorted
Grouping ID
Grouping ID
Customer ID
Order ID
Time
Time
State Change timestamp
Time
Grouping ID
Grouping ID
Time
Pattern : SBC Calculate Previous Date

Use the Analytic Query step
Hold your previous row and add the previous value to the stream
Pattern : SBC Calculate Previous Date

The result looks like this:
Pattern : SBC Calculate Time Difference

Calculate DAYS_BETWEEN_ORDERS (Example with JavaScript step)
This Order date Previous Order Date (in days)
(2000/09/27 2000/05/20) = 59 DAYS
var DAYS_BETWEEN_ORDERS;
var one_day=1000*60*60*24;
if ( PREV_ORDER_DATE.getDate()
!= null ) {
DAYS_BETWEEN_ORDERS =
Math.round(Math.abs(orderdate.get
Date().getTime() PREV_ORDER_DATE.getDate().get
Time()) / one_day);
}
else {
DAYS_BETWEEN_ORDERS = 0;
}
Pattern : Create / Update Fields (UPSERT)

Want to update something in the warehouse
Some columns are ONLY touched on INSERT (Created Time, etc)
Some columns are touched on UPDATE (Updated Time, etc)
Auditing Requirements
Track what process made what changes to what data
Easy for downstream ETL
SELECT * FROM ODS_TABLE where UPDATED_TIME >= last time I ran
Pattern : Create / Update Fields (UPSERT)

Get your regular fields
customernumber
customername
Calculate the values you might use
CREATE_TIME (current time)

CREATE_USER (username)
UPDATE_TIME (current time)
UPDATE_USER (username)
Configure UPSERT properly
Pattern : Create / Update fields (UPSERT)

Use Insert / Update step
Configure fields that are to be only
touched on INSERT
Update = N
Configure fields that are to be

touched on UPDATE
Update = Y

Some fields are ONLY changed on
INSERT
CREATE_TIME
CREATE_USER
Some fields are changed on
INSERT AND UPDATE

UPDATE_TIME
UPDATE_USER

An Upsert (Insert and Update) can also be accomplished by a
combination of a Table Output step with constraints on the database,
error handling and an Update step.
Depending on the database (time used for constraints checking,
building indexes), this can be faster than the Insert / Update step.
Pattern: Transactions I (DB Transactions)

Transformation wide DB transactions (e.g. commit, rollback): You want
to ensure that there is no data loss when a Transformation is run and
multiple target tables are needed in a consistent state (e.g. child /
parent dependencies like customer invoice headers and details)
Check the option Make the transformation database transactional
Pattern: Transactions I (DB Transactions)

Additional: Set the commit sizes in all database related steps that need to
be transactional to an almost infinite number like 99999999.
Pattern: Transactions (Other Approaches)

Reasons for the use of other patterns then DB transactions:
1. A DB transaction can only be Transformation wide and not Job wide
2. The amount of data in a transaction could create a heavy workload for the
database (e.g. DB keeps it in a temporary store, keeps the tables in a
consistent state, handles locking situations)
Other approaches to keep the data in a consistent state (all in

combination with a delta loading and not a full load, see next slides
for more details):
Use a physical table centric approach: Table rename/delete

Mark the changed records with a Dirty Flag
Use of Batch-IDs
Combination with an own Status table keeping the latest processed keys
The best choice is depending on your data sizes, your transactions and
your database behaviour.
Pattern: Transactions II (Table Centric)

Also useful, when target tables can't have partial data
Example: Data load takes 3 hours but you can't be doing PHYSICAL
inserts into the FACT_TABLE. Users need to see all the data or NONE.
Nothing in between.
Pre Processing
Create a temp swap tables that match structures of the target tables.
When this is combined with delta loading, also copy the original table
table to the swap table
When you have referential integrity / foreign keys (what is not a good
practice for a data warehouse) that concept can only hardly be used
Transformation
Load the swap table(s)
Post Processing
Drop the old target tables

Rename swap tables to the target tables
Pattern: Transactions II (Table Centric)
Pattern: Transactions III (Dirty Flag)

The dirty flag
Often a boolean column (please check the database connections if your
database support boolean data types)
Used to mark the last inserted rows in an actual run
Pre Processing
Delete all rows with a dirty flag set in all target tables. Records in this
state mean the last job was not finished successfully and the partial data is
deleted.
Transformation
Load the table(s) with the dirty flag set
Post Processing
Reset the dirty flag in all target tables

Transformation setting the dirty flag

Job handling the dirty flag
delete from oltp_xxx where

dirty_flag is true;
START TRANSACTION;
UPDATE oltp_orders SET dirty_flag=false where dirty_flag is true;
UPDATE oltp_orderdetails SET dirty_flag=false where dirty_flag is true;
COMMIT;
Pattern: Transactions IV (Batch IDs)

The Batch ID
There are many ways to get a batch ID (see previous slides). In this
pattern, we are using the batch ID from the logging table.
This example does keep the tables in a consistent state but does not
include a reload of failed records. This can be combined with a status table
that holds the processed keys (see next slides).
Pre Processing
Get the batch IDs from the logging table where the logging record indicates
an error. Delete all rows in all target tables for these batch IDs. Records
with these batch IDs mean the last job was not finished successfully and
the partial data will be deleted.
Transformation
Load the table(s) with the batch ID
Post Processing
No action needed (eventually update indexes etc.)

Transformations are adding the Batch ID from the calling job

Job handling the Batch-ID
delete from oltp_xxx where

job_batch_id in
(select ID_JOB from
log_test_batch_id where
ERRORS>0);

Job handling the Batch-ID
Enable job logging
Tick Pass batch ID? in the job settings
Pattern: Transactions V (Status Table)

The Status Table
Create a status table that holds for all jobs/transformations all tables that
need to be in a consistent state. For all tables the last processed keys
(source/target) and the status is saved.
Pre Processing
No action needed (eventually define loops and chunks of data, see next
slides)
Transformation
Load the table(s) with the previous keys (delta loading) and save the last
processed record working in chunks is possible
Post Processing
No action needed (eventually update indexes etc.)

Example of a transformation with a status table
Blocking Step
Wait for the last record to be
processed.
Add constants
Add the table names (for the key
range).
Insert/Update
Store the last processed key and
batch id for the table.
Note: When the target key is different (e.g. auto generated), both keys
(source & target) would need to be stored.

Example to combine a status table with delta loading: Taking the last
processed key into account when loading from source.
Oltp_job_status
Get the last processed key.
Use max() to get at least one row.
Source key null?

Set the key to 0 when this is the first load.
Orders
Add a where clause: WHERE ordernumber > ?

Example to combine with
delta loading in chunks:
You only want to process
e.g. 1000 rows per
transformation.
The chunk size can also be
set as a named parameter
Results of delta loading in chunks (2560 rows in total, 3 chunks: 1000, 1000,
560 records)
Logging statistics of step orders and content of table oltp_job_status of
first, second and third run:

Example to combine with delta loading in chunks and table
dependencies: Keep the tables in a consistent state.
Options to combine: Dirty flag, Batch IDs (delete the records that are
in an error state as described before) or you can also use the chunk
size as the commit size and set the Make the Transformation
transactional flag to use DB transactions. The latter simplifies the
handling a lot but the practice is depending on the database and
chunk size.

Have multiple keys per table (e.g. for the orderdetails table:
ordernumber_from, ordernumber_to and the orderlinenumber. The
ordernumber_to will be taken from the previous run of the order
headers)

Details:

Details:

Details:

Now, the job is pretty simple and:
handles all kinds of aborts (database, PDI, etc.)
the job is restartable
all tables are kept in a consistent state
Pattern: Loops
In general loops are allowed in Jobs (in contrast to Transformations
were this is not possible). An Example for a loop is given here: Wait
for a file and process it with two transformations. After the processing
of the file is finished, the job should loop and wait for the next file.
X
Dont do it this way: This will work in general and for a long while.
But due to design reasons sooner or later this will lead to a
StackOverflowError, even when the StackSize is increased. This can
happen in production after some hours, days or weeks.
Pattern: Loops
Options:
Externalize the loop to the operating system within a shell or batch file.
The errorlevel could be checked also.
Iterations: Stop the job after a certain number of iterations and restart it
in a loop in the operating system (the loop could also be in a shell or batch
file)
Let it schedule (e.g. from the DI server) in an interval but avoid
overlapping runs (e.g. a job takes longer than the interval)
Use the interval setting in the Start job entry (if this is suitable for the job)
Pattern: Loops
How to stop a job after a certain number of iterations?
Pattern: Loops
How to stop a job after a certain number of iterations?
The JavaScript step decrements a variable and validates if it should
continue:
Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?
Pattern: Loops
iterations?
Pattern: Loops
iterations?
Pattern: Loops
iterations?
You can try this with a maximum number of 2 iterations since it needs
3 when the chunk size is set to 1000. In this case it processes 2000
rows in two cycles for the transformation. When the job is run a
second time, it processes the remaining rows (560).
Pattern: Loops
How to avoid overlapping runs of jobs?
You can use a semaphore, e.g. setting a file or writing an entry in a
specific table and check this.
You can check the log entry of a job to execute if this job is still running:
Pattern: Loops
Similar as before (loading chunks), but the SQL is different:
Pattern: Loops
Similar as before (loading chunks), but the evaluation is different:
Pattern: Restartable Solutions and Dependencies

With all the patterns described before you can build a completely
restartable solution.
Step error handling and step validations were not used in these
examples, but could be implemented when known issues should be
handled.
Combine the previous samples with variables and you have a
framework for executing jobs that are restartable and reliable.
It is also possible to store dependencies and sequences of jobs and
transformations in a table and let these run from a master job
controlling these.
The latter is not describes here further and could also be combined
with scheduling and monitoring analyzing log files and the status of
servers.
Pattern: Transactions with Dimensions

Dimension tables normally do not need transactions, when they are
filled with the Slowly Changing Dimension (SCD) step. Example: When
a record is loaded a second time and there will be no change, no new
record nor version is added to the dimension table.
When a rollback is needed for a SCD table:
Rolling back a dimension record is very complicated since you need to
change the valid dates.
Best approach would be to use DB transactions or the table centric
rename.
Enterprise Repository
The Enterprise Repository

is based on the Security and Content Management modules in
the EE Data Integration Server:
Security allows you to manage users
and roles (default security) or
integrate security to your existing
security provider such as LDAP or
Active Directory
Content Management provides the
ability to centrally store and
manage your ETL jobs and
transformations. This includes full
revision history on content and
features such as sharing and
locking for collaborative
development environments
Setup
The EE Data Integration Server must be started.
When Spoon starts up you will prompted to connect to a
repository (or use in the menu: Tools / Repository / Connect)
Add a new Enterprise Repository
Setup
Enter an ID (server reference) and name (local reference) for
your repository connection
Log on to the Enterprise Repository by

entering the following credentials:
user name = joe, password = password.
Security
The Data Integration Server is configured out
of the box to use the Pentaho
default security provider
This has been pre-populated with a set of
sample users and roles including:
Joe Member of the admin role with full
access and control of content on the Data
Integration Server
Suzy Member of the CEO role with
permission to read and create content, but
not administer security
Note: See the Security Guide available in the Pentaho Knowledge

Base for details about configuring security to work with your
existing security providers such as LDAP or MSAD.
Security Deleting Users and Roles

Note: After deleting a user (or role) the security object is still
existing and is referenced. Example when the user pdi2000 is
deleted:
Please see the Best Practices for Deleting Users and Roles in the
Pentaho Enterprise Repository in the PDI Administration Guide
Content Management
Demo
Content Management
New repository based on JCR (Content Repository API for Java)
Improved Repository Browser
Enterprise Security
Configurable Authentication including support for LDAP and MSAD
Task Permissions defining what actions a user/role can perform such as
read/execute content, create content and administer security
Granular permissions on individual files and folders
Full revision history on content allowing you to compare and restore

previous revisions of a job or transformation
Ability to lock transformations/jobs for editing
'Recycling bin' concept for working with deleted files
Content Management Team Projects

Have one DI Server that holds the development repository,
security and scheduling
Deploy the DI Clients to the team members
Depending on your environment, the options are:
Have additional DI Servers for test and production
Have dev/test/prod directories below your team project and
change the directory by named variables (see also the following
slides)

Private and public directories proposed ongoing
Use your private directory for your own work and tests
Use the public directory for your team projects
Deployment scenario depending on the team size et al.

When you want to work on a transformation or job:
Lock it with your name or
Move it to your private directory
When you finished the work:
Unlock it or
Move it back to the public project directory
Note: When you move it, it can still be referenced when links are
specified with Specify by reference (see next slides)

When team members still need the part you are working on,
you need to copy it (when the links are entered without
references):
At this time you can only do a Save as for copying
The backdraw with Save as is: you loose the Version history
Therefor a move to another work place, referencing or locking is
the best ongoing when possible
Content Management - Backup and Deployment

The actual backup strategy is to backup the whole folder:
/data-integration-server/pentaho-solutions/
This includes the repository, security and scheduling e.g.

pentaho-solutions/system/jackrabbit/
pentaho-solutions/quartz/
Please remember to stop and start the DI server, more details

can be found in the PDI Admin Guide in the Knowledge Base

Backup from one server and restoring to another server could
be a option for the deployment,
but not, when you have different security and scheduling on the
test and production servers
You may omit the pentaho-solutions/quartz/ folder in this case.
Another deployment method would be to Export and Import

the repository.
At this time there is only a manual task
The whole repository including the private folders is ex- and
imported there is no project wide ex- and import

When you chose the option to have dev/test/prod directories
below your team project and change the directory by named
variables:
Content Management Specify by Reference

You can link to a resource (like a sub-job, a transformation, or a subtransformation) by name or reference.
The advantage of specifying by name and directory is:
You can use variables for the name and directory
The advantages of specifying the reference are:

You can move the referenced object to another location
You can rename the object
You can rename parts of the directory
Content Management File Based Repository

The new file based repository stores the jobs and transformations as
.kjb and .ktr files in XML format below a given directory
The main difference is: It can be referenced like a job or
transformation stored in a database or enterprise repository
Content Management Upgrade from 3.x

How to move from a collection of content files or an old database
repository to a new enterprise repository?
Move from a Database repository:
Export the DB repository (with version 3.x or upgrade the DB repository
before and export with version 4.x)
Import into the EE repository
Move from content files (.ktr / .kjb):

Connect to your new enterprise repository.
Go to the File menu and click Import from an XML file
Note: Currently there is no quick and easy way to accomplish this process.
If you have any references to other job or transformation files in your
saved jobs, you must update each of those references to point to the new
location in the enterprise repository.
Scheduling and Monitoring
Scheduling within the DI Server
Data Integration Engine This is a

Carte instance. Carte is also used
in clustering (see a separate
chapter).
Scheduling Internally the Quartz
scheduler is used and the tasks
are executed in the Data
Integration Engine.
Note: Scheduling within the DI Server is integrated

with the Enterprise Repository (Content
Management + Security).
Scheduling Kettle Content Before PDI 4.0
Spoon
Carte
Pentaho BI Server
Scheduling
OR
Files
Db
Script (CRON)
OR Job A
Job B
Scheduling Kettle Content NEW in PDI 4.0
Spoon
Data Integration Server

Scheduling
Db
OR
Files
OR
Job A
Job B
= Enterprise Repository
Scheduling options
Before PDI 4.0:
Via the operating system (e.g. CRON jobs, task scheduler)
Via the BI Suite scheduler (via xActions)
Via the Start-Job-Entry
It was complicated to schedule remote jobs (with Carte)
Scheduling since PDI 4.0:

Same options are still available, but:
Dont schedule via the Start-Job-Entry (will be deprecated soon)
As an additional option: Carte is now integrated in the DI Server
as the Data Integration Engine and scheduling jobs and
transformations are executed in this Carte instance
Scheduling Kettle Content

Pentaho BI Server
Spoon
Carte
Scheduling
BEFORE 4.0
OR
Script (CRON)
Files
Db
OR Job A
Spoon
Data Integration Server
Job B
AFTER 4.0
Scheduling
Db
OR
Files
OR
Job A
Job B
Scheduling - Demo
Scheduling - Demo
Note: Pause/Complete does not mean

the job or transformation is
paused/completed, but the
scheduling.
The Start and Stop buttons refer to the
scheduler and not the transformations
or jobs.
How to schedule remote jobs

This was the proposed ongoing before 4.0 and is still valid if
you want to schedule jobs on remote Carte servers additional
to the Data Integration server.
Since there are no options in Kitchen to run jobs on a remote
Carte server you need a wrapper job to define the remote Carte
server. [Defining a remote Carte server would need too many
properties that would lead to too complex command line options
for Kitchen.]
Define the wrapper job like this:

Define the slave server, e.g.:
and job entry accordingly, e.g.:

Uncheck Wait for the remote job to
finish [otherwise you would like to
wait]

Note: Jobs scheduled on the DI Server cannot execute a
transformation on a remote Carte server. You may see an
error line like this one when trying to schedule a job to run
on a remote Carte server:
UserRoleListDelegate.ERROR_0001_UNABLE_TO_INITIALIZE_US
ER_ROLE_LIST_WEBSVC!com.sun.xml.ws.client.ClientTranspor
tException: The server sent HTTP status code 401:
Unauthorized
To fix this, follow the instructions in Executing Scheduled

Jobs on a Remote Carte Server in the PDI Administrator's
Guide.
Monitoring - Pentaho Enterprise Console
Monitoring is tightly linked to

scheduling: You need to know if your
scheduled jobs run successful or not,
how much time they need etc.
Pentaho Enterprise Console (PEC)
provides you with an interface for
monitoring a DI Server / Carte
instance
Pentaho Enterprise Console - Demo
Monitoring - Pentaho Enterprise Console

Actual limitations of the PEC:
Only one DI Server or Carte Server can be monitored
Scheduling and Monitoring is not linked together in the UI
The PEC functionality is planned to be fully integrated into

a separate PDI Monitoring perspective in a future release.
Further options to monitor:
Via a Web-Browser to monitor the DI Server PDI status
Within Spoon: Monitoring Slave servers
The next slides give an overview of the monitoring that is

discussed in more detail in the chapter Clustering and
Partitioning
Monitoring DI Server: PDI Status
Monitoring DI Server: PDI Status
Monitoring within Spoon: Slave server Status
Note: When monitoring the DI Server you need to enter

the Web App Name: pentaho-di
Monitoring within Spoon: Slave server Status
Agile BI and PDI
Contrasting Development Processes
Pentahos Agile BI
Pentahos Agile BI initiative seeks to break down the barriers to
expanding your use of Business Intelligence through an iterative
approach to scoping, prototyping, and building complete BI
solutions.
It is an approach that centers on the business needs first,

empowers the business users to get involved at every phase of
development, and prevents projects from going completely off
track from the original business goals.
Agile BI
Business Users to
operate with or
without IT resources
Faster
Time
to Value
Continuous, realtime flow and

access of data
Conventional BI Apps
to be built and
deployed rapidly
within a single design
environment
Cloud
On-Premise
Agile BI Phases
Individual / Departmental
Agile Exploration
Agile Data Transformation
Solution Prototyping
Institutional
Infrastructure Design
Dimensional Modeling
Iterative Solution Development
Operational Deployment
Agile BI Core Tasks

Core Tasks for Individual/Departmental Phases
Agile BI Core Tasks

Core Tasks for InstitPhases
Pentahos Agile BI
In support of the Agile BI methodology, the Spoon design
environment provides an integrated design environment for
performing all tasks related to building a BI solution including ETL,
reporting and OLAP metadata modeling and end user visualization.
Business users will be able to
start interacting with data
building reports with zero knowledge of SQL or MDX
work hand in hand with solution architects to refine the solution.
PDI 4.0: Agile BI - Model

New Modeling and Visualization perspectives
A Data Transformation
becomes an
Analysis Model
PDI 4.0: Agile BI - Visualize

Once the Model is created you can
use the Drag-and-Drop Analyze
Reporting tool to drill, slice, dice
and pivot your data.
Data Quality - Example on Finding Issues
UK is missing
the territory in
this example.
This can be
corrected very
fast to EMEA in
the
transformation.
Data Quality - Example on Fixing Issues

There are many ways to accomplish this. Here is an example with the
Formula step:
Limitations
You need to use a Table Output step for the Visualization
Limitations of the 4.0 release: The table needs to have all fields
that you want to analyze. At this time there is no support to join
other tables (Snow-Flake- or Star-Schema).
Agile BI - Models
Agile BI - Models
Create and change Models (like a Mondrian Schema for Analysis or
Ad-Hoc Reporting) on the fly
Support for further functionality (known from the Schema

Workbench e.g. Star- / Snow-Flake Schemas) will be coming soon in
a future release
Easy publishing of your schema to the BI Server
Visualization with the integrated Analyzer

Pentaho Analyzer is an interactive analysis tool and provides you with a
rich drag-and-drop user interface that makes it easy for you to create
reports quickly based on your exploration of your data.
You can also display Pentaho Analyzer reports in a dashboard (in the BI
suite).
The user can query the data in a database without having to understand
how the database is structured.
The Analyzer presents data multi-dimensionally and lets you select what
dimensions and measures you want to explore.
Use the Analyzer to Drill, Slice, Dice, Pivot, Filter, Chart Data and to
create Calculated Fields.
Visualization with the integrated Analyzer

Integrated in the
context menu of
a step
Integrated in the
database dialog
Analyzer Report Overview
Drag fields from the field list to this area
Drag and drop Fields here.

The Field Panel
Fields from the
Data Model are
listed here.
The Available Fields Pane

To see the list of fields that are available to you
when you build your report, click the Show
Fields button to display the Available Fields
pane.
By Pressing the VIEW button you may organize
the list in three ways:
1.By Category (default)
2.By Type. Lets you see the list where all
number fields (blue) come first, followed by
text fields (orange).
3.A->Z. No grouping.
To change the organization, simply click the
View button next at the top of the pane.
Types of Fields
The following types of fields are available:
Text Fields (Names, Types, Categories, etc.): Product Name is an
example of a Text field.
Time Period Fields: Fiscal Year and Order Month are examples of
Time Period fields.
Number Fields: These types of fields are designed for summing,
dividing, creating averages, etc.
Fields are color-coded by type in both the report and the Available
Fields pane.
Text Fields and Time Period Fields: Orange
Number Fields: Blue
The Field Panel

This panel shows all dimensions with
the fields in their hierarchies.
Fields are drug onto the report canvas.
Hierarchies cannot be split across axis.
Right Click on the field for additional
options
Dragging Fields to the Canvas

The X Axis (Years)
The Y Axis (Territory)
Drag the field onto the canvas until
Drag the field onto the
you see the Horizontal Line.
canvas until you see the

Vertical Line.
The Result
Using Filters
Pentaho Analyzer offers the following options for filtering reports:
Filtering Text Fields: Text fields contain non-numeric information, so
you can choose to include or exclude certain values at will. Time
Periods, Names, Types, and Categories are examples of text field
groups; Product Line is an example of a specific text field.
Filtering Number Fields: Number fields include numeric information.

Sales Revenue is an example of a number field. You can create a
numeric filter using Greater/Less Than or Top Ten.
Types of Filters
Pentaho Analyzer offers the following options for filtering reports:
Filtering Text Fields: Text fields contain non-numeric information, so
you can choose to include or exclude certain values at will.
Selecting from a list of values. Pentaho will display a list of values,
and you choose to include or exclude certain values
Match part of a string. You type in part of the name (string) that the
name Contains or Does not Contain
Filtering Number Fields: Number fields include numeric information.
Greater/Less Than...
Top 10, etc...
You can have only one numeric filter on a report at any given time.
When the report is generated, the numeric filter is applied after
other filters are applied.
How Filters work together

Filters are applied in the following order:
Text field filters, such as Product Line = Snow Sports or Time Period =
2006 Q4. (Note that the order between these filters is irrelevant.)
Greater/Less Than component of numeric filters. This filter will further
restrict the data.
Top Ten component of numeric filters. This filter will even further
restrict the data.
Note: Another way to express this is: All text field filters are applied
first (#1) creating a first "invisible" version of the report. Second,
Greater/Less than filters are applied on this invisible report (#2), and -finally -- based on this report, the Top 10 filter is applied (#3).
Methods of Adding a Filter

A filter always acts on a field, so the first step is always to select a
field.
To add a new filter, use one of the following methods:
Method 1: Click on a field in the report, and select Filter from the
menu. (This method assumes you use the field in the report).
Method 2: From the Available Fields pane, find the field you want
to filter and drag the field into the Filters pane OR to the "(+) 0
Filter in Use" area.
Method 3: In the Available Fields pane, find the field you want to
filter and right click on the field, and select Filter.
Calculations, Totals, and Sorting

Once you add fields and filters to your report, you can calculate and
manipulate the data on your report.
There are the following three primary methods:
1.Changing the way totals are displayed.
For example, display totals as Averages.
2.Adding new fields that originate from existing fields.
For example, based on the field Revenue, you create % of Revenue.
3.Creating new numbers on the fly.
Displaying Grand Totals and Subtotals

By default, Grand Totals and Subtotals don't display when you view a
report in table format.
To show or hide Grand Totals or Subtotals, do the following:
In the BI Suite:
Click the More Actions / Set Report Options
In PDI:
Click the Show report options icon
Displaying Grand Totals and Subtotals

Select or deselect the appropriate checkboxes in the Totals section.
Click the OK button to save your specifications.
Displaying Totals as Averages, Max, Min, etc

Grand totals and subtotals are normally the sum of each individual row
or column value but, you can choose to also summarize the data in
other ways:
Sum (Default. Displayed as Total in table report)
Average (See also: More about Averages)
Max
Min
To display these, click the number field (such as "Sales Revenue") in the
report and select Show Average, Max, Min, etc from the menu.
Note: The various forms of totals (max, min, etc) will only display if
your report is set to display totals

Click the number field (such as "Sales Revenue") in the report and
select Show Average, Max, Min, etc from the menu.

Click the number field (such as "Sales Revenue") in the report and
select Show Average, Max, Min, etc from the menu.
Creating New Calculated Fields

Right click on the column header for a number in your report and select
User Defined Number Calculated Number
Use expression keys to create

the new expression
Agile BI a Professional Services View
Agile BI is a way to meet simple analytic requirements
Agile BI is a technique for building a product backlog of analytic

requirements
Can be used to enhance an existing DW
Once the ROI of an analytic can be estimated, appropriate investments can be

made
Agile BI is a way to build executive confidence in the development and

project management process
This confidence leads to a long-term sustainable investment strategy
"What is Agile BI?"

The winner from the What is Agile BI? contest (Q2/2010):
Please see:
http://www.pentaho.com/what_is_agile/
http://www.pentaho.com/agile_bi/
Pre and Post Processing
Pre and Post Processing

Things done at the start or end of ETL processing
ETL is typically done in BATCH
We can optimize for a BATCH usage pattern
Pre and Post can be at different levels of the process
Can be Pre entire process (email administrator about begin of process)

Can be Post entire process (email administrator about end of process)
Can be Pre single phase (prepare database for bulk inserts)
Can be Post single phase (prepare database for regular use after
inserts)
Logical
Update Summary tables
Physical
Usually database Focused

Drop and Recreate Indexes, etc
Pre and Post Processing (cont)
Constraints
Foreign Key Constraints can kill performance
DML require a lookup to ensure consistency
Usually not a big deal for 10s and 100s of DML statements / sec
Batching we hope for 1000s and 10000s of DML statements / sec
Most efficient to do INSERTs and check consistency once in a batch
system
Pre Processing
Drop or Disable constraints at the beginning of your load routine
Post Processing
(optional) Validate before trying to enable constraints

SELECT PROD_ID FROM ORDER_DETAILS WHERE PROD_ID NOT IN(
SELECT DISTINCT PROD_ID FROM PRODUCTS);
Recreate or Renable Constraints
STEPS TO USE:
Indexes
Rebuilding Indexes while your doing DML might be wasting resources
Why do the work 1000x times when you can do it once?
Many IDXes will need to be rebuilt entirely if enough DML has occurred
Bitmap Index is popular index for BI
Pre Processing
Drop Indexes used for BI workload

Leave Indexes used for Dimension / PK Lookups
Post Processing
Create Indexes
STEPS TO USE:
Statistics
Statistics help databases create effective plans for queries
If they're off you'll have poorly performing queries
Some Databases have thresholds that automatically trigger a statistics
collection
You DO want updated statistics at the end of your load
You DO NOT want statistic updating to be occurring throughout your
load
Pre Processing
(optional) Disable automatic statistic triggering in your DB
Post Processing
Gather statistics on tables you DMLed

exec DBMS_STATS.GATHER_TABLE_STATS(SCHEMA, MYTABLE1);
STEPS TO USE:
Summary Tables
For this example, OLAP AGGREGATEs and SUMMARIES are the same thing
Summary tables can improve performance and ease of use of reporting
tools
Tables that contain top level data

Sales by Year, Product and Region
100 000 -> 1 record (good compression)
Pre Processing
(optional) Drop Summary Table
Post Processing
Update summaries by executing long running queries against FACTs

INSERT INTO SUMMARY_TABLE ( Year, Product, Sales)
( SELECT YEAR, PRODUCT, SUM(SALES) FROM HUGEFACTTABLE ....)
STEPS TO USE:
Table Exchange Loading

FACT_TABLE is being used by reporting tools and can't have partial data
Data load takes 3 hours but you can't be doing PHYSICAL inserts into
the FACT_TABLE.
Users need to see all the data or NONE. Nothing in between.
Pre Processing
Create table NEW_RECORDS that matches structure of FACT_TABLE
Post Processing
Create a FACT_TABLE_TOSWAP that has records in FACT_TABLE +

NEW_RECORDS
INSERT INTO FACT_TABLE_TOSWAP (
SELECT * FACT_TABLE UNION SELECT * FROM NEW_RECORD)
Truncate/Rename FACT_TABLE
Rename FACT_TABLE_TO_SWAP to FACT_TABLE
STEPS TO USE:
Clean up Staging
Your staging area (database, files) has scratch data that is unnecessary
after loading DW
Temporary Tables
Files from Source systems no longer necessary
Pre Processing
None
Post Processing
Delete files from ${temp_csv_file_location}

truncate table TEMP_PRODUCT_DATA
...
NOTE: Cleanup CAN occur during Pre processing instead of Post.

STEPS TO USE:
ETL patterns
Most pre and post processing is needed in common situations
E.g. Transactions
Please see also the chapter ETL patterns
Tuning and DBA Topics
Tuning Strategy
START
HERE
INSTRUMENT AND IDENTIFY
TUNING CANDIDATES
MEASURE AND
MONITOR
IMPROVEMENT
TUNE INDIVIDUAL
TRANSFORMS, JOBS, AND
DATABASE
Tuning Strategy (cont)

Identify and Instrument
How long does the various parts of your ETL load take?
Pick biggest bang for the buck - It will take time to tune a
transform/job
100 Minute Load Process
75 min in LoadCustomer Transformation
5 min each for remaining 5 Transformations
Option 1 Improve One of the 5 Minute Transformations

Spend (1) day Tuning mapping, achieve 50% improvement in speed!
Time Improved: 5 min - (5 min * ) = 2.5 min
Time Job: 100 min 2.5 min = 97.5 min
2.5 % IMPROVEMENT for 1 DAY OF WORK
Option 2 Improve LoadCustomer

Spend (1) day Tuning mapping, achieve 10% improvement in speed!
Time Improved: 75 min - (75 min * .9) = 7.5 min
Time Job: 100 min 7.5 min = 92.5 min
7.5 % IMPROVEMENT for 1 DAY OF WORK
Scalability and Planning
Bad
Good
Scalability
Important questions for

operational planning
How long does it take to
process 100k records?
How long does it take to
process 125k records?
40
35
30
Does your ETL scale?

Beware the Hockey Stick
Hours
Scalability
25
20
15
10
5
0
100k
200k
300k
400k
500k
600k
Number of Records
Database Logging
Covered in different module
Provides the data needed for tuning strategy, planning
SQL Query (pseudo)
select
TRANSNAME,
RECORDS,
ELAPSED_TIME (Start - End)
from PDI_TRANSFORM_LOG
T RANSNAME
load_csv_data
load_csv_data
load_csv_data
load_csv_data
RECORDS ELAPSED_T IME

2602
18
5523
35
1111
15
100000
225
Database and SQL Tuning
ORDER
Approach
1)Identify tuning candidate
(previous slides)
2)Review against basic Tuning
Concepts
1)Make change
2)Measure
3)Rinse and Repeat
3)Tune SQL and Database
1)Make change
2)Measure
3)Rinse and Repeat
Tuning Concept - Disclaimer

Your Mileage May Vary
Tuning Concept Memory Settings

64-bit - if you have it, use it
64-bit architecture is common, machines with 8, 16, 32 GB RAM
But: In some cases it may be better to go with 32-bit to reduce
memory usage. In a 64-bit server, every pointer and every integer will
take twice as much space as in a 32-bit server. That overhead can be
significant, and is depending on your use case.
Steps that can benefit from more memory
Sort
Stream Lookup
Sort
Join Rows
Sort (did we mention this one already?)
Avoid swapping and you can improve your performance by an order of

magnitude
Tuning Concept SORT

SORT by default
starts swapping rows
at 5000
Configurable (in
Transform)
Location of
SWAPPING also
configurable
NOTE: It's # of ROWs
not the size (MB)
Tuning Concept DB vs PDI Sort

Sorting in database
means no sort is
necessary in PDI
Pros of DB Sort
Fast DBs are

usually fast at
sorting
Less data moving
over wire (ELT)
Cons of DB Sort
Less metadata and

more difficult to
read
Database is doing
more real work
Tuning Concepts Database Latency

Executing a query takes time. How much?
JDBC Driver preparation
Network to
RDBMS Parse/Query Execution Plan
RDBMS Execute
Network from
JDBC Driver result return
Dominant time is usually in Execute and the others are negligible in BI

workloads
Unless
You are doing the above thousands of times / sec
Tuning Concepts Database Latency Steps

The following steps will perform
ONE database operation for every row
Database Lookup
Database Join
Call DB Procedure
Table Input
Delete
TWO database operations for every row
Update
Insert / Update
ONE, TWO, THREE database operations depending
Dimension Operator
Combination Dimension Operator
Tuning Concepts Database Latency Numbers

For small-ish loads (1000s and 10000s of rows) this is usually not an issue
In large volumes it adds up
Tuning Concepts Database Latency Caching

Turn on Caching in steps that support it.
Doesn't improve the DML
(INSERT/UPDATE/DELETE) but will speed the
SELECTS
NOTE: Table must not change for caching to
provide accurate results
Tuning Concept Stream when Possible

Smaller data sets
Small is a function of your
memory
Rough Guess (10k records)
Slightly slower until all lookup

rows have been loaded
Blazing fast once all lookup rows
have been loaded
When possible use the Merge Join
step. It joins on sorted streams
and is very fast and has more join
options.
Tuning Concept Latency vs Memory

More Streaming requires more memory
Faster Sorting requires more memory
More Caching requires more memory
Tradeoffs
Memory on a non HA ETL server is usually less expensive than

memory on HA database
Dial in your memory settings (depending on the row size etc.)
StreamLookup to 200k rows with 1GB of RAM
Change a DB Lookup with 500k rows into a StreamLookup
requires approximately 2.5 GB of RAM
Tuning Concept Joined Select vs Lookups

Joined Select
Join your two tables using SQL
in the Table Input step
Advantages of Joined Select
PDI memory (lookup data

doesn't have to reside in
memory)
Latency (no trip back to the
database for a lookup)
Fast(er)
Disadvantages of Joined Select

Increases load on the DB
Less readable (it's SQL)
Breaks metadata driven
principle
2011, Requires
Pentaho. All Rights Reserved.
www.pentaho.com.
knowledge
of SQL
Tuning Concept Reduce Cardinality SOONER

Reduce the number of
records you are
processing SOONER
than LATER
Reduces the number of
Lookups
Row Passing
....
SQL and Database Tuning

Have PDI tell you what SQL it will be using for SELECT/INSERT/UPDATEs
Turn to logging level Detailed or above to get SQL
INSERT INTO PDI_TRAINING (C1,C2) VALUES (?, ?)
Use traditional database tuning techniques
Not covered here

Explain Plans
Indexes
Storage Engines
...
Other good database design precepts
Use integers for keys instead of strings if possible

...
Interpreting Runtime Data
Runtime Data
Tabular data in Spoon that helps a developer understand information
about the transformation as it runs
Information about Number of Records streaming
Time
Records / Second
Status of Steps
Input / Output records on hops
Basics
On Log view when running Transformations from Spoon

Basics Columns
Stepname Name of the Step (lookup_region, read_orders)

Copynr If multiple copies are started which one this is (0,1,2,3,4, ...)
Read Number of records received from PREVIOUS step
Written Number of records passed to NEXT STEP
Input Number of records read from a file, database, etc.
Output Number of records output to a file, database, etc.
Rejected Number of records rejected
Errors Number of errors
Active Status of step (Initializing, Running, Finished, etc)
Basics Example
11468 Read from DB
Read 1466 Records
Read 1465 Records
Time and Records

Time
Number of Seconds from Start of

Step to either
Now() if still running
Time of Last recorded if
Finished
Speed
Number of records / Time

Time follows above formula so this
shows
Live throughput while running
Aggregate throughput for entire
run if Finished
NOTE: The Speed of a step is not
solely dependent on itself. Next
slides clarify.
Input / Output
Input / Output figure gives information about the
# of Records on the Input Hop
# of Records on the Output Hop
Hops can hold a configurable number of rows (0..N)
Example
Step1 has no INPUT and Hop1 as OUTPUT

Step2 has Hop1 as INPUT and Hop2 as OUTPUT
Step3 has Hop2 as INPUT and no OUTPUT
Input / Output Example

Hop1
Step1 output = 10000

Step2 input = 10000
Hop2
Step2 output = 0
Step3 input = 0
Input / Output Interpretation

Where rows are sitting on hops gives you information for improving
performance.
Formal
Look for the furthest downstream step with few records on its OUTPUT
and many records on its INPUT
Informal
Look for the first 0 on OUTPUT side and 10000 on the INPUT side
Step3 is Processing rows as fast as Step2 produces

Step1 is producing rows faster than Step2 consumes
Backlog
Slow steps cause a backlog
Look at all the steps Input / Ouput
10000
See what is CAUSING the backlog

(downstream with 0 on it's Output)
NOT just what IS backed up
10000
10000
Clustering and Partitioning

Clustering:
Computer cluster is a group of linked computers, working together
closely so that in many respects they form a single computer. The
components of a cluster are commonly, but not always, connected
to each other through fast local area networks.
Clusters are usually deployed to improve performance and/or
availability over that provided by a single computer, while
typically being much more cost-effective than single computers of
comparable speed or availability.
Columbia, the new (2004) supercomputer, built of 20 SGI Altix clusters, a total of
10240 CPU
Credit: NASA Ames Research Center/Tom Trower

Define Cluster Nodes (Slave/Master Servers)
One of them must be a Master (check the window height in the slave server dialog
to see the option Is the master)
The default username/password is cluster/cluster

Define Cluster Schemas (bunch of Nodes)
Create a new Kettle cluster schema
Select the master and slave servers
Note: When you get an error that the Socket port is already in use, try another
port. This can also happen when an error arises and the port is not closed.

Set up Carte on the slave machines
Carte is a simple web server that allows you to execute transformations and jobs
remotely.
The username / password is by default: cluster / cluster
It can be changed in your Pentaho Data Integration distribution in
pwd/kettle.pwd
To encrypt a password, use encr.bat/.sh with a parameter
This command line tool obfuscates a plain text password for use in XML and
password files.

Start a sample transformation clustered
Cx2 means these steps are executed clustered on two slave servers.
All other steps are executed at the master server
To execute the transformation:

Monitoring in Spoon
On the Slave / Master servers click on Monitor in the context menu
You will also see a preloaded Row generator test as a test transformation in
every Carte instance..

Monitoring via a Browser

What is happening in the background?

Special considerations
The cluster-nodes are a target from another step this works.

An Info Step (e.g. used by Stream Lookup) is outside the cluster-nodes.

An Info Step (e.g. used by Stream Lookup) is outside the cluster-nodes.
The slave server has two Socket Writers from dim_time to each of the slaves

An Info Step (e.g. used by Stream Lookup) is outside the cluster-nodes
Each slave server has a Socket Reader and gets only the half of the rows

An Info Step (e.g. used by Stream Lookup) is outside the cluster-nodes
Solution: Change the Data Movement to Copy to next steps

When the Sort step is a bottleneck: Let it run clustered!
But you need a Sorted Merge step, that does the following in the background but
in an clustered environment:

Clustering the databases (Partitioning)
When we look at the first simple example again:
Since data ends up on 2 different servers with 2 different database connections on

the same database you can get into trouble (deadlocks) if one server is doing
updates to the same ID as the other one.
This makes it actually a candidate for partitioning because if we create 2
partitions (0 and 1) we guarantee that the same ID will always end up on the same
server.

Partition for clustered databases
When you want to use clustered databases, uncheck the Dynamically create
the schema and the database clusters are taken into account.

Partitioning methods for distributing the data
Round-robin: That is the standard method when no explicit

partitioning is defined.
Mirror to all partitions: Data are copied to all slaves.
Mod partitioner ("hash-partitioned) : Distribute the data by an ID
and guarantee that the same ID will always end up on the same
server.
This method takes an ID (Integer or even the hash code of a
String, Date, etc.) and divides that number by the total number
of partitions and take the remainder (modulo).
For example:
id=26, 3 partitions --> 26%3 = partition 2

Write to a clustered database without clustering the transformation

Read from a clustered database without clustering the transformation

Read from a file in a clustered environment
Remember to check "Running in parallel (otherwise you get all rows on all
clusters)
The file is divided internally into trunks of data for each cluster-node to process.
The same principle works for the Fixed file input.

Dynamic Clusters (available since 3.2.0)
A dynamic cluster is a cluster schema where the slave servers are only known at
runtime.
This situation is occurring in those situations where hosts are being added or
removed at will, such as in cloud computing settings. It will also handle fail over
situations.
More details of this powerful feature can be found over here:
http://wiki.pentaho.com/display/EAI/Dynamic+clusters
Hadoop
The Case for Big Data

Enterprises increasingly face needs to store, process and maintain
larger and larger volumes of structured and unstructured data
Compliance
Competitive Advantage
Challenges associated with big data
Cost storage and processing power
Timeliness of data processing
Google trends for Hadoop
Why Hadoop?
Low cost, reliable scale-out architecture for storing massive
amounts of data
Parallel, distributed computing framework for processing data
Proven success in solving Big Data problems at fortune 500
companies like Google, Yahoo!, IBM and GE
Vibrant community, exploding interest, strong commercial
investments
Hadoop for Data Integration and BI

Top Use Cases for Hadoop*
1. mine data for improved business intelligence
2. reducing cost of data analysis
3. log analysis
Top Challenges with Hadoop*
1. Steep technical learning curve
2. Hiring qualified people
3. Availability of appropriate products and tools
Unfortunately, Hadoop was not designed specifically for ETL and BI use cases:
Its not a database
High latency queries and jobs not ideal for all BI use cases
Skill set mismatch for traditional ETL users and BI Solution architects
*Based on a survey of 100+ Hadoop users conducted by Karmasphere, Sept. 2010
Pentaho BI Suite for Hadoop

Lowers technical barriers by providing an easyto-use ETL environment for managing data in
Interactive Analysis
Hadoop
Batch Reporting
and Ad Hoc Query
Provides end-to-end BI Tools addressing common

Hoc Query and Interactive Analysis
Extreme ETL scalability through integration with
Data Marts
Agile BI
BI use cases with Hadoop including Reporting, Ad
Hadoops MapReduce framework
Hadoop
Workflow Integration of Hadoop jobs with
PDI ETL Jobs
external ETL and BI activities

Reduces costs through our subscription-based
pricing model, reduced dependency on high paid
technical resources, and easier maintainability
Log
File
s
DBs and
other sources
Big Data Does Not Replace Data Marts

Its not a database
High latency
Optimized for massive data-crunching
Databases are immature
Databases are no-SQL
US and +1
Worldwide:
+1 (866) 660-7555
| 573
Slide
US and Worldwide:
(866) 660-7555
| Slide
What Hadoop Really is.

Core components
HDFS a distributed file system allowing massive storage across a cluster of
commodity servers
Map-Reduce
Framework for distributed computation, common use cases include
aggregating, sorting, and filtering BIG data sets
Problem is broken up into small fragments of work that can be computed
or recomputed in isolation on any node of the cluster
Related Projects
Hive a data warehouse infrastructure on top of Hadoop
Implements a SQL like Query language, including a JDBC driver
Allows MapReduce developers to plugin custom mappers and reducers
Hbase the Hadoop database AH HA!
A variant of NoSQL databases, problematic for traditional BI
Best at storing large amounts of unstructured data
US and +1
Worldwide:
+1 (866) 660-7555
| 574
Slide
US and Worldwide:
(866) 660-7555
| Slide
Hadoop and BI?

Instead of this...
US and +1
Worldwide:
+1 (866) 660-7555
| 575
Slide
US and Worldwide:
(866) 660-7555
| Slide
Hadoop and BI?

You have to do this in Java...
public void map(
Text key,
Text value,
OutputCollector output,
Reporter reporter)
public void reduce(
Text key,
Iterator values,
OutputCollector output,
Reporter reporter)
US and +1
Worldwide:
+1 (866) 660-7555
| 576
Slide
US and Worldwide:
(866) 660-7555
| Slide

Data Marts, Data Warehouse,
Analytical Applications
Pentaho Data
Integration
Design
Hadoop
Pentaho Data
Integration
Deploy
Orchestrate
Pentaho Data
Integration
US and +1
Worldwide:
+1 (866) 660-7555
| 577
Slide
US and Worldwide:
(866) 660-7555
| Slide
Visualize
Reporting / Dashboards / Analysis

Web Tier
DM & DW
RDBMS
Optimize
Hive
Hadoop
Files / HDFS
Load
Applications & Systems
US and +1
Worldwide:
+1 (866) 660-7555
| 578
Slide
US and Worldwide:
(866) 660-7555
| Slide
The Road Ahead

Streaming Data Source Support
In support of near-realtime use cases
Long/Always running data processing jobs
NoSQL Integration
Facilitate BI use cases on top of Hbase, possibly others like Cassandra
Contiguous Meta-data
Data Lineage and Impact Analysis covering the entire big data
architecture
The End of MapReduce ( as a concept ETL users need to understand)
Push down optimization of Transformations that generate native
MapReduce tasks in Hadoop
US and +1
Worldwide:
+1 (866) 660-7555
| 579
Slide
US and Worldwide:
(866) 660-7555
| Slide
Operations Patterns
Operations General Overview

Health checks for PDI components
DI Server
Carte and clustered environments
Kitchen/Pan
Detect dead locks
Health checks for external components
JVM (e.g. memory, used CPUs)
Server (e.g. test with a network ping)
Databases (up and running)
Signal to noise detection:
Defining what is normal (noise) and notify on exceptions (e.g. set
thresholds absolute or relative eventually by average)
Define what is unusual and notify on these events
Define Events (e.g. Actions, Notifications & Alerts)
Constraint #1: Minimise the footprint and impact to the system by the
measurements
Operations Define Actions by Events

When operations detects an event, an action can be taken, e.g.:
Start/Stop/Restart a process (Job or Transformation)
Start/Stop/Restart a server (DI Server or Carte)
Notify by logging or sending an alert (e.g. mail)
The actions for specific events can be defined in different operational

check routines and they call all the same event handler.
By this concept, actions can be added and changed easily and are
processed at one common place.

Examples for logging actions:

Examples for starting
Jobs
Transformations
Shell jobs and
Sending mail
Pattern: Watchdog
Watchdog:
A watchdog timer is a computer hardware or software timer that
triggers a system reset or other corrective action if the main program,
due to some fault condition, such as a hang, neglects to regularly
service the watchdog (writing a service pulse to it, also referred to
as kicking the dog, petting the dog, feeding the watchdog or
waking the watchdog).
The intention is to bring the system back from the unresponsive state
into normal operation. [], more e.g. on Wikipedia
Most of the PDI health checks can be accomplished with the
Watchdog concept
Pattern: Watchdog
Watchdog timers for multitasking (e.g. many PDI jobs and cluster nodes):
A software crash might go undetected by conventional watchdog strategies
Success lies in weaving the watchdog into the fabric of all of the system's
tasks, which is much easier than it sounds.
Build a watchdog task
Create a data structure (database table) that has one entry per task
When a task starts it increments its entry in the structure. Tasks that only
start once and stay active forever can increment the appropriate value
each time through their main loops, e.g. every 10,000 rows
As the job or transformation runs the number of counts for each task
advances.
Infrequently but at regular intervals the watchdog runs.
The watchdog scans the structure, checking that the count stored for each
task is reasonable. One that runs often should have a high count; another
which executes infrequently will produce a smaller value.
If the counts are unreasonable, halt and let the watchdog timeout and fire
an event. If everything is OK, set all of the counts to zero and exit.
Pattern: Watchdog
An example implementation with PDI:
This is task oriented, not server oriented
This means it will be checked, if the task (a Transformation or Job) is
running as expected independently in what environment (e.g.
clustered or not)
Task 1
Watchdog
Task
Definitions
1..n
Counter
1..n
Task 2
Event
Task n
Pattern: Watchdog
The Watchdog Task Definitions table
Environment variable: ${watchdog_task_table}
Example table name for operations:
op_watchdog_task
Fields & Descriptions
wd_task_id
Unique Task ID
wd_task_description
Task Description (optional)
wd_task_disabled
1=Do not check the task (the counters are still incremented by
the tasks)
wd_task_min_count
When >0: Check if the counter is at least at this value after

the cylce time
wd_task_max_count
When >0: Check if the counter is below this value after the
cylce time
wd_task_cycle_minutes
Defines the check time in minutes.
Pattern: Watchdog
Fields & Descriptions (Watchdog Task Definitions table continued)
wd_task_lenient_count
When an exception is detected, be

lenient for x times.
wd_task_event_type
In case of an exception, fire this event.
wd_task_event_details
See the events section for more

details.
wd_task_last_reset
When (date & time) was the counter

last reset by the watchdog?
wd_task_last_detection_count
When lenient count is used, the

number of detections is logged here by
the watchdog.
Pattern: Watchdog
The Watchdog table for counting
Environment variable: ${watchdog_table}
Example table name for operations: op_watchdog
wd_task_id
Unique Task ID
wd_hostname
Hostname of the last task (informational only)
wd_ip_address
IP-Adress of the last task (informational only)
wd_slave_server
Slave Server of the last task (informational only)
wd_last_run
Date & time of the last task run and counter

increment (informational only)
wd_counter
The counter is incremented by every task run. This

is checked by the watchdog and reset.
Pattern: Watchdog
In the sample implementation, an event is fired after the cycle time, when:
The cycle time is reached AND
the counter is zero OR
the counter is below min OR
the counter is above max
When lenient count is defined, it waits to fire an event until it reaches the
number of events.
Watchdog Environment Variables
watchdog_task_table The Watchdog Task Definitions table
watchdog_table
The Watchdog table for counting
watchdog_mode
normal: normal operation

disabled: do not increment the timer by the task
learn: reserved for future use, not used yet
Pattern: Watchdog
Overview of the sample implementation jobs and transformations:
Create a sample environment
Define a database connection operations_db and share it
Create test tables with job test_watchdog_create_tables
Fill the watchdog task table with samples by transformation
test_fill_watchdog_task
Samples for incrementing the watchdog counters by tasks
When you want to run the watchdog task within a job, have a look at
test_watchdog_job that is calling transformation watchdog_task to increment
the counter
When you want to run the watchdog task within a transformation, have a look
at transformation test_watchdog_task_streaming that is calling a subtransformation (mapping) watchdog_task_streaming. Make sure to define a
threshold at what number of processed rows the counter should be
incremented. This is used to avoid performance problems. It is also possible to
call the watchdog_task_streaming at the end with a blocking step or at the
beginning.
Pattern: Watchdog
Sample for the watchdog to check the counter:
Job watchdog_main should be run in an interval, e.g. every minute
This job is calling other jobs and transformations to implement the logic: job
watchdog_check_wrapper, job watchdog_check and transformation
watchdog_check
When an event is triggered, it is calling the job event
Test run:
When you let watchdog_main run for the first time, you will get the following
in the log entries for all tasks to check:
Result from watchdog_check - Detection: 0 (0=ok, 1=detection, 2=lenient ok)
[last_date is not valid, looks like the first run: initializing]
After the cyle time is reached and no task was run, you will get:
Result from watchdog_check - Detection: 1 (0=ok, 1=detection, 2=lenient ok)
[wd_counter is null or 0]
And the event is fired (in this case, the logging):
PDI Operations Event - Log ERROR: Task 99 exceeded
Feel free to let the tasks run to increment the counters, change some settings
and watch the different results!
Pattern: Health Check for the JVM

Analyze the available memory over a time period
Send an event on memory shortage (defined by thresholds)
An example implementation to collect the available memory can be found
in transformation JVM_collect_data:
Together with the last log date, hostname, current process identifier
(PID) and memory information, the environment variable
${operations_instance_id} is also logged to differentiate by this id.
You may define thresholds in a different table and check these similar to
the other operations patterns and fire events accordingly, e.g. when the
available memory goes below 20 percent.
Pattern: Failover of a master server

Controller:
Check if the main master server is up and running
If it is down, switch to a secondary master server
Master server
The master server is also checking, if the controller is up and running
If it is down, switch to a secondary controller
Master
Controller
Failover
Definitions
& Status

An example implementation with PDI:
Supporting of multiple failover masters and controllers
If the main server is up again, an automatic switch back is too complex
The switch back could be accomplished on a process based
approach or when all processes (jobs and transformations) are
finished. This could lead to a down time for specific processes. Out
of this, an automatic switch back is not implemented.
As an option, a controlled and manual switch to another server is
possible
The slave servers are not taken into account since the master is
capable of handling these with the Dynamic Clustering option
A dedicated DI Server, Carte Server or Kitchen instance is needed for
the controller.

The Failover Master Definitions table
Environment variable:${failover_master_table}
op_failover_master
fm_id
Unique failover master ID
fm_description
Description of this server
fm_status_url
The URL of the DI Server (e.g. http://localhost:9080/pentahodi/kettle/status?xml=Y) or Carte Server (e.g.

http://localhost:8084/kettle/status/?xml=Y) to check the
status in XML format
fm_user
User for authentication (tests are joe or cluster)
fm_password
Password (encrypted is also possible, see encr.bat./sh)

Fields & Descriptions (Failover Master Definitions table continued)
fm_is_disabled
1=Do not check this server
fm_is_controller
1=This is a controller (otherwise a master)
fm_is_primary
1=This is the primary master or controller

(otherwise secondary, failover)
fm_is_active
1=This is the actual active server (master or

controller) [this will be changed automatically
by the controller or master]
fm_failover_order
1...n: Order to activate failover masters or

controllers (since multiple failover servers are
possible)
fm_last_check
Date & Time of the last status check


Fields & Descriptions (Failover Master Definitions table continued)
fm_last_status
1=Online, 0=Offline
fm_last_response_time
Last response time in ms
fm_last_response_message
Last response message in XML (most times on

success), HTML or Exception (up to 250
chars) depending on the failure.
fm_last_nr_jobs
Actual number of running jobs
fm_last_nr_transformations
Actual number of running transformations
fm_controlled_switch_to
[not used, yet]
fm_controlled_switch_initia
[not used, yet]
ted

Create the test definition table with job test_create_failover_master_table
Fill the definition table with samples by transformation
test_fill_failover_master
Note: To simplify testing, you can disable or enable checks by setting the
field fm_is_disabled accordingly.
Failover of a master server, Environment Variables
failover_master_table
The Failover Master Definitions table
operations_instance_id=1
This is the instance id for each DI Server or Carte

Server. This is the according fm_id.

We propose to set the environment variables in the kettle.properties file on
each server.
If you want to set up a test environment with multiple DI or Carte servers on
one machine, you need to modify the KETTLE_HOME variable for each DI or
Carte server instance accordingly (see Knowledge Base for more information
about the KETTLE_HOME variable).
A startup batch script could look like this:
set KETTLE_HOME=C:\Pentaho\Kettle\KETTLE_HOME_3
cd "C:\Pentaho\pdi-ee-4.0.1-GA\data-integration"
start carte.bat 127.0.0.1 8085

Sample for the failover process:
Job failover_main should be run in an interval, e.g. five minutes.
This is running on every master and every controller, also on the
failover servers.
This job is calling other jobs and transformations to implement the
logic:
job failover_check_status (includes transformation
failover_master_table_type_specific and job
failover_check_online) to check the status of the servers.
When it is running on the active controller, it is checking the
master servers and vice versa: the active master is checking the
controller(s). This is updating the status in the table.
Then job and transformation failover_change_status is executed
to change the active master (or controller) when it is down.
In the job failover_change_status, it is also possible to implement
to fire events depending on a status change.

Sample for checking if this is the active server:
Include job failover_am_i_active into your existing jobs. This makes
it possible to use the same scheduler settings on all servers and let
them active. (It would also be possible to start and stop the
scheduler, but then you have to ensure that historic jobs are not
fired up again when the scheduler starts.)
A sample implementation is in test_am_i_active
It would also possible to set an environment variable (e.g.
stop_processing) by a remote execution job (fired by an event). This
variable can then be checked by jobs or within streaming
transformations.

Test run for the failover procedure
Define the master and controller servers accordingly in the Failover Master
Definitions table, you may recreate the definition table with samples by
transformation test_fill_failover_master
Set the environment variable operations_instance_id accordingly to your
definitions for each server instance
Start the DI and/or Carte Servers up
Execute job failover_main and look at the results in the Failover Master
Definitions table. This depends on what environment variable
operations_instance_id it is running. If this is the active master, it checks all
controllers. If this is the active controller, it checks all masters.
The following example shows one active and one unreachable server:

You may change the environment variable operations_instance_id
accordingly to your definitions for each server instance to check the
other servers.
Execute job failover_main and look at the results in the Failover
Master Definitions table.
Please check: When your active master server gets into an offline
status, the next available master server will get the active flag after
the job is executed.

You may check what happens, when you start the job
test_am_i_active on an active and an inactive server.
If you want to change back to the primary master, you can watch on
the # of jobs and transformations of the failover and switch back
manually by changing the active flags in the Failover Master
Definitions table.
Another possible extension by the knowledge of the number of
running jobs and transformations, is to implement a load balancer
depending on a threshold of parallel running processes.
Pattern: Health check for clustered environments

Tasks:
Check if the Master server is up and running
Check if the Slave servers are up and running
Start a transformation on the cluster nodes and see if the response
times are reasonable.
You can modify the pattern Failover of a master to accomplish this
task. Simply add the slave servers to the list of servers, add another type
to the definition table for a slave and modify the job
failover_check_status to take this new type into account.
It is also possible to auto register slave servers to a master and get the
list of registered slave servers from the master as part of the Dynamic
clustering option.
A sample implementation for this pattern is not available at this time.
Pattern: Workload Balancing

Tasks:
Have a workload queue for jobs and transformations
Gather information about the actual workload on each server
Depending on thresholds for specific measures (# of jobs, # of
transformations, memory usage, CPU usage, response times etc.) you
can route the next job or transformation from the workload queue to a
specific server
Optionally, you may define specific thresholds for different jobs and
transformations (e.g. a job needs the server exclusive or dedicated) or
fire events when a) resources are not available for a certain amount of
time or b) the queue gets too big or queue entries are not processed in
a certain amount of time.
You can modify the patterns Failover of a master and Health check for
clustered environments to accomplish this task and add some additional
logic around.
A sample implementation for this pattern is not available at this time.
Pattern: Analyzing log entries

Analyze the log entries for:
Unusual short or long running jobs (signal to noise detection by thresholds)
Dead lock situations
Number of processed rows (min/max)
Assumptions:
Database logging is enabled
When analyzing dead lock situations, the logging interval must be set and must
be below the check cycle time
When analyzing the number of rows, these must be set in the logging of the
transformation settings accordingly
Performance considerations:
Analyzing log entries could be a performance burden to the database and the
overall system. To minimize this, indexes should be set accordingly. By the
design of this patterns, we limited the access to the logging tables to the
minimum needed, e.g. by temporarily storing parts of the log data to a
temporary local file.

The Analyze Log Definitions table
Environment variable:${analyze_log_table}
Example table name for operations: op_analyze_log
al_id
Unique Analyze Log ID
al_type
1=Job, 2=Transformation
al_name
Job or Transformation name for checking

the log entries (*=all)
al_is_disabled
1=Do not check this entry
al_cycle_minutes
Defines the check time in minutes.
al_last_batch_id
This is the last completed batch_id (set to 1 at the beginning). Used to limit the
number of log entries to check.

Fields & Descriptions (Analyze Log Definitions table continued)
al_deadlock_minutes
When >0: Check for new log entries after this time
to detect deadlock situations.
al_deadlock_event_type
When a deadlock is detected, fire this event
al_deadlock_event_details
See the events section for more details.
al_min_minutes
When >0: Check at the end of a process, if it took at

least x minutes
al_max_minutes
When >0: Check within the process, if it took

already more than x minutes
al_time_event_type
When a timing issue is detected, fire this event
al_time_event_details

Fields & Descriptions (Analyze Log Definitions table continued)
al_min_rows
When >0: Check at the end of a process, if it

processed at least x rows
al_max_rows
When >0: Check within the process, if it processed

already more than x rows
al_row_event_type
When a # of rows issue is detected, fire this event
al_row_event_details
al_is_status_check_failed
=1: Check for failed jobs and transformations
al_status_event_type
When a status issue is detected, fire this event
al_status_event_details

The Analyze Log table for checking
Environment variable:${analyze_log_check_table}
op_analyze_log_check
al_id
Unique Analyze Log ID
al_last_check
Date and time of last check
al_last_check_batch_id
Corresponding BATCH_ID of the log.
al_last_channel_id
Corresponding CHANNEL_ID of the log. This

is used together with the al_id to determine
the corresponding log entry.
al_last_status
This is the copy of the last status from the

original log

Fields & Descriptions (Analyze Log table for checking continued)
al_last_log_crc
This is the last CRC from the log message of

the original log entry used to detect changes
and deadlock situations
al_last_log_crc_change
Date and time of the last CRC change
al_is_finished
=1 means, this entry reflects a finished

transformation or job without issues
al_detection
>0 means, this entry reflects an entry with an

issue (1=failed, 2=deadlock, 3=min time,
4=max time, 5=min rows, 6=max rows)
Note: When al_is_finished=1 or al_detection>0, this entry will no

longer be checked to avoid multiple events for the same log entry

In the sample implementation, an event is fired after the cycle time, when:
The cycle time is reached AND
When checking for failed status: Status is stop or ERRORS>0 OR
When checking for deadlocks: No change in the log message field for the
specified amount of time OR
When max/min rows are defined: the # of max rows is reached while
running or after the process has finished, the # of min rows is not reached
OR
When max/min minutes are defined: the # of max minutes is reached
while running or after the process has finished, the # of min minutes is not
reached
For details, please see the transformation analyze_log_check
Analyze Log Environment Variables
analyze_log_table
The Analyze Log Definitions table
analyze_log_check_table
The Analyze Log table for checking

Create test tables with job test_create_log_tables
Fill the analyze log table with samples by transformation
test_fill_analyze_log_tables
Note: To simplify testing, you can disable or enable checks by setting
the field al_is_disabled accordingly.
Samples for simulating malfunctions
When you want to simulate a deadlock situation, have a look at
transformation and job test_deadlock.
When you want to simulate an error and set a process to failed, have a
look at transformation & job test_failed.

Sample for the analyzing log entries:
Job analyze_log_main should be run in an interval, e.g. every five minutes
This job is calling other jobs and transformations to implement the logic: job
analyze_log_check_wrapper, transformation analyze_log_load_temp, job
analyze_log_check and transformation analyze_log_check
When an event is triggered, it is calling the job event
When multiple exceptions for a process are found, the highest value of the
detection is used to fire the event (see definition of field al_detection).
Test run, general test:
When you let analyze_log_main run for the first time, you will get the
following in the log entries:
Result from analyze_log__check - Detection: 0 (0=ok, 1=failed, 2=deadlock,
3=min time, 4=max time, 5=min rows, 6=max rows) [unknown]
This is correct since no log entries to analyze are there.

Test run for a failed transformation:
Prepare by filling the analyze log table with samples by transformation
test_fill_analyze_log_tables (set al_is_disabled to 0 for al_id=5)
Execute transformation test_failed
Execute analyze_log_main again, you will get the following in the log entries:
3=min time, 4=max time, 5=min rows, 6=max rows) [Status indicates failed
(=stop) or ERRORS>0.]
And the event: PDI Operations Event - ERROR: Transformation test_failure
failed
Have a look at table op_analyze_log, and check that al_last_batch_id has
changed for this log entry
Have a look at table op_analyze_check_log, and check that al_detection is 1
for this log entry

Test run for a failed job:
test_fill_analyze_log_tables (set al_is_disabled to 0 for al_id=4)
Execute job test_failed
Execute analyze_log_main again, you will get the following in the log entries:
3=min time, 4=max time, 5=min rows, 6=max rows) [Status indicates failed
(=stop) or ERRORS>0.]
And the event: PDI Operations Event - ERROR: Job failed
The definition * was used for this check, when you look at table
op_analyze_log and the test entry al_id=4
Have a look at table op_analyze_log, and check that al_last_batch_id has
changed for this log entry
Have a look at table op_analyze_check_log, and check that al_detection is 1
for this log entry
Reruns of analyze_log_main will not log this failed job any more.

Test run for a deadlock transformation or job:
test_fill_analyze_log_tables (set al_is_disabled to 0 for al_id=2 for the
transformation or al_id=1 for the job)
Execute transformation or job test_deadlock
Note: Due to http://jira.pentaho.com/browse/PDI-4557,you need to execute
this remotely and not in the same Spoon instance
Execute analyze_log_main again, you will not get anything about a deadlock in
the log entries until you reached the al_deadlock_minutes time. You may see
messages like: [Not finished, yet. Checking next time again.] or [Check cycle
time not reached, yet. Checking next time again.]
Execute analyze_log_main again after the al_deadlock_minutes time is
reached, you will get the following in the log entries:
3=min time, 4=max time, 5=min rows, 6=max rows) [Deadlock detected.]
And the event: PDI Operations Event - ERROR: dead lock in transformation
test_deadlock
Like before: Reruns of analyze_log_main will not log this deadlock any more.
Further Information
More Resources
Kettle project page:
http://kettle.pentaho.com
Enterprise Edition Documentation, Knowledge Base Articles and more
http://kb.pentaho.com/
Community Documentation (WIKI):
http://wiki.pentaho.com/display/EAI/
For up to date information, check the forums:
http://forums.pentaho.org/forumdisplay.php?f=69
Bug and Feature Requests with Road Maps (JIRA):
http://jira.pentaho.com
FAQ for Bug and Feature Requests:
http://wiki.pentaho.com/display/EAI/Bug+Reports+and+Feature+Requests+FAQ
More Resources
Community:
http://community.pentaho.com
Pentaho Open Source Business Intelligence Suite - European User Group
http://xing.com/net/pug
Pentaho Open Source Business Intelligence at LinkedIn
http://www.linkedin.com/groups?gid=105573
Other user groups
http://wiki.pentaho.com/display/COM/Pentaho+User+Groups

PD I 2000 Lectures

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

PD I 2000 Lectures

Hochgeladen von

Copyright:

Verfügbare Formate

Pentaho Data Integration

2011, Pentaho. All Rights Reserved. www.pentaho.com.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 2

Audience and Course Prerequisites

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 3

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 4

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 5

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 6

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 7

Pentaho Data Integration

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Pentaho Data Integration (PDI) Introduction

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 9

Use as a BI Platform Component

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 10

Enterprise Edition (EE) Data Integration Server

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 11

Enterprise Edition (EE) Data Integration Server

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 12

The Enterprise Console

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 13

The KETTLE Project

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 14

PDI (KETTLE) History

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 15

PDI (KETTLE) History

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 16

PDI (KETTLE) History

Faster flat file reading:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 17

PDI (KETTLE) History

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 18

PDI (KETTLE) History

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 19

PDI Version 4.2 (July 2011)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 20

PDI Version 4.2

reserve next value range from a slave sequence service

Memory tuning of logging back-end with

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 21

PDI Version 4.2

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 22