Beruflich Dokumente
Kultur Dokumente
for
Database Developers
July 2011
Welcome Agenda
Audience and prerequisites
Learning objectives
Class process
Course outline
Learning Objectives
At the end of the course, you should understand:
The basic architecture and features of Pentaho Data Integration.
The concept and features of the advanced Pentaho Data Integration
Enterprise Edition.
How PDI supports you in the Agile BI approach.
Learning Objectives
At the end of the course, you should be able to
Load and write data from and to different data sources
Join data from different sources
Use PDI and ETL design patterns (like restartable solutions)
Influence the performance aspects of databases and transformations
Build portable and flexible jobs and transformations
Schedule jobs and transformations
Use logging, monitoring and error handling features of PDI
Load, transform, and create complex XML structures
Use scripting (JavaScript, Formula, Java) in transformations
Apply clustering and partitioning solutions for high volumes
Learning Objectives
What are your objectives?
What are you expectations?
Course Process
Daily schedule:
9:00 am 5:00 pm
1 hour lunch break -noon
15 minute morning break -10:30
15 minute afternoon break -3:30
The course is a combination of lecture, demo and labs.
Feel free to ask questions or to seek clarification!
Online survey will be provided for feedback and suggestions.
Common Uses
Data warehouse population:
Built-in support for slowly changing
dimensions, junk dimensions and other data
warehouse concepts.
Export of database(s) to text-file(s) or other
databases.
Import of data into databases, ranging from
text-files to Excel spreadsheets.
Data migration between database applications.
Common Uses
Exploration of data in existing database
(tables, views, synonyms, ).
Information enrichment by looking up data
in various information stores (databases,
text-files, Excel spreadsheets, ).
Data cleansing by applying complex
conditions in data transformations.
Application integration.
Example
1
Agile BI
Modeling and Visualization perspectives
Agile BI
Modeling and Visualization perspectives.
Kitchen
Command line tool for executing jobs
modeled in Spoon
Job Example
Transformations
Transformations are a network of logical tasks (Steps):
Read a flat file
Filter it
Sort it
Load it into MySQL
Threading mechanism
KETTLE_ROWSET_GET/PUT_TIMEOUT
Lazy Conversion
Lazy conversion is a delayed conversion of data types
Provides a performance boost.
Conversion takes place only where it is really needed
Ideally at output steps
Sometimes not at allreading from text-file and writing back to
text-file
If output format is the same as the input format, no conversion
Steps support lazy conversion
Specifically: CSV File Input, Fixed File Input and Table Input
Other steps support it transparently
Lazy Conversion
Binary form of data can cause issues:
For example: sorting on binary form ignores character set sorting
sequence.
New feature in the Select Values step
Covered in more detail in a separate module
Converts binary data to and from binary character data
Other methods:
Middle or scroll wheel button, click on the first step and drag onto the
second.
Use SHIFT+Click and drag from one step to another
Select 2 steps, right click on one of them and select New hop.
Drag Hops onto the canvas.
Inserting a step (or job entry) between others:
Move the step over the arrow until the arrow becomes drawn in bold
Release the mouse-button
Window sizes:
May have to resize some dialogs to all parameters
Log View
Shows statistics associated with execution of a Transformation.
Used to understand performance and to check the results.
Logging can be very granular down to the row level if needed.
Safe Mode
Available in the Execute a Transformation/Job window
Used in cases that mix rows from various sources
Makes sure that these rows all have the same layout (metadata).
Forces each Transformation to check layout of each row.
Error thrown on row that differs in layout from first row
Step and offending row are reported
Has performance tradeoffs:
Checking on each row slows performance.
Source of an error found sooner, useful in trouble shooting.
Analyzing Errors
2010/05/18 16:38:00 - Generate Rows.0 - ERROR (version 4.0.0-) : Couldn't parse Integer
field [WrongType] with value [abc] -->
org.pentaho.di.core.exception.KettleValueException:
Debugging
Introduced in PDI 3.0
Provides condition break points
Replaying a Transformation
Is implemented for Text File Input and Excel Input
Allows files containing errors to be sent back to source and corrected.
Uses .line file to reprocess file:
ONLY lines that failed are processed during the replay.
Uses the date in the filename of the .line file to match the replay
date.
Database Connections
Multiple database connections to different databases can be created.
With a PDI repository:
Defined connections readily available to transformations and jobs.
Connection information for the repository itself is stored in
repositories.xml.
Without a PDI repository:
Connection definition contained in a single Transformation or Job.
Can share connection definitions in subsequent Transformations and
Jobs.
Database Connections
Available database connections appear in the Main Tree.
Database Connections
General database connection options
Quoting
Quoting is used when reserved names or special characters are used.
For example: Field names, sum, V.A.T., overall sales.
PDI has an internal list of reserved names for most of the supported
database types.
PDIs automatic quoting can be overridden.
Feedback on quoting is always welcome to improve quoting
algorithms.
Database Explorer
In the toolbar:
In the connection context menu:
Impact Analysis
What impact does the Transformation have on the used databases?
SQL Editor
Creates the needed DDL for the output steps related to a database
table, often CREATE statements for tables or indices.
SQL button in the toolbar creates all needed DDL for tables.
No automatic mechanism to alter tables when the layout changed.
For example: A field type from a source table is changed
DDL can be easily and manually changed.
Output Steps
Table Input
Reads information from a database, using a connection and SQL.
Table Input Options
Step Name: The name has to be unique in a single Transformation.
Connection: The database connection used to read data from.
SQL: The statement used to read information from the database
connection; may be any query.
Insert data from step: The input step name where parameters for
the SQL come from, if appropriate.
Limit: Sets the number of lines that are read from the database.
Excel Input
Reads information from one or more Excel files.
The options provided by PDI GUI for accepting excel inputs include:
Step Name: Name of the step.
File Tab: To define the filenames, with variable support.
Sheet Tab: To define the sheet(s) to import.
Fields Tab: To specify the fields that need to be read from the
Excel files.
Error handling Tab: Allows user to define how to react when error
is encountered.
Content Tab: Includes the sub options of Header, No empty rows,
Stop on empty rows, Field name, Sheet name field, Row number
field and Limit.
Access Input
Reads information from one or more Access files.
No ODBC connection necessary.
Allows Access files to be read on non-Windows platforms.
Access Input Options:
Step Name: Name of the step.
File Tab: To define the filenames, with variable support.
Content Tab: Specify the table name and the inclusion of the file
name, table name, row number and limit.
Fields Tab: Specify the fields that need to be read from the Access
files.
XBase Input
Reads data from most types of DBF file derivates called the XBase
family. (dBase III/IV, Foxpro, Clipper, ...)
Options:
Step name: Unique name (in transformation) of the step.
Filename: Name of XBase file with variable support.
Limit size: Only read this number of rows; zero means unlimited.
Add rownr?: Adds a field to the output with the specified name that
contains the row number.
Generate Rows
Outputs a number of rows, default is empty but optionally containing a
number of static fields.
Options
Fields: Static fields user might want to include in the output row.
LDAP Input
Reads data from a LDAP server.
Options
Host: Hostname or IP address of the LDAP server.
Port: The TCP port to use, typically 389.
User Authentication: Enable to pass authentication credentials to
server.
Username/Password: For authenticating with the LDAP server.
Search base: Location in the directory from which the LDAP search
begins.
Filter String: The filter string for filtering the results.
Fields: Define the return fields and type.
Table Output
Insert (only) information in a database table.
Options
Target table
Commit size
Truncate table
Ignore insert errors
Partition data over tables
Use batch update for inserts
Return auto-generated key
Name auto-generated key field
Is the name of the table defined in a field
Insert / Update
Automates simple merge processing:
Look up a row using one or more lookup keys.
If a row is not found, insert the new row.
If found and targeted fields are the identical, do nothing.
If found and targeted fields are not identical, update the row.
Options
Step name
Connection
Target table
Commit size
Keys
Update fields
Do not perform any updates (If used, operates like Table Output,
but without any Insert errors caused by duplicate keys).
Update
Same as the Insert / Update step except no insert is performed in the
database table.
ONLY updates are performed.
Delete
Same as the Update step, except rows are deleted.
Excel Output
Exports data to an Excel file
Options
Sheet name
Protect sheet with a password
Use a template (e.g. with a preformatted sheet)
Append or override the contents of the template
Access Output
Exports data to an Access file
ODBC not required
Can be used on non-Windows platforms
Options
XML Output
Writes rows from any source to one or more XML files
Options
File name
Extension
Include stepnr in file name
Include date in file name
Include time in file name
Split every N rows
Parent XML element
Row XML element
Fields
Zipped
Encoding
pentaho_oltp
Product
Sales Facts
Customer
Time
Fact Table
Dimension Table
OLTP
OLAP
Look at the first line: This is the default row returned when you lookup
a dimension and the key is not found. This row (for not found entries) is
created automatically by PDI with null values.
And enter the fields you want to retrieve (mind the change of the
columns, especially Type of return field instead of Type of dimension
update):
When the date field is empty, the actual date and time for lookup and
new inserts is used.
When you have a date field with a valid from date, you can use this
here.
Lookups
Lookups
The Lookup feature of PDI accesses a data source to find values per a
defined matching criteria, i.e. key.
The following steps have lookup functionality in PDI:
Commonly Used
Database Lookup
Stream Lookup
Merge Join
Others
Database Join
Call database procedure
Dimension Update/Lookup
Combination Update/Lookup
HTTP Lookup
Database Lookup
Lookup attributes from a single table based on a key-matching criteria
Options for performing database lookup include:
Lookup table: The name of the table where the lookup is done.
Enable cache: This option caches database lookups for the duration of
the Transformation.
Enabling this option can increase performance.
Danger: If other processes are changing values in the table do not
set this option.
Load all data from table: Preload the complete data in memory at the
initialization phase. This can replace a Stream Lookup step in
combination with a Table Input step and is faster.
SELECT
ATTRIB1 as FullName
FROM
lookup_table
WHERE
ID = <<value of field in stream>>
Stream Lookup
Allows users to lookup data using information coming from other steps in the
transformation.
The data coming from the Source step is first read into memory (cache) and is
then used to look up data for each record in the main stream.
Options provided by Kettle GUI for performing stream lookup include:
Source step: The step from which to obtain the in-memory lookup data
Key(s) to lookup value(s): Allows user to specify names of fields that are used
to lookup values. Values are always searched using the equal comparison.
Fields to retrieve: User can specify the names of the fields to retrieve, as
well as the default value in case the value was not found or a new fieldname
in case the user wishes to change the output stream field name.
Merge Join
Takes TWO sorted streams and performs a traditional JOIN on EQUALITY
INNER = Only output a row when the key is in both streams
LEFT OUTER = Output a row even if there is no matching key in 2nd
Step
RIGHT OUTER = Output a row even if there is no matching key in 1st
Step
FULL OUTER = Output a row regardless of matching
Options provided by PDI GUI for merge join include:
First Step: Step to refer to as the 1st
Second Step: Step to refer to as the 2nd
Keys for 1st: The key fields from the 1st Stream
Keys for 2nd: The key fields from the 1st Stream
Join Type: The key fields from the 1st Stream
FULL
OUTER
Database Join
Options provided by Kettle GUI for database join procedure include:
SQL: The SQL query to launch towards the database.
Number of rows to return: 0 means all, any other number limits the
number of rows.
Outer join?: When checked, will always return a single record for each
input stream record, even if the query did not return a result.
The parameters to use in the query.
Parameters noted as ? in the query
Order of fields in parameter list must match the order of the ? in the
query.
Dimension Lookups
Uses the same dimension step used for updating
Same fields, same setup
Stream Datefield stream_date between EFFECT and EXPIRE
HTTP Lookup
Covered in Web Service module
Field Transformations
Field Transformations are steps that operate at the field level within a
stream record.
The step types covered in this section include:
Select Values
Calculator
Add Constants
Null If
Select Values
This step type is used to:
Select/remove fields from the process stream.
Rename fields
Specify/change the length and/or precision of fields.
3 Tabs are provided:
Select and Alter: Specify the exact order and name in which the fields
have to be placed in the output rows.
Remove: Specify the fields that have to be removed from the output
rows.
Meta-data: Change the name, type, length and precision (the metadata) of one or more fields.
Calculator
Provides a list of functions that can be executed on field values.
An important advantage Calculator has over custom JavaScript scripts is
that the execution speed of Calculator is many times that of a script.
Besides the arguments (Field A, Field B and Field C) the user also needs
to specify the return type of the function.
You can also opt to remove the field from the result (output) after all
values were calculated. This is useful for removing temporary values.
Calculator (cont)
The list of functions supported by
the calculator includes commonly
used mathematical and date
functions.
Add Constants
Adds constants to a stream.
The use is very simple:
Specify the name
Enter value in the form of a string
Specify the formats to convert the value into the chosen data type.
Null If
If the string representation of a field is equal to a specified value, then
the output value is set the null (empty).
Set Transformations
Set Transformations
Set Transformations are steps that operate on the entire set of data
within a stream.
The operations operate across all rows and not strictly within a row
The steps covered in the section include:
Filter Rows
Sort Rows
Join Rows
Merge Rows
Unique Rows
Aggregate Rows
Group By
Filter Rows
Filter rows based upon conditions and comparisons with full boolean logic
supported.
Output can be diverted into 2 streams: Records which pass (true) the condition
and records which fail (false).
Often used to:
Identify exceptions that must be written to a bad file
Branch transformation logic if single source has two interpretations
The options provided for this step include:
Send true data to step: Which step receives those rows which pass the
condition.
Send false data to step: Which step receives those rows which fail the
condition.
Sort Rows
Sort rows based upon specified fields, including sub sorts, in ascending
or descending order.
The options provided for this step include:
A list of fields and whether they should be sorted ascending or not.
Sort directory: This is the directory in which the temporary files are
stored when needed. The default is the standard temporary directory
for system.
Sort size: The more rows you can store in memory, the faster the
sort. Eliminating need for temp files reduced costly disk I/O.
The TMP-file prefix: Choose a recognizable prefix to identify the
files when they show up in the temp directory.
Join Rows
Produces combinations of all rows on the input streams.
The options provided by PDI on this feature include:
Step name: Name of the step; name has to be unique.
Main step to read from: Specifies the step to read the most data
from. This step is not cached or spooled to disk, the others are.
The condition: User can enter a complex condition to limit the
number of output rows. If empty, the result is a cartesian product.
Temp directory: Specify the name of the directory where the system
stores temporary files.
Temporary file prefix: This is the prefix of the temporary files that
will be generated.
Max. cache size: The number of rows to cache before the systems
reads data from temporary files.
Merge Join
The Merge Join step performs a classic merge join between data sets
with data coming from two different input steps. Join options include
INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.
The options provided by PDI on this feature include:
Step name: Name of the step; name has to be unique.
First Step: Specify the first input step to the merge join.
Second Step: Specify the second input step to the merge join.
Join Type: INNER, LEFT OUTER, RIGHT OUTER, or FULL OUTER
Keys for 1st step: Specify the key fields on which the incoming data
is sorted.
Keys for 2nd step: Specify the key fields on which the incoming data
is sorted.
Sorted Merge
The Sorted Merge step merges rows coming from multiple input steps
providing these rows are sorted themselves on the given key fields.
The options provided by PDI on this feature include:
Fields: Specify the fieldname and sort direction
(ascending/descending).
Merge Rows
Compares and merges two streams of data
Reference Stream
Compare Stream
Mostly used to identify deltas in source data when no timestamp is
available
Reference Stream = The previously loaded data
Compare Stream = The newly extracted data from the source
Usage note: Ensure streams are sorted by comparison key fields
The output row is marked as follows:
identical: The key was found in both streams and the values to
compare were identical.
changed: The key was found in both streams but one or more
values is different.
new: The key was not found in the reference stream.
deleted: The key was not found in the compare stream.
Unique Rows
Removes duplicates from the input stream.
Usage Note: Only consecutive records will be compared for duplicates, thus the
stream must be sorted by comparison fields.
The options provided for this step include:
Add counter to output?: Enable this to know how many rows duplicated for
each row in the output.
Counter fields: Name of the numeric field containing the number of
duplicate rows for each output record.
Fieldnames: A list of field names on which the uniqueness is compared. Data
in the other fields of the row is ignored.
Ignore Case Flag: Allows case insensitive matching on string fields.
Aggregate Rows
Generates unique rows and produces aggregate metrics.
The available aggregation types include SUM, AVERAGE, COUNT, MIN,
MAX, FIRST and LAST.
THIS STEP TYPE IS DEPRECATED AND SHOULD NOT BE USED
Use Group By step type instead.
Group By
Calculates aggregated values over a defined group of fields.
Operates much like the group by clause in SQL.
The options provided for this step include:
Aggregates: Specify the fields that need to be aggregated, the method (SUM,
MIN, MAX, etc.) and the name of the resulting new field.
Include all rows: If checked, the output will include both the new aggregate
records and the original detail records. You must also specify the name of
the output field that will be created and hold a flag which tells whether the
row is an aggregate or a detail record.
Very nice feature: Aggregate function Concatenate strings separated by can
be used to create a list of keys like 117, 131, 145,
The input needs to be sorted, another option is to use the Memory Group
By step that handles unsorted input.
Pivot Transformations
Pivot Transformations
Pivot Transformations are steps which flip the axis of the data (from rows
to columns and vice-versa).
Steps that are covered in this section:
Row Normalizer
Denormalizer
Row Flattener
Row Normalizer
Normalizes rows of data
For example:
Weekdate
Metric Type
Quantity
2001-01-07
Miles
1996
2001-01-07
Loaded Miles
1996
2001-01-07
Empty Miles
weekdate
Miles
Loaded_miles
Empty_miles
2001-01-07
1996
1996
2001-01-28
587
539
48
2001-01-28
Miles
587
..
..
..
..
2001-01-28
Loaded Miles
539
2001-01-28
Empty Miles
48
Step name: Name of the step. This name has to be unique in a single
transformation.
Type field: The name of the type field. (Metric Type in our example)
Row Denormalizer
Denormalizes data by looking up key-value pairs.
For example:
Weekdate
Metric Type
Quantity
2001-01-07
Miles
1996
2001-01-07
Loaded Miles
1996
2001-01-07
Empty Miles
2001-01-28
Miles
587
2001-01-28
Loaded Miles
539
2001-01-28
Empty Miles
48
weekdate
Miles
Loaded_miles
Empty_miles
2001-01-07
1996
1996
2001-01-28
587
539
48
..
..
..
..
Row Flattener
Flattens sequentially provided rows
Usage Notes
Rows must be sorted in proper order.
Use denormalizer if Key-Value pair intelligence is required for
flattening.
For example:
Field1
A
D
Field1
A
A
D
D
Field2
B
E
Field2
B
B
E
E
Field3
C
F
Field3
C
C
F
F
Target1
One
Three
Flatten
One
Two
Three
Four
Target2
Two
Four
Step name- Name of the step. This name has to be unique in a single
transformation.
Closure Generator
This step was created to allow you to generate a Reflexive Transitive
Closure Table for Mondrian.
Technically, this step reads all input rows in memory and calculates all
possible parent-child relationships. It attaches the distance (in levels)
from parent to child.
The options provided for this step include:
Add Sequence
Adds a sequence number to the stream.
A sequence is an ever-changing integer value with a defined start value and
increment.
Options provided for this step include:
Name of value- Name of new field that is added to stream.
Use DB to get sequence- Option to be enabled when sequence is to be driven by a
database sequence.
Connection name- Choose the name of the connection on which the database sequence
resides.
Sequence name- The name of the database sequence.
Use counter to calculate sequence- Enable to have sequence generated by Kettle. Be
careful, Kettle generated sequences are created anew for each run of the
Regex Evaluation
Evaluates a Regular Expression
Field to Evaluate name of the EXISTING field that contains the
string you want to perform the evaluation against
Result Fieldname name of the NEW field to put the result. Values:
Y/N
Regular Expression The regular expression to evaluate
Other Options: Case Sensitivity, Encodings, Whitespace, etc
Split Fields
Split fields based upon delimiter
Options provided for this step include:
Field to split- The name of the field you want to split.
Delimiter- Delimiter that determines the end of values in the field.
Fields- List of fields to split into.
Original Field:
Multiple Fields:
12/31/2007
12
31
2007
Value Mapper
Maps input value to a new output value based on a mapping table
This is usually done in a data driven manner with a database table,
however this step allows you to define the mapping table in your code
Useful is the mapping table is small and rarely or never changes
For example, if user wants to replace Gender Types:
Fieldname to use: gender_code
Target fieldname: gender_desc
Default Upon: If you don't match, then put this (else statement)
Source/Target Mapping: F->Female, M->Male
Start with the details like order lines with the product specific issues
Lookup order headers to get some customer specifics and the order
date
This replaces e.g. the product code by the technical key productid
Introduction to Jobs
Jobs
Jobs aggregate up individual
pieces of functionality to
implement an entire process
Job Entries
Job Hops
Job Settings
Job Entries
Job Hops
Job Hops
A job hop is a graphical representation of one or more data streams between two
steps
A hop always links two job entries and can be set (depending on the type of
unconditionally,
after successful execution,
or failed execution
Unconditional
True (Success)
False (Error)
Job Settings
Job settings are the options that control the behavior of a job and the
method of logging a jobs actions.
Job Entry
A job entry is a primary building block of a job
Execute transformations, retrieve files, generate email, etc.
A single job entry can be placed multiple times on the canvas.
Start
Defines the starting point for job execution
Only unconditional job hops are available from a Start job entry.
The start icon also contains basic scheduling functionality.
Dummy
Use the Dummy job entry to do nothing in a job.
This can be useful to make job drawings clearer or for looping.
Dummy performs no evaluation.
Transformation
Execute a transformation.
The options provided for this job entry are:
Name of the job entry- This name has to be unique in a single job. A
job entry can be placed several times on the canvas, however it will
be the same job entry.
Name of transformation- The name of the transformation to start.
Repository directory- The directory in repository where
transformation is located.
Filename- Specify the XML filename of the transformation to start.
Specify log file- Check this if you want to specify a separate logging
file for the execution of this transformation.
Name of log file- The directory and base name of the log file (for
example C:\logs).
Extension of the log file- The filename extension (for example: log or
txt)
Transformation (cont.)
Include time in filename - Adds the system time to the filename.
Logging level - Specifies the logging level for the execution of the
transformation.
Copy previous results to arguments - The results from a previous transformation
can be sent to this one using the Copy rows to result step.
Arguments - Specify the strings to use as arguments for the transformation.
Execute once for every input row - Support for looping has been added by
Transformation (example)
Using Repository
Using Files
Name of the job entry- This name has to be unique in a single job. A
job entry can be placed several times, however it will be the same job
entry.
Specify log file- Check this if you want to specify a separate logging
file for the execution of this job.
Name of log file- The directory and base name of the log file (for
example C:\logs)
can be sent to this job using the Copy rows to result step in a transformation.
Arguments - Specify the strings to use as arguments for the job.
Execute once for every input row - This implements looping. If the previous job
entry returns a set of result rows, you can have this job executed once for every
row found. One row is passed to this job at every execution. For example you can
execute a job for each file found in a directory using this option.
Using Files
Shell
Executes a shell script on the host where the job is running.
The options provided for this job entry are:
Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.
Specify log file - Check this if you want to specify a separate logging
file for the execution of this shell script.
Name of log file - The directory and base name of the log file (for
example C:\logs).
Extension of the log file - The filename extension (for example: log or
txt).
Shell (cont.)
Include date in filename - Adds the system date to the filename.
Include time in filename - Adds the system time to the filename.
Logging level - Specifies the logging level for the execution of the shell.
Copy previous results to arguments - The results from a previous transformation
can be sent to the shell script using Copy rows to result step.
Arguments - Specify the strings to use as arguments for the shell script.
Execute once for every input row - This implements looping. If the previous job
entry returns a set of result rows, you can have this shell script executed once for
every row found. One row is passed to this script at every execution in
combination with the copy previous result to arguments. The values of the
corresponding result row can then be found on command line argument $1, $2, ...
(%1, %2, %3, ... on Windows).
Mail
Send an e-Mail.
The options provided for this job entry are:
Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.
SMTP server - The mail server to which the mail has to be sent.
Mail (cont.)
Include date in message - Check this if you want to include date in the e-Mail.
Contact person - The name of the contact person to be placed in the e-Mail.
Contact phone - The contact telephone number to be placed in the e-Mail.
Comment - Additional comment to be placed in the e-Mail.
Attach files to message - Check this if you want to attach files to this message.
Select the result files types to attach - When a transformation (or job) processes
files (text, excel, dbf, etc) an entry is being added to the list of files in the result
of that transformation or job. Specify the types of result files you want to add.
Zip files into a single archive - Check this if you want to zip all selected files into
a single archive.
Zip filename - Specify the name of the zip file that will be placed into the e-mail.
SQL
Execute an SQL script
You can execute more than one SQL statement, provided that they are
separated by semi-colons
The options for this job entry are:
Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.
FTP
Retrieve one or more files from an FTP server.
The options provided for this job entry are:
User name - The user name to log into the FTP server.
Remote directory - The remote directory on FTP server from which files are taken.
Target directory - The directory on the machine on which Kettle runs in which you want
to place the transferred files
Wildcard - Specify a regular expression here if you want to select multiple files.
Use binary mode? - Check this if the files need to be transferred in binary mode.
Remove files after retrieval? - Remove the files on the FTP server, but only after all
selected files have been successfully transferred.
Table Exists
Verifies if a certain table exists on a database.
The options provided for this job entry are:
Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.
File Exists
Verifies if a certain file exists on the server on which PDI runs.
The options provided for this job entry are:
Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry
JavaScript Evaluation
Calculates a boolean expression.
This result can be used to determine which next step will be executed.
The following variables are available for the expression:
nr (integer): the job entry number. Increments at every next job entry.
SFTP
Retrieves one or more files from an FTP server using the Secure FTP
protocol.
The options provided for this job entry are:
Name of the job entry - The name of the job entry.
SFTP-server name (IP) - The name of the SFTP server or the IP
address.
SFTP port - The TCP port to use. This is usually 22.
User name - The user name to log into the SFTP server.
Password - The password to log into the SFTP server.
Remote directory - Remote directory on SFTP server from which get
files.
Target directory - The directory on the machine on which Kettle runs
in which you want to place the transferred files.
Wildcard - Specify a regular expression if you want to select multiple
files.
Remove files after retrieval? - Remove the files on the SFTP server,
US and Worldwide: +1 (866) 660-7555 | Slide 218
2011, Pentaho. All Rights Reserved. www.pentaho.com.
but only after all selected files have been successfully transferred.
HTTP
Gets a file from a web server using the HTTP protocol.
The options provided by Chef on this feature are:
Run for every result row - Check this if you want to run this job entry
for every row that was generated by a previous transformation. Use
the Copy rows to result.
Fieldname to get URL from - The fieldname in the result rows to get
the URL from.
HTTP (cont.)
Add date and time to target filename - Check this if you want to add
date and time yyyMMdd_HHmmss to the target filename.
In the transformation you get the argument with the Get System Info
step:
When Execute for every input row is not checked, only the first row
will be taken (when arguments are used).
Clear the list of result rows must be checked, otherwise this could
lead to an infinite loop (because this transformation is generating rows,
too).
For every file that is written to the file system e.g. by the Text File
Output step or Excel Output step, its filename is stored in a List of
result files.
This list can be processed by a transformation with the step Get files
from result.
You can programmatically create this list by the step Set files in
result:
The list of result files can be used to automatically send all produced
files via mail:
Beneath the mail addresses (To, CC, BCC, Reply To, From) and the mail
server settings you can enter the following options:
When you uncheck Only send comment in mail body you get the following details in the
mail
mail message body together with the message comment:
Enclosed you find the latest ....
Job:
----JobName
Directory
JobEntry
:
:
:
:
:
:
:
:
:
1
0
0
0
0
0
0
0
true
Select other file types to add logfiles for different detail levels.
If you want to add a log you need to define this in a previous job entry
for a transformation, e.g.:
When you create more files then you want to send, there is the option
to Clear list of result files within running a transformation or sub-job.
Attention: In this case, the job entry is called two times and does not
start when both entries are finished. (see the following slide)
The job entry will call the job that includes the parallel tasks.
Now, the log entry will be executed, when both entries are finished.
Conditions
With conditions you can change the pathway of your job.
Most of the job entries support to give a result back as true or false.
E.g. if a transformation fails, the result is false.
More complex conditions can be handled by the JavaScript job entry.
This is different to the JavaScript for transformations and only
evaluates.
Conditions
Within the JavaScript job entry you can use the following variables:
lines_input
lines_output
lines_updated
lines_rejected
lines_read
lines_written
files_retrieved
errors
exit_status
nr
is_windows
Note: All variables beginning with lines_ need some preparation
Conditions
Example for checking the processed lines:
Conditions
You have to enable and define what step of the transformation should
be taken for the variable lines_written. Do this within the
transformation:
Conditions
Example for checking for a specific value within the result lines:
Since 3.0 there is only the Modified JavaScript step with a compatibility
mode:
Test-Script
The Test-Script button creates test data for the input fields, thus
depending on the needed format this can fail. Example:
Built-in Functions
There are a lot of build-in functions with samples:
With _step_ you can access almost all information from the context
of your transformation and environment.
If you want to add 3rd party libraries, you can add the jar to the
classpath and use them.
Note: Keep in mind by using PDI classes, they can change in newer
versions.
Replacing Values
Use case: You want to replace parts of a string.
Actually, it is not possible to change the value of an existing one in
version 3.0
In the old-style engine, it was possible to change strings to numbers,
etc. The result was a big mess where data types got mixed a lot.
Number / Integer mixups especially since JS resorts to using doubles
for almost anything.
As such, the developers want to discourage doing exactly that as
much as possible.
For doing a replacement you have to use the compatibility mode.
The developers actually discuss a solution like this:
Add a column in the "Fields" section: "Replace value? (Y/N)" - In that
situation it can indeed convert values and verify data types.
Since PDI version 3.0 the JavaScript engine is not sealed any more:
Sealing prevents using of common JavaScript libraries.
This was actually very limiting for experienced users because there
are some very good JavaScript libraries that contain many useful
functions.
Since PDI version 3.0 you can use them by including them in the
classpath.
http://jslib.mozdev.org/
Formula step
The Formula step is based on the Open-Formula-Syntax
You can reference values with square brackets: [value]
Get help for every function with clicking on the function
Applying business rules (if / then / else) with more complex logic is
possible
Dynamic Transformations
Dynamic Transformations
Use case: One transformation fits all
You want to use only one master transformation and want to control
the
Input-type (e.g. CSV, Fixed File, Excel)
Preprocessing
Field Mapping
Validation
Enrichment
You can accomplish this easily by the use of sub transformations
(mappings) and variables
Dynamic Transformations
Use case: One transformation fits all
Here is a sample transformation that calls different subtransformations that are controlled by variables by a job and out of
this, it is completely flexible.
Dynamic Transformations
Use case: Dynamic field mapping
You get a lot of different input files and need to output this into a
harmonized file structure.
In this case, you can use the ETL Metadata Injection step controlling
this transformation:
Dynamic Transformations
ETL Metadata Injection step: sample controlling transformation:
Dynamic Transformations
With the possibility of controlling your transformation by meta data you
will be very flexible and can accomplish e.g. processing invoice data
from many customers or suppliers with very little transformations or
jobs.
More steps will support this powerful feature in the next releases.
Using XML
XML Output
Basic XML Output for simple and flat structures
XML Join
The XML Join Step allows to add xml tags from one stream into a
leading XML structure from a second stream (similar to Add XML but
more powerful).
XSD Validator
Validate a XML file against a XML Schema Definition (XSD).
DTD Validator
This step provides the ability to validate a XML document against a
Document Type Definition (DTD).
[]
Process large XML files: define Prune path to handle large files
Since the processing logic of some XML files can sometimes be very
tricky, a good knowledge of the existing Kettle steps is recommended
to use this step. Please see the different samples of this step for
illustrations of the usage.
XML Output
Basic XML Output for simple and flat structures
The usage is easy: Beneath the filename, you have to define your
root and row XML element name.
XML Output
A result could look like this:
Add XML
Adds XML to a data stream for more complex XML Outputs
This step allows you to encode the content of a number of fields
in a row in XML. This XML is added to the row in the form of a
String field.
Add XML
Enter the field names and optionally the Element name or if this
should be included as an attribute.
Add XML
A use case is to build more complex (nested) XML structures.
Here is a basic example:
XML Join
XML Join allows to add xml tags from one stream into a leading XML
structure from a second stream.
Together with the Add XML step, XML Join is used for building more
complex XML files. It replaces the stream join of fields and
simplifies the creation.
Referencing Variables
Whenever you see this icon, you can use variables:
Press Ctrl-Space to see a list of available variables.
Referencing Variables
If you want to test your transformation at design time, make sure you
set the variable for test purposes. (Edit / Set environment variables)
Spoon automatically detects variables, that are referenced but are not
set and lists them here.
Referencing Variables
Variables are very useful in:
Flexible file processing:
Flexible table processing:
Table Input:
Table Output:
And flexible database connections:
Named Parameters
Named parameters are a system that allows you to parameterize your
transformations and jobs. On top of the variables system that was
already in place prior to the introduction in version 3.2, named
parameters offer the setting of a description and a default value. That
allows you in turn to list the required parameters for a job or
transformation.
They can be set in the job or transformation properties.
Named Parameters
When starting a job or transformation you can overwrite the default:
Shared Objects
A variety of objects can be placed in a shared objects file on the local
machine.
The default location for the shared objects file is
$HOME/.kettle/shared.xml.
Objects that can be shared using this method include:
Database connections
Steps
Slave servers
Partition schemas
Cluster schemas
To share one of these objects, simply right-click on the object in the
tree control on the left and choose share.
Shared Objects
This is useful especially for Connections like we do it in this course.
Thus we do not have to enter the information again for new
transformations.
If we want to change one of the connection properties like the user
name, we could do it once for all transformations.
When you use Slave servers, Partition schemas, Cluster schemas this is
the same.
Shared Objects
If you want to change the location of the shared objects file you can do
this in the properties or your transformation.
This is recommended, when it should be independent of the users home
directory.
JNDI
To configure the connection for the use at design time, edit the file:
data-integration\simple-jndi\jdbc.properties
This file can be found in the DI server in:
data-integration-server\pentaho-solutions\system\simple-jndi
Here is a sample connection for a shared connection named
SampleData:
SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/url=jdbc:hsqldb:hsql://localhost/sampledata
SampleData/user=pentaho_user
SampleData/password=password
Logging
What is Logging?
Summarized Information about the Job or Transformation execution
Number of records Inserted
Total Elapsed Time spent in a Transformation
Job1
PDI Logging
Transform1
Transform2
Data Warehouse
Staging
Data Marts
Logging Channels
Very detailed information about every channel, e.g. the object type and
name:
Logging Channels
Deep information about the executed transformation filename or
repository details (Object and Revision) and the logging channel hierarchy
(Parent / Root):
This is also usable for lineage analysis when a report is build out of these
information
Performance Logging
You need to enable performance Logging and Monitoring:
Performance Logging
You can see the results in the
Executiong Results / Performance Graph:
Performance Logging
You can see the results also in the log table for further analysis:
A snapshot was taken every second
Performance Logging
The snapshot contains the processed rows:
Performance Logging
The snapshot contains also the buffer situation.
This is very useful in analyzing bottle necks, e.g. when the number of
input buffer rows if often higher than the output buffer row (the
ratio), then this step takes more time for processing and is most likely
the bottle neck.
The normal preview looks like this (not the sniffing, but similar):
Note: Press the Close button
and not the Stop button, see
next slide....
You can analyze the detailed row before and after the step or at any
other position within one transformation:
The entire transformation will not fail and continue to process your
data.
Note: Not all steps support this error handling feature at this time.
With Always log rows, all rows in this data stream will be logged.
You see CR / LF within the data. If you want to eliminate them, you can
use a JavaScript.
ETL Patterns
Introduction to Patterns
pattern / noun
1)a form or model proposed for imitation
2)something designed or used as a model for making things <a
dressmaker's pattern>
3)an artistic, musical, literary, or mechanical design or form
Informally
Common Data X
Common Format Y
Identify Candidates
Notice data SIMILARITIES, not
SPECIFICS
Data Characteristics
PEG
PATTERN
HOLE
Similar To...
Processing Characteristics
Similar algorithms
Similar square pegs and round
holes
PEG
PATTERN
HOLE
Housekeeping Characteristics
Similar loading/tracking
techniques
Pattern : Batching
Tag DML with something to identify it as part of one load process
Batch
ID: 27
Load Date: 10-Oct-2007
FACT/SCD II records have a column that identify which batch they
were inserted during
Logical Rollback
Commit every 1000 records is good for performance, but what happens
when you only get part way through a load
DELETE from FACT where batch_id = (current batch_id) to cleanup
Auditing
Get BATCH_ID
from somewhere
Set Variable
Use Variable in all INSERT /
UPDATE operations
Your data ends up having an extra
attribute and looking like this:
Set variable
${BATCH_ID}
Use variable
${BATCH_ID}
on INSERT,
UPDATE steps
Detect changes
TWO inputs
What to compare
Flagfield
Created
Changed
Time
Grouping ID
Grouping ID
Time
var DAYS_BETWEEN_ORDERS;
var one_day=1000*60*60*24;
if ( PREV_ORDER_DATE.getDate()
!= null ) {
DAYS_BETWEEN_ORDERS =
Math.round(Math.abs(orderdate.get
Date().getTime() PREV_ORDER_DATE.getDate().get
Time()) / one_day);
}
else {
DAYS_BETWEEN_ORDERS = 0;
}
Auditing Requirements
Track what process made what changes to what data
Easy for downstream ETL
SELECT * FROM ODS_TABLE where UPDATED_TIME >= last time I ran
customernumber
customername
Update = N
Update = Y
CREATE_TIME
CREATE_USER
The best choice is depending on your data sizes, your transactions and
your database behaviour.
Create a temp swap tables that match structures of the target tables.
When this is combined with delta loading, also copy the original table
table to the swap table
When you have referential integrity / foreign keys (what is not a good
practice for a data warehouse) that concept can only hardly be used
Transformation
Post Processing
Pre Processing
Delete all rows with a dirty flag set in all target tables. Records in this
state mean the last job was not finished successfully and the partial data is
deleted.
Transformation
Post Processing
START TRANSACTION;
UPDATE oltp_orders SET dirty_flag=false where dirty_flag is true;
UPDATE oltp_orderdetails SET dirty_flag=false where dirty_flag is true;
COMMIT;
Pre Processing
Get the batch IDs from the logging table where the logging record indicates
an error. Delete all rows in all target tables for these batch IDs. Records
with these batch IDs mean the last job was not finished successfully and
the partial data will be deleted.
Transformation
Post Processing
Pre Processing
No action needed (eventually define loops and chunks of data, see next
slides)
Transformation
Load the table(s) with the previous keys (delta loading) and save the last
processed record working in chunks is possible
Post Processing
Add constants
Add the table names (for the key
range).
Insert/Update
Store the last processed key and
batch id for the table.
Note: When the target key is different (e.g. auto generated), both keys
(source & target) would need to be stored.
Oltp_job_status
Get the last processed key.
Use max() to get at least one row.
Orders
Add a where clause: WHERE ordernumber > ?
Results of delta loading in chunks (2560 rows in total, 3 chunks: 1000, 1000,
560 records)
Logging statistics of step orders and content of table oltp_job_status of
first, second and third run:
Pattern: Loops
In general loops are allowed in Jobs (in contrast to Transformations
were this is not possible). An Example for a loop is given here: Wait
for a file and process it with two transformations. After the processing
of the file is finished, the job should loop and wait for the next file.
X
Dont do it this way: This will work in general and for a long while.
But due to design reasons sooner or later this will lead to a
StackOverflowError, even when the StackSize is increased. This can
happen in production after some hours, days or weeks.
Pattern: Loops
Options:
Externalize the loop to the operating system within a shell or batch file.
The errorlevel could be checked also.
Iterations: Stop the job after a certain number of iterations and restart it
in a loop in the operating system (the loop could also be in a shell or batch
file)
Let it schedule (e.g. from the DI server) in an interval but avoid
overlapping runs (e.g. a job takes longer than the interval)
Use the interval setting in the Start job entry (if this is suitable for the job)
Pattern: Loops
How to stop a job after a certain number of iterations?
Pattern: Loops
How to stop a job after a certain number of iterations?
The JavaScript step decrements a variable and validates if it should
continue:
Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?
Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?
Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?
Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?
You can try this with a maximum number of 2 iterations since it needs
3 when the chunk size is set to 1000. In this case it processes 2000
rows in two cycles for the transformation. When the job is run a
second time, it processes the remaining rows (560).
Pattern: Loops
How to avoid overlapping runs of jobs?
You can use a semaphore, e.g. setting a file or writing an entry in a
specific table and check this.
You can check the log entry of a job to execute if this job is still running:
Pattern: Loops
How to avoid overlapping runs of jobs?
Similar as before (loading chunks), but the SQL is different:
Pattern: Loops
How to avoid overlapping runs of jobs?
Similar as before (loading chunks), but the evaluation is different:
Enterprise Repository
Setup
The EE Data Integration Server must be started.
When Spoon starts up you will prompted to connect to a
repository (or use in the menu: Tools / Repository / Connect)
Add a new Enterprise Repository
Setup
Enter an ID (server reference) and name (local reference) for
your repository connection
Security
The Data Integration Server is configured out
of the box to use the Pentaho
default security provider
This has been pre-populated with a set of
sample users and roles including:
Joe Member of the admin role with full
access and control of content on the Data
Integration Server
Suzy Member of the CEO role with
permission to read and create content, but
not administer security
Please see the Best Practices for Deleting Users and Roles in the
Pentaho Enterprise Repository in the PDI Administration Guide
Content Management
Demo
Content Management
New repository based on JCR (Content Repository API for Java)
Improved Repository Browser
Enterprise Security
Configurable Authentication including support for LDAP and MSAD
Task Permissions defining what actions a user/role can perform such as
read/execute content, create content and administer security
Granular permissions on individual files and folders
Spoon
Carte
Pentaho BI Server
Scheduling
OR
Files
Db
Script (CRON)
OR Job A
Job B
Spoon
Db
OR
Files
OR
Job A
Job B
= Enterprise Repository
Scheduling options
Before PDI 4.0:
Via the operating system (e.g. CRON jobs, task scheduler)
Via the BI Suite scheduler (via xActions)
Via the Start-Job-Entry
It was complicated to schedule remote jobs (with Carte)
Spoon
Carte
Scheduling
BEFORE 4.0
OR
Script (CRON)
Files
Db
OR Job A
Spoon
Job B
AFTER 4.0
Scheduling
Db
OR
Files
OR
Job A
Job B
Scheduling - Demo
Scheduling - Demo
Pentahos Agile BI
Pentahos Agile BI initiative seeks to break down the barriers to
expanding your use of Business Intelligence through an iterative
approach to scoping, prototyping, and building complete BI
solutions.
Agile BI
Business Users to
operate with or
without IT resources
Faster
Time
to Value
Conventional BI Apps
to be built and
deployed rapidly
within a single design
environment
Cloud
On-Premise
US and Worldwide: +1 (866) 660-7555 | Slide 476
Agile BI Phases
Individual / Departmental
Agile Exploration
Agile Data Transformation
Solution Prototyping
Institutional
Infrastructure Design
Dimensional Modeling
Iterative Solution Development
Operational Deployment
Pentahos Agile BI
In support of the Agile BI methodology, the Spoon design
environment provides an integrated design environment for
performing all tasks related to building a BI solution including ETL,
reporting and OLAP metadata modeling and end user visualization.
Business users will be able to
start interacting with data
building reports with zero knowledge of SQL or MDX
work hand in hand with solution architects to refine the solution.
A Data Transformation
becomes an
Analysis Model
UK is missing
the territory in
this example.
This can be
corrected very
fast to EMEA in
the
transformation.
Limitations
You need to use a Table Output step for the Visualization
Limitations of the 4.0 release: The table needs to have all fields
that you want to analyze. At this time there is no support to join
other tables (Snow-Flake- or Star-Schema).
Agile BI - Models
Agile BI - Models
Create and change Models (like a Mondrian Schema for Analysis or
Ad-Hoc Reporting) on the fly
Integrated in the
database dialog
Types of Fields
The following types of fields are available:
Text Fields (Names, Types, Categories, etc.): Product Name is an
example of a Text field.
Time Period Fields: Fiscal Year and Order Month are examples of
Time Period fields.
Number Fields: These types of fields are designed for summing,
dividing, creating averages, etc.
Fields are color-coded by type in both the report and the Available
Fields pane.
Text Fields and Time Period Fields: Orange
Number Fields: Blue
The Result
Using Filters
Pentaho Analyzer offers the following options for filtering reports:
Filtering Text Fields: Text fields contain non-numeric information, so
you can choose to include or exclude certain values at will. Time
Periods, Names, Types, and Categories are examples of text field
groups; Product Line is an example of a specific text field.
Types of Filters
Pentaho Analyzer offers the following options for filtering reports:
Filtering Text Fields: Text fields contain non-numeric information, so
you can choose to include or exclude certain values at will.
Selecting from a list of values. Pentaho will display a list of values,
and you choose to include or exclude certain values
Match part of a string. You type in part of the name (string) that the
name Contains or Does not Contain
Filtering Number Fields: Number fields include numeric information.
Greater/Less Than...
Top 10, etc...
You can have only one numeric filter on a report at any given time.
When the report is generated, the numeric filter is applied after
other filters are applied.
Note: The various forms of totals (max, min, etc) will only display if
your report is set to display totals
Note: The various forms of totals (max, min, etc) will only display if
your report is set to display totals
Please see:
http://www.pentaho.com/what_is_agile/
http://www.pentaho.com/agile_bi/
Logical
Physical
Constraints
Foreign Key Constraints can kill performance
DML require a lookup to ensure consistency
Usually not a big deal for 10s and 100s of DML statements / sec
Batching we hope for 1000s and 10000s of DML statements / sec
Most efficient to do INSERTs and check consistency once in a batch
system
Pre Processing
Post Processing
Indexes
Rebuilding Indexes while your doing DML might be wasting resources
Why do the work 1000x times when you can do it once?
Many IDXes will need to be rebuilt entirely if enough DML has occurred
Bitmap Index is popular index for BI
Pre Processing
Post Processing
Create Indexes
STEPS TO USE:
Statistics
Statistics help databases create effective plans for queries
If they're off you'll have poorly performing queries
Some Databases have thresholds that automatically trigger a statistics
collection
You DO want updated statistics at the end of your load
You DO NOT want statistic updating to be occurring throughout your
load
Pre Processing
Post Processing
Summary Tables
For this example, OLAP AGGREGATEs and SUMMARIES are the same thing
Summary tables can improve performance and ease of use of reporting
tools
Pre Processing
Post Processing
Post Processing
Clean up Staging
Your staging area (database, files) has scratch data that is unnecessary
after loading DW
Temporary Tables
Files from Source systems no longer necessary
Pre Processing
None
Post Processing
ETL patterns
Most pre and post processing is needed in common situations
E.g. Transactions
Please see also the chapter ETL patterns
Tuning Strategy
START
HERE
INSTRUMENT AND IDENTIFY
TUNING CANDIDATES
MEASURE AND
MONITOR
IMPROVEMENT
TUNE INDIVIDUAL
TRANSFORMS, JOBS, AND
DATABASE
Bad
Good
Scalability
40
35
30
Hours
Scalability
25
20
15
10
5
0
100k
200k
300k
400k
500k
600k
Number of Records
Database Logging
Covered in different module
Provides the data needed for tuning strategy, planning
SQL Query (pseudo)
select
TRANSNAME,
RECORDS,
ELAPSED_TIME (Start - End)
from PDI_TRANSFORM_LOG
T RANSNAME
load_csv_data
load_csv_data
load_csv_data
load_csv_data
ORDER
Approach
1)Identify tuning candidate
(previous slides)
2)Review against basic Tuning
Concepts
1)Make change
2)Measure
3)Rinse and Repeat
3)Tune SQL and Database
1)Make change
2)Measure
3)Rinse and Repeat
Sort
Stream Lookup
Sort
Join Rows
Sort (did we mention this one already?)
Cons of DB Sort
Lookups
Row Passing
....
Runtime Data
Tabular data in Spoon that helps a developer understand information
about the transformation as it runs
Information about Number of Records streaming
Time
Records / Second
Status of Steps
Input / Output records on hops
Basics
Basics Example
Speed
Input / Output
Input / Output figure gives information about the
# of Records on the Input Hop
# of Records on the Output Hop
Hops can hold a configurable number of rows (0..N)
Example
Hop2
Step2 output = 0
Step3 input = 0
Look for the furthest downstream step with few records on its OUTPUT
and many records on its INPUT
Informal
Look for the first 0 on OUTPUT side and 10000 on the INPUT side
Backlog
Slow steps cause a backlog
Look at all the steps Input / Ouput
10000
10000
10000
Columbia, the new (2004) supercomputer, built of 20 SGI Altix clusters, a total of
10240 CPU
Credit: NASA Ames Research Center/Tom Trower
Note: When you get an error that the Socket port is already in use, try another
port. This can also happen when an error arises and the port is not closed.
Cx2 means these steps are executed clustered on two slave servers.
All other steps are executed at the master server
To execute the transformation:
You will also see a preloaded Row generator test as a test transformation in
every Carte instance..
But you need a Sorted Merge step, that does the following in the background but
in an clustered environment:
When you want to use clustered databases, uncheck the Dynamically create
the schema and the database clusters are taken into account.
Remember to check "Running in parallel (otherwise you get all rows on all
clusters)
The file is divided internally into trunks of data for each cluster-node to process.
The same principle works for the Fixed file input.
Hadoop
Why Hadoop?
Low cost, reliable scale-out architecture for storing massive
amounts of data
Parallel, distributed computing framework for processing data
Proven success in solving Big Data problems at fortune 500
companies like Google, Yahoo!, IBM and GE
Vibrant community, exploding interest, strong commercial
investments
Interactive Analysis
Hadoop
Batch Reporting
and Ad Hoc Query
Data Marts
Agile BI
Hadoop
Log
File
s
DBs and
other sources
US and +1
Worldwide:
+1 (866) 660-7555
| 573
Slide
US and Worldwide:
(866) 660-7555
| Slide
US and +1
Worldwide:
+1 (866) 660-7555
| 574
Slide
US and Worldwide:
(866) 660-7555
| Slide
US and +1
Worldwide:
+1 (866) 660-7555
| 575
Slide
US and Worldwide:
(866) 660-7555
| Slide
US and +1
Worldwide:
+1 (866) 660-7555
| 576
Slide
US and Worldwide:
(866) 660-7555
| Slide
Pentaho Data
Integration
Deploy
Orchestrate
Pentaho Data
Integration
2010, Pentaho. All Rights Reserved. www.pentaho.com.
US and +1
Worldwide:
+1 (866) 660-7555
| 577
Slide
US and Worldwide:
(866) 660-7555
| Slide
Visualize
RDBMS
Optimize
Hive
Hadoop
Files / HDFS
Load
2010, Pentaho. All Rights Reserved. www.pentaho.com.
US and +1
Worldwide:
+1 (866) 660-7555
| 578
Slide
US and Worldwide:
(866) 660-7555
| Slide
US and +1
Worldwide:
+1 (866) 660-7555
| 579
Slide
US and Worldwide:
(866) 660-7555
| Slide
Operations Patterns
Pattern: Watchdog
Watchdog:
A watchdog timer is a computer hardware or software timer that
triggers a system reset or other corrective action if the main program,
due to some fault condition, such as a hang, neglects to regularly
service the watchdog (writing a service pulse to it, also referred to
as kicking the dog, petting the dog, feeding the watchdog or
waking the watchdog).
The intention is to bring the system back from the unresponsive state
into normal operation. [], more e.g. on Wikipedia
Most of the PDI health checks can be accomplished with the
Watchdog concept
Pattern: Watchdog
Watchdog timers for multitasking (e.g. many PDI jobs and cluster nodes):
A software crash might go undetected by conventional watchdog strategies
Success lies in weaving the watchdog into the fabric of all of the system's
tasks, which is much easier than it sounds.
Build a watchdog task
Create a data structure (database table) that has one entry per task
When a task starts it increments its entry in the structure. Tasks that only
start once and stay active forever can increment the appropriate value
each time through their main loops, e.g. every 10,000 rows
As the job or transformation runs the number of counts for each task
advances.
Infrequently but at regular intervals the watchdog runs.
The watchdog scans the structure, checking that the count stored for each
task is reasonable. One that runs often should have a high count; another
which executes infrequently will produce a smaller value.
If the counts are unreasonable, halt and let the watchdog timeout and fire
an event. If everything is OK, set all of the counts to zero and exit.
Pattern: Watchdog
An example implementation with PDI:
This is task oriented, not server oriented
This means it will be checked, if the task (a Transformation or Job) is
running as expected independently in what environment (e.g.
clustered or not)
Task 1
Watchdog
Task
Definitions
1..n
Counter
1..n
Task 2
Event
Task n
Pattern: Watchdog
The Watchdog Task Definitions table
Environment variable: ${watchdog_task_table}
Example table name for operations:
op_watchdog_task
Fields & Descriptions
wd_task_id
Unique Task ID
wd_task_description
wd_task_disabled
1=Do not check the task (the counters are still incremented by
the tasks)
wd_task_min_count
wd_task_max_count
When >0: Check if the counter is below this value after the
cylce time
wd_task_cycle_minutes
Pattern: Watchdog
Fields & Descriptions (Watchdog Task Definitions table continued)
wd_task_lenient_count
wd_task_event_type
wd_task_event_details
wd_task_last_reset
wd_task_last_detection_count
Pattern: Watchdog
The Watchdog table for counting
Environment variable: ${watchdog_table}
Example table name for operations: op_watchdog
Fields & Descriptions
wd_task_id
Unique Task ID
wd_hostname
wd_ip_address
wd_slave_server
wd_last_run
wd_counter
Pattern: Watchdog
In the sample implementation, an event is fired after the cycle time, when:
The cycle time is reached AND
the counter is zero OR
the counter is below min OR
the counter is above max
When lenient count is defined, it waits to fire an event until it reaches the
number of events.
Watchdog Environment Variables
watchdog_task_table The Watchdog Task Definitions table
watchdog_table
watchdog_mode
Pattern: Watchdog
Overview of the sample implementation jobs and transformations:
Create a sample environment
Define a database connection operations_db and share it
Create test tables with job test_watchdog_create_tables
Fill the watchdog task table with samples by transformation
test_fill_watchdog_task
Samples for incrementing the watchdog counters by tasks
When you want to run the watchdog task within a job, have a look at
test_watchdog_job that is calling transformation watchdog_task to increment
the counter
When you want to run the watchdog task within a transformation, have a look
at transformation test_watchdog_task_streaming that is calling a subtransformation (mapping) watchdog_task_streaming. Make sure to define a
threshold at what number of processed rows the counter should be
incremented. This is used to avoid performance problems. It is also possible to
call the watchdog_task_streaming at the end with a blocking step or at the
beginning.
Pattern: Watchdog
Sample for the watchdog to check the counter:
Job watchdog_main should be run in an interval, e.g. every minute
This job is calling other jobs and transformations to implement the logic: job
watchdog_check_wrapper, job watchdog_check and transformation
watchdog_check
When an event is triggered, it is calling the job event
Test run:
When you let watchdog_main run for the first time, you will get the following
in the log entries for all tasks to check:
Result from watchdog_check - Detection: 0 (0=ok, 1=detection, 2=lenient ok)
[last_date is not valid, looks like the first run: initializing]
After the cyle time is reached and no task was run, you will get:
Result from watchdog_check - Detection: 1 (0=ok, 1=detection, 2=lenient ok)
[wd_counter is null or 0]
And the event is fired (in this case, the logging):
PDI Operations Event - Log ERROR: Task 99 exceeded
Feel free to let the tasks run to increment the counters, change some settings
and watch the different results!
You may define thresholds in a different table and check these similar to
the other operations patterns and fire events accordingly, e.g. when the
available memory goes below 20 percent.
Master
Controller
Failover
Definitions
& Status
fm_description
fm_status_url
fm_user
fm_password
fm_is_controller
fm_is_primary
fm_is_active
fm_failover_order
fm_last_check
1=Online, 0=Offline
fm_last_response_time
fm_last_response_message
fm_last_nr_jobs
fm_last_nr_transformations
fm_controlled_switch_to
fm_controlled_switch_initia
ted
failover_master_table
operations_instance_id=1
al_type
1=Job, 2=Transformation
al_name
al_is_disabled
al_cycle_minutes
al_last_batch_id
This is the last completed batch_id (set to 1 at the beginning). Used to limit the
number of log entries to check.
al_deadlock_minutes
When >0: Check for new log entries after this time
to detect deadlock situations.
al_deadlock_event_type
al_deadlock_event_details
al_min_minutes
al_max_minutes
al_time_event_type
al_time_event_details
al_min_rows
al_max_rows
al_row_event_type
al_row_event_details
al_is_status_check_failed
al_status_event_type
al_status_event_details
al_last_check
al_last_check_batch_id
al_last_channel_id
al_last_status
al_last_log_crc_change
al_is_finished
al_detection
analyze_log_table
analyze_log_check_table
Further Information
More Resources
Kettle project page:
http://kettle.pentaho.com
Enterprise Edition Documentation, Knowledge Base Articles and more
http://kb.pentaho.com/
Community Documentation (WIKI):
http://wiki.pentaho.com/display/EAI/
For up to date information, check the forums:
http://forums.pentaho.org/forumdisplay.php?f=69
Bug and Feature Requests with Road Maps (JIRA):
http://jira.pentaho.com
FAQ for Bug and Feature Requests:
http://wiki.pentaho.com/display/EAI/Bug+Reports+and+Feature+Requests+FAQ
More Resources
Community:
http://community.pentaho.com
Pentaho Open Source Business Intelligence Suite - European User Group
http://xing.com/net/pug
Pentaho Open Source Business Intelligence at LinkedIn
http://www.linkedin.com/groups?gid=105573
Other user groups
http://wiki.pentaho.com/display/COM/Pentaho+User+Groups