Sie sind auf Seite 1von 21

XYZ Client

Kenilworth, NJ

Analysis of ETL Tools


System Name: CDW - Scoping v1.0

Prepared By:

Name: ____________________________________________________ Date: ______________


Ken

ANALYSIS OF ETL TOOLS


SL No. Chapter Page no

1.Characteristics of an ETL tool 2

2.Requirement by the Project 4

3.ETL Tool Comparison 6


Characteristics of an ETL Tool
1.Access data from multiple, operational data sources

2.Re-map source data into a common format

3.Standardize data to enable load to conformed, target databases

4.Filter data, convert codes, perform table lookups, calculate derived values

5.Automated slowly changing dimension support (Type I, Type II,Type III)

6.Incremental aggregation & computation of aggregates by the ETL tool in one pass of the source
data

7.Support for Unicode & multi-byte character sets localized for Japanese and other languages

8.Support graphical job sequencer, re-usable containers, and nesting of sessions

9.Validate data to check content and range of field values

10.Perform procedural data cleansing functions

11.Support complete development environment, including versioning and run-time debugger

12.Load cleansed data to the target data mart or central DW

13.Produce audit and operational reports for each data load

14.Automatic generation of centralized Metadata

15.Automatic generation of data extract programs

16.Native interfaces to legacy files, relational databases, ERP sources (e.g.,SAP R/3 and
PeopleSoft), eBusiness applications, Web log files, IBM MQ-Series, XML sources etc.

17.Support for near real-time clickstream data warehousing

18.Support for an enterprise eBusiness environment, including integration at the metadata level with
BI tools, ERP applications, CRM applications, analytic applications, corporate portals, etc.

19.Platform independence and scalability to enterprise data warehousing applications, directly


executable in-memory, multi-threaded processing for fast and parallel operation.

20.No requirement to generate and compile source code .

21.No requirement for intermediate disc files

22.Support for concurrent processing of multiple source data streams, without writing procedural code

Characteristics of an ETL Tool (Contd.…)


23.Specification of ETL functions using pre-packaged transformation objects, accessible via an
intuitive graphical user interface
24.Extensible transformation objects at a high level of significance

25.Ability to specify complex transformations using only built-in transformation objects. The goal is to
specify transformations without writing any procedural code

26.Automatic generation of central metadata, including source data definitions, transformation


objects, target data models, and operational statistics

27.Metadata exchange architecture that supports automatic synchronization of central metadata with
local metadata for multiple end-user BI tools

28.Central management of distributed ETL engines and metadata using a central console and a
global metadata repository

29.End-user access to central metadata repository via a right-mouse click

30.Metadata exchange API compliant with COM, UML, and XML

31.Support of metadata standards, including OLE DB for OLAP

32.Ability to schedule ETL sessions on time or the occurrence of a specified event, including support
for command-line scheduling using external scheduling programs

33.Ability to schedule FTP sessions on time or event

34.Integration with data cleansing tools

35.Import of complete data models from external data modeling tools

36.Strong data warehouse administration functions

37.Support for the analysis of transformations that failed to be accepted by the ETL process

38.Extensive reporting of the results of an ETL session, including automatic notification of significant
failures of the ETL process
Requirement by the Project
1.Ease of Use and Maintenance

2.Security Features and Administrative Functions

3.Sources - Oracle Database, Files, XML Files, Cobol Files

4.Targets – Oracle

5.Must be able to use Oracle Functions and specifically Date Functions

6.Must be able to take advantage of Oracle parallel Architecture

7.Must be able to Partition Data for Reading and Writing Purpose

8.Must be able to Support Oracle External Load i: e SQL Load

9.Able to call Oracle Functions and Procedures

10.Must be able to work with Lookups

11.Must be able to work with Containers

12.Graphical Interface for Design, Development and Implementations

13.Must have a debugger or way to debug the Mapping/Graph/JobStream,

14.Version Controlling

15.Load of Failure Data

16.Incremental Load

17.It should be able to work with Uni Code multibyte character localized for different countries (Data
is in different places)

18.Sort, Aggregate, Join Transformations (Look for Inner, Semi Outer, Outer Join)

19.Must be able to work with Oracle BLOB, CLOB data type

20.Meta Data Management - Global Metadata Repository

21.Meta Data can be shared across the enterprise

22.Able to Import Metadata from Erwin or Designer 2000


Requirement by the Project (Contd…)
23.Able to Export Metadata to Cognos

24.Meta Data change Impact analysis

25.Must be able to Log errors

26.Keeping Track of Slowly Changing Dimensions (CDC - Change Data Capture)

27.Must have a Sequence Generator for Primary Key

28.Able to Perform Audit on Load and Failure of Data

29.Able to validate data and check content and range of data

30.Can be scheduled through Shell scripts or Scheduling tools

31.Ability to merge data from more than one sources in a single Mapping/Graph/JobStream

Feature Ascential Software Informatica Ab Initio Cognos


DataStage XE PowerCenter GDE (Graphical Decision Stream
Development
Enterprise, Ab
Initio Co->Op
System)

Development and
Maintenance Process
Support
Visual Metaphor Multiple Screens for Handling Has different screens for Source GDE (Graphical Two environment one
development, Scheduling and import, Target Import, Scheduling Development one for
Administrative tasks and Administrative tasks. The Environment) is Job Scheduling
mappings become complex when present. It is not
number of transformations are very intuitive. It has
added and are easy to handle as the many components
project evolves. each represented by
a square box. Newer
version has GUI for
running the graphs
and administrative
jobs.
Feature Ascential Software Informatica Ab Initio Cognos
DataStage XE PowerCenter GDE (Graphical Decision Stream
Development
Enterprise, Ab
Initio Co->Op
System)

Development and
Maintenance Process
Support
Multiple Sources and Targets DataStage XE includes unlimited Informatica PowerCenter has It can read from and Though it says that it
number of heterogeneous data access to Multiple source and write to heterogeneous sourc
sources and targets. Multiple, Targets. But in one mapping it can heterogeneous heterogeneous targe
unlimited targets can be of mixed access data from heterogeneous sources because of
origin, can have multiple source but can not write to its Own Operating
destinations and receive data heterogeneous targets a limitation system, which has a
through a variety of loading which will be take care in newer greater flexibility.
strategies (bulk loader, flat file version.
I/O, direct SQL) in the same job.
Feature Ascential Software Informatica Ab Initio Cognos
DataStage XE PowerCenter GDE (Graphical Decision Stream
Development
Enterprise, Ab Initio
Co->Op System)

Native It has different product It has different product On mainframes, Ab Initio Extraction from Mainframe will be by the
Mainframe Data DataStage XE/390 that call Power Connect for can read and write and use of 3rd Party Software.
Extraction generates COBOL Mainframe Data. It update and delete rows
programs that run directly accesses DB2 on in DB2 databases and
on the mainframe, mainframe using DB2 records in VSAM files
providing native mainframe Connect. directly. It can read and
extraction, transformation write any MVS dataset.
and loading capabilities. Ab Initio reads
mainframe IMS data. All
data types are supported
Feature Ascential Software Informatica Ab Initio Cognos
DataStage XE PowerCenter GDE (Graphical Decision Stream
Development
Enterprise, Ab Initio
Co->Op System)

Built-in functions and A lot of Built In functions A lot of Built In functions and There is lot of built in Yes
routines and the use of Script the use of Script language functions available to do
language like Basic or like Sql to create your own the different jobs including
SQL to create your own functions near about 50 built Compress, Database,
functions. in components for the use to Datasets, De partition,
very common job in the FTP, Miscellaneous,
datawarehouse. Partition, Sort, Transform,
and Validate.

Advanced transformation Script language like basic Script Language only use of Ab Initio includes a full Call functions like C
support available so have Meta SQL and can call External programming language
Data with in the tool and Procedures for Advanced (called DML) that can
can write advanced Transformations but meta express if-then-else, case,
Transformations data of the transformations is cascading (prioritized)
not integrated rules, looping, and much
more.
Feature Ascential Software Informatica Ab Initio Cognos
DataStage XE PowerCenter GDE (Graphical Decision Stream
Development Enterprise,
Ab Initio Co->Op System)

Version Control DataStage XE includes Version control The built-in version management Through Source Control
and Configuration a component for implemented on to the system allows users to navigate
Management version control that 'folder' level in to old versions of an object,
saves the history of all PowerCenter. It is not navigate to the version of a graph
the data integration great but it is taken used in a particular run of a job,
development. It care in 6.0 or 7.0 view the state of the entire
preserves application release. Currently repository at a particular point in
components such as Informatica users use time, and find the differences
table definitions, third party versioning between two versions of a graph,
transformation rules, control applications data transformation, or data type
and source/target like PVCS. definition.
column mappings within
a 2-part numbering
scheme.

Graphical Job Very good tool to Newer version 6.0 has No till the last version. New Do not Know
Sequencer visually create the this in Powercenter. version has it that is what I came
sequence of jobs to be Currently it is through to know
run through GUI. session manager.

Feature Ascential Software Informatica Ab Initio Cognos


DataStage XE PowerCenter GDE (Graphical Decision Stream
Development Enterprise,
Ab Initio Co->Op System)

External function External functions are Yes, but limited to C+ Custom components can be Yes for C like functions
support easily integrated within +, Java, PL/SQL etc. implemented in any language,
DataStage. More but most of the job and including Cobol, C, C++, Perl,
importantly the flexibility transformation can be Java, Forte, PL SQL etc.
and completeness of achieved with the built
the tools allows the in functions and
developer to stay inside transformations. There
the tool using the easy is no way to import
to use scripting meta data from the
language like Basic and external function. It will
SQL be take care in newer
version.
Feature Ascential Software Informatica Ab Initio
DataStage XE PowerCenter GDE (Graphical
Development
Enterprise, Ab Initio
Co->Op System)

Feature (For re use of


Containers Ascential
For businessSoftware
logic reuse Informatica
It has the same concepts for Ab Initio
Ab Initio
provides facilities Cognos
Business logic) in the form
DataStage XE of discrete Business
PowerCenterlogic reuse across for developing
GDE (Graphical graphs with Decision Stream
components, DataStage the project and mappings the connections and use
Development
uses the concept of with the help of Mapplets. of Sub Graphs. A
Containers. Containers Enterprise,
subgraph can be Abstored
Initio
are independent, Co->Op System)
away in a library and
Meta Data Analysis shareable DataStage reused as many times as
and Management objects that can be desired and used across
included in many jobs multiple applications.
Source schema change Developers
within the samecan come toor
project, Yes through Powerplug. The Repository keeps No
management -Graphical know all the relationships
multiple projects. When Informatica's PowerPlug track of dependencies and
Impact Analysis Across associated
business logicwithchanges,
and object product allows easy the relationships among
Tools by this feature, developers
updates are made to the comparison between industry objects stored in it, so you
can access
shared the impact
container. Jobs of leading modeling tools and can assess the impact of
changes across
using the shared the the Informatica repository. any change for example it
environment
Container arebefore
updatedthe Differences can be is possible to determine
change actually occurs.
with the new information highlighted and meta-data the possible impact of
when they are recompiled. imported. (A newer version program changes by
Shared Containers are will have the features like, locating the parts of an
useful for building up a when you change the application that depend on
library of standardized, source, all the connected an object being changed.
granular and repetitive fields from the source till
data integration tasks target will get automatically
such as test data updated). It has different built
generation, job recovery in reports for knowing the
or slowly changing source to target column
dimensions relationships, the sources
and targets used in a
mapping, the mapplets used
in the mapping.
Feature Ascential Software Informatica Ab Initio
DataStage XE PowerCenter GDE (Graphical
Feature Ascential Software Informatica Ab Initio Cognos
Development
DataStage XE PowerCenter GDE (Graphical Decision Stream
Enterprise, Ab Initio
Development
Co->Op System)
Enterprise, Ab Initio
Validation DataStage XE permits Through Scripts and built in Co->Op System)
It has lot many built in
validation of transforms function like is_numeric, components (like
Global meta data browsing and
Metauser
dataconnections.
exploring, is_spaces
Browsing ofetc. The database
meta-data can transformation)
Repository Browser that will be Yes
DataStage
analysis and jobs will fail
usage connections
be done through validated
are the client for use
Interface providesdousers
in Graphs to the
immediately at run-time,
functions for impact or mapping to run.The
tools. In addition designer
there is a validations like Compare
access to the Repository
in validation mode,
analysis, data lineage, and validates
web-based meta-dataany
the map for Checksum, Validateweb
through a standard
because connections
the publications and and syntactical error or data
reporter distributed with type
the Records
browser. etc.The GDE will
SQL statements
subscription of metaare data error before
powercenter. it is ready to be let the developer know of
checked prior to
through Metastage executed by the scheduler. any error in the garph.
processing
Explorer. any rows.

Canvas Annotation The users can “write Every Mapping and Component specific or
notes” or add text onto the Transformation has 2 type of graph specific annotations
canvas or screen as they annotations, a name or are available.
create DataStage jobs. object Description and
Users can now add comment box. The name will
comments, labels or other be displayed when you point
explanations to the the mouse to the
designs. transformations.
Publication of meta data Meta data information can Through Webzine - Web Not Very Clear. But the Do not Know
directory be published in XML based Meta Data Reporter. specified user can show
and/or HTML format It is a web based meta-data all or any of their
complete with hyperlinks reporting tool to facility properties such as
for easy end-user access publication of meta-data. business area, subject
and navigation. area, steward group,
database, modified by,
created and last modified
date.
Feature Ascential Software Informatica Ab Initio Cognos
DataStage XE PowerCenter GDE (Graphical Decision Stream
Development
Enterprise, Ab Initio
Co->Op System)

Meta data reuse DataStage XE uses a PowerCenter collects (load) It has way of integrating to Yes
publish and subscribe and distributes (unload) meta Designer tool and OLAP
mechanism to distribute data between tools using tool. It is not clear that
standard meta data from a bridges - i.e. import from how the reuse of
variety of sources. Other Oracle Designer, Erwin or Metadata is achieved ?
users can subscribe to PowerDesigner or export to
meta data publications on BI tools such as BO,
a one-time or recurring Microstrategy and
basis. When meta data Impromptu. It is one
changes, subscribers are directional. There is another
automatically notified. tool called Metadata
Exchange SDK which is
used for bi directional use of
Meta data from Designer
Tools or Olap tools and vice
versa. Moreover it can be
Exported and Imported
through XML Files

Data Lineage - Integration A way to know from what Through Scripts it is possible Through Scripts it is No
of design and event meta sources the data has been possible
data populated to the target
when you have more than
one source to extract in a
graph.

Feature Ascential Software Informatica Ab Initio Cognos


DataStage XE PowerCenter GDE (Graphical Decision Stream
Development
Enterprise, Ab Initio
Co->Op System)
Integration with E/R Meta data sharing Meta data sharing Ab Initio has an ERwin Yes
modeling tools capabilities with the capabilities with the unloader, which will look
industries leading E/R industries leading E/R at an ERwin diagram,
modeling tools including; modeling tools including; create dataset and
Erwin, Designer 2000, Erwin, Designer 2000, column objects in the Ab
PowerDesigner, and E/R PowerDesigner etc with the Initio Repository, and
Studio help of Powerplug and annotate them with
Metdata Exchange SDK. comments and descriptive
information. But whether it
can be integrated with
Erwin or D2K is a
question remains
unanswered? Even if it
can be integrated how it
can be done ?

Integration with Business With Business Objects, Through Power Bridges it Not Very Clear. How this Cognos Powerplay
Intelligence Tools Brio, Cognos Impromptu, has integration with Brio, is done? Through Architect
MicroStrategy Cognos, MicroStrategy,
Business Objects
Feature Ascential Software Informatica Ab Initio Cognos
DataStage XE PowerCenter GDE (Graphical Decision Stream
Development
Enterprise, Ab Initio
Co->Op System)
Performance
Parallelism It has advanced parallel The PowerCenter Because of its parallel Hashing Techniques for
processing capabilities architecture limits the architecture aka MPP and Sort/Join and Aggregate
which enable it effectively product ability to perform the concept of partitioning
and efficiently use the parallel processing. While (Component, Pipe and
capabilities of SMP, SMP PowerCenter can partition a Data), it has been able to
Clusters and MPP data set and run separate handle very high volume
platforms. In addition, processes, the newer version of data. It can do parallel
DataStage XE efficiently has greater capabilities for and pipeline loading.
leverages database Sort, Join and Aggregate
parallelism (i.e. DB2 transformations. It is faster
IEEE, Oracle Enterprise than previous versions.
Edition with parallel Pipeline bulk loading of data,
tables). memory caching.

Named Pipe Support Ability to break large job Sort, Aggregate When the graphs are Do not Know
into smaller job and then Transformation has the same running on the same
these smaller jobs has the facility. server they communicate
ability to communicate through Named Pipe or
with each other making Share Memory, when they
the whole process to run are different servers, they
faster. communicate through
network. All these is
determined by Ab Initio
Operating System

Feature Ascential Software Informatica Ab Initio Cognos


DataStage XE PowerCenter GDE (Graphical Decision Stream
Development
Enterprise, Ab Initio
Co->Op System)
Shared-in Memory Hash Allows multiple "instances" Yes for Lookups, Joiners, There are in-memory Yes
Tables of a job, or even multiple Aggregates and Sort versions of the Join, Sort,
disparate jobs, running in transformation this technique and Lookup components.
parallel, to share the same is used to boost the For dynamic data lookups,
in-memory image of a performance. Dynamic Ab Initio include several
hash table and thus make lookup uses this technique mechanisms Lookup File
better use of machine too. represents one or multiple
resources. serial files or a parallel flat
file of data records small
enough to be held in main
memory, letting a
transform function retrieve
records much more
quickly than it could
retrieve them if they were
stored on disk.
Feature Ascential Software Informatica Ab Initio
DataStage XE PowerCenter GDE (Graphical
Development
Enterprise, Ab Initio
Co->Op System)
Platform Support

Hardware support Unix, NT and Mainframe Unix and NT platforms only Unix and NT and
platforms Windows Platforms

Extensibility Central to DataStage’s Not aware of any such things Simple mechanism of
architecture is the ‘wrapping’ user programs,
DataStage Plug-in API. the Co>Operating System
This allows Ascential delivers the ultimate in
Software engineering, extensibility
vars and end-users alike
to code their own
interfaces to external
databases, processes,
and bulk loaders.

Integrated Bulk Loader Have it bundled within the Have it bundled within the Have features for Load
software and can be software and can be invoked and Unload bulk data
invoked within the scripts within the scripts for Oracle, through Load and Unload
Sybase and SQL Server Component
Feature Ascential Software Informatica Ab Initio
DataStage XE PowerCenter GDE (Graphical
Development
Enterprise, Ab Initio
Co->Op System)
Data Quality and
Cleansing
Ability to assess current The development teams Data cleansing can be This can be achieved
data quality and business users have achieved by building the through Data Profiler
the ability to audit, monitor rules within the logic. Rows Product which sits on the
and certify the quality of can be flagged as problem or top of Ab Inito Co-
data at key points routed to a problem file/table. >Operating system. It
throughout the data analyzes the graph the
integration lifecycle. data and stores in the
metadata repository,
which then can be used
by the developer to look at
the quality of the data.
More ever it has in build
components like
checksum, compare
records which can be
incorporated to graphs to
access the data quality.

Integrate 3rd party data With Trillium and First With Trillium and First Logic With Trillium and First
cleansing tools Logic (How easy is it, I do (How easy is it, I do not know Logic (How easy is it, I do
not know as I have not as I have not tried it out) not know as I have not
tried it out) tried it out)
Feature Ascential Software Informatica Ab Initio
Feature Ascential
DataStageSoftware
XE Informatica
PowerCenter Ab
GDEInitio
(Graphical
DataStage XE PowerCenter GDE (Graphical
Development
Development
Enterprise, Ab Initio
Enterprise, Ab Initio
Co->Op System)
PRODUCT Co->Op System)
Ease of Use and
Component wise product, Component wise product buy One Product and there
Maintenance
Buy what you use, but what you use (It has now its are add on products for
It has improved
Datastage a lot over
XE comes up Very easy totools).
own OLAP use and
You have User saysTransfer
Metadata the new (but
GUI is
the time had a demo of
with alll the features that Maintain (Own Experience)
to buy the products for the good but have no
comes with access to
the tool over
we require the web
like Transferring metadata from experience
mainframe and other
communicating with Designer tool and source)
New Releases and How muchtools
Designer costand
is involved
Olap transferring metadata to
Patches Tools OLAP tools
18 % to 22 % of the 20% of the Purchase Cost Do not Know
COST
Purchase Cost price every price Every year
year
Cost can be spread Cost can be spread because One time cost
Customer Support because
Can of itsacomponent
we have dedicated of its component in nature
and Services in nature
person for our questions
to
PerbeCPU
solved? It is based on Per Server Per Per CPU and Per
Do not Know Repository
Do not Know(we can have Developer
Do not Know
many servers pointing to the
Case Studies and Can we have the client Repository) and also
Access to Site under number and visit their site depends on number of
using ETL tool for their Sources and Targets. The
Production warehousing licensing is done for no of
Environment using CPU's like 4 CPU's or 6
the ETL tool CPU's etc.
Yes Yes Yes

Availability of
Manpower (Price &
Consultancy)
Good Very Good (Low Price and
High Quality)