You are on page 1of 16

Exero 01, 5555


2 THE GREEN FIELD MARCH 2010 , pg.1 BLA BLA BLA Exero 01, 5555


Exero 01, 5555



March 2010 - Edition 1 TABLE OF CONTENTS

From the Editor THE SAgA of ETL
Traversing the open route Talend - The Vanguard Kettle - Transform Data into Profits Apatar - Bringing Business Closer to IT CloverETL - ETL made Easy

2 3 4 5 6 6,7 7 8 9 10 10 11 12 13


more on pg. 3


Time for a walk in the clouds


ROI on Data Integration with Informatica

Datastage Ab Initio

The solution to enterprise data integration A new beginning

Redefine SSIS

Data Quality with ODI Microsofts Bet in the ETL Market


Banita Rout, a BI Team Member is a key contributor of this newsletter and you can read more of her articles:
Talend - The Vangaurd pg. 4 DataStage - The solution to Enterprise Data Integration pg. 9

Merging Horizons

ETL in the times to come

From the Editors Desk

by Sweta Gupta Business Intelligence is the buzz word that occupies the top rank in the list of priorities of the CIOs. The main challenge of Business Intelligence is to gather and serve organized information regarding all relevant factors that drive the business and enable end-users to access that knowledge easily and efficiently and in effect maximize the success of an organization. As competition gets fiercer in the market, opting for the correct Bi solution assumes increased importance. With this growing popularity of Business Intelligence, we, at Reverside BI Labs, focus on exploring the gREEN areas with the idea of coming up with our independent opinion about the leading BI strategies, tools and vendors. A team of IT professionals is dedicated for doing research and survey of the BI market and building capacity. We dive to the depth of the various propositions that the leading tools and vendors in the market promise, for showcasing the available alternatives and helping our valued clients find the best suited solution. We are pleased to bring forth gREEN FIELD our BI newsletter with the idea of providing a space for showcasing the research and views of the BI Labs and sharing it with all the members of the organization. In this first issue, we focus on one among the most important areas that Business Intelligence (BI) covers from a technical standpoint - ETL or Extract, Transform and Load. We have tried to put forth a unified view of the past, the present and the future of ETL, an integral part of BI. We have discussed some leading proprietary ETL tools like Oracle Data Integrator, Ab Initio, Datastage as well as Open Source like KETTLE, Talend, CloverETL and Apatar. And Comparison of these tools has been based on statistics and findings by the BI lab team. So next time you are looking for whats new and latest in the BI market, ASK US!! Thanks to the BI lab team members for their contributions. We hope that you like this edition and please do share your comments to help us make gREEN FIELD better.


Stuck between Build or Buy? Well. Nothing strange. Because both are very attractive choices which make the decision even more difficult. This scenario is the same with any data warehousing project where you need to decide whether populate your data warehouse manually using custom code or choose a proprietary ETL tool like Informatica or Oracle Warehouse Builder. Then you know there is always this one good thing about the open source tools. You get exactly what you need for free. Well this has brought smiles on the faces of business in a lot of organizations. And a handful of these are ones who went on to choose an open source product for ETL.
Read more on pg. 5

4 THE GREEN FIELD MARCH 2010, pg.3 BLA BLA BLA Exero 01, 5555


by Sweta Gupta These have been the causes of nightmares for organizations both big and small. So how do we prepare to face these challenges and ascertain that they would not haunt us again? In this era of high-end technology and information, anything that is used and bought by an organization translates to a source of huge amount of data. We are flooded with information from numerous sources which adds more complexity in harnessing it as well as deriving substantial conclusions. There is too much data but very little insight. Absence of Single Version of Truth causes chaos. Business decisions are driven by Availability of Right information at Right time. For Example, Gross Margin can be defined differently by Finance and Marketing which influences how and what numbers are reported. Data Warehousing emerged as the savior for organizations by harnessing information and translating them to profits. With growth in mind, it has been on the nerves of all organizations across all domains. As the keeper of highly refined and detailed information, data warehouses form the base and core of strategic decisions taken by business. Thanks to ETL (acronym for Extract, Transform and Load) which sieves out only the required and fine grained data from the transactional systems and routes them to data warehouse. As an acronym, however, ETL only tells part of the story. ETL tools also commonly move or transport data between sources and targets, document how data elements change as they move between source and target (i.e., meta data), exchange this meta data with other applications as needed, and administer all run-time processes and operations (e.g., scheduling, error management, audit logs, and statistics.)

metadata tables separately.

code for complex requirements which made maintenance difficult. The Volume of data kept 3. Any new changes required manual changes increasing exponentially through time and the to the metadata tables. parameters for measuring performance grew more complex. The vendors soon realized the 4. Also these programs, being single threaded, weight of the T in ETL. had a slower rate of execution. Bulk processing had to be adhered to meet the In mid 90s vendors recognized the opportu- challenge. The only way it could be achieved nity and started shipping ETL tools that would was by moving the transformation overhead lessen the arduous task of writing programs for from ETL engines to the source and target daETL. And thus Code Generation ETL Tools came tabases. With the transformations being in-dainto market. These tools provided a graphical tabase, data could flow from source to target user interface which would generate the code and then transformed by the target database. It for the ETl process. would thus eliminate the row-byrow processing thereby improving efficiency and perforHowever this did not succeed in the long run. mance of the ETL process. Reasons?? Today, several DBMS vendors embed ETL capa1. The tools would produce the code in third gen- bilities in their DBMS products (as well as OLAP eration languages like COBOL and hence main- and data mining capabilities). Since these Datatenance of the code was difficult as it required base Centric ETL vendors offer ETL capabilities extensive knowledge on the specific language. at no or little extra charge, organizations are seriously exploring this option because it prom2. Also it did not automate the run time environ- ises to reduce costs and simplify their BI enviment. ronments. 3. Often, administrators had to manually dis- So its not weird to hear users asking Why should tribute and manage compiled code, schedule we purchase a third-party ETL tool when we can and run jobs, or copy and transport files. get ETL capabilities for free from our database vendor of choice? What would be the additional All these were enough reasons for vendors to benefits of buying a third-party ETL tool? bring to the market the Engine based ETL tools. Is it that ETL tools are used only with warehousThese products launched in mid to late nineties es? employed proprietary scripting languages running within an ETL or DBMS server. The devel- The answer is NO. ETL does not only work with opers use the graphical interface to design the data warehouses but also when it comes to ETL workflows which are stored in a metadata moving data among applications (web-based) repository. The ETL engine which typically sits , customer data integration and database conon a Windows or UNIX machine either connects solidation. The non data warehouse usage is directly to a relational data source and reads growing alarmingly and is already more than the repository at runtime to determine how to 40 % of the total ETL industry usage. User pracprocess the incoming data. It can also connect tices for ETL continue to evolve to keep pace to non relational databases using third party with new requirements for data integration. The gateways to connect or by creating a flat file. It result is a growing market and innovative user is also possible to process ETL workflows in par- practices. allel across multiple processors Market growth proves ETL is here to stay. Although the engine based approach unifies Hey Wait!!! This is not the end of story. There is the design and the runtime environments it lot more in store for you. does not necessarily eliminate all custom cod Read more about ETL in the times to come... ing and maintenance. The user needed to write

With ETL STOP Crunching numbers START Crushing them!!!

ETL, since the time of introduction has continuously improved and evolved to help users take better and more informed decisions. In early nineties, the ETL process was hand coded. Developers used a combination of different languages like Shell, SAS, Perl, Database, etc. to write custom codes to perform the ETL task. But these hand written ETL programs 1. Were lengthy and hard to document. 2. Hand coding of ETL required maintenance of

Exero 01, MARCH 2010 , BLA THE GREEN FIELD5555 BLA BLA pg.4 5

The Vanguard
by Banita Rout There is nothing new about the fact that organizations information systems tend to grow in complexity. The reasons for this include the layer stackup trend and the fact that information systems need to be more and more connected to those of vendors, partners and customers. Another reason is the multiplication of data storage formats, protocols and database technologies. So how do we manage a proper integration of this data scattered throughout the companys information systems? Various functions lay behind the data integration principle: business intelligence or analytics integration and operational integration. By 2000, data warehousing had begun to emerge as a concept that was applicable to companies from medium sized and small to large, but there was a gap in the marketplace. While there certainly was a need for ETL, the deal size was so large that midsize and smaller companies could not afford the ETL software. It was then that a new type of ETL - was introduced. This was ETL for the midsize marketplace. And Talend emerged as the numero uno in this space. Both ETL for analytics and ETL for operational integration needs are addressed by Talend Open Studio. Talend has a functional ETL tool set, but at open systems prices. This means that there is affordability for the midsize world. Talend offers its basic kernel for free. The basic kernel can be downloaded from the Internet. Sitting on top of the Talend basic kernel are other features and services. Talend fits into the marketplace with good functionality at a price significantly below that of any other competitor. This is indeed good news for the midsize companies who need ETL but who do not need the price tag of a full-blown ETL package offered to and used by much larger companies. Talend hits all the highlights one would look for in traditional integration platforms: Batch Delivery, Transforms, ETL, Data Governance, And a strong set of connectivity adapters.

tions. The threads created can be executed from within the studio or as standalone scripts. In a nutshell, Talend Open Studio is a solution that addresses all of an organizations data integration needs: Synchronization or replication of databases Right-time or batch exchanges of data ETL for analytics Data migration Complex data transformation and loading Data quality

Here is a comparative analysis of Talend with some of its biggest open source and proprietary competitors. Talend Vs pentaho: Pentaho is a metadata driven framework which is tightly integrated into a BI framework whereas Talend is a code generator which can be easily integrated to any BI platform. Pentaho supports java as the programming language where Talend supports both Perl and Java. And there are no limitations of loading of date in case of Talend. And the most important thing is that we dont need to install and configure the Talend software. Talend Vs CloverETL: CloverETL is also a metadata driven framework where there is a limitation of loading the huge numbers of records. Clover ETL doesnt accept .xlsx files where as Talend can easily do it. Talend Vs Informatica: Informatica is cost effective and Talend is an open source where we can download and configure easily. In Talend we can export the job as a script and can run it through the command prompt which is not in case of Informatica. Talend is now the recognized market leader in open source data integration and has become a competitor to the other market leaders. There are now more than 1,000 paying customers around the globe, including Yahoo!, Virgin Mobile, Sony and Swiss Life. Its not unreasonable to say that Talend will definitely go a long way.

Continuous efforts are being put in to make Talend a tough competitor to the products from the commercial space. With every version we see enhanced features being added. Some of the features seen in the latest version of Talend are: Open Bravo components Die on Error on tamp Enable Informix Bulk inserts (tInformixBulkOutput). tELTMSSQL, tELTSybase and tELTPostgreSQL components Enable PreparedStatement for the all DB Row components MacOS X ini file points to correct launcherAnalysis of a set of columns are enhanced Ability to use Java User-defined indicators New type of UDI (with numeric values) Menus to drill down into the values on pattern matching indicator

At the same time it keeps pace with important trends with such features as change data capture, metadata support, federated views, and SOA-based access to data services. Talend is capable of scaling from small departmental file migrations to large-scale enterprise warehousing projects Talend Open Studio operates as a code generator allowing data transformation scripts and underlying programs to be generated either in Java or Perl. Its GUI is made of a metadata repository and a graphical designer. Talend Open Studio is a metadata-driven solution. All metadata is stored and managed in the repository shared by all the modules. The jobs are designed using graphical components, for transformation, connectivity, or other opera-

6 THE GREEN FIELD MARCH 2010 , pg.5 BLA BLA BLA Exero 01, 5555

by Harapriya Montry

Transform Data into Profits

Stuck between Build or Buy? Well. Nothing strange. Because both are very attractive choices which make the decision even more difficult. This scenario is the same with any data warehousing project where you need to decide whether populate your data warehouse manually using custom code or choose a proprietary ETL tool like Informatica or Oracle Warehouse Builder. Then you know there is always this one good thing about the open source tools. You get exactly what you need for free. Well this has brought smiles on the faces of business in a lot of organizations. And a handful of these are ones who went on to choose an open source product for ETL. Lets dig a little deeper. The build solution is appealing in that there are no upfront costs associated with software licensing and you can build the solution to your exact specifications. However, businesses today are in a constant state of change and the ongoing costs to maintain a custom solution often negate the initial savings. Proprietary ETL offerings will get your project off the ground faster and provide dramatic savings in maintenance costs over time, but often carry a six figure price tag just to get started. Pentaho Data Integration delivers the best of both worlds with no upfront license costs and a significant reduction in TCO compared to custom built solutions. An annual subscription providing professional support, certified builds, and IP indemnification is also available at a fraction of the cost of proprietary offerings. When the KETTLE open source product moved under the Pentaho umbrella it gave the product a new lease on life. This was already one of the most (if not the most) popular open source ETL with a vibrant developer community. However it was at risk of falling behind those open source ETL offerings (like Talend) that was backed by a funded company. Pentaho is a good match for KETTLE as it puts it into a complementary suite and offers some out of the box integration between products. Companies looking for all in one open source business intelligence are going to like this suite Pentaho Data Integration. Unlike the traditional ETL process (extract, transform and load) KETTLE has a slightly modified content ETTL Data extraction from source databases Transport of the data Data transformation Loading of data into a data warehouse Kettle comes with 4 tools: Spoon: GUI allowing you to design complex transformations Pan: Batch executor of transformations (XML or in repository) Chef: GUI allowing you to design complex jobs Kitchen: Batch executor of jobs (XML or in repository) It supports a Parallel Processing Architecture by distributing ETL tasks across multiple servers. Out of the box integration with other Pentaho open source products such as BI, EII and EAI. The GUI Designer interface, the out of the box transformer objects and the support for slowly changing dimensions should enable increased developer productivity. Community articles show an enthusiastic sharing of tips and tricks. Enterprise-class performance and scalability SAP Connector also available

Why should we go for Kettle? It is one of the oldest open source ETL tools. It has a large user community and a new drive from the support from Pentaho. It can run on Windows, UNIX and Linux. It has the integration with other Pentaho open source products such as BI, EII and EAI. No fee for license .It has a strong easy to use GUI require less training. It includes a transformation library with over 70 mapping objects. Almost every popular database is supported. Many advanced features exist to allow fast inserts such as batch updates. A Pentaho forum and a Issue Tracking and Pentaho Community with deep live technical articles that are better than some premium ETL vendor sites. pentaho Data Integration is a full-feature ETL solution including: Transformations and jobs are made up of 100% Meta data. This meta-data is parsed by Kettle and executed. No code-generation is involved. Pentaho Kettle has relatively richer features (compared to other open source alternatives like Talend, CloverETL etc) in its open source version Fairly large connectivity options to support all databases and systems Very rich library of transformation objects which can be extended Supports real-time debugging Command line or application interface to control and run jobs available in both open source and commercial editions One of the very important ETL need Dimension Lookup/Update to handle slowly changing dimensions is available & easy to use. Error logs are easily available and they are easy to configure. No need to code it explicitly Pentaho Services monitoring console available for monitoring Pentaho related services Though error recovery is manual, for Text File Input and Excel Input operators that are capable of logging the error rows and to re-run only those error rows when run again Clustering feature is available open source edition It provides a plug-in mechanism that allows us to create plug-ins for any possible data acquisition or transformation purpose. It is one of the only ETL tools on the market to support partitioned tables on PostgreSQL by allowing records to be inserted into different inherited tables. It can schedule tasks but needs a scheduler for that. It can run remote jobs on slave servers on other machines. It has data quality features: from its own GUI, writing more customized SQL queries, JavaScript and regular expressions.

KETTLE Vs Talend Both Talend and Kettle are few of the industry leading Open source ETL tools; lets compare few of their features. Ease of Use: Pentaho Kettle It has the most easy to use GUI out of all the ETL tools. Training can also be found online or within the community. Talend It has also a GUI but is an add-on inside Eclipse RC. Speed: Pentaho Kettle it is faster than Talend, but the Java-connector slows it down somewhat. Also requires manual tweaking like Talend. Can be clustered by placed on many machines to reduce network traffic. Talend It is slower than Pentaho. It requires manual tweaking and prior knowledge of the specific data source to reduce network traffic and processing. Data Quality: Pentaho It has DQ features in its GUI, allows for customized SQL statements, by using JavaScript and Regular Expressions. It also has some additional modules after subscribing. Talend It has DQ features in its GUI, allows for customized SQL statements and by using Java. Connectivity: Pentaho Kettle It can connect to a very wide variety of databases, flat files , xml files, excel files and web services. Talend Can connect to all the current databases, flat files, xml files, excel files and web services, but is reliant on Java drivers to connect to those data sources. The best part is Pentaho Data Integrations metadata-driven approach lets you simply specify WHAT you want to do, but not HOW you want to do it. Now administrators can create complex transformations and jobs in a graphical, dragand-drop environment without having to generate any custom code. And definitely Pentaho is a good match for KETTLE as it puts it into a complementary suite and offers some out of the box integration between products. KETTLE transforms data into profits.

Kettle is 100% metadata based, without requiring any code generation in order to run properly. Metadata driven ETL tools are worth their worth in gold because they dont require code changes in order to fully manage and control the tool.
It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI. It has a strong community of 13,500 registered users. It has a stand-alone java engine that processes the jobs and tasks.

Exero 01, MARCH 2010 , BLA THE GREEN FIELD5555 BLA BLA pg.6 7

by Jagyanseni Das

Bringing Business Closer to IT

A programmer spent hours of his spare time to switch from one database format to another with big troubles related to handling duplicates, converting formats. Then he was supposed to integrate thousands of records from several MS Excel spreadsheets to Oracle DB. A Business Analyst wonders how much time would be required to integrate all data. How would the companys data warehouse be fed with current and past records? This is where ETL tool comes to the rescue of developers as well as business. ETL tools are meant to extract, transform and load the data into Data Warehouse for decision making. They make the job of a programmer easier by providing an easy way to update and insert records to different sources very quickly. There are various ETL tool available in the market. Apatar is one of the market leading ETL tool. Apatar open source project was founded in 2005 by Apatar, Inc. At the beginning of the 2007 Alex Malkov, Apatar Product Manager felt it was time to share the results of their hard work among business users for saving them from spending endless hours of coding building pipes between data sources and applications. Why Apatar? It is very user-friendly and, even for a non-technical user; it would take just a couple of hours to get trained. Where data integration is required on a regular basis, Apatar will benefit from reusable connectors and mappings, with a further quantum leap in productivity. Customers dont have to pay for the software as it is an Open Source. Once the Programmer installs and starts working on the environment, we dont need to pay more to the programmer for maintenance. A little bit of technical knowledge & training is required to start working on Apatar. Apatar has been developed using sophisticated techniques to achieve the data integration by drag and drop of connectors, Operations, Data Quality services. Features of Apatar: Integrate data across the enterprise Populate data warehouses and data marts Cross systems such as: source systems; flat files; FTP logic; queue-to-queue; and application-to-application Cross time-zones, currency barriers efficiently Overcome brittle mainframe or legacy code uplinks that transfer data, sometimes unreliably Schedule, maintain no-code or little code in connections to many different systems Platform Independent It is Unicode-compliant (meaning It can handle any language It provides mailing facility for Notification It provides CDYNE Death Index for verifying Social Security information and prevent deceased credit fraud. If the customer is deceased, providing information about the date of death, date of birth, and zip code of last known residence. All of the information cross-references with CDYNEs Death Index master file, which is updated directly from the U.S. Social Security Administration once a month. It Provides CDYNE Phone Verification Web service allows determining the validity of any U.S. or Canadian phone number.

New Releases: Apatar Allows to Automatically Truncate Text to Specified Length Apatar Data Integration Parses the Text of Emails Omitting System Information Apatar Controls Number of Scheduled Data Integration Launches Apatar Data Integration Replaces Multiple Field Values

Advantage: No Coding .A visual job designer is used to develop all kind of mapping and transformation. It provides connectivity for more than 40 different data sources.

With its Ease of Use, Apatar is bridging the gap between business and IT. Development is still on to provide connector for SAP, Microsoft Exchange Server, and Microsoft Dynamics CRM. Lets see what the next release has for us.

by Sodyam Bebarta

ETL made Easy

Total Cost of Ownership is the major concern for all organizations irrespective of the geography, domain and area of operation. The ETL market is hot with many vendors and products however there are very few products which address all the above mentioned scenarios. One such product is Clover ETL. It is platform-independent and resource-efficient. Due to high scalability it can be used on low-cost PCs as well as high-end multi-processors servers. CloverETL is enhanced with the visual design of data transformations CloverETL Designer. It allows for the easy design of any data manipulating application through the suitable combination of standard predefined ETL components using visual editor. CloverETL Engine is an Open Source tool distributed under dual license which allows total transparency and control over the tool, as complete source code for the engine is available to all customers and end-users. In order to use any software tool in a professional environment, it is necessary to have competent support and service to provide bug fixes or application enhancements and to have expert consultants at hand who have practical knowledge and experience with the tool. Customer support as well as training and consulting services for CloverETL Engine/CloverETL Designer are offered by OpenSys company . How does CloverETL work? It is based on the Transformation Graph. Transformation Graph is a set of certain specialized components, which are interlinked via data pipes. Every such component performs certain operation . Data processed by CloverETL flows through the transformation graph and is stepby-step transformed into the required format. While performing the transformation, data can be merged, sorted, split or enriched in many other ways. pros Embedded technology Being completely platform independent, CloverETL can be easily embedded in other applications as a powerful transformation library. Small footprint Compared to its competitors, CloverETL shows modest memory requirements even when performing complex data transformation tasks Rapid customization Thanks to its modular structure, CloverETL can be easily extended by custom Java-coded components. Such components can be used as any other component contained in the standard package. Reduced cost of ownership CloverETL suite offers wide range of solutions to meet any user requirements. Ranging from developers-oriented CloverETL Engine to enterprise-oriented CloverETL Server, CloverETL delivers the best price-performance ratio.
continue on pg.7

8 THE GREEN FIELD MARCH5555 , pg.7 BLA BLA BLA Exero 01, 2010

CloverETL cont...
Short development time CloverETL is continuously developed by a stable team of programmers, which allows flexible reaction to customer needs and feedback. Customization can be delivered within a few days. Easy installation There is no need for expensive on-site assistance, CloverETL can be easily installed, configured, and run by its users. There is no need for installation of expensive proprietary applications as running environment. New Features and Components: Infobright Data Writer: This component writes data into Infobright software, a column-oriented relational database. Infobright is a provider of solutions designed to deliver a scalable data warehouse optimized for analytic queries. Infobright is a highly compressed column-oriented database, based on MySql engine. In this database data are stored column-by-column instead of more typically row-by-row. There are many advantages of column-orientation, including the ability to do more efficient data compression and allowing compression to be optimized for each particular data type. The higher efficiency can be achieved because each column stores a single data type as opposed to rows that typically contain several data type. However the main purpose of Infobright software is to deliver a scalable data warehouse database optimized for analytic queries

Web Services component: The new component makes communication with Web Services easier than ever. It provides user friendly graphical interface for mapping your data into Web Service fields, automatically generates requests and process responsesAdditionally to reading plain data from Microsoft Excel sheets, the Excel component is now also capable of reading user-formatted values such as currencies, dates or numbers. New tracking option: Customers can now see all absolute speed rates for finished data transformations, facilitating comparative analysis in pursuit of process improvements. New Aspell Lookup table: Brand new implementation of this component brings better performance, improved configuration and better customization. Improved treatment of empty (NULL) values: Developers can now specify special strings that should be treated as empty (NULL) when data is being parsed. This feature simplifies processing of typical application export files which often contain values insignificant for ETL processing. Additionally it may lead to improved processing throughput and lower memory consumption of data transformation. More user friendly File URL dialog and improved LDAP functionality. CloverETL Vs Others CloverETL is a metadata based tool, it does not require any code generation in order to run jobs. Changes made to transformations out-

side visual editor are reflected once loaded back into the designer It brings a smaller palette of components but their functionality is more Complex, they beat Talends equivalents in many aspects. Its also easier to choose Suitable component in CloverETLs palette than in Talends palette It has a special ETL scripting language, CTL (Clover Transformation Language) which is easy to learn and enables users without programming skills to develop a complex transformation in a short time. Clover ETL and Talend, both products support component and pipeline parallelism to speed up executions.Test results show that Talend is not able to efficiently utilize more CPUs to speed up an execution. Even for an experienced ETL developer, Pentaho Data Integration is definitely more difficult to learn than CloverETL. Components often have unexpected names and confusing interface. Many components require sorted input, making graphs more cluttered. News of the Hour: Customers can evaluate these new features along with CloverETLs other leading capabilities with a free 30-day trial of the CloverETL Designer Pro evaluation, which is available at Information management professionals can also evaluate the enterprise integration features of CloverETL Server via an online demo.

ETL - Time for a walk in the clouds

by Sudip Basu

Enterprises are slowly moving into Cloud Computing and IT giants like Microsoft ,Amazon ,IBM and many other gearing up to facilitate this change. Technology is used by businesses to cut cost and shifting to cloud computing will not only cut cost but will allow companies to focus on their core business. Cloud computing is a way of computing, via the Internet, that broadly shares computer resources instead of having a local personal computer handle specific applications. Cloud Computing can actually be sliced into three main layers. 1. Hardware/Infrastructure Storage or CPU power 2. Software/Application you can use the software as a utility like renting a car. 3. Platform here you can build and deploy your web applications Cloud Computing can be viewed as a stack of the above layers. Now with cloud computing becoming popular we will soon see data in different data formats in many different clouds. With time we would require cloud to cloud integration or even cloud to enterprise integration thats were ETL (Extract Transform and load) comes into the picture. Due to the adoption of cloud computing we have data scattered all around so now we need to be sure that the data is up to date accurate and complete. Cloud Data Integration: Importing and exporting data needs the ETL to read data in different formats and convert them to the right format of the target system we

would need the ETL to map different file formats 3. We will be able to experience the true intelike relational database to flat file or flat file to gration capabilities of an ETL. web service. Using the power of graphical tools all this can be just a click a drag and drop. The combination of the ETL technology with a Cloud with proper planning could set any small Synchronization of data: business up and running and it can use the Having several applications we need to have a same technologies that any big size company proper synchronization of data between them may be using. Integration is key here as and ETL this to can be done using the power of the ETL. can make or break the harmony between the Where the look-up functionality of the Transfor- clouds. mation can sync the data between the different data sources. Large companys who have already spent huge sums of money on data centers and applications Optimization of the data: would not want to move into cloud computing Using the power of the ETL we can look for du- forgoing what they have already invested in, plicate data or even check for data integrity. The rather they would integrate the existing InfraData Quality checks can be done on the data structure with a cloud. This could allow them and the errors can be logged into the flexibility to experiment with new technology without having to worry about the infraReplication of data: structure or the cost of licenses. ETL can be used to replicate data to back it up or even for moving from on premise database With clouds we will also see unstructured data to the cloud. all around and the challenge to maintain this data can be possible using Content ETL which You may be wondering that our common ETL can map the different models look for the perdoes all these things already but whats new missions, metadata, users and then perform the with a cloud? actual transfer. Thus the ETL can now be used for Content transfers between the Clouds. 1. As we can get Infrastructure as a service (IAAS) we can do the transformations in no time. The Cloud using the ETL could take data wherWe can increase the speed of the processing ever whenever required, and this could also by increasing the infrastructure required. mean optimized use of the resources which in The Bulk data that needs to be processed turn could reduce cooling cost of servers and can be processed in parallel and the Infra- maintenance of large data centrists. structure can be release when completed. Could it actually be that a proper integration us2. We no longer need to worry about the ing ETL and cloud computing is here to make software installation and license main- the world a greener place? tenance it would all be a web based experience. (SAAS)Software as a Service

Exero 01, MARCH 2010 , BLA THE GREEN FIELD5555 BLA BLA pg.8 9

by BI Labs Members

ROI on Data Integration with Informatica

What better an introduction could be for worlds number one independent leader in data integration software Informatica. Informatica Corporation provides data integration and data quality software and services for various businesses, industries and government organizations, including telecommunications, health care, insurance, and financial services. Informatica comprises six business units which include: Data Integration, Data Quality, Cloud Data Integration, Application Information Lifecycle Management (ILM), Complex Event Processing (CEP) and B2B. What gives way to ETL tools like Informatica? Think of GE, the company has over 100+ years of history & presence in almost all the industries. Over these years companys management style has been changed from book keeping to SAP. This transition was not a single day transition. In transition, from book keeping to SAP, they used a wide array of technologies, ranging from mainframes to PCs, data storage ranging from flat files to relational databases, programming languages ranging from COBOL to Java. This transformation resulted into different businesses, or to be precise different sub businesses within a business, running different applications, different hardware and different architecture. Technologies are introduced as and when invented & as and when required. This directly resulted into the scenario, like HR department of the company running on Oracle Applications, Finance running SAP, some part of process chain supported by mainframes, some data stored on Oracle, some data on mainframes, some data in VSM files & the list goes on. If one day company requires a consolidated report of assets, there are two ways. First completely manual, generate different reports from different systems and integrate them. Second fetch all the data from different systems/applications, make a Data Warehouse, and generate reports as per the requirement. data sources (mainframe/RDBMS/Flat Files/ XML/VSM/SAP etc), can move/transform data between them. It can move huge volumes of data in a very effective way, many a times better than even bespoke programs written for specific data movement only. It can throttle the transactions (do big updates in small chunks to avoid long locking and filling the transactional log). It can effectively join data from two distinct data sources (even an xml file can be joined with a relational table). In all, Informatica has got the ability to effectively integrate heterogeneous data sources & converting raw data into useful information. Architecture Illustration: Informatica ETL product, known as Informatica Power Center consists of 3 main components. specific functions. For example, an Aggregator transformation performs calculations on groups of data. Informatica has a strong list of built in transformations that it provides t ease an ETL developers work. The Information Cloud SaaS has gained enormous ground in the competitive market today. Informatica had sensed this shift towards Cloud Computing before any other ETL tool provider and launched a dozen of adaptors like Affymetrix, Brocade Communications Systems and PowerData. A change in direction was observed after Informatica joined hands with and ApexConnect program on the AppExchange. This has turned out to be a strategic relationship for its customers ensuring they can manage and share all of

1. Informatica PowerCenter Client Tools: These are the development tools installed at developer end. These tools enable a developer to Define transformation process, known as mapping. (Designer) Define run-time properties for a mapping, known as sessions (Workflow Manager) Monitor execution of sessions (Workflow Monitor) Manage repository, useful for administrators (Repository Manager) Report Metadata (Metadata Reporter) 2. Informatica PowerCenter Repository: Repository is the heart of Informatica tools. Repository is a kind of data inventory where all the data related to mappings, sources, targets etc is kept. This is the place where all the metadata for your application is stored. All the client tools and Informatica Server fetch data from Repository. Informatica client and server without repository is same as a PC without memory/ harddisk, which has got the ability to process data but has no data to process. This can be treated as backend of Informatica. 3. Informatica PowerCenter Server: Server is the place, where all the executions take place. Server makes physical connections to sources/targets, fetches data, applies the transformations mentioned in the mapping and loads the data in the target system. Informatica Transformations A Value Add A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set of transformations that perform

their enterprise data and information on demand. As cloud computing has become widely adopted in organizations of all sizes, Informatica has continued to expand their focus on cloud data integration. In 2009 the company announced Informatica Cloud 9, a comprehensive offering for cloud data integration. It featured: The Informatica Cloud Platform - a multitenant, enterprise-class data integration platform-as-a-service (PaaS). Informatica Cloud Services - purpose-built, software-as-a-service (SaaS) data integration applications designed for non-technical users. Informatica Data Quality and PowerCenter Cloud Editions the ability for customers to run Informatica software on infrastructure as a service platforms such as Amazon EC2.

Obviously second approach is going to be a better bet. Now to fetch the data from different systems, making it coherent and loading into a Data Warehouse requires some kind of extraction, cleansing, integration, and load. ETL stands for Extraction, Transformation & Load. ETL Tools provide facility to Extract data from different non-coherent systems, cleanse it, merge it and load into target systems. Informatica what and how? Informatica is an easy to use ETL tool. It has got a simple visual interface like forms in visual basic. You just need to drag and drop different objects (known as transformations) and design process flow for Data extraction transformation and load. These process flow diagrams are known as mappings. Once a mapping is made, it can be scheduled to run as and when required. In the background Informatica server takes care of fetching data from source, transforming it, & loading it to the target systems/databases. Informatica can communicate with all major

Challengers? Inspite of being the best of breed product in the Data Integration space, Informatica faces tough competition from hand coding. Some of the proprietary competing ETL tools are IBM DataStage, Ab Initio, Business Objects Data Integrator, and Microsofts SQL Server Integration Services. Not to forget are some of the open source offerings like Apatar, CloverETL, Pentaho, Kettle and Talend who are pushing from the low end, by offering less expensive solutions meeting quality expectations. Informatica has answered all the competitors by continuously acquiring the best of the available in the market and the latest HOT news reads Informatica Acquires Siperian. Details at http://

10THE GREEN BLA Exero 01, 2010 , pg.9 BLA BLA FIELD MARCH 5555

by Banita Rout

The solution to Enterprise Data Integration

Over the past decade, IT departments at many organizations have built large, sophisticated data integration and management infrastructures using industry leading products to deliver business value in terms of better customer understanding, faster time to market, and business agility all at lower costs. Most of todays critical business initiatives cannot succeed without effective integration of information. Initiatives such as single view of the customer, business intelligence, supply chain management, and Basel II and Sarbanes-Oxley compliance require consistent, complete, and trustworthy information. IBM Information Server is the industrys first comprehensive, unified foundation for enterprise information architectures, It is the integrated set of components that include WebSphereDataStage and Quality Stage, Web Sphere Information Analyzer, Federation Server, and Business Glossary that share a common metadata repository, common administration, common logging and common reporting and is capable of scaling to meet any information volume requirement so that companies can deliver business results within these initiatives faster and with higher quality results. DataStage has seen major transformations in the past years from an extract-transform-load tool running in what was called the Universe engine, to what is now a DataStage engine. With the need to adapt to demands of volume processing, the parallel processing engine has been integrated into DataStage. IBM Information Server supports all of these initiatives: Business intelligence Master data management Infrastructure rationalization Business transformation Risk and compliance components, that solve multiple types of business problems. Information validation, access and processing rules can be reused across projects, leading to a higher degree of consistency, stronger control over data, and improved efficiency in IT projects There are many new features and added functionalities in Web Sphere DataStage that help cut development time, simplify job design and improve job performance. Among these Features there are some more advance features which make it the first choice of the retailer like Single interface to integrate heterogeneous applications Flexible development environment Data Connection Object ODBC Connector Slowly Changing Dimension stage Range Look-up Advanced and Quick Find Parameter Set Common Logging Reuse, Versioning and Sharing Resource Estimation tool Performance Analysis tool and Job Compare Business advantages of using DataStage as an ETL tool: Apart from the other advantages data stage provides retailers more benefits for which it is preferred ETL tool Significant ROI over hand-coding Learning curve - quick development and reduced maintenance with GUI tool Development Partnerships - easy integration with top market products interfaced with the data warehouse, such as SAP, Cognos, O racle, Teradata, SAS Single vendor solution for bulk data transfer and complex transformations (DataStage versus DataStage TX) Transparent and wide range of licensing options. And now lets see how its better than its leading market competitors. DataStage Vs Informatica Datastage is more powerful transformation engine by using functions and routines. We can do almost any transformation. Informatica is more visual, programmer friendly. Lookups in Datastage are much faster than Informatica, because the way the hash files are built. We can tune the hash files to get an optimal performance. DataStage has a command line interface. The dsjob command can be used by any scheduling tool or from the command line to run jobs and check the results and logs of jobs. DataStage Vs SSIS SSIS introduces the partitioned sort but DataStage show much evidence of a parallel processing architecture to handle very high volume transformation, cleansing and load. Almost every type of transformation in a DataStage and/or QualityStage parallel job can partition data and run on multiple nodes. IBM is one of the markets leading Vendor. So it has always tried to maintain the performance label of its products as it did in DataStage. Every new release of the product fixes some bugs with some added advanced features to compete with other market leading ETL tool and recognize itself as a leading solution.

Capabilities IBM Information Server features a unified set of separately orderable product modules, or suite

Exero 01, MARCH 2010 , BLA THE GREEN FIELD 5555 BLA BLApg.10 11

Ab Initio
by BI Labs Members

A new beginning
While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly recommended, but its not a must. When you evaluate ETL tools, it pays to look for the following characteristics: Functional capability: Ability to read directly from your data source: Metadata support: The Ab Initio software is a suite of products which together provide a platform for data processing applications. The Core Ab Initio products are: Co Operating System The Component Library Graphical Development Environment Enterprise Meta>Environment Data Profiler Conduct It Ab Initio Vs Informatica Informatica and Ab Initio both support parallelism. But vInformatica supports only one type of parallelism but the Ab Initio supports three types of parallelism. 1. Component 2. Data Parallelism 3. Pipe Line parallelism. Ab Initio supports different types of text files that are not possible in Informatica. Informatica is an engine based ETL tool, so we cant see or modify the code that it generates after development. Ab Initio is a code based ETL tool, which generates ksh or bat etc. code, that can be modified to achieve the goals, if any that cannot be taken care through the ETL tool itself. In Ab Initio you can attach error and reject files to each transformation and capture and analyze the message and data separately. Informatica has one huge log! Very inefficient when working on a large process, with numerous points of failure. Informatica is very basic as far as transformations go whereas Ab Initio is much more robust. So go ahead and Open up your ETL options with Ab Initio.

However, its an absolute piece of cake to use. It does require some thinking about, but thats more to do with the logic of the process than use of the tool itself. Understanding what you want to achieve is stage one, establishing which graph components is stage two and the easy stage is the last one, putting the graph together. Ab Initio is suite of applications containing the various components, but generally when people name Ab Initio, they mean Ab Initio Cooperation system, which is primarily a GUI based ETL Application. It gives user the ability to drag and drop different components and attach them, quite akin to drawing. The strength of Ab Initio-ETL is massively parallel processing which gives it capability of handling large volume of data.

Ab Initio has added lots of features over the years, especially in response to prospect or customer requests. IBM OS/390 support SOAP/XML support A compressed file system that can directly store 100s of TBs of user data Dynamic script Generation PDL and Component folding Handling run time related errors Efficient use of components Documentation tools Run History Tracking Mastery of parallel processing, high performance computing and ETL job performance Understanding of associated environments and technologies

by BI Labs Members

Data Quality with ODI

Todays integration project teams face the daunting challenge of deploying integrations that fully meet functional, performance, and quality specifications on time and within budget. These processes must be maintainable over time, and the completed work should be reusable further. Traditional Extract, Transform, Load tools closely intermix data transformation rules with the integration process procedures, requiring the development of both data transformations and dataflow. ODI-EE takes a different approach to integration by clearly separating the declarative rules (the what) from the actual implementation (the how). Integrating data and applications throughout the enterprise, and presenting a unified view of them, is a complex proposition. Not only are there broad disparities in data structures and ap-

plication functionality, but there are also fundamental differences in integration architectures. Some integration needs are data oriented, especially those involving large data volumes. Other integration projects lend themselves to an event-oriented architecture for asynchronous or synchronous integration. Changes tracked by Changed Data Capture constitute data events. The ability to track these events and process them regularly in batches or in real time is the key to the success of eventdriven integration architecture. ODI-EE provides rapid implementation and maintenance for all types of integration projects. The ODI-EE architecture is organized around a modular repository, which is accessed in clientserver mode by componentsgraphical modules and execution agentsthat are written entirely in Java. The architecture also includes

a Web application, Metadata Navigator, which enables users to access information through a Web interface. Poor-quality data afflicts almost every company of moderate size and operational complexity. In fact, inconsistent, inaccurate, incomplete, and out-of-date data are often the root cause of expensive business problems such as operational inefficiencies, faulty analysis for business optimization, unrealized economies of scale, and dissatisfied customers. Savvy IT managers can solve a host of these and other business-level problems by committing to a program of comprehensive data quality. Oracle Data Integrator offers a comprehensive data quality solution to meet any data quality challenge for any type of global data with a single, well integrated technology package.
continue on pg.11

12THE GREENBLA Exero 01, 5555 , pg.11 BLA BLA FIELD MARCH 2010

Redefine cont...
Oracles solution for comprehensive data quality includes three products: Oracle Data Integrator, Oracle Data Profiling, and Oracle Data Quality for Oracle Data Integrator. These three best-ofbreed technologies work seamlessly together to solve the most challenging enterprise data governance problems. The first step in a comprehensive data quality program is to assess the quality of your data through data profiling. Profiling data means reverse-engineering metadata from various data stores, detecting patterns in the data so that additional metadata can be inferred, and comparing the actual data values to expected data values. Profiling provides an initial baseline for understanding the ways in which actual data values in the systems fail to conform to expectations. Oracle Data Integrator profiling capabilities ensure data assessment is not a one-time activity, but an ongoing practice that ensures data quality over time. Once data problems are well understood, the rules to repair those problems can be created and executed by data quality engines. For both standard data quality and advanced data quality, an initial set of rules can be generated based on the results of profiling, then users that understand the data can refine and extend those rules. Comprehensive data quality should be a key enabling technology for any IT infrastructure, and it is critical to solving a range of expensive business problems. Comprehensive data quality is particularly important in the context of any data integration process to prevent data quality

problems from proliferating. Oracle Data Integrators inline, stepped approach to comprehensive data quality ensures that data is adequately verified, validated, and cleansed at every point of the integration process. After Quality, Security is the main concern need to be taken care of. The first steps in securing an integration project are setting up access to Oracle Data Integrator objects and defining user profiles and access privileges for those users. Oracle Data Integrator can provide the security to integration project requires, even in the most highly sensitive environments. Next Challenge is to manage the version - development teams face a great deal of trouble

in managing a projects hundredssometimes thousandsof work units throughout the development process and beyond. Success or failure of the version management process can greatly affect the integrity and success of any development project. Regardless of the database or applications within the IT ecosystem, the ODI solution can be optimized to drive the highest-performance bulk or real-time transformations. Oracles vision is to combine and enable these capabilities from within a next-generation, unbreakable ServiceOriented Architecture that will continue to drive business value within the enterprise for many years to come.


Microsofts Bet in the ETL Market

by BI Labs Members

The data to which a company has access is key to its future success, but obtaining meaningful information from data can be far from straightforward. Companies may need to harvest data from multiple geographical locations and it is unlikely that all the data will be stored in a single format. Microsoft Office Excel spreadsheets, Microsoft Access database , XML documents, SQL Server database, Oracle databases, Teradata data warehouses, and SAP systems are just a few of the data stores that contemporary organizations use. Other issues, such as data ownership and compliance with regulatory requirements, can further complicate matters. Data consolidation can be time consuming and resource intensive, and batch windows can be hard to find in an increasingly globalized environment. Furthermore, the value of data can also depreciate in a relatively short period of time. Consequently, making reliable data available in a timely and efficient manner is a major challenge for the modern data worker. SQL Server 2008 introduced SQL Server Integration Services, enterprise-level data integration and workflow solutions platform for performing extract, transform, and load (ETL) operations. Integration Services provides a set of powerful features that enable the merging and consolidation of data from heterogeneous sources, and includes tools for extracting, cleaning, standardizing, transforming, and loading data. A wide variety of built-in connectors support these operations, enabling Integration Services to interact not just with SQL Server databases, but with many other proprietary and non-proprietary data sources.

The SQL Server 2008 implementation of Integration Services builds upon the strengths of the previous release, and as a result the new release is a robust enterprise ETL platform that is even more productive and extensible. Two key areas of development in SQL Server 2008 Integration Services are: Improved options for connectivity. Significant gains in performance.

The In-Build connectors that available are OLE DB, ADO.NET, FLATFILE, MULTI FLATFILE, and FILE, FTP and HTTP, MSMQ, MSOLAP100, SMOSERVER, SMTP, SQLMOBILE, WMI, XML. In addition to the extensive range of built-in connectors, there are many more that we can install as add-ons. Some of these connectors are provided by Microsoft and others by third parties. There are two main reasons why vendors create add-on connectors: To facilitate access to a data source that is not supported by any of the built-in connectors To provide an improvement in performance over existing connectors

Integration Services provides a wide range of data source connectors out of the box and many add-on connectors are available from Microsoft and from third-party vendors. As a result, Integration Services is able to work with a broader range of sources than ever before. The new connectivity options have also contributed to improving performance, and SQL Server now has the fastest ETL tool available. We can use Integration Services to create packages that encapsulate a specific business requirement, such as extracting data from an Oracle database, cleaning the data, and then loading it into an Analysis Services database. Packages consist of one or more control flow tasks, where each task feeds into the next. There are add-on connectors also that offer connectivity to sources that have no built-in connector, such as Teradata and SAP BI, or that offer improved performance over existing connectors for sources that are already supported, such as Oracle.

Fig shows an Oracle Source and Oracle Destination in use as part of the data flow.

continue on pg.12

Exero 01, 5555 BLA BLA pg.12 THE GREEN FIELD MARCH 2010 , BLA 13

SSIS cont...
How many of you have heard the myth that Microsoft SQL Server Integration Services (SSIS) does not scale? Well here is a question as an answer!! Does your system need to scale beyond 4.5 million sales transaction rows per second? SQL Server Integration Services is a high performance Extract-Transform-Load (ETL) platform that scales to the most extreme environments. SQL Server Integration Services can process at the scale of 4.5 million sales transaction rows per second. That should bring a smile on a lot of faces.

Merging Horizons
of ETL, EAI and EII
by Gitanjali Kahaly

When managing complex database environments, IT vendors and buyers agree on the three top priorities: Integration, Integration and Integration. And restrictions on IT spending and staffing has further fuelled the need to integrate existing systems rather than investing in new technology. Data Integration refers to the organizations inventory of data and information assets as well as the tools, strategies and philosophies by which fragmented data assets are aligned to support business goals. Data Integration problems are becoming a barrier to business success and a company must have an enterprise wide data integration strategy if it is to overcome this barrier. That explains why so many vendors seem to be bragging about their integration capabilities these days. Broadly speaking, enterprise business integration can occur at four different levels in an IT system; data, application, business process and user interaction. Many technologies neatly fit into one of these categories but there is a trend in the industry towards IT applications supporting multiple integration levels, it is therefore very important to design an integration architecture that can incorporate all four levels of enterprise business integration. Data Integration provides a unified view of the business data that is scattered throughout an organization. It may be a physical view of data that has been captured from multiple disparate data sources and consolidated data into an integrated data store like a data warehouse or operational data store, or it may be a virtual federated view of disparate data that is assembled dynamically at data access time. A third option is to provide a view of data that has been integrated by propagating data from one database to another - like merging customer data from a CRM database into an ERP database, for example. Over time, companies are migrating to the philosophy of a service-oriented architecture (SOA) that applies Web protocols and standards for self-identifying application and data end points. This transition is proceeding slowly and selectively as companies are reluctant to abandon proven systems, including mainframes and traditional messaging, which remain mission-critical to business operations.

The four levels of enterprise business integration do not operate in isolation from each other. In a fully integrated business environment interaction often occurs between the different integration levels. In the data warehousing environment, some data integration tools work with application integration software to capture events from an application workflow, and transform and load the event data into an operational data store (ODS) or data warehouse. The results of analyzing this integrated data are often presented to users through business dashboards that operate under the control of an enterprise portal that implements user interaction integration. Both IT staff and vendors now realize that data integration cannot be considered in isolation. Instead, a data integration strategy and infrastructure must take into account the application, business process, and user interaction integration strategies of the organization.

message-based, transaction-oriented, point-topoint (or point-to-hub) brokering and transformation for application-to-application integration. The core benefits offered by EAI are: A focus on integrating both business level processes and data A focus on reuse and distribution of business processes and data A focus on simplifying application integration by reducing the amount of detailed, application specific knowledge required by users.

All these three EII, ETL and EAI range from the need for real time versus batch integration and from the need for the integration of data versus the integration of applications. So, we need to judge our needs better and identify a matching solution. For organizations that need real time data integration, EII fits in there. For those who require batch data integration, ETL would be the best bet. And for those who need either batch or real time application integration, EAI is the most appropriate tool. But gradually the horizons of these three Es are merging. EAI, ETL and EII can co-exist. This is actually a matter of fact in most of todays organizations. Every organization has the need for EAI so that their various data entry systems like inventory, payroll, marketing, operations can talk to each other. This is followed up by ETL to store the same data to central repositories from where it can be extracted as per requirement. After all this is done, EII comes into play by delivering the decision maker with a customizable view that might extract data from a single database, multiple databases or even OLTP applications. A classic reference architecture in which all 3 tools can play a part is when, transactional applications are integrated thro EAI, data from these applications flow into an Enterprise Data Warehouse (EDW) by leveraging ETL capability and then EII tools help to combine data from OLTP applications, EDW, external data repositories and local excel sheets for business decision making. EAI, EII and ETL complement each other and when implemented together within an organizations data integrations architecture only strengthen the foundations of all decisions hence promising growth.

As with any technology, theres convergence in the market place. Convergence across EII, EAI, ETL, and web-services. SOA is the architectural icing on the cake. Lets analyze these three Es of data integration in a little more detail.
vEII (Enterprise Information Integration) provides an optimized and transparent data access and transformation layer providing a single relational interface across all enterprise data. It enables the integration of structured and unstructured data to provide real-time read and write access, to transform data for business analysis and data interchange and to manage data placement for performance, currency, and availability. ETL (Extract, Transform and load) is designed to process very large amount of data. It provides a suitable platform for: Improved productivity by reuse of objects and transformations Strict methodology Better metadata support, including impact analysis

EAI (Enterprise Application Integration) provides

14 THE GREENBLA Exero 01, 5555 , pg.13 BLA BLA FIELD MARCH 2010

ETL in the times to come

by Sweta Gupta

Even after so much being written, blogged and discussed about this three lettered word ETL which spans Extract, Transform and Load of data from varied sources to specific targets, there is still so much left to be told. And here is a candid look at the current and the future perspective of ETL. ETL technology has continuously evolved from the legacy code generators to proprietary engine and to the current third generation ETL. I.e. ELT. This exchange of places between T and L has made all the difference. With the early generations of ETL already having been discussed in The SAGA of ETL lets talk about how the letter T in ETL has proved to be the determining factor when it comes to measuring performance, efficiency and ease of use. The proprietary engine based ETL tools had the ETL hub server sitting between the source and the target. All data had to be routed through this server where it would be transformed (row by - row) before it reached the target (the warehouse).This made the ETL process slow and ineffective and with it arose a need for an alternative that would reduce this overhead. The database vendors heard the cry and invested significantly to better their RDBMS by adding new functionalities so as to build the complex transformation logic in house thus leveraging the power of their traditional SQL. ETL architecture then turned to ELT architecture, where users were presented with a highly interactive GUI with the ability to generate native SQL to execute data transformations on the data warehouse server. ELT also enabled bulk processing of data after being loaded to the target. The performance bettered by 1000 times and only increases with increasing volume of data flowing in. Since database engines can be

the source and target, the SQL code could be distributed among them to achieve the best performance. Todays RDBMSs have the power to perform any data integration work. Third-generation E-L-T tools take advantage of this power by leveraging and orchestrating the work of these systems and processing all data transformations in bulk. Change being the trend, the needs of business has never been constant. This is an era of realtime applications generating huge volumes of data every millisecond. We live in a browsercentric world today. The next generation Integration Technology has to support and service data integration, data warehousing and e-biz applications and services. One solution is the Enterprise Application Integration (EAI) tools which respond to the real time needs of the internet and other applications but they lack the extract and load capability of the ETL tools. Even DQM (Data Query Management) could not prove to be an ideal solution. Though it bypasses the data warehousing architecture and provides real-time data access and integration in heterogeneous DBMS/platform environments, it is not ideal for large volumes of data. So what should we call as a complete solution? The answer is - A blend of ETL, EAI and DQM tools which would route data to and from information - craving entities based on prescribed business rules. GartnerGroup has named this flexible, scalable and intelligent solution as ILN Information Logistics Network. ETL vendors are now focused to get their tools to encompass the full range of data integration capabilities needed for integration and management of business processes and transactions across ERP and CRM systems. The tools cannot afford to stay the same and need to integrate

both structured and unstructured real time data and to manage and share technical to external industry data. ETL must evolve into integration technology that solves issues at the data level and beyond. Nowadays real time databases appear distributed over the web and contain specific information. Hence the challenge for ILN would be to collect the interactions taking place and send them securely to the data warehouse for analysis and actions. ILNs can use the underlying software and hardware to leverage parallel processing from source to target. Some ETL vendors are making their tools critical to the process of sharing data from data warehousing to business to business (B2B). They foresee that eventually metadata will be XML- based because, as an interchange format, it offers a great deal of flexibility. ETL for HTML is a popular phrase used to describe how most of us will access web data. It encompasses Web2.0 and Enterprise Data Management. Unlike traditional ETL, Web Data Services provides two-way access to data. This means we can leave the data where it resides best and get full programmatic access by using a Web Data Server to wrap the applications into standard service APIs like REST, SOAP or .NET. With the data explosion around us it becomes impractical to move and synchronize data into one common data repository. The data we need to perform our analysis and drive business decisions will change more and more rapidly. We will need new data sources daily, or at least weekly, to react to the ever changing business needs of the future. Lets hope the marriage of ETL, EAI and DQM begin a new trend in the world of Data Integration.

Editor: Sweta Gupta Content & Layout Design: Jakkie Swart Assistant editors: Sodyam Bebarta, Sudip Basu, Harapriya Montry, Jagyanseni Das, Banita Rout & Gitanjali Kahaly.

Exero 01, 5555 BLA BLA , pg.14 THE GREEN FIELD MARCH 2010 BLA 15

16THE GREENBLA Exero 01, 5555 , pg.15 BLA BLA FIELD MARCH 2010