Data Integration Raden WP Final

Data Integration: What’s Next?
Neil Raden
May 1, 2008
530 Showers Drive #7348

Mountain View CA 94306
info@smartenoughsystems.com
www.smartenoughsystems.com
© 2008 Smart (enough) Systems, LLC.

Duplication only as authorized in writing by Smart (enough) Systems LLC
Sponsored by expressor software corporation

www.expressor-software.com
ABOUT SMART (ENOUGH) SYSTEMS
Smart (enough) Systems LLC. helps clients automate and improve the
decisions underpinning their day to day business operations.
Smart (enough) Systems LLC. is the only full service, vendor neutral
company focused on the enterprise decision management marketplace,
providing research, advisory services and implementation.
The company’s singular focus on the automation and improvement of

operational decisions allows it to bring multiple classes of technology to
bear on solving business problems such as business agility, operational
business intelligence, analytic competition and business process
management.
ABOUT THE AUTHOR
Neil Raden is the co-founder of Smart (enough) Systems LLC (SES). Prior to founding SES, Neil
was the founder of Hired Brains, Inc. a consulting, systems integration and implementation
services in Business Intelligence, Decision Automation and Business Process Integration for
clients worldwide. Raden is an active consultant and widely published author and speaker. He
is the coauthor of Smart (Enough) Systems, Prentice-Hall, 2007 and he welcomes your
comments at neil@smartenoughsystems.com.
>2
Table of Contents
Executive Summary ...................................................................................................................... 5
Defining DI: Then and Now ........................................................................................................ 6
A Little History of DI ................................................................................................................ 7
Better, but not Enough ............................................................................................................. 8
Managing from Scarcity ........................................................................................................... 8
Next Generation DI Qualities ...................................................................................................... 9
Semantic Rationalization........................................................................................................ 10
Abstraction ............................................................................................................................... 11
Scaling ....................................................................................................................................... 12
Factors Driving the Need for Better DI Solutions .................................................................. 12
Growing Volumes ................................................................................................................... 13
New Applications ................................................................................................................... 15
Need for speed ........................................................................................................................ 15
Cost and Benefit Factors............................................................................................................. 15
Implementation Productivity and Reuse ............................................................................. 16
Conclusion ................................................................................................................................... 17
>3
“Everybody gets so much information all day long that they
lose their common sense.”
-Gertrude Stein
>4
Executive Summary Data integration (DI) efforts
No one can deny that the conduct of business today is typically consume a significant
remarkably different - faster, more competitive and more part of the budget in IT
diverse than it was a decade ago. Whether this can be implementations, particularly
attributed to market forces and innovation or, rather, to data warehouses.
the advance of technology, is an open question. It does
seem, however, that technology has played the leading
role and business has had to adapt to keep up. A business might say of how their business has
changed (Paraphrasing the late Jerry Garcia speaking about his band’s enduring popularity),
“We didn’t invent it. It invented us.”
Two factors have played a leading role in this process: Moore’s Law and ubiquitous network
bandwidth. It is clear that both the absolute and relative cost of computing components (on a
unit basis) have declined drastically. Gordon Moore predicted 1 40 years ago that
microprocessor components would drop in price 25% a year and double in capacity every 18-24
months. After 40 years, this trend shows no signs of abating. Similar gains have been achieved
in storage devices, and networks have actually exceeded these predictions. But these gains
would be just islands of efficiency were it not for the widespread adoption of the Internet as the
means for conducting business.
In the same way that bullet trains and high-definition TV need new infrastructure to perform at
their design capacity, information technology innovation makes demands on physical and
intellectual infrastructure too. This exposes some painful deficiencies in Information
Technology (IT) departments: new technology challenges existing methodologies, and roles
need to be revised. Not only do new tools need to be learned, new ways of thinking about the
problems must be employed. The sooner the “ah ha!” effect sets in, when practitioners realize
the efficiency they can gain by rearranging their mental models and modifying their work
habits, the sooner the return on investment can be realized.
Nowhere is this more evident than in matters concerning data, where promising technology
implementations can quickly flounder and fail. Existing tools and techniques are too dependent
on programmers manually inspecting and coding transformation rules, on data stewards
arbitrating the conflicting uses of data and cataloguing its semantics, and on DBAs constantly
modifying physical artifacts of databases to accommodate each new discovery. Data integration
(DI) efforts typically consume a significant part of the budget in IT implementations,
particularly data warehouses. While it may be economically feasible, for example, to install the
hardware and software backbone to perform real-time decision-making applications such as
credit fraud detection, the costs of the associated DI development and maintenance may
overwhelm the budget.
Missing from today’s DI tools are two key capabilities that can reduce or even eliminate the
slow work and redundancy – semantic rationalization and abstraction.
1What Moore actually stated was that the number of transistors that can be inexpensively placed on an
integrated circuit is increasing exponentially, doubling approximately every two years.
>5
Computers are adept at sorting, searching and comparing things at speeds and volumes that
humans can hardly imagine. Semantic rationalization capabilities detect the identity correlation
between various sources of data based on algorithms the computer can process at tremendous
speed. This is a role for computers and software, not people who work at a comparatively
glacial pace. In addition, each new project or subject area can start with what is already known
in a metadata repository, applying it to both sources and targets of information. The linchpin of
this capability is abstraction.
A formal definition of data abstraction is the enforcement of a clear separation between the
abstract properties of data and the concrete details of its implementation. What this means is that
processes that require access to the physical representation of data, for example, rows from a
database table, can refer to the data through its meaningful abstracted data properties without
knowledge of its location, type or access method. Why this indirection is useful may not be
immediately apparent, as it adds a step. However, when many different processes, applications
or services read or write that data, its physical characteristics can be modified without affecting
any of the programs. Perhaps more importantly, the frequent churn in definitions – brought
about by changing business needs, mergers of companies or any number of other factors – can
be dealt with more quickly and more accurately.
As a result of all of these factors, DI tools and techniques, most of which were developed based
on conditions prevalent over a decade ago, are in dire need of some fresh thinking.
Defining DI: Then and Now

Data Integration is a collection of technologies that support both business intelligence (BI)
applications and operational systems, for the most part, as separate applications. BI-oriented
tools are most often referred to as ETL for the extract, transform and load processes they
support moving data from original source systems to persistent repositories such as data
warehouses, data marts and other downstream or separate locations. DI for operational
systems, because its service requirements are more immediate, includes integration brokers and
application interfaces (EAI). These tools are either incorporated into the design of new
programs, or “wrapped” around existing ones, but their focus was always limited– integrating
and conforming only the data needed to support the transactions that were part of the target
applications.
The differences between the two flavors of DI are manifested in the way they approach latency
and persistency. ETL has historically been a bulk process since data warehouses are typically
read-only and not used for up-to-the-minute requirements. Conversely, operational
requirements for DI have been immediate and transient. A request for data is made in-process
and the transaction is closed.
Going forward, these distinctions are beginning to blur. Analytical applications are becoming
much more time-sensitive and operational systems are becoming smarter, requiring some
analytical insight.
>6
A Little History of DI A recent Aberdeen study
From the time the first IT department wrote its second reported that the average
program, there was a need for data integration. Since Fortune 500 Company
programmers tend to regard programs written by other spends 76% of its IT budget
programmers as, at best, inadequate, organizations often on maintenance, not new
found themselves saddled with an inventory of initiatives .
incompatible systems, each with its own way of defining
business objects, concepts and rules. As technology platforms changed, the programs became
not only logically incompatible; they were physically incapable of communicating without
developing yet another program to facilitate data movement between the original ones. When
organizations acquired other organizations, the problem multiplied. When the Internet opened
up business to the outside world, the problem multiplied geometrically. IT departments spent
an inordinate amount of time building and maintaining these one-to-one interfaces. 2
Before the popular conception of data warehousing arose 15 or so years ago, building
applications based on data culled from various sources was a difficult, expensive, time-
consuming and error-prone process. There was always a scarcity of people who had a solid
understanding of the physical artifacts in sourcing application systems; documentation was
poor or missing; and the systems themselves were devilishly difficult to access. The semantics
of the source systems’ data were more often recorded in the “go-to guy’s” brain as they were in
any documentation, electronic or otherwise.
With data warehousing, the problem only grew worse. The presumption was that a single
extract from each source system would replace a mountain of redundant and diverse
“interfaces” to support downstream reporting applications. Granted, once a data warehouse
was in production, sourcing integrated data from it for reporting and analytical applications
became somewhat easier, but only somewhat. Data still had to be integrated and moved into a
data warehouse common model first (an intermediate step), before it could be again
transformed into feeds for the portfolio of existing applications. Initially, this job was handled
by programmers who had been given source and target specifications from data modelers and
data analysts. The only difference between this arrangement and the pre-data warehouse world
was that there was now a lot more to do.
Writing programs is a slow process and maintaining source code is tedious. At first, some
software products to assist this effort arose, but they were strictly source code generators. They
were not present at runtime, nor did they provide error-logging, metadata, reusable
components or any of the other essential elements of an application infrastructure such as
security, role definition, load balancing, etc. In short, they offered some productivity in the
source code production effort, but in something as complicated as data integration, that
amounts to only a small fraction of the effort.
2 They still do. A recent Aberdeen study reported that the average Fortune 500 Company spends 76% of
its IT budget on maintenance, not new initiatives.
>7
Within a few short years, a new generation of tools arrived, now defined by the market analysts
as Extract, Transform and Load (ETL) tools, and based on SQL structures rather than code
generation. Some of these tools began to introduce the concept of metadata to the discipline
and eventually added capabilities touted to provide maintenance and reuse advantages. This is
and was the state-of-art for a decade, but it is no longer adequate.
Better, but not Enough

Nevertheless, populating data warehouses was, and still is, fraught with problems. Consider as
an example life insurance valuation.
For regulatory purposes, life insurance companies are required to periodically value their entire
portfolio of risks and assets in order to demonstrate solvency. There are different techniques
based on statutory requirements, but in essence, a great deal of data is gathered about every
policy, every coverage, every insured, premiums and claims and matched against all of the
various assets the company keeps as reserves. Each item is valued according to actuarial
principles and assumptions about mortality, interest rate risk, etc. Then cash flow scenarios are
developed and analyzed to insure that asset maturities are always adequate and timed to cover
expected outlays.
This is an especially difficult process because data is often pulled from dozens of separate
application systems, some of them quite old. Integrating this data can be arduous because
quality controls are often applied loosely over time, naming conventions change, data is passed
between systems, but not identically and even the semantics change over time (e.g., a column
stored claim date until 1994, but claim incurred data from that point on). It’s easy to see that the
discovery period, finding the data and understanding its meaning sufficiently to map it to the
target application, is long and tedious. An ETL project requires a team of players whose efforts
may overlap and even conflict. Ongoing maintenance and enhancement, as team members
rotate out and new players enter, can result in even lower productivity. Without comprehensive
lifecycle management the risk of underachieving, or even failure, is high.
Managing from Scarcity

The Y2K saga is the most glaring example of the managing-from-scarcity mindset. Computing
resources used to be so scarce and so expensive that just saving the “19” from the date field was
a nearly universal trick. In 2008, a programmer is as likely to think about saving two bytes as
she is saving two gigabytes. In the 25-year period from 1980 to 2005, the density of hard disk
drives grew five orders of magnitude: from one megabyte to 100 gigabytes (see chart below). If
automobile technology progressed at the same pace, a new Ferrari would be capable of going
from 0-60 miles an hour in 5 one-hundred thousandth’s of a second, and getting 50 BILLION
miles per gallon (which probably wouldn’t take that long with a top speed of 1/6 the speed of
light).
>8
San Jose Research Center – Hitachi Global Storage Technologies 3
Storage and memory weren’t the only problems. A good programmer was one who wrote tight,
efficient code that could execute in a small address space. This usually meant keeping features
to a minimum, and most importantly, optimizing the code for the application in isolation,
without consideration of the other applications in the portfolio. Obviously, developing efficient
code is still an important quality, but it’s no longer the most important one. Today, when
Fortune 500 companies spend more than 75% of their budgets on maintenance 4 and are
massively constrained in their ability to create new applications, speed is king. Computing
resources are not free, but they are inexpensive relative to missed regulatory requirements,
competitive problems or lost partnering opportunities.
Next Generation DI Qualities

ETL is extremely useful for designing and executing the physical movement and transformation
of data artifacts. However, for the most part, ETL does not provide the actual rules for this
work. Instead, it is mostly done by hand because rules and semantics are not found with the
source data, they must be discovered and validated by people. The inherent problem though is
3 http://www.hitachigst.com/hdd/hddpdf/tech/chart02.pdf
4Estimate based on industry sources, including Business Technographics® November 2004 United States
SMB Benchmark Study, Forrester Research, Inc., November 2004; and Application Portfolio Management
Tools, Forrester Research, Inc., April 12, 2004.
>9
that the imprecision of these specifications and subsequent designs of these processes are in
stark contrast to the formality needed in the actual applications.
A next generation DI product has to perform predictably (and well) when accessing data
volumes that range in size from a single token to a massive, time-critical, multi-source pile of
data. The mixed use and workload characteristics of converged analytical and operational
processing demand it. In addition, metadata cannot be just a word. Metadata discovery, on the
one hand, and metadata-driven operation, on the other, are the keys to maintainability,
usability and agility. An active metadata repository enables discovery and rationalization of
data, dependency analysis and indirection/abstraction of the data. Naturally, features should
include data quality functionality such as the usual matching, cleansing and
measuring/metrics. In a high-performance environment, hand-offs and interfaces slow things
down, so it is essential that the next-generation DI tool incorporates many of the diverse
features now found in separate offerings.
Semantic Rationalization
All organizations are different, but that doesn’t mean they are dissimilar in many areas. Within
an organization, different applications may refer to a common element, such as Federal Income
Tax, with very different names, from Fed_Inc_Tax to Tx_Fed to TDFIC. While it may be unlikely
for these names to repeat across systems within a small organization, they are increasingly
likely to repeat when the population of systems examined increases.
Semantic Rationalization Current ETL Practice
Rationalized semantic definitions,

Descriptions based on physical
Definitions rules and terminology derived from
schema metadata data inspection
business usage
Sharable model linking and Multiple collections, each specific
reconciling multiple existing to its application and/or data
collections source
Sharing/reuse
Reconciliation across collections for Requires consolidation into single
sharing via a graph-based model collection for sharing
Standardize where possible, then “Single Version of the Truth”;
Mediating differences use sharable model to reconcile attempts to reach agreement on
remaining different definitions single definition
Updating/modifying data Dynamically via easily managed ETL scripts and SQL
classifications meta-models of business definitions
Rules included in model Managed mostly by IT, typically

Business rules with different tools and typically
Managed mostly by business users requires explicitly coding all rules
> 10
Any useful next-generation ETL tool should come with the built-in ability to identify these tens
(or hundreds) of thousands of alternate spellings. In this way, it is possible for the tool to
provide the basis for mappings by rationalizing the fields using semantic techniques to match
them or find matches based on context. When a hose kit is labeled as a “kit_4_rplcmt,” the
semantic rationalization process combines the information with other knowledge to determine
that the part is a “motorcycle_hose_replacement_kit.” With the volumes and time constraints
present today, there simply aren’t enough people to do this work.
Beyond removing some work from the DI effort, semantic rationalization leads to gains of
visibility, which in turn support conformance, governance (when rules are associated with
semantic definitions), and productivity/simplification at the development level because
semantic rationalization provides the abstraction that drives these benefits.
Abstraction
Abstraction, in the sense of data integration issues, is the separation of the meaning of things
from their physical implementation. It also has an active aspect: one should be able to interact
with the abstraction without regard to the form or location of the physical artifacts it represents.
The problem is that the meaning and relationships of data can be implemented in many
different ways, simultaneously, and it is too inefficient to have to cater to each one. It is more
efficient to present these meanings to people and processes than database schema, application
code, etc.
Creating a usable abstraction for data integration requires semantically rationalized business
terms as well as abstracted transformation and business rules (see chart below). Like current
ETL tools, this facilitates the creation of both persistent and transient transformation and
integration of data, but it has an added value. The abstraction provides the means for
applications that use the data, such as reporting, operational intelligence, real-time BI and even
decision automation systems, to address the more understandable and enduring semantically
abstracted business terms rather than the actual data. The major benefit of this is that non-
technical people are able to interact with the abstraction in multiple ways, including the
development of reusable business rules.
A much clearer explanation is the following: a customer is represented in a semantic dictionary

as “customer_identifier.” In various artifacts, this may be represented by different names such as
> 11
custno, cust# or cnum. In some cases, it may only be represented implicitly by combinations
with other values, or it may not be represented at all. If you consider the activities above the
abstraction layer (meta-models), such as BI, operational applications or even Master Data
purposes, it is only necessary to keep a consistent definition of the abstract term
customer_identifier. No matter how much the individual instances of customer_identifier change in
source systems, no matter how much error and duplication exist in the artifacts below the
abstraction (including persistent, transformed data such as a data warehouse), nothing is
disturbed or needs introspection or maintenance. When new instances of customer_identifier are
encountered in new sources or targets, everything we know about customer_identifier (the
relationships and rules) is automatically available to the architects and analysts during design
and to the developer during assembly and implementation.
Scaling
Although ETL is a smooth, predictable, batch process today, system resources can rarely be
accurately estimated, provisioned and managed in a straightforward way. In fact, applications
are often run in isolation on hardware since the most “scalable” tools today are architected in a
way that all available resources are applied to the problem in order to gain the greatest possible
throughput. Managing resources across multiple parallel applications has been one of the
motivating factors for the popularity of grid architectures. But the whole process of
transforming data is changing, and the size and timing of loads are becoming far less
predictable. When data warehousing was a fairly static enterprise (fixed schema), requirements
for new data were controllable. In other words, a critical requirement would not arise overnight
for a few terabytes of data never mapped before, with continuous updating. The ETL
mechanism, whether an engine (a computing platform that processes data) or an agent (a
service that orchestrates the processing of other computing platforms), was not likely to scale
smoothly by one or more orders of magnitude, unless it was configured for the infrequent peak
periods with resources unused in the interim.
Real-time data has a short shelf-life. It may have lingering value as historical information for,
say, analysis of historical trends or spotting patterns and anomalies in predictive mining.
However, when it is used for the purposes that demand it to be fresh and current, it has to
arrive in a timely fashion. For this reason, it is very likely that the same information will flow to
more than one application. In other words, the mapping will be many-to-many. If you assume
you have one terabyte of transaction information to feed to downstream applications in real-
time, the load on the ETL systems may very well be a multiple of that.
Factors Driving the Need for Better DI Solutions

Early generation ETL tools were a huge improvement over hand-coding transformations for
data warehousing and BI applications, and integration brokers and EAI provided a similar
boost to distributed applications, especially for e-commerce. Even though these may seem like
recent innovations, 10 years is a very long time. Incremental efficiencies in the use of new
technologies eventually fall behind as the purely technical component of the advantage mature
and a new approach to using the technology is needed. Given the staggering volumes of
information and the increasing diversity in data types (both structured and unstructured) that
> 12
need to be handled today, the new kinds of applications that arise and the speed of decisions
that are needed demand a new round of innovation in DI.
Growing Volumes
Why is the amount of data growing so rapidly? One The result of the
answer lies in the phenomenon that was written off externalization of business
(prematurely) after the dotcom bust – the externalization of via the Internet is an
business as a result of the Internet. Before the Internet, explosion in the number of
networking between businesses was the domain of sources, languages, and
organizations that could afford proprietary networks and classification schemes.
EDI projects, and it was very expensive. That has all
changed, and the Internet has enabled the speed volume of business data traffic to increase
dramatically. The cost for conducting business this way has likewise fallen and, of course, the
reach of connected businesses is virtually limitless. Combined, these factors allow even the
smallest organizations to participate - greatly increasing not only the volume of data, but the
disparity of it. The result is an explosion in the number of sources, languages, and classification
schemes well in advance of any agreement on ways to standardize them. It is not the inability to
move or map this data that interferes with smoothly processing the deluge; it is the lack of
understanding that renders it unusable.
Beyond the Internet, government regulatory and compliancy requirements have also forced
businesses to retain data not previously considered, such as email, logs, and unstructured
material. Seeking out the increasingly small margins required by competitive business
pressures has brought data mining technologies back into vogue, typically requiring vast
amounts of digestible data.
Before the Internet, business-to-business transactions were orderly processes, typically designed
around a one-to-one relationship where both partners understood the shared handshake, or a
one-to-many relationship, where one partner was in control of the process and the other players
had to conform to the structure and controls set down. In today’s connected businesses, many-
to-many relationships rule, where the standards belong to everyone and no one.
Even when externalized processes are more or less in sync, it may still require integrating
incoming data with legacy systems. In addition, new issues and opportunities can arise
spontaneously that call for integrating data from partners’ legacy systems, and there isn’t time
to standardize the data interchange. A new business opportunity, a joint-venture, a new market,
a new country: how can that be handled? For example, suppose a partner wishes to shift
sourcing to Asia, where the operation uses Baan instead of SAP, or even a locally-developed
package?
> 13
Of course, the aforementioned effects of Moore’s Law are at play here, too. When enterprise
disk storage cost $2.00/Mb in 1996 but fell to less than a penny, it’s not unreasonable to assume
that the increase in data volumes has been as much a push from new business conditions as it
has a pull from sheer capacity. Nature abhors a vacuum.
Richard Winter 5 is very specific about the growth trends in data:
“I believe that virtually all organizations are facing continued rapid growth in data volumes –
typically 1.5 to 2.5 xs per year – and for business reasons that are more or less inescapable.
Executives are keenly aware of this ongoing growth already as they grapple with its year-to-year
budget effects. However, at any given point in time, I believe the more profound effects are in the
5-year time frame. In that time frame, the growth factors are in the 10x – 100x range.”
Pushing this phenomenon are data practices borne of business requirements such as longer data
retention periods, full atomic detail of integrated historical data, unstructured, image, location
and sensor data. One data warehouse appliance vendor 6 announced in a press release that its
upper limit of storage grew from six terabytes in 2002 to 100 terabytes in 2005 and to a 1,000
terabytes in 2008.
5 Richard Winter is the President of Winter Corporation, specializing in large scale data management,
writing at www.B-Eye_Network.com, April 10, 2008
6 According to Jit Saxena, chairman and CEO, Netezza, “When we launched the Netezza appliance in
2002, our largest system capacity was six terabytes. In 2004, we increased the maximum capacity of our
systems to 27 terabytes, and in 2005 we increased the top end again, to 100 terabytes. Now with this
expansion (to 1000tb), we have a very broad range of high-performance system options to mirror the
needs of our customers as they manage and analyze their growing data volumes.”
> 14
New Applications
With so much data available, and economically feasible to process, new ways of doing business
(or conducting government) naturally emerge. For example, a consumer products company
may have used focus groups and third-party market data to understand pricing of its products.
The initiation of a pricing study, subsequent gathering of information and analysis and
discussion might take a year or longer. Today, consumer product companies routinely gather
sell-through information directly from their retailers, and make micro-decisions on pricing and
promotions in near real-time. Without Moore’s Law, the Internet, and new approaches this will
not be possible.
There are thousands of examples like this, and thousands more not yet conceived. The
combination of factors (processing all the data, not a sample; the computing power to process it;
the storage capacity to gather it; and the bandwidth to move it or just use it in place) opens new
ways of thinking about data and new applications to use it and reuse it. The net effect is that the
gap between analytical and operational systems, which have been separated for so long while
managing from scarcity, is disappearing and all of our tools, methodologies and concepts are
showing their age.
A tangential aspect of this is that an ETL tool must be as capable of orchestrating the process at
run time as it is moving data within the enterprise. In fact, more than just run time, it has to
support the entire lifecycle of the process, including design and developer tools, optimization
and intelligent automation of as many steps as possible.
Need for Speed

Data warehousing and BI have operated under the “why do today what can be put off until
tomorrow” principle. There wasn’t much urgency when the output of the applications, reports,
would languish for hours or even days before people read them. The requirement for high
throughput was generally confined to off hours and limited to a few applications within large
companies. All that is changing now. Whether the need to do things faster arose because
technology was available to do it or because we needed to catch up to competitors who did it
first, is a topic of debate. What is not debatable is that tolerance for latency in operations is
disappearing. Monthly sales reports from distributors have given way to RFID tracking in real
time; truckers and ships at sea transmit telemetry that is analyzed for instantaneous logistics
planning. In a sense, all data is in motion until it disappears or is placed in a folder. Previously,
we dealt with the folders because it wasn’t possible to catch much data on the fly. But now that
technology has made that possible, new applications for fresh, streaming data are emerging.
Cost and Benefit Factors

What drives up the cost, duration and risk of any IT implementation? Perhaps more than
anything else are hand-offs, disconnects and interfaces. Current ETL practice is a serial effort
involving a number of individuals, none of whom has the whole picture. The subject matter
expert, or SME, is the person who has an experiential and/or anecdotal knowledge of the
usages and semantics of the data in question. The SME has to communicate this to designers
with lists and time-consuming explanations, only to have the information rendered into
> 15
technical documents that rarely capture the entire nuance of the information conveyed. In yet
another hand-off, this design is given to the developer(s) who are tasked with implementing a
process from information that is already twice removed.
Early data warehousing efforts brutally demonstrated that successful ETL is rarely a “boil the
ocean” effort, attempting to map and transform all of the data at once. It is broken down into
subject areas to limit the complexity of each step. Unfortunately, with each new step or subject
area, a greater proportion of the data needed to populate the target schema has already been
mapped. This might seem like an advantage, but it is often the opposite. When certain
information needs to be shared across subject areas, earlier mappings often turn out to be
insufficient or even incorrect in the wider context. Coupled with this problem is that team
members, who have a recent familiarity with the material, often move on to other projects and
their implicit knowledge is lost. Reuse of existing designs and procedures is critically important,
but support for reuse in current tools is often limited to one-time cut and paste with no
enduring relationship between the modules.
Next-generation DI tools must provide superior features for development productivity and
reuse.
Implementation Productivity and Reuse

In a typical DI development group, there are data stewards and DBAs working together with DI
tool experts. In medium to large projects, it is a
virtual, conglomerated soup-to-nuts group of people
who usually have never worked together before and
approach the work from very different perspectives,
if not cultures and time zones. In well-run efforts,
these differences dissipate in time and the team finds
a way to work cooperatively. However, once an initial
effort goes into production, there is often large-scale
turnover in the group and the shared understandings
are often lost. The human element is often difficult to
manage, and, as a fallback, roles at different stages
are managed by the software infrastructure, which
tends to simplify the process by segregating efforts.
The net effect is that it is difficult to develop synergy
in the team and, instead, the result is the sum (or less
than the sum) of the individual efforts. In the worst
case, the whole effort generates even more ambiguity and complexity than it replaced.
What is lacking in current tools is an understanding of the entire lifecycle of the project, from
initial conception to routine maintenance and enhancement. One approach is to lower overall
project costs by differentiating the work so that the tasks are matched to skill level and cost. To
be successful, this approach requires a solid foundation of consistent interfaces between the
work groups. Semantic rationalization creates an abstraction, the active metadata repository
that can represent abstracted transformation rules. A single metadata facility covers every
aspect of the tool: project management, design collaboration and reuse can all be managed
centrally and effectively. Another vital feature that is needed is the ability for the tool to
> 16
introspect metadata. This supports drawing conclusions and informing about dependencies,
conflicts and other lurking problems that befall development and, perhaps to an even greater
extent, maintenance and enhancement. Current tools have limited capabilities in these areas,
which result in much longer develop/test/production cycles and higher bug rates.
Reuse has been an elusive goal for a long time. Being able to apply pieces of applications to
other uses repeatedly carries the promise of appreciable gains in efficiency. However, reuse can
come in a number of different guises, and some are much more useful than others. For example,
designing components whose operations are either general in nature or configurable to a broad
range of effects would appear to be helpful. The problem is most of these “libraries” of reusable
code or components are only reusable manually. Manual reuse requires knowing or being able
to find the objects you need to apply. Its value is dependent on the knowledge and skill of the
individual practitioner, and in most cases the work required to reuse an artifact ends up
outweighing the expected savings.
Computers, on the other hand, are much more efficient at making these connections and
applying them. Once something is understood intellectually, rendered in code or some form of
executable representation, and centrally catalogued, reuse is much more likely to deliver the
expected lift in productivity. In the case of DI, abstraction of the transformation rules provides
the platform for reuse and durability.
Conclusion
Current ETL tools offer little in terms of developing and proofing business and transformation
rules. Their functional style is that of a singular effort by someone who has learned to build
procedures and has gained some understanding of the meaning of the data, but is still
predominantly a developer. Neither do these tools provide any assistance with the
development of the rules of transformation. Next-generation DI tools must involve the business
users with direct knowledge of data semantics and business requirements of the DI effort. To
achieve this, they have to do two things very well – simplify the development of rules through
semantic rationalization of source metadata, and provide a capability for people and processes
to interact with the rules and the common business vocabulary across project roles. Innovations
applicable to the entire application lifecycle as well as innovations within individual roles are
necessary for us to keep pace with the escalating demands of the business.
Because data integration has become imperative for businesses of all sizes, lower entry points
for procuring DI technologies and improved pricing structures for high-end businesses are also
needed. Current DI tool pricing follows an enterprise pricing model, meaning very large up-
front costs and hefty annual maintenance fees. Pricing is based on server CPUs, which are
usually over-estimated to avoid running out of resources in the middle of project and/or costly
and sometimes embarrassing requests to management to quickly upgrade the license. Next-
generation DI tools will need to find a way to rectify this situation to remove this chokehold on
the business. As we know from all performance efforts, removal of a bottleneck will open the
floodgates, allowing increased price performance until we encounter the next bottleneck.
Clearly the time is at hand to embrace the next generation.
> 17

Data Integration Raden WP Final

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Integration Raden WP Final

Hochgeladen von

Copyright:

Verfügbare Formate

Data Integration: What’s Next?

530 Showers Drive #7348

© 2008 Smart (enough) Systems, LLC.

Sponsored by expressor software corporation

The company’s singular focus on the automation and improvement of

ABOUT THE AUTHOR

Executive Summary ...................................................................................................................... 5

Defining DI: Then and Now ........................................................................................................ 6

A Little History of DI ................................................................................................................ 7

Better, but not Enough ............................................................................................................. 8

Managing from Scarcity ........................................................................................................... 8

Next Generation DI Qualities ...................................................................................................... 9

Factors Driving the Need for Better DI Solutions .................................................................. 12

Growing Volumes ................................................................................................................... 13

New Applications ................................................................................................................... 15

Need for speed ........................................................................................................................ 15

Cost and Benefit Factors............................................................................................................. 15

Implementation Productivity and Reuse ............................................................................. 16

Defining DI: Then and Now

Better, but not Enough

Managing from Scarcity

Next Generation DI Qualities

Semantic Rationalization Current ETL Practice

Rationalized semantic definitions,

Rules included in model Managed mostly by IT, typically

A much clearer explanation is the following: a customer is represented in a semantic dictionary

Factors Driving the Need for Better DI Solutions

Richard Winter 5 is very specific about the growth trends in data:

Need for Speed

Cost and Benefit Factors

Implementation Productivity and Reuse

Das könnte Ihnen auch gefallen