Sie sind auf Seite 1von 60

Tired of the Big Data hype?

Get Real with

SQL on Hadoop
Real-Time

Real Scale

Real Apps

Real SQL

Splice Machine is the real-time, SQL-on-Hadoop database.


For companies contemplating a costly scale up of a traditional RDBMS, struggling
to extract value out of their data inside of Hadoop, or looking to build new
data-driven applications, the power of Big Data can feel just out of reach.
Splice Machine powers real-time queries and real-time updates on both operational
and analytical workloads, delivering real answers and real results to companies
looking to harness their Big Data streams.

Learn more at www.splicemachine.com

CONTENTS

BIG DATA
SOURCEBOOK
DECEMBER 2013

introduction

From the publishers of

2
PUBLISHED BY Unisphere Mediaa Division of Information Today, Inc.
EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974
CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055
Denise M. Erickson,
Thomas Hogan Jr.,
Senior Graphic Designer
Group Publisher
808-795-3701; thoganjr@infotoday
Jackie Crawford,
Joyce Wells, Managing Editor
Ad Trafcking Coordinator
908-795-3704; joyce@dbta.com
Alexis Sopko, Advertising Coordinator
Joseph McKendrick,
908-795-3703; asopko@dbta.com
Contributing Editor; joseph@dbta.com
Sheila Willison, Marketing Manager,
Sheryl Markovits, Editorial and Project
Events and Circulation
Management Assistant
859-278-2223; sheila@infotoday.com
(908) 795-3705; smarkovits@dbta.com
DawnEl Harris, Director of Web Events;
Celeste Peterson-Sloss, Deborah Poulson,
dawnel@infotoday.com
Alison A. Trotta, Editorial Services

Joyce Wells

industry updates
4

ADVERTISING Stephen Faig, Business Development Manager,


908-795-3702; Stephen@dbta.com

Bill Miller, Vice President and


General Manager, BMC Software
Mike Ruane, President/CEO,
Revelation Software

10

143 Old Marlton Pike, Medford, NJ 08055

The Age of Big Data Spells


the End of Enterprise IT Silos
Alex Gorbachev

16

Big Data Poses Legal Issues and Risks


Alon Israely

Susie Siegesmund, Vice President


and General Manager,
U2 Brand, Rocket Software

BIG DATA SOURCEBOOK is published annually by Information Today, Inc.,

The Battle Over Persistence


and the Race for Access Hill
John OBrien

INFORMATION TODAY, INC. EXECUTIVE MANAGEMENT


Thomas H. Hogan, President and CEO
Thomas Hogan Jr., Vice President,
Marketing and Business Development
Roger R. Bilboul,
Chairman of the Board
M. Heide Dengler, Vice President,
Graphics and Production
John C. Yersak,
Vice President and CAO
Bill Spence, Vice President,
Information Technology
Richard T. Kaser, Vice President, Content
DATABASE TRENDS AND APPLICATIONS EDITORIAL ADVISORY BOARD
Robin Schumacher, Vice President
Michael Corey, Chief Executive Ofcer,
Ntirety
of Product Management, DataStax

The Big Picture

20 Unlocking the Potential of Big Data


in a Data Warehouse Environment
W. H. Inmon

POSTMASTER
Send all address changes to:
Big Data Sourcebook, 143 Old Marlton Pike, Medford, NJ 08055

26

Copyright 2013, Information Today, Inc. All rights reserved.


PRINTED IN THE UNITED STATES OF AMERICA
The Big Data Sourcebook is a resource for IT managers and professionals providing
information on the enterprise and technology issues surrounding the big data phenomenon
and the need to better manage and extract value from large quantities of structured,
unstructured and semi-structured data. The Big Data Sourcebook provides in-depth articles
on the expanding range of NewSQL, NoSQL, Hadoop, and private/public/hybrid cloud
technologies, as well as new capabilities for traditional data management systems. Articles
cover business- and technology-related topics, including business intelligence and advanced
analytics, data security and governance, data integration, data quality and master data
management, social media analytics, and data warehousing.
No part of this magazine may be reproduced and by any meansprint, electronic or any
otherwithout written permission of the publisher.

Cloud Technologies Are Maturing


to Address Emerging Challenges
and Opportunities
Chandramouli Venkatesan

30 Data Quality and MDM Programs Must


Evolve to Meet Complex New Challenges
Elliot King

34

In Todays BI and Advanced Analytics


World, There Is Something for Everyone

COPYRIGHT INFORMATION
Authorization to photocopy items for internal or personal use, or the internal or personal use
of specic clients, is granted by Information Today, Inc., provided that the base fee of US
$2.00 per page is paid directly to Copyright Clearance Center (CCC), 222 Rosewood Drive,
Danvers, MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations
that have been grated a photocopy license by CCC, a separate system of payment has been
arranged. Photocopies for academic use: Persons desiring to make academic course packs
with articles from this journal should contact the Copyright Clearance Center to request
authorization through CCCs Academic Permissions Service (APS), subject to the conditions
thereof. Same CCC address as above. Be sure to reference APS.
Creation of derivative works, such as informative abstracts, unless agreed to in writing by
the copyright owner, is forbidden.
Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook.
Big Data Sourcebook disclaims responsibility for the statements, either of fact or opinion,
advanced by the contributors and/or authors.

2013 Information Today, Inc.

Joe McKendrick

40

Social Media Analytic Tools


and Platforms Offer Promise
Peter J. Auditore

46 Big Data Is Transforming


the Practice of Data Integration
Stephen Swoyer

The Big Picture


By Joyce Wells

DBTAs Big Data Sourcebook is a guide to the enterprise and technology issues IT professionals are being asked to
cope with as business or organizational leadership increasingly
denes strategies that leverage the big data phenomenon.
It has been well-documented that social media, web,
transactional, as well as machine-generated and traditional relational data, are being collected within organizations at an accelerated pace. Today, according to common
industry estimates, 80% of enterprise data is unstructured
or schema-less.
The reality of what is taking place in IT organizations
today is more than hype. According to an SAP-sponsored survey of 304 data managers and professionals,
conducted earlier this year by Unisphere Research, a
division of Information Today, Inc., between one-third
and one-half of respondents have high levels of volume,
variety, velocity, and value in that datathe well-known
four characteristics that dene big data. The 2013 Big
Data Opportunities Survey found that two-fths of
respondents have data stores reaching into the hundreds
of terabytes and greater. Eleven percent of respondents
said the total data they manage ranges from 500TBs to
1PB, 8% had between 1PB and 10PBs, and 9% had more
than 10PBs.
In addition, data stores are growing rapidly. According to another study produced by Unisphere Research
and sponsored by Oracle, almost nine-tenths of the 322
respondents say they are experiencing year-over-year
growth in their data assets. Respondents to the survey were
data managers and professionals who are members of the
Independent Oracle Users Group (IOUG). For many, this
growth is in double-digit ranges. Forty-one percent report
signicant growth levels, dened as exceeding 25% a year.
Seventeen percent report that the rate of growth has been
more than 50% (Achieving Enterprise Data Performance:
2013 IOUG Database Growth Survey).
Big data offers enormous potential to organizations
and represents a major transformation of information
technology. Beyond the obvious need to effectively store
and protect this data, IT organizations are increasingly

BIG DATA S OURCEBOOK 2013

seeking to integrate their disparate forms of data and to


also perform analytics in order to uncover information
that will result in their organizations competitive advantage. What makes big data valuable is the ability to deliver
insights to decision makers that can propel organizations
forward and grow revenue.
As might be expected, the largest organizations in the
SAP-Unisphere studythose with 1,000 employees and
upare engaged in big data initiatives, but many smaller
rms are also pursuing big data projects as well. More than
a third of the smallest companies or agencies in the survey,
37%, say they are involved in big data efforts, along with
43% of organizations with employees in the hundreds.
According to the study, three-fourths of respondents have
users at their organizations that are pushing for access to
more data to do their jobs.
Products to address the big data challenge are coming
to the rescue. The expanding range of NewSQL, NoSQL,
Hadoop, and private/public/hybrid cloud technologies, as
well as newer capabilities for traditional data management
systems, present extraordinary advantages in effectively
dealing with the data deluge. But which approaches are
best for each individual organization? Which approaches
will have staying power and which will fall by the wayside?
As Radiant Advisors John OBrien rightly points out in his
overview of the state of big data, we are in the infancy of a
new eraand moving into a new era has never been easy.
To help advance the discussion, in this issue, DBTA has
assembled a cadre of expert authors who each drill down
on a single key area of the big data picture. In addition,
leading vendors showcase their products and unique
approaches on how to achieve value and mitigate risk in
big data projects. Together, these articles and sponsored
content provide rich insight on the current trends and
opportunitiesas well as pitfalls to avoidwhen addressing the emerging big data challenge.

sponsored content

Does Big Data =


Big Business Value?
With big data, the world has gotten far
more complex for IT managers and those
in charge of keeping a business moving
forward. So how do you simplify your
architecture and operations while raising
the value of the innovative tools youve
crafted to meet your business goals? With the
emergence of simple key/value type data
such as MongoDB, Cassandra, social media
databases, and Hadoopdata connectivity is
evolving to meet requirements for speed and
consistency.

AN EXAMPLE
Every year, NASA and the National
Science Foundation host a contest across
the scientific communities, the results
often resonating in both the academic
and business worlds. The latest challenge:
How can organizations pull together all the
right data from a variety of sources before
performing analysis, drawing conclusions
and making decisions? Sounds like big
data, right?
Consider the problem of determining
if life ever existed on Mars. A huge variety
of data collected by the Mars rover is
fed into clusters of databases around the
world. It then gets transmitted as a whole
to a variety of data sets and Hadoop
clusters. What do we do with it? How does
the scientific community organize itself
to deal with this influx?
There are similar examples in every
industry, all leading to key integration
challenges: How do we make dissimilar data
sets uniformly accessible? And how do we
extract the most relevant information in a
fast, scalable and consistent way?
The problems of data access and relevancy
are complicated by three additional data
processing realities:

1. Big data is driven by economics. When


the cost of keeping information is less than the
cost of throwing it away, more data survives.
2. Applications are driven by data. Big
data applications drive data analysis. Thats
what theyre for. And they all have the same
marching orders: Get the right data to the
right people at the right time.
3. Dark data happens. Because nothing
is thrown away, some data may linger for
years without being valued or used. This
dark data might not be relevant for one
analysis, but could be critical for another.
In theory and in future practice, nothing
is irrelevant.

THE BIG DATA MARKET


According to a recent Progress DataDirect
survey, most respondents use Hadoop le
systems or plan to use them within two years.
Respondents also included Microsoft HD
Insights, Cloudera, Oracle BDA and Amazon
EMR in the list of technology they plan to
use in the next two years. This indicates the
growing market awareness that it is now
economically feasible to store and process
many large data sets, and analyze them in
their entirety.
The survey also asked respondents to
rank leading new data storage technologies.
MongoDB and Cassandra have both gained
a large foothold. Progress DataDirect will
soon be supporting them.
TECHNOLOGY
ADDRESSES THE NEED
Market growth and maturation has led
to new approaches for storage and analysis
of both structured and multi-structured
data. Recent breakthroughs include:

Integration of external and social data


with corporate data for a more complete
perspective.
Adoption of exploratory analytic
approaches to identify new patterns
in data.
Predictive analytics coming on strong as
a fundamental component of business
intelligence (BI) strategies.
Increased adoption of in-memory
databases for rapid data ingestion.
Real-time analysis of data prior to storage
within the data warehouses and Hadoop
clusters.
A requirement for interactive, native,
SQL-based analysis of data in Hadoop
and HBase.
As the cost of keeping collected
data plummets, new data sources are
proliferating. To address the growing need,
organizations must be able to connect a
variety of BI applications to a variety of data
sources, all with different APIs and designs
without forcing developers to learn new
APIs or to constantly re-code applications.
The connection has to be fast, consistent,
scalable and efcient. And most importantly,
it should provide real-time data access for
smarter operations and decision making.
SQL connectivity, the central value of our
Progress DataDirect solutions, is the answer.
It delivers a high-performance, scalable and
consistent way to access new data sources
both on-premise and in the cloud. With SQL,
we treat every data source as a relational
databasea fundamentally more efcient
and simplied way of processing data.

PROGRESS DATADIRECT
www.datadirect.com

DB TA .CO M

industry updates

The State of Big Data in 2013

The Battle
Over Persistence
and the Race
for Access Hill
By John OBrien

BIG DATA S OURCEBOOK 2013

industry updates

The State of Big Data in 2013

The Battle Over Persistence is somewhat of


a holy war between the data consistency
inherently derived from a singular data store
and the performance derived from data stores
optimized for specic workloads.

Shifting gears into a new era has never


been easy during the transition. Only in hindsight do we clearly see what was right in front
of our facesand probably the whole time.
This is one nugget of wisdom I have been
sharing with audiences through keynotes
at data warehousing (DW), big data conferences, and major company onsite briengs.
Having been part of the data management
and business intelligence (BI) industry for
25 years, I have witnessed emerging technologies, business management paradigms, and
Moores Law reshape our industry time and
time over.
Big data and business analytics have all the
promise to usher in the information age, but
we are still in the infancy of our next era
and frankly thats what makes it so exciting!
In 2013, the marketplace for big data,
BI, NoSQL, and cloud computing has seen
emerging vendors, adapting incumbents, and
maturing technologies as each compete for
market position. Some of these battles are
being resolved in 2013, while others will be
resolved in later yearsor potentially not at
all. Either way, understanding the challenges
on the landscape will assist with technology
decision making, strategies, and architecture
road maps today and when planning for
years ahead.
Two of the more dominant shifts occurring around us this year can be called the
Battle Over Persistence and the Race for
Access Hill.

The Battle Over Persistence


The Battle Over Persistence didnt just start
5 years ago with the emergence of big data or
the Apache Foundation Hadoop; its been an
ongoing battle for decades in the structured

data world. As the pendulum swings broadly


between centralized data and distributed
disparate data, the Battle Over Persistence is
somewhat of a holy war between the data
consistency inherently derived from a singular data store versus the performance derived
from data stores optimized for specic workloads. The consistency camp argues that with
enough resources, the single data store can
overcome performance challenges, while the
performance camp argues that they can manage the complexity of mixed heterogeneous
data stores to ensure consistency.
Decades ago, multidimensional databases,
or MOLAP cubes, were optimized to persist
and work with data in a way different than
row-based relational databases management
systems (RDBMs) did. It wasnt just about representing data in star schemas derived from a
dimensional modeling paradigmboth of
which are very powerfulbut about how that
data should be persisted when you knew how
users would access and interact with it. OLAP
cubes represent the rst highly interactive user
experience: the ability to swiftly slice and
dice through summarized dimensional data
a behavior that could not be delivered by
relational databases, given the price-performance of computing resources at the time.
Persisting data in two different data stores
for different purposes has been a part of BI
architecture for decades already, and todays
debates challenge the core notion of transactional systems and analytical system workloads: They could be run from the same data
store in the near future.

Data Is Data
The NoSQL family of data stores was born
out of the business demands to capitalize on

the orders of magnitude of data volume and


complexity inherent to instrumented data
acquisitionrst from the internet website
and search engines tracking your every click,
to the mobile revolution tracking your every
post. Whats different about NoSQL and
Hadoop is the paradigm on which its built:
Data is data.
Technically speaking, data is free, but what
does cost money and contributes to return
on investment calculations are costs to store
and access data: infrastructure. So, developing a software solution that leverages the lowest cost infrastructure, operating costs, and
footprint was required to tackle the order of
magnitude that big data representedi.e., the
lowest capital cost of servers, the lowest data
center costs from supplying power and cooling, and the highest density of servers to t
the most in a smaller space. With the data is
data mantra, we dont require understanding
of how the data needs to be structured beforehand, and we accept that the applications creating the data may be continuously changing
structure or introducing new data elements.
Fortunately, at the heart of data abstraction
and exibility is the key-value pair of data,
and this simple elemental data unit enables
the highest scalability.

A Modern Data Platform Has Emerged


The Battle Over Persistence principle
argues that there are multiple databases (or
data technologies), each with its own clear
strengths, and most-suited for different kinds
of data and different kinds of workloads
with data. For now, the pendulum has swung
back into the distributed and federated data
architecture. We can embrace exibility and
overall manageability of big data platforms,
DB TA .CO M

industry updates

The State of Big Data in 2013

such as Hadoop and MongoDB. Entityrelationship modeled data in enterprise data


warehouses and master data management
fuse consistent and standard context into
schemas and support temporal aspects of reference data with rich attribution to fuel analytics. Even analytic optimized databases
such as columnar, MPP (massive parallel
processing), appliances, and even multidimensional databasescan be combined
with in-memory databases, cloud computing,
and high-performance networks. Separately,
highly specialized NoSQL or analytic databasessuch as graph databases, documents,
or text-based analytic engineshave their
place and can be executed natively in these
specialized databases.
Companies and vendors are beginning to
accept that there needs to be multiple database technologies interwoven together to
deliver the much needed Modern Data Platform (MDP), but keep in mind that the pendulum will continue to swingit may be 5
or 10 years from now, but some things about
technology that we know hold true. Computing price-performance will continue as it has
with Moores Law, so we can converge higher
numbers of CPU cores in parallel with lower
cost, more abundant memory with faster solid
state storage, and higher capacity mechanical
disk drives. Tack on the rate of technology
innovation and maturity that is driving big
data today, and we could see the capabilities
of Hadoop derivatives, MongoDB, or some
emerging data technologies eclipse highly
specialized and optimized data technologies
being deployed today to meet demands.
There are great debates about the disparate
databases ecosystems versus the all-in-one
Hadoopits simply a matter of timing and
vision versus the reality of todays demanding, data-centric environments.

The Race for Access Hill


When you accept the premise of a federated data architecture based primarily on
workloads rather than logical data subjects,
the next question that arises is, How do I nd
anything and where do I start? The ability to
manage the semantic context of all data, its

BIG DATA S OURCEBOOK 2013

usage for monitoring and compliance, or to


provide users with a single or simple point of
access is the Race for Access Hill.
When you think about the internet, you
realize that its used as a singular noun, similar to how Google has become a verb meaning to search through the millions of servers
that comprise the internet. Therefore, if the
Modern Data Platform represents all the disparate data stores and information assets of
the enterprise in a singular noun form, we

One major concept at stake


in the Race for Access Hill
is how to centralize
semantic context.

need a point of access and navigation. Otherwise, the MDP is simply a bunch of databases.
One major concept at stake for modern
data architects in the Race for Access Hill is
how to centralize semantic context for consistency, collaboration, and navigation. Previously in the organized world of data schemas, there were many database vendors and
technologies that made data access heterogeneous, but it was still unied SQL data access
under a single paradigm. Federated data
architectures were predominantly still SQL
schema in nature and easier to unify. Todays
key-value stores, such as Hadoop, have the
ability to separate the context of data or its
schema from the data itself, which has great
discovery-oriented benets for late-binding
the schema with the data, rather than analyzing and designing a schema prior to loading
data in as a traditional RDBMS.
Centralizing context can be done in a
Hadoop clusters HCatalog or Hive components for semantic integration with other
SQL-oriented databases for federation,
hence joining the SQL world where possible.
(Reminds me of my favorite recent Twitter
quote, Who knew the future of NoSQL was
SQL?) Data virtualization (DV) can serve as

a single access point for the broad, SQL-based


consumer community, therefore becoming
the glue of the Modern Data Platform that
unies persistence across many data store
workloads. The later addition of HCatalog
and Hive to Hadoop also has this capability,
but only for the data that can t this paradigm; MapReduce functionality was designed
to enable any analytic capability through a
programming model. Other NoSQL data
stores, such as graph databases, dont inherently speak SQL, so in order to be comprehensive, an access layer (or point) needs to be
service-oriented as well. Consumers will need
a simple navigation map that allows them to
access and consume information from data
services, as well as virtual data tables. The
long-term strategy will lean further toward a
service-orientation more and more over time;
however, virtualized data will still be needed
for information access situations.

Competing for the Hill


The resolution for this portion of the
Race for Access Hill will be gradual within
the coming years; as the need arises, a technology and strategy are already in place for
companies to adopt. However, this is not the
case with the hill portion of the the race:
Vendors are racing to position their products
to be that single point of access (the hill) with
compelling arguments and case studies to
support them. Aside from the SQL/services
centralization of semantic context, the next
question becomes, Where should this access
point live within the architecture?
There are four different locations or layers where centralized access and context
could be effectively manageda continuum
between two points with the data at one end
and the consumer or user at the other, if you
will. Along this continuum are several points
where you could introduce centralized access
and information context. Starting from the
data end, you could make the single point of
access within a databasethis database could
have connections to other data stores and
virtualization as the representation for the
users. Next could be to centralize the access
and information context above the database

industry updates

The State of Big Data in 2013

layer but between the BI app and consumer


layers with a data virtualization technology.
Third could be to move further along the
path toward the user into the BI application
layer, where BI tools have the ability to create meta catalogs and data objects in a managed order for reporting, dashboards, and
other consumers. Finally, some argue that the
useror desktopapplication is the place
where users can freely navigate and work
with data within the context they need locally
and with a much more agile fashion.

Not All Data Is Created Equal


Despite database, data virtualization, and
BI tool vendors racing to be the single point
of access for all the data assets in the modern data platform for their own gains, there
isnt one answer for where singular access and
context should live because its not necessarily
an architectural question but perhaps a more
philosophical onea classic it depends.
With so many options available from the vendors today, understanding how to blend and
inherit context under which circumstances or
workload is key.
First, understand which data needs to be
governed vigorouslynot all data is created
equal. When the semantic context of data
needs to be governed absolutely, moving the
context closer to the data itself ensures that
access will be inherited context every time. For
relational databases, this is the physical tables,
columns, and data types that dene entities
and attribution within a schema of the data.
For Hadoop, instead, this would be the denition of the table and columns, with the Hive
or HCatalog abstraction layer bound to the
data within the Hadoop Distributed File System (HDFS). Therefore, a data virtualization
tool or BI server could integrate multiple data
stores schemas as a single virtual access point.
Counter to this approach is certain data that
does not have a set denition yet (discovery),
or when local interpretation is more valuable
than enterprise consistencyhere it makes
more sense for the context to be managed by
users or business analysts in a self-service or
collaborative nature. The semantic life cycle
of data can be thought of as discovery, veri8

BIG DATA S OURCEBOOK 2013

cation, governance, and, nally, adoption by


different users in different ways.
As for the it depends comment regarding different analytic workloads, lets examine
another new hot topic of 2013: Analytic Discovery, or specically, the analytic discovery
process. Analytic databases have been positioned as higher-performing and analyticoptimized database between the vast amounts
of big data in Hadoop and the enterprise reference data, such as data warehouses and

With so many options


available, understanding how
to blend and inherit context
under which circumstances
or workload is key.

master data management hubs. The analytic


database is highly optimized for performing
dataset operations and statistics by combining the ease of use from SQL and the performance of MPP database technology, columnar data storage, or in-memory processing.
Discovery is a highly iterative mental processsomewhat trial and error and verication. Analytic databases may not be as exible
or scalable as Hadoop, but they are faster out
of the box. So, when an analytic database is
used for a discovery workload, some degree of
semantics and remote database connections
should live within them. Whether the analytic
sandbox is for discovery or is for running production analytics accumulating more analytic
jobs over time is still unknown.

Whats Ahead
In 2013, two major shifts in the data landscape occurred. The acceptance of leveraging
the strengths of various database technologies
in an optimized Modern Data Platform has
more or less been resolved, but the recognition of a single point of access and context is
next. Likewise, the race for access will continue well into 2014and while one solution

may win out over the others with enough


push and marketing from vendors, the overall
debate will continue for years, with blended
approaches being the reality at companies.
And, get ready: The next wave in data is
now emerging, once again pushing beyond
web and mobile data. The Internet of Things
(IoT)or, Machine-to-Machine (M2M) data
comes from a ratio of thousands of devices
per person that creates, shares, and performs
analytics, and, in some cases, every second.
Whether its every device in your home, car,
ofce, or everywhere in between that has a
plug or battery generating and sharing data
in a cloud somewhereor its the 10,000
data points being generated every second by
each airline jet engine on the ight Im on
right nowthere will be new forms of value
created by business intelligence, energy efciency intelligence, operational intelligence,
and many other forms and families of articial intelligence.

John OBrien is principal and


CEO of Radiant Advisors.
With more than 25 years of
experience delivering value
through data warehousing
and business intelligence
programs, OBriens unique
perspective comes from the combination of
his roles as a practitioner, consultant, and
vendor CTO in the BI industry. As a globally
recognized business intelligence thought
leader, OBrien has been publishing articles and presenting at conferences in North
America and Europe for the past 10 years. His
knowledge in designing, building, and growing enterprise BI systems and teams brings
real-world insights to each role and phase
within a BI program. Today, through Radiant
Advisors, OBrien provides research, strategic advisory services, and mentoring that
guide companies in meeting the demands of
next-generation information management,
architecture, and emerging technologies.

WHAT HAS YOUR


BIG DATA DONE
FOR YOU LATELY?

TransLattice helps solve the


worlds Big Data problems.
Bridge your federated systems with effortless
visibility and data control to get real benefit
from your data.

www.TransLattice.com

industry updates

The State of Big Data Management

The Age of Big Data


Spells the End
of Enterprise IT Silos
By Alex Gorbachev

Data management has been a hot topic


in the last years, topping even cloud computing. Here is a look at some of the trends and
how they are going to impact data management professionals.

The Rise of Datacation


Today, businesses are ending up with more
and more critical dependency on their data
infrastructure. Before widespread electrication was implemented, most businesses were
able to operate well without electricity but in
a matter of a couple of decades, dependency
on electricity became so strong and so broad
that almost no business could continue to
operate without electricity. Similarly, datacation is whats happening right now. If
underlying database systems are not available,
manufacturing oors cannot operate, stock
10

BIG DATA S OURCEBOOK 2013

exchanges cannot trade, retail stores cannot


sell, banks cannot serve customers, mobile
phone users cannot place calls, stadiums
cannot host sports games, gyms cannot verify their subscribers identity. The list keeps
growing as more and more companies rely on
data to run their core business.

Consolidation and Private Database Clouds


Database consolidation has been lagging
behind application server consolidation. The
latter has long moved to virtual platforms
while the database posed unique challenges
with host-based virtualization. However, with
server virtualization improvements and database software innovations such as Oracles
Multitenant, database consolidation moved
to the next level and most recently reemerged
as database as a service with SLA manage-

ment, resource accounting and chargeback,


self-service capabilities, and elastic capacity.

Commodity Hardware and Software


Hardware performance has been rising
consistently for decades with Moores Law,
high-speed networking, solid-state storage,
and the abundance of memory. On the other
hand, the cost of hardware has been consistently decreasing to the point where we now
call it a commodity resource. Public cloud
infrastructure as a service (IaaS) has dropped
the last barriers of adoption.
On the software side, open source phenomena resulted in the availability of free or inexpensive database software that, combined with
access to affordable hardware, allows practcally any company to build its own data management systemsno barriers for datacation.

industry updates

The State of Big Data Management

Its difcult to specialize due to the quickly changing


scope of roles as well as rapid evolution of the
software. Getting things done in a siloed environment
takes a very long timethis is misaligned with the
need to be more agile and adaptable to changing
requirements and timelines.

The Future of Database Outsourcing


Datacation, consolidation, virtualization,
Moores Law, engineered systems, cloud computing, big data, and software innovations
will all result in more eggs (business applications) ending up in one basket (a single
data management system). Consequently, the
impact of an incident on such a system is signicantly higher, affecting larger numbers of
more critical business applications and functionsfor example, a major U.S. retailer that
has $1 billion of annual revenue dependent
on a single engineered system or another single engineered system handling 2% of Japans
retail transactions.
Operating such critical data systems
becomes much more skills-intensive rather
than labor-intensive, and, as companies follow the trend of moving from a zillion low
importance systems to just a few highly critical systems, outsourcing vendors will have
to adapt. The modern database outsourcing
industry is broken because its designed to
source an army of cheap but mediocre workers. The future of database outsourcing is with
the vendors focused on enabling their clients
to build an A-team to manage the critical data
systems of today and tomorrow.
Breaking Enterprise IT Silos
The age of big data spells the end of
enterprise IT silos. Big data projects are very
difcult to tackle by orchestrating a number of very specialized teams such as storage
administrators, system engineers, network

12

BIG DATA S OURCEBOOK 2013

specialists, DBAs, application developers,


data analysts, etc.
Its difcult to specialize due to the quickly
changing scope of roles as well as rapid evolution of the software. Getting things done in
a siloed environment takes a very long time
this is misaligned with the need to be more
agile and adaptable to changing requirements
and timelines. A single, well-jelled big data
team is able to get work done quickly and in a
more optimal waybig data systems are basically new commercial supercomputers in the
age of datacation andjust like with traditional supercomputersthey require a team
of professionals responsible for the management of the complete system end-to-end.
Pre-integrated solutions and engineered
systems also break enterprise IT silos by
forcing companies to build a cross-skilled
single team responsible for that whole engineered system.

The Future for Hadoop and NoSQL


Whether Hadoop is the best big data platform from a technology perspective or not,
it has such a broad (and growing) adoption
in the industry nowadays that there is little
chance for it to be displaced by any other
technology stack.
While, traditionally, core Hadoop has
been thought of as a combination of HDFS
and MapReduce, today, both HDFS and
MapReduce are really optional. For example,
the MapR Hadoop distribution uses MapR-FS,
and Amazon EMR uses S3. The same applies

to MapReduceCloudera Impala has its own


parallel execution engine, Apache Spark is a
new low-latency parallel execution framework, and many more are becoming popular.
Even Apache Hive and Apache Pig are moving from pure MapReduce to Apache Tez, yet
another big data real-time distributed execution framework.
Hadoop is here to stay and that means the
Hadoop ecosystem at large. It will evolve and
add new capabilities at a blazing-fast pace. Some
will die out and others move into mainstream.
Core Hadoop as we know it will change.
There are many commercial off-the-shelf
(COTS) applications available that use relational databases as a data platformCRM,
ERP, ecommerce, health records management, and more. Deploying COTS applications on one of the supported relational database platforms is a relatively straightforward
task, and application vendors have a proven
track of deployments with clearly dened
guidelines. It can be argued that the majority of relational database deployments today
host a third-party application rather than an
in-house developed application.
Big data projects, on the other hand, are
pretty much 100% custom-developed solutions and nonrepeatable easily at another
company. As Hadoop has become the standard platform of the big data industry,
expect a slew of COTS applications to deploy
on top of Hadoop platforms just as they are
deployed on top of relational databases such
as Oracle and SQL Server.

For example, all retail players have to solve


the challenges of providing a seamless experience to the clients across both physical and
online channels. All city governments have
the same needs for trafc planning and realtime control to minimize trafc jams and at
the same time to minimize the cost of operations and ownership. Companies will be able
to buy a COTS application and deploy it on
their own Hadoop infrastructure no matter
what Hadoop distribution it is.
It is, however, quite possible that the new
big data COTS applications will be dominated by software as a service (SaaS) offerings
or completely integrated solution appliances
(as an evolution of engineered systems) and
that means a completely different repeatable
deployment model for big data.
Unlike Hadoop, however, the world of
NoSQL is still represented by a huge variety of
incompatible platforms and its not obvious
who will dominate the market. Each of the
NoSQL technologies has a certain specialization and no one size ts allunlike relational
databases.

Relational Databases
Are Not Going Anywhere
While there is much speculation about
how modern data processing technologies
are displacing proven relational databases, the
reality is that most companies will be better
served with relational technologies for most
of their needs.
As the saying goes, if all you have is a
hammer, everything looks like a nail. When
database professionals drink enough of the
big data Kool-Aid, many of their challenges
look like big data problems. In reality, though,
most of their problems are self-inicted. A
bad data model is not a big data problem.
Using 7-year-old hardware is not a big data
problem. Lack of data purging policy is not
a big data problem. Miscongured databases,
operating systems, and storage arrays are not
big data problems.
There is one good rule of thumb to assess
whether you have a big data problem or not
if you are not using new data sources, you likely
dont have a big data problem. If you are con-

suming new information from the new data


sources, you might have a big data problem.

Whats Ahead
There are a few areas in which we can certainly expect to have many innovations over
the next few years.
Real-time analytics on massive data volumes has more and more demand. While
there are many in-memory database technologies including many proprietary solutions,
I believe the future is with the Hadoop ecosystem and open standards. However, proprietary solutions such as SAP HANA or just
announced Oracle In-Memory Database are
very credible alternatives.

Unlike Hadoop, the world of


NoSQL is still represented by
a huge variety of platforms
and each of the NoSQL

technologies has a certain


specialization.

Graph databases will see signicant


uptake. There are several graph databases
and libraries available, but they all have
unique weaknesses when it comes to scalability, availability, in-memory requirements,
data size, modication consistency and plain
stability. As we have more and more data
generated that is based on dynamic relations
between entities, graph theory becomes a
very convenient way to model data. Thus,
the graph databases space is bound to evolve
at a fast pace.
Continuously increasing security demands
is a general trend in many industries although
most of the modern data processing technologies have weak security capabilities out of
the box. This is where established relational
databases with very strong security models
and capabilities to integrate easily with central
security controls have a strong edge. While its
possible to deploy a Hadoop-based solution

with encryption of data in transit and at rest,


strong authentication, granular access controls, and access audit, it takes signicantly
more effort than deploying mature database
technologies. Its especially difcult to satisfy strict security standards compliance with
newer technologies, as there are no widely
accepted and/or certied secure deployment
blueprints.
The future of the database professionalOne of the challenges that is holding
companies from adopting new data processing
technologies is the lack of skilled people to
implement and maintain that new technology.
Those of us with a strong background in traditional database technologies are already in
high demand and are even in higher demand
when it comes to the bleeding-edge, not-yetproven databases. If you want to be ahead of
the industry, look for opportunities to invest in
learning one of the new database technologies
and do not be afraid that it might be one of
those technologies that becomes nonexistent
in a couple of years. What you learn will take
you to the next level in your professional career
and make it much easier to adapt to the quickly
changing database landscape.

Alex Gorbachev , chief


technology ofcer at Pythian, has architected and
designed numerous successful database solutions
to address challenging
business requirements. He
is a respected gure in the database world
and a sought-after leader and speaker at conferences. Gorbachev is an Oracle ACE director, a Cloudera Champion of Big Data, and a
member of OakTable Network. He serves as
director of communities for the Independent
Oracle User Group (IOUG). In recognition of
his industry leadership, business achievements, and community contributions, he
received the 2013 Forty Under 40 award from
the Ottawa Business Journal and the Ottawa
Chamber of Commerce.

DBTA .CO M

13

industry updates

The State of Big Data Management

sponsored content

Elephant Traps
How to Avoid Them With Data Virtualization
Big Data is being talked about
everywhere in IT and business conferences,
venture capital, legal, medical and government
summits, blogs and tweets even Fox News!
The prevailing mindset is that if you dont
have a Big Data project, youre going to be left
behind. In turn, CIOs are feeling pressured
to do somethinganythingabout Big
Data. So while they are putting up Hadoop
clusters and crunching some data, it seems
that the really big (data) question all of them
should be asking is where is the value going
to come from?, what are the real use cases?,
and nally how can they prevent this from
becoming yet another money pit, or elephant
trap, of technologies and consultants?

TRAP 1NOT FOCUSING ON VALUE


Much of the talk about Big Data is
focused on data not the value in it.
Perhaps we should start with valueidentify
those business entities and processes where
having innitely more information could
directly inuence revenue, protability or
customer satisfaction. Take for example
the customer as an entity. If we had perfect

14

BIG DATA S OURCEBOOK 2013

knowledge of current and potential


customers, past transactions and future
intentions, demographics and preferences
how would we take advantage of that to
drive loyalty and increase share of wallet
and margins? Or to focus on a process such
as delivering healthcare serviceshow
would Big Data impact clinical quality, cost
and reduce relapse rates? Enumerating the
possible impact of Big Data on real business
goals (or social goals for non-prots) should
be the rst step of your Big Data strategy,
followed by prioritizing them which would
involve weeding out the whimsical and
instead focus on the practical.

TRAP 2SEEKING DATA PERFECTION


With value in mind, you must be willing
to experiment with many different types of
Big Data (structured to highly unstructured)
and sourcesmachine and sensor data
(weather sensors, machine logs, web
click streams, RFID), user-generated data
(social media, customer feedback), Open
Government and public data (nancial data,
court records, yellow pages), corporate data

(transactions, nancials) and many more.


In many cases the broader view might yield
more value than the deep and narrow view.
And this allows companies to experiment
with data that may be less than perfect
quality but more than t for purpose.
While quality, trustworthiness, performance
and security are valid concerns, over-zealously
ltering out new sources of data using old
standards will fail to achieve the full value of
Big Data. Also data integration technologies
and approaches are themselves siloed with
different technology stacks for analytics
(ETL/DW), for business process (BPM,
ESB), content and collaboration (ECM,
Search, Portals). Companies need to think
more broadly about data acquisition and
integration capabilities if they want to acquire,
normalize, and integrate multi-structured
data from internal and external sources and
turn the collective intelligence into relevant
and timely information through a unied/
common/semantic layer of data.

TRAP 3COST, TIME AND RIGIDITY


While all the data in the worldand
its potential valuecan excite companies,
it would not be economically attractive
except to the largest organizations if Big
Data integration and analytics were done
using traditional high-cost approaches
such as ETL, data warehouses, and highperformance database appliances. From the
start, Big Data projects should be designed
with low cost, speed and exibility as the
core objectives of the project. Big Data is still
nascent, meaning both business needs and
data realms are likely to evolve faster than
previous generations of analytics, requiring
tremendous exibility. Traditional analytics
relied heavily on replicated data, but Big Data
is too large for replication-based strategies
and must be leveraged in place or in ight
where possible. This also applies in the output

sponsored content
Big

Data i

n the We

b/

Clo
u

Enterprise & Cloud Apps

Data.gov

WWW

CLOUD
STORAGE

Hadoop
Web Streams

Denodo Platform
Unstructured
Content

Connect

MDX

->

Unified
Data Access

Combine
Unified
Data Layer

->

Agile BI & Analytics

Publish
Universal Data
Publishing

Log Files
Data
Query Widget
Address
New Jersey
Seattle
New Jersey
Minnesota

Relational / Parallel /
Columnar

Seattle
Seattle
Minnesota
New Jersey

Chart Widget

Map Widget

Customer Name
Chevron Corporation
IBM
JPMorgan

solutions connect diverse data realms and


sources ranging from legacy to relational to
multi-dimensional to hierarchical to semantic
to Big Data/NoSQL to semi-structured web
all the way to fully unstructured content and
indexes. These diverse sources are exposed
as normalized views so they can be easily
combined into semantic business entities and
associated across entities as linked data.

Chevron Corporation
JPMorgan
IBM
Chevron Corporation
JPMorgan

Minnesota

IBM

New Jersey

Chevron Corporation

Enterprise Apps

Users

B ig D
se
ata in the Enterpri

direction where Big Data results must be easy


to reuse across unanticipated new projects in
the future.

AVOIDING THE TRAPS


To prevent Big Data projects from
becoming yet another money pit and suffer
from the same rigidity of data warehouses,
there are four areas in particular to
consider: data access, data storage, data
processing, and data services. The middle
two areas (storage and processing) have
received the most attention as open source
and distributed storage and processing
technologies like Hadoop have raised hopes
that big value can be squeezed out of Big
Data using small budgets. But what about
data access and data services?
Companies should be able to harness
Big Data from disparate realms cost
effectively, conform multi-structured data,
minimize replication, and provide real-time
integration. The Big Data and analytic result
sets may need to be abstracted and delivered
as reusable data services in order to allow
different interaction models such as discover,
search, browse, and query. These practices
ensure a Big Data solution that is not only
cost-effective, but also one that is exible for
being leveraged across the enterprise.

have all received a lot of attention include


Hadoop, Amazon S3, Google Big Query,
etc. The other is data virtualization, which
has been less talked about until now, but
is particularly important to address the
challenges of Big Data mentioned above:
Data virtualization accelerates time to
value in Big Data projects: Because data
virtualization is not physical, it can rapidly
expose internal and external data assets
and allow business users and application
developers to explore and combine
information into prototype solutions that can
demonstrate value and validate projects faster.
Best of breed data virtualization solutions
provide better and more efficient
connectivity: Best of breed data virtualization

Virtualized data inherently provides lower


costs and more exibility: The output of
data virtualization are data services which
hides the complexity of underlying data
and exposes business data entities through a
variety of interfaces including RESTful linked
data services, SOA web services, data widgets,
or SQL views to applications and end users.
This makes Big Data reusable, discoverable,
searchable, browsable and queryable using a
variety of visualization and reporting tools,
and makes the data easily leveraged in realtime operational applications as well.

CONCLUSION
CIOs and Chief Data Ofcers alike would
do well to keep the dangers of elephant
traps in mind before they find themselves
ensnared. The truth is that every Big Data
project needs a balance between the Big Data
technologies for storage and processing on
the one hand and data virtualization for data
access and data services delivery on the other.

DATA VIRTUALIZATION
Several technologies and approaches
serve the Big Data needs of which two
categories are particularly important.
The rst has received a lot of attention
and involves distributed computing across
standard hardware clusters or cloud
resources, using open source technologies.
Technologies that fall in this category and
DBTA .CO M

15

industry updates

The State of Data Security and Governance

Big Data Poses


Legal Issues and Risks
By Alon Israely

The use of big data by organizations


today raises some important legal and regulatory concerns. The use of big data systems
and cloud-based systems is expanding faster
than the rules or legal infrastructure to manage it. Risk management implications are
becoming more critical to business strategy.
Businesses must get ahead of the practice to
protect themselves and their data.
Before a discussion of those legal and risk
issues, its important that we speak the same
language as the terms big data and cloud
are overused and mean many different things.
For our purposes here, big data is the continuously growing collection of datasets that
derive from different sources, under individualized conditions and which form an overall
set of information to be analyzed and mined
in a manner when traditional database technologies and methods are not sufcient. Big
data analysis requires powerful computing
systems that sift through massive amounts of

16

BIG DATA S OURCEBOOK 2013

information with large numbers of variables


to produce results and reporting that can be
used to determine trends and discover patterns to ultimately make smarter and more
accurate (business) decisions.
Big data analysis is used to spot everything
from business or operational trends to QA
issues, new products, new diseases, new ways
of socializing, etc. Cloud technologies are
required to help manage big data analysis.
Big data leverages cloud technologies such
as utility computing and distributed storage
that is, massive parallel software that runs
to crunch, correlate, and present data in new
ways. Cloud infrastructure is highly scalable
and allows for an on-demand and usagebased economic model that translates to
low-cost yet powerful IT resources, with a low
capital expense and low maintenance costs.
Cloud infrastructure becomes even more
important as the creation and use of the data
continues to grow. Every day, Google pro-

cesses more than 24,000TB of data, and a


few of the largest banks processes more than
75TB of internal corporate data daily across
the globe. Those massive sets of data form
the basis for big data analysis. And as big
data becomes more widely used and those
datasets continue to grow, so do the legal and
risk issues.
Legal and risk management implications
are typically sidelined in the quest for big data
mining and analysis because the organization
is typically focused, rst and foremost, on trying to use the data effectively and efciently
for its own internal business purposes, let
alone giving attention to ensuring that any
legal and risk management implications are
also covered. The potential value of the results
of using big data analysis to increase income
(or lower expenses) for the company tends
to drown out the calls for risk oversight. Big
data can be a Siren, whose beautiful call lures
unsuspecting sailors to a rocky destruction.

industry updates

The State of Data Security and Governance

Big data can be a Siren, whose beautiful call lures


unsuspecting sailors to a rocky destruction. Understanding
the legal and regulatory consequences will help keep your
company safe from those dangerous rocks.

Understanding the legal and regulatory consequences will help keep your company safe
from those dangerous rocks.

Developing Protection Strategies


In order to protect the organization from
legal risks when using big data, businesses must
assess issues and develop protection strategies.
The main areas typically discussed related to
legal risks and big data are in the realm of consumer privacy; but, the legal compliance, such
as legal discovery and preservation obligations,
are also critical to address. Records information management, information governance,
legal, and IT/IS professionals must know how
to identify, gather, and manage big datasets in
a defensible manner when that data and associated systems are implicated in legal matters
such as lawsuits, regulatory investigations, and
commercial arbitrations. Organizations must
understand the risks, obligations, and standards associated with storing and managing
big data for legal purposes. As with all technology decisions, there should be a cost/benet
analysis completed to quantify all risks, including soft risks such as the risk to reputation of
data breaches or the misuse of data.
Big data can be a sensitive topic when lawsuits or regulators come knockingespecially
if the potential legal risks have not thoroughly
been considered by companies early on as
they put in place big data systems and then
rely upon its associated analysis. Thus its
important to bring in the lawyers together
with the technologists early, though this is
not always easy to do. Big data from a legal
perspective includes consumer privacy and
international data transfer (cross-border)
issues, but more risky is the potential expo-

sure of using that data in the normal course


and maintaining the underlying raw data and
analyses (e.g., trending reports). For example,
one question raised is about those parts of an
organizations big data that may be protected
by a legal privilege.
Some examples of big data usage in the
market that carry critical legal implications
and ramications and which have their own
tough questions include:
Determining customer trends to identify
new products and markets
Finding combinations of proteins and
other biological components to identify
and cure diseases
Using social-networking data (e.g.,
Twitter) to predict nancial market
movements
Consumer level support for nding better
deals, products, or info (e.g., Amazon justlike-this, or LinkedIn people-you-mayknow functions)
Using satellite and other geo-related
imagery and data to determine
movement of goods across shipping
lanes and to spot trends in global
manufacturing/distribution
Corporate reputation management
by following social media and other
internet-based mentions, and comparing
those with internal customer trend data
Use by government and others to
determine voting possibilities and
accuracy for demographic-related issues

The Legal Risks


With respect to the legal risks involved,
whats good for the goose is good for the
gander. That is, its important to remem-

ber that use of big data by a company may


open the door for discovery by opposing litigants, government regulators, and other legal
adversaries.
Technical limitations of identifying,
storing, searching, and producing raw data
underlying big data analysis may not guard
against discovery, and being forced to produce raw data underlying the big data
analysis used by the organization to make
important (possibly, trade secret classied)
decisions can be potentially dangerous for
a companyespecially as that data may
end up in the hands of competitors. Thus,
an organization should perform a legal/risk
evaluation before any analysis using big data
is formulated, used, or published.
A major risk faced by organizations utilizing big data analysis is a legal request by
opposing parties and regulators (e.g., for discovery or legal investigation purposes) for
big datasets or its underlying raw data. It can
be very difcult to maintain a limited scope
related only to the legal issues at hand. This
means the organization can end up turning
over far more data than is either necessary
or appropriate due to technical limitations
for segmenting or identifying the relevant
data subsets. Challenges associated with such
issues are still new and thus there are no
known industry best practices, and no legal
authority yet exists. Though this is not good
news for organizations currently using big
data analysis that may be also implicated in
lawsuits or other legal matters, there are ways
to mitigate exposure and protect the organization as best possible, even now as this is
still very much an unknown territory, from a
legal compliance perspective.

DBTA .CO M

17

industry updates

The State of Data Security and Governance

Information security risks are also important factors to consider within the larger legal
and risk context. If they are not mitigated
early on, they alone can lead to opening the
door for broader discovery related to big
datasets and systems. Information security in
a broad sense can include:
Data Integrity and Privacy
Encryption
Access Control
Chain-of-Custody
Relevant Laws/Regulations
Corporate Policies
Specic examples of situations where information security policies should be monitored
include:
Vendor Agreements
Data Ownership & Custody
Requirements
International Regulations
Condentiality Terms
Data Retention/Archiving
Geographical Issues
Entering into contracts with third-party
big data-related providers is an area that warrants special attention and where legal or risk
problems may arise. Strict controls related to
third-parties are important. More and more
big data systems and technologies are supplied by third parties, so the organization
must have certain restrictions and protections
in place to ensure side-door and backdoor
discovery doesnt occur.
When dealing with third-party control,
avoiding common pitfalls leads to better data
risk and cost control. Common problems that
arise include:
Inadvertent data spoliation, which
can include stripping metadata and
truncating communication threads
Custody and control of the data,
including access rights and issues with
data removal
Problems with relevant policies/
procedures, which can include a lack
of planning and a lack of enforcement
of rules
International rules and regulations,
including cross-border issues
Big data sources are no different than traditional data sources in that big data sources
and the use of big data should be protected

18

BIG DATA S OURCEBOOK 2013

like any other critical corporate document,


dataset, or record.

Mitigating Risk
To best mitigate risk from both internal
and third-party users, certain procedures
related to data access and handling should be
implemented via IT control:
Auditing and validation of logins and access
Logging of actions
Monitoring
Chain-of-custody
Executive oversight, however, is also an
extremely important method of managing data
risk. Organizational commitment to appropriate control procedures evidenced through
executive support is a key factor to creating,
deploying, and maintaining a successful information risk management program. Employees
who are able to see the value of the procedures
through the actions and attitudes of those in
management more appreciate the importance
of those procedures themselves.
All in all, a practical, holistic approach is
best for risk mitigation. Here are some tips for
managing legal information/data risk:
Use a team approach: Include
representatives from legal, IT, risk, and
executives to cover all bases.
Use written SOPs and protocols:
Standard ways of operating/responding/
process management and following
written protocols are key to consistency.
Consistency helps defend the process in
legal proceedings if needed.
Leverage native functionality when
responding to legal requests: Reporting
that is sufcient for the business should
be appropriate for the courts. Also be sure
to establish a strong separation of the
presentation layer from the underlying
data for implicated system identication
purposes.
Multi-departmental involvement is also
very important to creating and maintaining
a successful risk mitigation environment and
plan. It is easy to lose track of weak spots in
data handling when only one group is trying to guess the activities of all the others in
an organization. Executives, IT, legal, and
risk all have experiences to share that could
implicate weakness in the systems. Review by
a team helps cover all the bases.

Implementation across departments also


reinforces the importance to the organization of the risk procedures. Organizations
that create risk programs but choose not to
implement them, or that implement them
inconsistently, face their own challenges when
dealing with the courts in enforcing data and
document requests, even those requests with
a broad scope.

Whats Ahead
This is a new eld for legal professionals
and the courts. Big data is here to stay and
will become increasingly ubiquitous and a
necessary part of running an efcient and
successful business. Because of that, those systems and data (including derived analysis and
underlying raw information) will be implicated in legal matters and will thus be subject
to legal rules of preservation, discovery, and
evidence. Those types of legal requirements
are typically burdensome and expensive
when processes are not in place and people
are not trained. Relevant big data systems and
applications are not designed for the type of
operations required by legal rules of preservation and discoveryrequirements related
to maintaining evidentiary integrity, chainof-custody, data origination, use, metadata
information, and historical access control.
This new technical domain will quickly
become critical to the legal fact-nding process. Thus, organizations must begin to think
about how the data is used and maintained
during the normal course of business and
how that may affect their legal obligations if
big data or related systems are implicated
which may likely be the case with every legal
situation an organization may face.

Alon Israely, Esq, CISSP, is


a co-founder of Business
Intelligence Associates. As
a licensed attorney and
IT professional, together
with the distinction of the
CISSP credential, he brings
a unique perspective to articles and lectures, which has made him one of the most
sought-after speakers and contributors in his
eld. Israely has worked with corporations and
law rms to address management, identication, gathering, and handling of data involved
in e-discovery for more than a decade.

sponsored content

With HPCC Systems,


LexisNexis Data Enrichment
is Achieved in Less Than One Day
OVERVIEW
The LexisNexis Global Content
Systems Group provides content to a wide
array of market facing delivery systems,
including Lexis for Microsoft Ofce, and
LEXIS.COM. These services deliver access
to content to more than a million end users.
The LexisNexis content collection consists
of more than 2.3 billion documents of
various size, and is more than 20 terabytes
of data. New documents are added to the
collection every day.
The raw text documents are prospectively
enhanced by recognizing and resolving
embedded citations, performing multiple
topical classications, recognizing entities,
and creation of statistical summaries and
other data mining activities.
The older documents in the collection
require periodic retrospective processing to
apply new or modied topical classication
rules, and to account for changes on the basis
of the other data enhancements. Without
the periodic retrospective processing, the
collection of documents would become
increasingly inconsistent. The inconsistent
application of the above enhancements
materially reduces the effectiveness of the
data enhancements.
THE CHALLENGE
The LexisNexis Content management
system had evolved over a 40-year period
into a complex heterogeneous distributed
environment of proprietary and commodity
servers. The systems acting as repository
nodes were separated from the systems
that performed the data enhancements.
The separation of the repository nodes
from the processing systems required that
copies of the documents be transmitted
from the repository systems to the data
enhancement system, and then transmitted
back to the repository after the enhancement
process completed. The transmission of the
documents created additional processing
latencies, and the elapsed time to perform

a retrospective topical classication or


indexing became several months.
The delay to apply a new classication
to the collection retrospectively created a
situation where older documents might not
be found by a researcher via the topical index
when the index topic was new or recently
modied. The lack of certainty about the
coverage of the indexing required the
researcher to conduct additional searches,
especially when the classication covered
a new or emerging topic.

THE SOLUTION
LexisNexis Global Content Systems Group
consolidated the content management and
document enhancement and mining systems
onto HPCC Systems to solve multiple data
challenges, including content enrichment
since data enrichment must be applied across
all the content simultaneously to provide a
superior search result.
HPCC Systems from LexisNexis is an
open-source, enterprise-ready solution
designed to help detect patterns and hidden
relationships in Big Data across disparate data
sets. Proven for more than 10 years, HPCC
Systems helped LexisNexis Risk Solutions
scale to a $1.4 billion information company
now managing several petabytes of data on
a daily basis from 10,000 different sources.
HPCC Systems is proven in entity
recognition/resolution, clustering and content
analytics. The massively parallel nature of the
HPCC platform provides both the processing
and storage resources required to fulll the
dual missions of content storage and content
enrichment.
HPCC Systems was easily integrated with
the existing Content Management workow
engine to provide document level locking and
other editorial constraints.
The migration of the content repository
and data enhancement processing to the
HPCC platform involved creating several
HPCC worker clusters of varying sizes
to perform data enrichments and a single

HPCC Data Management cluster to house


the content. This conguration provides
the ability to send document workloads of
varying sizes to appropriately sized worker
clusters while reserving a substantially sized
Data Management cluster for content storage
and update promotions. Interactive access is
also provided to support search and browse
operations.

THE RESULTS
The new system achieves the goal
of having a tightly integrated content
management and enrichment system that
takes full advantage of HPCC Systems
supercomputing capabilities for both
computation and high speed data access.
The elapsed time to perform an
enrichment pass of the entire data collection
dropped from six to eight weeks to less
than a day. This change is so signicant that
LexisNexis has already increased the degree
of enrichment into other capabilities that
were previously out of reach.
ABOUT HPCC SYSTEMS
HPCC Systems was built for small
development teams and offers a single
architecture and one programming
language for efcient data processing
of large or complex queries. Customers,
such as nancial institutions, insurance
companies, law enforcement agencies,
federal government and other enterprise
organizations, leverage the HPCC Systems
technology through LexisNexis products and
services. For more information, visit
www.hpccsystems.com.

LEXISNEXIS
www.hpccsystems.com
LexisNexis and the Knowledge Burst Logo are
registered trademarks of Reed Elsevier Properties Inc.,
used under license. HPCC Systems is a registered
trademark of LexisNexis Risk Data Management Inc.
Copyright 2012 LexisNexis. All rights reserved.

DBTA .CO M

19

industry updates

The State of Data Warehousing

Unlocking the
Potential of Big Data in a
Data Warehouse Environment
By W. H. Inmon

In the beginning, the data warehouse


was a concept that was not accepted by the
database fraternity. From that humble beginning, the data warehouse has become conventional wisdom and is a standard part of the
infrastructure in most organizations. Data
warehouse has become the foundation of
corporate data. When an organization wants
to look at data from a corporate perspective,
not an application perspective, the data warehouse is the tool of choice.

Data Warehousing and


Business Intelligence
A data warehouse is the enabling foundation of business intelligence. Data warehouse
and business intelligence are linked as closely
as sh and water.
20

BIG DATA S OURCEBOOK 2013

The spending on data warehousing and


business intelligence has long ago passed that
of spending on transaction-based operational
systems. Once, operational systems dominated the budget of IT. Now, data warehousing and business intelligence dominate.
Through the years, data warehouses have
grown in size and sophistication. Once, data
warehouse capacity was measured in gigabytes. Today, many data warehouses are measured in terabytes. Once, single processors
were sufcient to manage data warehouses.
Today, parallel processors are the norm.
Today, also, most corporations understand
the strategic signicance of a data warehouse.
Most corporations appreciate that being able
to look at data uniformly across the corporation is an essential aspect of doing business.

But in many ways, the data warehouse


is like a river. It is constantly moving, never
standing still. The architecture of data warehouses has evolved with time. First, there was
just the warehouse. Then, there was the corporate information factory (CIF). Then, there
was DW 2.0. Now there is big data.

Enter Big Data


Continuing the architectural evolution is the
newest technologybig data. Big data technology arrived on the scene as an answer to the need
to service very large amounts of data. There are
several denitions of big data. The denition
discussed here is the one typically discussed in
Silicon Valley. Big data technology:
Is capable of handling lots and lots
of data

industry updates

The State of Data Warehousing

There is new technology called textual disambiguation


which allows raw unstructured text to have its context
specically determined. In addition, textual disambiguation
allows the output of its processing to be placed in a standard
database format so that classical analytical tools can be used.

Is capable of operating on inexpensive


storage
Big data:
Is managed by the Roman census
method
Resides in an unstructured format
Organizations are finding that big data
extends their capabilities beyond the scope
of their current horizon. With big data technology, organizations can search and analyze data well beyond what would have ever
fit in their current environment. Big data
extends well beyond anything that would
ever fit in the standard DBMS environment.
As such, big data technology extends the
reach of data warehousing as well.

Some Fundamental Challenges


But with big data there come some fundamental challenges. The biggest challenge is
that big data is not able to be analyzed using
standard analytical software. Standard analytical software makes the assumption that data
is organized into standard elds, columns,
rows, keys, indexes, etc. This classical DBMS
structuring of data provides context to the
data. And analytical software greatly depends
on this form of context. Stated differently, if
standard analytical software does not have the
context of data that it assumes is there, then
the analytical software simply does not work.
Therefore, without context, unstructured
data cannot be analyzed by standard analytical software. If big data is to fulll its destiny,
there must be a means by which to analyze big
data once the data is captured.

Determining Context
There have been several earlier attempts
to analyze unstructured data. Each of the
attempts has its own major weakness. The
previous attempts to analyze unstructured
data include:
1. NLPnatural language processing.
NLP is intuitive. But the aw with NLP is
that NLP assumes context can be determined
from the examination of text. The problem
with this assumption is that most context is
nonverbal and never nds its way into any
form of text.
2. Data scientists. The problem with
throwing a data scientist at the problem of
needing to analyze unstructured data is that
the world only has a nite supply of those
scientists. Even if the universities of the world
started to turn out droves of data scientist, the
demand for data scientists everywhere there is
big data would far outstrip the supply.
3. MapReduce. The leading technology of
big dataHadoophas technology called
MapReduce. In MapReduce, you can create and
manage unstructured data to the nth degree.
But the problem with MapReduce is that it
requires very technical coding in order to be
implemented. In many ways MapReduce is like
coding in Assembler. Thousands and thousands
of lines of custom code are required. Furthermore, as business functionality changes, those
thousands of lines of code need to be maintained. And no organization likes to be stuck
with ongoing maintenance of thousands of lines
of detailed, technical custom code.
4. MapReduce on steroids. Organizations
have recognized that creating thousands of
lines of custom code is no real solution.

Instead, technology has been developed that


accomplishes the same thing as MapReduce
except that the code is written much more
efficiently. But even here there are some
basic problems. The MapReduce on steroids
approach is still written for the technician, not
the business person. And the raw data found
in big data is essentially missing context.
5. Search engines. Search engines have
been around for a long time. Search engines
have the capability of operating on unstructured data as well as structured data. The only
problem is that search engines still need for
data to have context in order for a search to
produce sophisticated results. While search
engines can produce some limited results
while operating on unstructured data, sophisticated queries are out of the reach of search
engines. The missing ingredient that search
engines need is the context of data which is
not present in unstructured data.
So the data warehouse has arrived at the
point where it is possible to include big data
in the realm of data warehousing. But in order
to include big data, it is necessary to overcome
a very basic problemthe data found in big
data is void of context, and without context,
it is very difcult to do meaningful analysis
on the data.
While it is possible that data warehousing
will be extended to include big data, unless
the basic problem of achieving or creating
context in an unstructured environment is
solved, there will always be a gap between big
data and the potential value of big data.
Deriving context then is the forthcoming
major issue of data warehouse and big data for
the future. Without being able to derive context

DBTA .CO M

21

industry updates

The State of Data Warehousing

for unstructured data, there are limited uses for


big data. So exactly how can context of text be
derived, especially when context of text cannot
be derived from the text itself?

Deriving Context
In fact, there are two ways to derive context for unstructured data. Those ways are
general context and specic context. General context can be derived by merely declaring a document to be of a particular variety. A
document may be about shing. A document
may be about legislation. A document may
be about healthcare, and so forth. Once the
general context of the document is declared,
then the interpretation of text can be made in
accordance with the general category.
As a simple example, suppose there were
in the raw text this sentence: President Ford
drove a Ford. If the general context were
about motor cars, then Ford would be interpreted to be an automobile. If the general
context were about the history of presidents
of the U.S., then Ford would be interpreted to
be a reference to a former president.

Textual Disambiguation
The other type of context is specic context. Specic context can be derived in many
different ways. Specic context can be derived
by the structure of a word, the text surrounding a word, the placement of words in
proximity to each other, and so forth. There
is new technology called textual disambiguation which allows raw unstructured text
to have its context specically determined.
In addition, textual disambiguation allows
the output of its processing to be placed in
a standard database format so that classical
analytical tools can be used.
At the end of textual disambiguation,
analytical processing can be done on the
raw unstructured text that has now been
disambiguated.

The Value of Determining Context


The determination of the context of unstructured data opens the door to many types of
processing that previously were impossible. For
example, corporations can now:
Read, understand, and analyze their corporate contracts en masse. Prior to textual

22

BIG DATA S OURCEBOOK 2013

disambiguation, it was not possible to look at


contracts and other documents collectively.
Analyze medical records. For all the work
done in the creation of EMRs (electronic
medical records), there is still much narrative
in a medical record. The ability to understand
narrative and restructure that narrative into a
form and format that can be analyzed automatically is a powerful improvement over the
techniques used today.
Analyze emails. Today after an email is read,
it is placed on a huge trash heap and is never
seen again. There is, however, much valuable
information in most corporations emails. By
using textual disambiguation, the organization
can start to determine what important information is passing through their hands.
Analyze and capture call center data.
Today, most corporations look at and analyze
only a sampling of their call center conversations. With big data and textual disambiguation, now corporations can capture and analyze all of their call center conversations.
Analyze warranty claims data. While a
warranty claim is certainly important to the
customer who has made the claim, warranty
analysis is equally important to the manufacturer to understand what manufacturing processes need to be improved. By being able to
automatically capture and analyze warranty
data and to put the results in a database, the
manufacturer can benet mightily.
And the list goes on and on. This short
list is merely the tip of the tip of the iceberg
when it comes to the advantages of being
able to capture and analyze unstructured
data. Note that with standard structured
processing, none of these opportunities have
come to fruition.

Some Architectural Considerations


One of the architectural considerations
of managing big data through textual disambiguation technology is that raw data on
a big data platform cannot be analyzed in a
sophisticated manner. In order to set the stage
for sophisticated analysis, the designer must
take the unstructured text from big data,
pass the text through textual disambiguation,
then return the text back to big data. However, when the raw text passes through textual
disambiguation, it is transformed into disam-

biguated text. In other words, when the raw


text passes through textual disambiguation, it
passes back into big data, where the context of
the raw text has been determined.
Once the context of the unstructured text
has been determined, it can then be used for
sophisticated analytical processing.

Whats Ahead
The argument can be made that the process of disambiguating the raw text then
rewriting it to big data in a disambiguated
state increases the amount of data in the
environment. Such an observation is absolutely true. However, given that big data is
cheap and that the big data infrastructure is
designed to handle large volumes of data, it
should be of little concern that there is some
degree of duplication of data after raw text
passes through the disambiguation process.
Only after big data has been disambiguated is
the big data store t to be called a data warehouse. However, once the big data is disambiguated, it makes a really valuable and really
innovative addition to the analytical, data
warehouse environment.
Big data has much potential. But unlocking that potential is going to be a real challenge. Textual disambiguation promises to be
as profound as data warehousing once was.
Textual disambiguation is still in its infancy,
but then again, everything was once in its
infancy. However, the early seeds sewn in textual disambiguation are bearing some most
interesting fruit.

W. H. Inmon the father


of data warehousehas
written 52 books published
in nine languages. Inmon
speaks at conferences regularly. His latest adventure
is the building of Textual
ETLtextual disambiguationtechnology
that reads raw text and allows raw text to be
analyzed. Textual disambiguation is used to
create business value from big data. Inmon
was named by ComputerWorld as one of the
10 most inuential people in the history of
the computer profession, and lives in Castle
Rock, Colo.

sponsored content

Filling the Content Blind Spot


The adage Every company is a data
company is more true today than ever.
The problem is most companies dont
realize how much valuable data theyre
actually sitting on, nor how to access
and use this untapped data. Companies
must exploit whatever data enters their
enterprise in every format and from every
source to gain a comprehensive view of
their business.
Most IT professionals focus all
their resources on guring out how to
effectively access structured data sources.
Projects associated with data warehousing
and business intelligence get all the
attention. And in some cases they yield
valuable insights into the business. But
the fact is that structured data sources
are just the tip of the iceberg inside most
companies. There is so much intelligence
that goes unseen and unanalyzed simply
because they dont know how to get at it.
For that reason, forward-looking CIOs
and IT organizations have begun exploring
new strategies for tapping into other
non-traditional sources of information
to get a more complete picture of their
business. These strategies attempt to gather
and analyze highly unstructured data like
websites, tweets and blogs to discover trends
that might impact the business.
While this is a step in the right direction,
it misses the bigger picture of the Big Data
landscape. The blind spots in these data
strategies are both the unstructured and
semi-structured data that is contained in
content like reports, EDI streams, machine
data, PDF les, print spools, ticker feeds,
message buses, and many other sources.

UNDERSTANDING THE
CONTENT BLIND SPOT
A growing number of IT organizations
now see value in information contained
within these content blind spots. The key
reason: It enhances their business leaders
ability to make smarter decisions because
much of this data provides a link to past
decisions.
Companies also realize that these nontraditional data sources are growing at an

exponential rate. They have become the


language of business for industries like
healthcare, nancial services and retail. So
where do you nd these untapped sources
of information? Easy; theyre everywhere.
As companies have rolled out ERP, CRM
and other enterprise systems (including
enterprise content management tools), they
have also created thousands of standard
reports. Companies are also stockpiling
volumes of commerce data with EDI
exchanges. Excel spreadsheets are ubiquitous
as well. And as PDF les of invoices and
bills-of-lading are exchanged, vital data is
being saved. All these sources possess semistructured data that can reveal valuable
business insight.
But how do you get to these sources,
and what do you do with them?

OPTIMIZING INFORMATION
THROUGH VISUAL
DATA DISCOVERY
Next-generation analytics enable
businesses to analyze any data variety,
regardless of structure, at real-time velocity
for fast decision making in a visual data
discovery environment. These analytic tools
link diverse data types with traditional
decision-making tools like spreadsheets and

business intelligence (BI) systems to offer


a richer decision making capability than
previously possible.
By tapping into semi-structured and
unstructured content from varied sources
throughout an organization, next-gen
analytics solutions are able to map these
sources to models so that they can be
combined, restructured and analyzed.
While it sounds simple, the technology
actually requires signicant intelligence
regarding the structural components of the
content types to be ingested and the ability
to break these down into atomic level items
that can be combined and mapped together
in different ways.
For organizations to fully exploit the
power of their information, they have to
uncover the content blind spots in their
enterprise that hold so much underutilized
value. Leveraging structured, unstructured
and semi-structured content in a visual
discovery environment can deliver enormous
improvements in decision making and
operational effectiveness.

DATAWATCH
www.datawatch.com

DBTA .CO M

23

GROW

your connections
Join DBTA via Facebook, Twitter, Google+, and LinkedIn to connect with industry
peers, receive the latest-breaking news, gain insights, get conference discounts,
download white papers, hear about webinars, and much more.

sponsored content

Overcoming the
Big Data Transfer Bottleneck
Businesses all over the world are
beginning to realize the promise of Big
Data. After all, being able to extract data
from various sources across the enterprise,
including operational data, customer
data, and machine/sensor data, and then
transform it all into key business insights can
provide signicant competitive advantage. In
fact, having up-to-date, accurate information
for analytics can make the difference
between success and failure for companies.
However, its not easy. A recent study by
Wikibon noted that returns thus far on Big
Data investments are only 50 cents to the
dollar. A number of challenges stand in front
of maximizing return. The data transfer
bottleneck is but one real and pervasive issue
thats causing many headaches in IT today.

REASONS FOR THE BIG DATA


TRANSFER BOTTLENECK
Outdated technology. Moving data is
hard. Moving Big Data is harder. When
companies rely on heritage platforms
engineered to support structured data
exclusively, such as ETL, they quickly nd
out that the technology simply cannot scale
to handle the volume, velocity or variety of
data, and therefore, cannot meet the realtime information needs of the business.
Lagging system performance. Even if
source and target systems are in the same
physical location, data latency can still be
a problem. Data often resides in systems
that are used daily for operational and
transaction processing. Using complex
queries to extract data and launching
bulk data loads mean extra work for CPU
and disk resources, resulting in delayed
processing for all users.

Complex setup and implementation.


Sometimes companies manage to deliver
data using complex, proprietary scripts and
programs that take months of IT time and
effort to develop and implement. With SLAs
to meet and business opportunities at risk of
being lost, most companies simply dont have
the luxury of wading through this difcult
and time-consuming process.
Delays caused by writing data to disk.
When information is extracted from systems,
it is often sent to a staging area and then
relayed to the target to be loaded. Storage to
disk causes delays as data is written and then
read in preparation for loading.

decisions made without real-time data may


also be called into question.

THE ANSWER
There is a solution to overcoming this
challenge. Attunity beats the Big Data
bottleneck by providing high-performance
data replication and loading for the broadest
range of databases and data warehouses
in the industry. Its easy, Click-2-Replicate
design and unique TurboStream DX data
transfer and CDC technologies give it the
power to stand up to the largest bottlenecks
and win. Partner with Attunity. You too can
beat the data transfer bottleneck!

Proliferation of sources and targets.


With data that can reside in a variety of
transactional databases such as Oracle, SQL
Server, IBM Mainframe, and with newer
data warehouse targets such as Vertica,
Pivotal, Teradata UDA and Microsoft PDW
on the rise, setup time can increase and
performance can be lost using solutions that
are not optimized to each platform.
Limited Internet bandwidth. If source
and target systems are in different physical
locations, or if the target is in the cloud,
insufcient Internet bandwidth can be a
major cause of data replication lag. Most
networks are congured to handle general
operations but are not built for massive data
migrations.

HARSH REALITY
When timely information isnt available,
key decisions need to be deferred. This
can lead to lost revenues, decreased
competitiveness, or lower levels of customer
satisfaction. Additionally, the reliability of

Learn more!
Download this eBook by data
management expert, David Loshin:
Big Data Analytics Strategies
Beating the Data Transfer Bottleneck
for Competitive Gain
http://bit.ly/ATTUeBook

ATTUNITY
For more information,
visit www.Attunity.com
or call (800) 288-8648 (toll free)
+1 (781) 730-4070.

DBTA .CO M

25

industry updates

The State of Cloud Technologies

Cloud Technologies
Are Maturing to Address Emerging
Challenges and Opportunities
By Chandramouli Venkatesan

Cloud technologies and frameworks


have matured in recent years and enterprises are starting to realize the benets of
cloud adoptionincluding savings in infrastructure costs, and a pay-as-you-go service
model similar to Amazon Web Services. Here
is a look at the cloud market and its convergence with the big data market, including key
technologies and services, challenges, and
opportunities.

Evolution of the Cloud Adoption


The technology, platform, and services
that were available in the early 1990s were
similar to the cloud adoption of the last
decade. We had distributed systems with Sun
RISC-based server workstations, IBM mainframes, millions of Intel-based Windows
desktops, Oracle Database Servers (including Grid Computing10g), and J2EE N-tier
architecture. There were application service
providers (ASPs), managed service providers
(MSPs), and internet service providers (ISPs)
26

BIG DATA S OURCEBOOK 2013

offering services similar to cloud offerings


today. What has changed?
There were signicant events that triggered the emergence of cloud offerings and
their adoption. The rst one was the Amazon
Web Services (S3, EC2, RDS, SQS) scale out
and the development of IaaS (infrastructure
as a service) once Amazon was able to realize the benets of its offering for its own
internal use. The second major event was
the search engine, advertising platform, and
Google Big Table (memcache) and realization that millions of nodes with commodity
hardware (cheap) can be leveraged to harness
MapReduce and other frameworks to distribute the search query and provide results at a
millisecond response time (unheard of with
even mainframes).
In the middle of the 2000-decade, the traditional telecom and mobile phone service
providers saw that that they needed to move
to scale-out platforms (the cloud) to manage
their mobile customer base, which grew from

a few million to a billion (factor of 1000).


The mobile data grew from a few terabytes
to petabytes and they needed newer scale-out
platforms and wanted on-premise as well as
hybrid cloud deployments.
The creators of Hadoop ran TeraSort
benchmarks with large clusters of nodes in
order to determine the benets of MapReduce
frameworks. It resulted in the emergence of
the Hadoop Cluster Distribution; NoSQL data
stores such as columnar, document, and graph
databases; and massive parallel performance
(MPP) analytical databases. An ecosystem of
vendors emerged to reap the benets of the
scale-out cloud infrastructure, MapReduce
frameworks, and Hadoop and NoSQL data
stores. The applications included data migration, predictive analytics, fraud detection, and
data aggregation from multiple data sources.
The new paradigm shift addressed the
key issue of scale as well as the handling of
unstructured data that was lacking in traditional relational databases. The paradigm

industry updates

The State of Cloud Technologies

The future of cloud deployments will involve rapid adoption of new


technology frameworks beyond Hadoop, open standards in the area of
cloud security, identity, trust, as well as a universal and simple query
language for aggregating data from legacy and emerging data stores.

shift occurred as a result of the availability of


commodity hardware and a framework to run
massive parallel data processing across clusters
of nodes including distributed le system, high
performance analytical databases, and NoSQL
data stores for handling unstructured data.

Hybrid Cloud
For enterprises that are adopting the hybrid
(public/private/community) cloud pay-as-yougo model for IaaS, PaaS, and SaaS cloud
deployments, the key drivers are cost, exibility, and speed (time to set up hardware, software, and services). The primary use cases for
the new hybrid model include the ability to do
data migration, fraud detection, and the ability
to manage unstructured data in real time.
But the move to hybrid cloud deployment comes with new challenges and risks.
The biggest challenge for cloud deployments
today is in the area of data security and identity. There are several cloud providers who
offer IaaS, PaaS, SaaS, network as a service,
and everything as a service and probably
offer good rewalls to protect data within
the boundaries of their data center. The challenges include data at rest, data in ight used
in mobile devices accessing the cloud provider, and data derived from multiple cloud
providers and provision of a single-view to
the mobile customer.

BYOD
The ubiquitous mobile computing is driving the new cloud adoption model faster than
anticipated and a key driver is BYOD (bring
your own device). The traditional IT shop
had control of its assets whether on-premise
or on cloud. However, the demands of BYOD
and the myriad mobile devices, applications,

and mobile stores have resulted in the IT


organization losing control of users identity,
as one can have more than one prole. The
use of the biometric information such as ngerprint and eye scans is still in its infancy
for the mobile users. There are some efforts
in standardization in cloud identity management such as OpenID Connect, OAuth, and
SIEM, but the adoption is slow, and it will
take time to work seamlessly across many
cloud providers.

Trust the Cloud Providers


The key security issue for cloud and mobility deployment is establishment of trust and
trust boundaries. There are several players
in the cloud and mobile deployments offering different services, and they need to work
seamlessly end-to-end. The trust worthiness
is enabled by the ability to automatically signoff or hand-off to another cloud/mobile service provider in the trust boundary and still
maintain the data integrity at each hand-off.
The automatic sign-off would need to verify
the validity of the cloud provider, protect
the identity of the users, as well as guarantee
the nontampering of content. The intermediate trust verication providers would also
be a cloud provider similar to verication of
ecommerce internet sites. The trust verication provider must support the SLAs for security, identity, and trust between mobile and
cloud service provider. The key requirement
is to ensure the integrity and trust between
mobile and cloud providers, inter-cloud, and
intra-cloud providers.
The mobile end user will have a trust
boundary with mobile/telecom service provider (cellular) or managed service provider
(Wi-Fi). The trust will be recorded, and some

portion of identity will be passed on to one or


more cloud providers offering different services. Each trust boundary will have a negotiation between mobile and cloud providers
or between cloud providers to establish the
identity, security, and integrity of data as well
as the mobile user.
The future of cloud is in the convergence
of simple standards for security, identity, and
trust, and it involves all participants in the
cloud: mobile device vendors; service providers; cloud IaaS, PaaS, and SaaS vendors; and
the network. The pay-as-you-go model would
have a price tag factored in for a minimum
SLA level in terms of guarantee and additional pricing based on additional levels of
security, including security locks at the CPU
boundary.

Cloud/Big Data Frameworks


In addition to cloud security, identity verication, and trust regarding data integrity,
the technology of cloud/big data frameworks
will have rapid changes and adoption in the
next few years. One such adoption is the
standardization of a query language for the
NoSQL data stores similar to SQL for relational database management systems. The
query language will result in query nodes that
accept incoming queries and in turn result
in distributed queries across the cluster of
nodes, handling all issues dealing with data,
including security, speed, and reliability of the
transaction.
The price per TB (not GB) of ash and
random access memory will drive the future
adoption of cloud/big data predictive analytics
and learning models. This is key in generating
value in different verticals such as healthcare,
education, energy, and nance. The ability

DBTA .CO M

27

industry updates

The State of Cloud Technologies

to keep big data in an in-memory cache in a


smaller footprint including mobile devices will
result in improvements in gathering, collecting,
and storing data from trillions of mobile devices
and perform data, predictive, behavioral, and
visual analytics at near-real time (microseconds).
The key cloud adoption driver today is the number of cores per computing nodes. The future
of cloud adoption will involve a large memory
cache in addition to many cores per computing
node (commodity hardware).
The technology of Hadoop frameworks
has evolved since 2004, and it includes the
MapReduce framework, the Hadoop Distributed File System, and additional technologies.
There is a need for a beyond Hadoop framework, and the future of Hadoop will be built
in to the platform (iOS, Linux, Windows,
etc.) similar to a task scheduler in a platform
OS. The new frameworks beyond Hadoop
will need to provide distributed query search
engines out of the box, the ability to easily
manage custom queries, and the ability to
provide a mechanism to have an audit trail
of data transformations end-to-end across
several mobile and cloud providers. The audit
trail or probe will be similar to a ping or trace
route command, and it should be available to
ensure the integrity of data for end-to-end
deployment.

Emerging Standards
There are several emerging standards
for cloud deployments, primarily to address
identity, security, and software-defined
networking (SDN). IaaS, PaaS, and SaaS
cloud deployments have matured, and there
are several players that coexist in the cloud
ecosystem today. The standards such as
OpenID, Open Connect, OAuth, and Open
Data Center Alliance have several cloud providers and enterprises signing up every day,
but the adoption will take a few more years
to evolve and mature. Open standards are the
key to the future adoption of cloud and the
seamless ow of secure data among different cloud providers. This offers a paradigm
similar to a free market economy, which is a
goal, but in reality, the goal to be strived for
by future cloud players is about 60% open
standards and 40% proprietary frameworks

28

BIG DATA S OURCEBOOK 2013

in order to promote competition and an even


playing eld. Customers will demand faster
adoption in open standards for cloud deployments, and the keys to adoption are speed,
exibility, cost, and focus on solving their
problems efciently. The current approach of
enterprises spending time and money in the
evaluation, selection, and use of cloud providers will pave the way for pay-as-go-you
go cloud providers on demand for blended

Open standards are the key


to the future adoption of
cloud and the seamless
ow of secure data among
different cloud providers.
services. There will be blend of services leveraging mobile and cloud deployment, such as
single sign-on, presales, actual customer sale,
post-sales, recommendation systems, etc. The
cloud adoption of IaaS, PaaS, and SaaS will
give way to business models similar to prepay,
post-pay debit/credit cards for products and
services with cloud ready offerings.
The cloud/big data deployments will
see the emergence of multiple data centers
managed by multiple cloud providers, and
the cloud will have to support distributed
query-based search, with results that can be
provided to the mobile user at near real time.
This would require open standards to allow
seamless data exchange between multiple
data centers, maintaining the SLA levels for
performance, scalability, security, and identity. It is a clear challenge and opportunity for
the future of cloud, but it is likely that new
mobile apps will drive the need for cooperation between cloud providers or result in consolidation of several players into a few mobile
and cloud providers.

Billing Systems for the Cloud


Future cloud deployments will require
both mobile and cloud provider payment
processing to keep pace with other aspects of
the cloud deployment model, such as secu-

rity, scalability, cost-savings, and reliability.


We would require, at a minimum, a billing
provider to provide platform billing and reconciliation of payments between cloud and
mobile service providers. The challenge of the
future for cloud-based billing providers is the
payment processing for a blend of services.
For example, payment of different rates for
providers in the cloud, such as device, mobile,
cloud infrastructure/platform provider, storage, network, and payment service providers.
The break-even and moderate margin for a
pay-as-you go model in the cloud will be 40%
cost and 60% revenue; the cost reduction over
time would be as a result of consolidation
from both the mobile and cloud service provider offering integrated services. The pay-asyou-go business model, with SLA guarantees,
will be appealing for mom-and-pop stores
that want to adopt cloud services, coexist, and
compete with big retail stores, and will ultimately result in better service and lower cost
for the consumer.
The future of cloud deployments will
involve rapid adoption of new technology
frameworks beyond Hadoop, open standards
in the area of cloud security, identity, and
trust, as well as a universal and simple query
language for aggregating data from legacy and
emerging data stores. Future cloud adoption
will involve trillions of mobile devices, ubiquitous computing, zettabytes of data, and
improved SLAs between cloud providers, as
well as larger, cheaper memory cache and
multiple cores per computing node.
Chandramouli Venkatesan

(Mouli) has more than 20


years of experience in the
telecom industry, including
technical leadership roles
at Fujitsu Networks and
Cisco Systems, and as a
big data integration architect in the nancial and healthcare industries. Venkatesans
company MEICS, Inc. (www.meics.org) provides the analytics and learning platform for
cloud deployments. Venkatesan evangelizes
emerging technologies and platforms and
innovation in cloud, big data, mobility, and
content delivery networks.

sponsored content

Get Real With Big Data


Many organizations recognize
the value of generating insights from
their rapidly increasing web, social,
mobile and machine-generated data.
However, traditional batch analysis is
not fast enough.
If data is even one day old, the
insights may already be obsolete.
Companies need to analyze data in
near-real-time, often in seconds.
Additionally, no matter the timeliness
of these insights, they are worthless
without action. The faster a company
acts, the more likely there is a return on
the insight such as increased customer
conversion, loyalty, satisfaction or
lower inventory, manufacturing, or
distribution costs.
Companies seek technology solutions
that allow them to become real-time,
data-driven businesses, but have been
challenged by existing solutions.

LEGACY DATABASE OPTIONS


Traditional RDBMSs (Relational
Database Management Systems) such
as Oracle or IBM DB2 can support
real-time updates, but require expensive
specialized hardware to scale up to
support terabytes to petabytes of data.
At millions of dollars per installation,
this becomes cost-prohibitive quickly.
Traditional open source databases such
as MySQL and PostgresSQL are unable
to scale beyond a few terabytes without
manual sharding. However, manual sharding
requires a partial rewrite of every application
and becomes a maintenance nightmare to
periodically rebalance shards.
New Big Data technologies such as
Hadoop and HBase are cost-effective
platforms that are proven to scale from
terabytes to petabytes, but they provide
little or no SQL support. This lack of SQL
support is a major barrier to Hadoop
adoption and is also a major shortcoming
of NoSQL solutions, because of the massive
retraining required. Companies adopting
these technologies cannot leverage existing
investments in SQL-trained people, or SQL
Business Intelligence (BI) tools.

WITH SPLICE MACHINE,


COMPANIES CAN:
Unlock the Value of Hadoop. Splice
Machine provides a standard ANSI SQL
engine, so any SQL-trained analyst or SQLbased application can unlock the value of
the data in a current Hadoop deployment,
across most major distributions.
Combine NoSQL and SQL. Splice
Machine enables application developers
to enjoy the best of both SQL and NoSQL,
bringing NoSQL scalability with SQL
language support.
Avoid Expensive Big Iron. Splice
Machine frees companies with specialized
server hardware from the spiraling costs of
scaling up the handle over a few terabytes.
Scale Beyond MySQL. Splice Machine
can help those companies scale beyond
a few terabytes with the proven autosharding capability of HBase.
Future-proof New Apps. Splice
Machine provides a future-proof
database platform that can scale costeffectively from gigabytes to petabytes
for new applications.

SPLICE MACHINE: THE REAL-TIME


SQL-ON-HADOOP DATABASE
Splice Machine brings the best of these
worlds together. It is a standard SQL
database supporting real-time updates and
transactions implemented on the scalable,
Hadoop distributed computing platform.
Designed to meet the needs of real-time,
data-driven businesses, Splice Machine is
the only transactional SQL-on-Hadoop
database. Like Oracle and MySQL, it is a
general-purpose database that can handle
operational (OLTP) or analytical (OLAP)
workloads, but can also scale out costeffectively on inexpensive commodity
servers.
Splice Machine marries two proven
technology stacks: Apache Derby, a Javabased, full-featured ANSI SQL database, and
HBase/Hadoop, the leading platforms for
distributed computing.

SPLICE MACHINE ENABLES YOU


TO GET REAL WITH BIG DATA
As the only transactional SQL-onHadoop database, Splice Machine presents
unlimited possibilities to application
developers and database architects. Best of
all, it eliminates the compromises that have
been part of any Big Data database platform
selection to date.
Splice Machine is uniquely qualied to
power applications that can harness realtime data to create more valuable insights
and drive better, more timely actions. This
enables companies that use Splice Machine
to become real-time, data-driven businesses
that can leapfrog their competition and get
real results from Big Data.
SPLICE MACHINE
info@splicemachine.com
www.splicemachine.com

DBTA .CO M

29

The State of Data Quality and Master Data Management

industry updates

Data Quality and MDM


Programs Must Evolve to Meet
Complex New Challenges
By Elliot King

Data quality has been one of the central


issues in information management since the
beginningnot the beginning of modern
computing and the development of the corporate information infrastructure but since
the beginning of modern economics and
probably before that. Data quality is what
audits are all about.
Nonetheless, the issues surrounding data
quality took on added importance with the
data explosion sparked by the large-scale integration of computing into every aspect of
business activity. The need for high-quality
data was captured in the punch-card days of
the computer revolution with the epigram garbage in, garbage out. If the data isnt good, the
outcome of the business process that uses that
data isnt good either.
Data growth has always been robust, and
the rate keeps accelerating with every new generation of computing technology. Mainframe
computers generated and stored huge amounts
of information, but then came minicomputers
and then personal computers. At that point,
everybody in a corporation and many people
at home were generating valuable data that was
used in many different ways. Relational databases became the repositories of information

30

BIG DATA S OURCEBOOK 2013

across the enterprises, from nancial data to


product development efforts, from manufacturing to logistics to customer relationships to
marketing. Unfortunately, given the organizational structure of most companies, frequently
data was captured in divisional silos and could
not be shared among different departments
nance and sales, for example, or manufacturing and logistics. Since data was captured
in different ways by different organizational
units, integrating the data to provide a holistic
picture of business activities was very difcult.
The explosion in the amount of structured
data generated by a corporation sparked two
key developments. First, it cast a sharp spotlight
on data quality. The equation was pretty simple.
Bad data led to bad business outcomes. Second,
efforts were put in place to develop master data
management programs so data generated by
different parts of an organization could be coordinated and integrated, at least to some degree.

Challenges to Data Quality and MDM


Efforts in both data quality and master data
management have only been partially successful. Not only is data quality difcult to achieve,
it is a difcult problem even to approach. In
addition, the scope of the problem keeps

broadening. Master data management presents many of the same challenges that data
quality itself presents. Moreover, the complexity of implementing master data management
solutions has restricted them to relatively large
companies. At the bottom line, both data quality program and master data management
solutions are tricky to successfully implement,
in part because, to a large degree, the impact
of poor quality and disjointed data is hidden
from sight. Too often, data quality seems to be
nobodys specic responsibility.
Despite the difculties in gathering corporate resources to address these issues, during
the past decade, the high cost of poor quality
and poorly integrated data has become clearer,
and a better understanding of what denes
data quality, as well as a general methodology
for implementing data quality programs, has
emerged. The establishment of the general
foundation for data quality and master data
management programs is signicant, particularly because the corporate information environment is undergoing a tremendous upheaval,
generating turbulence as vigorous as that created by mainframe and personal computers.
The spread of the internet and mobile
devices such as smartphones and tablets is not

During the past decade, the high cost of poor quality and poorly
integrated data has become clearer, and a better understanding
of what denes data quality, as well as a general methodology
for implementing data quality programs, has emerged.

only generating more data than ever before,


many kinds of datamuch of it largely
unstructured or semistructuredhave
become very important. The use of RFID and
other kinds of sensor data has led to a data
tsunami of epic proportions. Cloud computing has created an imperative for companies to integrated data from many different
sources both inside and outside the corporation. And compliance with regulations in a
wide range of industries means that data has
to be held for longer periods of time and must
be correct. In short, the basics for data quality
and master data management are in place but
the basics are not nearly sufcient.

The Current Situation


In 2002, the Data Warehousing Institute
estimated that poor data quality cost American businesses about $600 million a year.
Through the years, that gure has been the
number most commonly bandied about
as the price tag for bad data. Of course, the
accuracy of such an eye-popping number
covering the entire scope of American industry is hard to assess.
However, a more recent study of businesses in the U.K. presented an even starker
picture. It found that as much as 16% of
many companies budgets is squandered
because of poor data quality. Departments
such as sales, operations, and nance waste
on average 15% of their budgets, according
to the study. That gure climbs to 18% for
IT. And the number is even higher for customer-facing activities such as customer loyalty programs. In all, 90% of the companies
surveyed opined that they felt their activities
were hindered by poor data.
When specic functional areas are assessed,
the substantial cost that poor data quality
extracts can become pretty clear. For example, contact information was one of the rst
targets for data quality programs. Obviously,

inaccurate, incomplete, and duplicated address


information hurts the results of direct marketing campaigns. In one particularly egregious
example, a major pharmaceutical company
once reported that 25% of the glossy brochures
it mailed were returned. Not only are potential
sales missed, current customers can be alienated. Marketing material that arrives in error
somewhere represents sheer costs.
Marketing is only one area in which the
impact of poor information is visible. One
European bank found that 100% of customer
complaints had their roots in poor or outright
incorrect information. Moreover, this study
showed, customers who register complaints
are much more likely to shop for alternative
suppliers than those who dont. The difference
in the churn between customers who complain
and whose complaints are rooted in poor data
quality and those who dont is a direct cost of
poor data quality.
And the list goes on. Poor data quality in
manufacturing slows time to market, leads
to inventory management problems, and can
result in product defects. Bad logistics data can
have a material impact on both the front end
and back end of the manufacturing process.

The Benets of Improving Data Quality


On the other side of the equation, improving data quality can lead to huge benets.
One company reported that improving the
quality of data available to its call center personnel resulted in nearly $1 million in savings. Another realized $150,000 in billing
efciencies by improving its customer contact
information.
As the cost/benet equation of data quality
has become more apparent, the need to dene
data quality has become more pressing. In
addition to the core characteristics of accuracy
and timeliness, the most concise expression of
the attributes of high-quality data is consistency,
completeness, and compactness. Consistency

means that each fact is represented in the


same way across the information ecosystem.
For example, a date is represented by two digits
for the month, two for the day, and four for the
year and is represented in that order across the
informational ecosystem in a company. Moreover, the facts represented must be logical. An
order due date, for example, cannot be earlier
than an order placed date.
Maintaining consistency is more difcult
than it may appear at rst. Companies capture
data in a multitude of ways. In many cases,
customers are entering data via web forms
,and both the accuracy and the consistency
of the data can be an issue. Moreover, data is
often imported from third-party source systems, which may use alternative formats to
represent facts. Indeed, even separate operational units within a single enterprise may
represent data differently.

Maintaining Data Consistency


Master data management is one approach
companies have used to maintain data consistency. MDM technology consolidates, cleanses,
and augments corporate data, synchronizing
data among all applications, business processes, and analytical tools. Master data management tools provide the central repository
for cross-referenced data in the organization,
building a single view of organizational data.
The second element of data quality is completeness. Different stakeholders in an organization need different information. For example,
the academic records department in a university may be most interested in a student grade
point average, the courses in which the student
is enrolled, and the students progress toward
graduation. The dean of students wants to
know if the student is living on campus, the
extra-curricular activities in which the student
participates, and any disciplinary problems the
student has had. The bursars ofce wants to
know the scholarships the student has received

DBTA .CO M

31

industry updates

The State of Data Quality and Master Data Management

industry updates

The State of Data Quality and Master Data Management

and the students payment history. A good data


system will not only capture all that information but also ensure that none of the key elements are missing.
The last element of good quality data is conciseness. Information is owing into organizations through several different avenues. Inevitably, records will be duplicated and information
comingled, and nobody likes to receive three
copies of the same piece of direct mail.
Because companies currently operate within
such a dynamic information environment, no
matter how diligent enterprises are, their systems will contain faulty, incorrect, duplicate,
and incomplete information. Indeed, if companies do nothing at all, the quality of their data
will degrade. Time decay is an ongoing, consistent cause of data errors. People move. They
get married and change their names. They get
divorced and change their names again. Corporate records have no way to keep up.
But time is only one of the root causes for
bad data. Corporate change also poses a problem. As companies grow, they add new applications and systems, making other applications
and systems obsolete. In addition, an enterprise may merge with or purchase another
organization whose data is in completely different formats. Finally, companies are increasingly incorporating data from outside sources.
If not managed correctly, each of these events
can introduce large-scale problems with corporate data.
The third root cause of data quality problems is that old standbyhuman error.
People already generate a lot of data and are
generating even more as social media content
and unstructured data become more signicant. Sadly, people make mistakes. People are
inconsistent. People omit things. People enter
data multiple times. Inaccuracies, omissions,
inconsistencies, and redundancies are hallmarks of poor data quality.
Given that data deterioration is an ongoing
facet of enterprise information, for a data quality program to work, it must be ongoing and
iterative. Modern data quality programs rest
on a handful of key activitiesdata proling
and assessment, data improvement, data integration, and data augmentation.
In theory, data improvement programs are
not complicated. The rst step is to characterize or prole the data at hand and measure how
closely it conforms to what is expected. The
next step is to x the mistakes. The third step

32

BIG DATA S OURCEBOOK 2013

is to eliminate duplicated and redundant data.


Finally, data quality improvement programs
should address holes in the enterprise information environment by augmenting existing
data with data from appropriate sources. Frequently, data improvement programs do not
address enterprise data in its entirety but focus
on high-value, high-impact information used
in what can be considered mission-critical
business processes.

data quality problems are masked. Business


processes seem to be working well enough, and
it is hard to determine beforehand what the
return on investment in a data quality program
would be. In addition, in many organizations,
nobody seems to own responsibility for the
overall quality of corporate data. People are
responsible or are sensitive to their own slice
of the data pie but are not concerned with the
overall pie itself.

The Big Data Challenge

Whats Ahead

To date, most data quality programs have


been focused on structured data. But, ironically, while the tools, processes, and organizational structures needed to implement an
effective data quality program have developed,
the emergence of big data has the potential to
completely rewrite the rules of the game.
Though the term big data is still debated,
it represents something qualitatively new.
Big data does not just mean the explosion of
transactional data driven by the widespread
use of sensors and other data-generating
devices. It also refers to the desire and ability
to extract analytic value from new data types
such as video and audio. And it refers to the
trend toward capturing huge amounts of data
produced by the internet, mobile devices, and
social media.
The availability of more data, new types of
data, and data from a wider array of sources
has had a major impact on data analysis and
business intelligence. In the past, people would
identify a problem they wanted to solve and
then gather and analyze the data needed to
solve that problem. With big data, that work
ow is reversed. Companies are realizing that
they have access to huge amounts of new
datatweets, for exampleand are working
to determine how to extract value from that
data, reversing the usual process.
Data quality programs will have to evolve
to meet these new challenges. Perhaps the rst
step will be methods for developing appropriate metadata. In general, big data is complex,
messy, and can come from a variety of different sources, so good metadata is essential.
Data classication, efcient data integration,
and the establishment of standards and data
governance will also be critical elements of
data quality programs that encompass big data
elements.
Ensuring data quality has been a serious
challenge in many organizations. Frequently,

It should not be a surprise that in a recent


survey of data quality professionals, twothirds of the respondents felt the data quality
programs in their organizations were only
OKthat is, some goals were met or poor.
On the brighter side, however, 70% indicated
that the companys management felt data
and information were important corporate
assets and recognized the value of improving
its quality. On balance, however, data quality
must be improved. In another survey, 61% of
IT and business professionals said they lacked
condence in their company data.
During the next several years, data quality professionals will face a series of complex
challenges. Perhaps the most immediate is to
be able to view data quality issues within their
organizations holistically. Data generated by
one divisionmarketing, lets saymay be
consumed by anothermanufacturing, perhaps. Data quality professionals need to be
able to respond to the needs of both.
Secondly, data quality professionals must
develop tools, processes, and procedures to
manage big data. Since a lot of big data is also
real-time data, data quality must become a
real-time process integrated into the enterprise information ecosystem. And nally, and
perhaps most importantly, data quality professionals will have to set priorities. Nobody
can do everything at once.
Elliot King has reported on
IT for 30 years. He is the
chair of the Department of
Communication at Loyola
University Maryland,where
he is a founder of an M.A.
program in Emerging Media.
He has written six books and hundreds
of articles about new technologies. Follow
him on Twitter @joyofjournalism. He blogs at
emergingmedia360.org.

sponsored content

Big Data ... Big Deal?


Data is growing exponentially,
experiencing unprecedented growth at
phenomenal speeds. According to IDC
(International Data Corporation), by
2015, nearly 3 billion people will be online,
generating nearly 8 zettabytes of data.
Analyzing large data sets and leveraging
new data-driven strategies will be essential
for establishing competitive differentiation
in the foreseeable future.
Big Data represents a fundamental shift
in the way companies conduct business
and interact with customers. Deriving value
from data sets requires that companies
across all industries be aggressive about
data collection, integration, cleansing
and analysis.

DATA SOURCES
BEYOND THE TRADITIONAL
Enterprises understand the intrinsic
value in mining and analyzing traditional
data sources such as demographics,
consumer transactions, behavior models,
industry trends, and competitor information.
However, the age of Big Data and advanced
technologies necessitate the analysis of new
data universes, such as social media and
mobile technologies.
Social media is one of the major elements
driving the overall Big Data phenomenon.
Twitter streams, Facebook posts and
blogging forums ood organizations
with massive amounts of data. Successful
Big Data strategies include the adoption
of technologies to pull relevant social
media into a single stream and integrate
the information into the core functions
of the enterprise. Automated processes,
matching technology and lters extract
content and consumer sentiment. When
social stream data is cleansed and integrated
into a database, enterprises gain invaluable
information on customer insights,

competitive intelligence, product feedback,


and market trends.
Mobile technology is also contributing
to the data inux as mobile devices become
more powerful, networks run faster and
apps more numerous. According to a report
by Cisco, global trafc on data networks
grew by 70% in 2012. The trafc on mobile
data networks in 2012885 petabytes or
885 quadrillion byteswas nearly 12 times
greater than total Internet trafc around the
world in 2000. As consumer behavior shifts
to new digital technologies, enterprises
are in a prime position to take advantage
of opportunities such as location-based
marketing.
GPS technologies are much more precise,
allowing marketers to deliver targeted
real-time messaging based on a consumers
location. Geofencing, a technology gaining
popularity among industries such as retail,
establishes a virtual perimeter around a
real-world site. For example, geofences
may be set up around a storefront. When
a customer carrying a smart device enters
the area, the device emits geodata, allowing
companies to send locally-targeted content
and promotions. According to research by
Placecast, a company specializing in locationbased services, one of every two consumers
visits a location after receiving an alert.

MANAGING BIG DATA


When properly managed, Big Data brings
big opportunities. Solid data management
processes and well-designed procedures for
data stewardship are crucial investments
for Big Data projects to be successful.
Structured and unstructured data must be
properly formatted, integrated and cleansed
to fully extract actionable and agile business
intelligence.
As the speed of business continues
to accelerate, data is generated instantly.

Traditional data quality batch processing is


no longer enough to fully sustain effective
operational decision-making. Integrating,
cleansing and analyzing data in real-time
allows a company to engage in opportunities
instantly. For example, using real-time data
processing, a company can personalize a
customers on-line website visit, enhancing
the overall customer experience. Monitoring
of transactions in real-time also has
important benets for security. Security
threats can instantly be identied, such
as fraudulent activity or individuals on a
security watch list. The applications are
numerous. Corporations able to react to
information the fastest will have the greatest
competitive advantage.
Big Data initiatives require planning
and dedication to be successful. According
to Gartner Predicts 2012 research, more
than 85% of Fortune 500 organizations
will be unable to effectively exploit Big
Data by 2015. Companies who successfully
incorporate Big Data projects into the overall
business strategy will gain signicant returns,
including better customer relationships,
improved operational efciency, identication
of marketing opportunities, security risk
mitigation, and more.

DATAMENTORS provides award-winning


data quality and database marketing
solutions. Offered as either a customerpremise installation or ASP delivered
solution, DataMentors leverages
proprietary data discovery, analysis,
campaign management, data mining
and modeling practices to identify
proactive, knowledge-driven decisions.
Learn more at www.DataMentors.com,
including how to obtain a complimentary
customer database quality analysis.

DBTA .CO M

33

industry updates

The State of Business Intelligence and Advanced Analytics

In Todays BI
and Advanced
Analytics World,
There Is Something
for Everyone
By Joe McKendrick

There is something for everyone within


todays generation of business intelligence and
advanced analytics solutions. Built on open,
exible frameworks and designed for users
who expect and need information at internet
speeds, BI and analytics are undergoing its
rst revolutionary transformation since computers became mainstream business tools.
Not only are the tools evolving, end users
are evolving as well. People are demanding
more of their analytics solutions, but analytics are also changing the way people across
enterprises, from end-users to infrastructure
specialists to top-level executives, work and
run their businesses.

All About Choice


For todays data infrastructure managers
charged with capturing, cleansing, processing,
and storing data, the new BI/analytics world
is all about choiceand lots of it. An array of
technologies and solutions is now surging into
the marketplace that offers smarter ways to
capture, manage, and store big data of all types
and volumes.
A company doesnt need to be an enterprise on the scale of a Google or eBay, turn34

BIG DATA S OURCEBOOK 2013

ing huge datasets into real-time insights on


millions of customers. Organizations of all
sizes are now getting into the game. In fact,
more than two-fths of 304 data managers
surveyed from all types and sizes of businesses report they have formal big data initiatives in progress, with the goals of delivering predictive analytics, customer analysis,
and growing new business revenue streams
(2013 Big Data Opportunities Survey,
sponsored by SAP and conducted by Unisphere Research, a division of Information
Today, Inc., May 2013).
There are a variety of data infrastructure
tools and platforms that are paving the way to
big data analysis:
Open Source/NoSQL/NewSQL Databases:
Alternative forms of databases are lling the
need to manage and store unstructured data.
These new databases often hail from the open
source space, meaning that they are immediately available to administrators and developers for little or no charge. NewSQL databases
tend to be cloud-based systems. NoSQL (Not
only SQL)-based databases are designed
to store unstructured or nonrelational data.
There are four categories of NoSQL databases:

key-value stores (for the storage of schema-less


data); column family databases (storing data
within columns); graph databases (employing
structures with nodes, edges, and properties to
represent and store data); and document databases (for the simple storage and retrieval of
document aggregates).
Hadoop/MapReduce Open Source Ecosphere: Apache Hadoop, an open source
framework, is designed for processing and
managing big data stores of unstructured data,
such as log les. Hadoop is a parallel-processing framework, linked to the MapReduce
analytics engine, that captures and packages
both unstructured and structured data into
digestible les that can be accessed by other
enterprise applications.
A survey of 298 data managers afliated
with the Independent Oracle Users Group
(IOUG) has found that Hadoop adoption is
likely to triple during the coming years. At
the time of the survey, 13% of respondents
had deployed or were in the process of implementing or piloting Hadoop, with an additional 22% considering adoption of the open
source framework at some point in the future
(Big Data, Big Challenges, Big Opportunities:

WHITE PAPER: Turbocharge Analytics with Data Virtualization


DOWNLOAD NOW: http://tinyurl.com/turbochargeanalytics
Traditional data integration approaches slow analytics adoption and constrain the
ability to achieve these objectives. This white paper outlines the analytics pipeline,
identifies how Big Data and Cloud Computing present new barriers to agility, and
describes how data virtualization successfully addresses these challenges.
A customer use-case is also included, illustrating how forward-thinking companies
are taking advantage of these modern data integration techniques to turbocharge
their analytics.

industry updates

The State of Business Intelligence and Advanced Analytics

[T]he new BI/analytics world is all about diving deep into datasets
and being able to engage in storytelling as a way to connect
data to the business.
2012 IOUG Big Data Strategies Survey, sponsored by Oracle and conducted by Unisphere
Research, September 2012).
Relational Database Management Systems: RDBMSs, on the market for close to 3
decades, structure data into tables that can
be cross-indexed within applications and are
increasingly being tweaked for the data surge
ahead. The IOUG survey nds nine out of
10 enterprises intend to continue using relational databases for the foreseeable future,
and it is likely that many organizations will
have hybrid environments with both SQL and
NoSQL running side by side.
Cloud: Cloud-based BI solutions offer
functionality on demand, along with more
rapid deployment, low upfront cost, and scalability. Many database vendors now support
data management and storage capabilities via
a cloud or software as a service environment.
In addition, other vendors are also optimizing their data products to be able to leverage
cloud resourceseither as the foundation
of private clouds, or running in on-premises
server environments that also access application programming interfaces (APIs) or web
services for additional functions.
In another survey of 262 data managers,
37% say their organizations are either running private cloudsdened as on-demand
shared services provided to internal departments or lines of business within enterprisesat full or limited scale, or are in pilot
stages (Enterprise CloudscapesDeeper
and More Strategic: 201213 IOUG Cloud
Computing Survey, sponsored by Oracle and
conducted by Unisphere Research, February
2013). This is up from 29% in 2010, the rst
year this survey was conducted. In addition,
adoption of public cloudsdened as ondemand services provided by public cloud
providersis on the upswing. Twenty-six
percent of respondents say they now use public cloud services either in full or limited ways,
or within pilot projects. This is up by 86%

36

BIG DATA S OURCEBOOK 2013

from the rst survey in this series, conducted


in 2010, when 14% reported adoption.
In addition, 50% of private cloud users
report they run database as a service, up from
35% 2 years ago. Among public cloud users,
37% run database as a service, up from 12%
2 years ago.
Data Virtualization: Just as IT assets are
now offered through service layers via software as a service or platform as a service,
information can be available through a data
as a service approach. In tandem with the
rise of private cloud and server virtualization
within enterprises, there has been a similar
movement to data virtualization, or database as a service. By decoupling the database
layer from hardware and applications, users
are able to access disparate data sources from
anywhere across the enterprise, regardless of
location or underlying platform.
In-Memory Technologies: Many vendors are adding in-memory capabilities to
offerings in which data and processing are
moved into a machines random access memory. In-memory eliminates what is probably
the slowest part of data processingpulling
data off disks. In an environment with large
datasetsscaling into the hundreds of terabytesthis will multiply into a bottleneck for
rapid analysis, limiting the amount of data
that can be analyzed at one time. Some estimate that the capacity of such systems can
already go as high as those of large, disk-based
databasesall that data stored in a RAID
array could potentially be moved right into
machine memory.
A recent survey of 323 data managers
demonstrates that in-memory technology is
poised for rapid growth. While in-memory is
seen within many organizations, it is mainly
focused on specic sites or pilot projects at
this time. A handful of respondents to the
survey, 5%, report the technology is currently
in widespread use across their enterprises,
while another 8% say it is in limited use across

more than three departments within their


organizations. Close to one-third, 31%, report
that they are either piloting or considering this
technology (Accelerating Enterprise Insights:
2013 IOUG In-Memory Strategies Survey,
sponsored by SAP and conducted by Unisphere Research, January 2013).

Technologies to Connect Data to Business


For quants, data analysts, data scientists,
and business users, the new BI/analytics
world is all about diving deep into datasets
and being able to engage in storytelling as a
way to connect data to the business.
There is a perception that developing
and supporting data scientist-type skill sets
require specially trained statisticians and
mathematicians supported by sophisticated
algorithms. However, with the help of tools
and platforms now widely available in todays
market, members of existing data departments
can also be brought up-to-speed and made
capable of delivering insightful data analysis.
Open Source: The revolutionary framework that broke open the big data analysis
scene is Hadoop and MapReduce. One of the
most potent tools in the quants toolboxes is
R, the open source, object-oriented analytics
language. R is rapidly deployable, tends to
be well-suited for building analytics against
large and highly diverse datasets, and has been
embedded in many applications. There are a
number of solutions that build upon R and
make the language easy to work with to visually manipulate data for the more effective
delivery of business insights.
Predictive Analytics: Predictive analytics
technology is a key mission awaiting quants,
data analysts, and data scientists. The technology is available; all it takes is a little imagination. For example, during the presidential
election in the fall of 2012, Nate Silver of The
New York Times put predictive analytics on the
map with his almost dead-on prediction of the
winning candidate. The same principles can

industry updates

The State of Business Intelligence and Advanced Analytics

be applied for more routine business problems, which potentially can uncover unforeseen outcomes. For example, one bank found
that its most protable customers were not
high-wealth individuals, but rather those who
were not meeting minimums and overdrafting
accounts and thus anteing up fees. In another
case, an airline found that passengers specifying vegetarian preferences in their on-board
meals were less likely to miss ights. Or even
counterintuitive ndingssuch as the dating
site that found people rated the most attractive
received less attention than average-looking
members. (Suitors felt they faced more competition with more attractive members.)
Programming Tools: A range of scripting and open source languagesincluding
Python, Ruby, and Perlalso include extensions for parallel programming and machine
learning.

Opening Up Analytics to the Business


For business users, the new BI/analytics
world is all about analytics for all. There
has been a growing movement to open up
analytics across the organizationpushing
these capabilities down to all levels of decision makers, including frontline customer
service representatives, production personnel, and information workers. A recent
survey of 250 data managers nds that in
most companies, fewer than one out of 10
employees have access to BI and analytic
systems (Opening Up Business Intelligence
to the Enterprise: 2012 Survey On Self-Service BI and Analytics, sponsored by Tableau Software and published by Unisphere
Research, October 2012).
Now, a new generation of front-end tools
is making this possible:
Visualization: Visual analytics is the new
frontier for end-user data access. Data visualization tools provide highly graphic, yet relatively
simple, interfaces that help end users dig deep
into queries. This represents a departure from
the ubiquitous spreadsheetrows of numbersas well as static dashboards or PDF-based
reports with their immovable variables.
Self-Service: There is a growing trend among
enterprises to enable end users to build or design
their own interfaces and queries. Self-service
may take the form of enterprise mashups, in
which end users build their own front ends that
are combined with one or more data sources,

38

BIG DATA S OURCEBOOK 2013

or through highly congurable portals. According to the 2012 Tableau-Unisphere self-service


BI and analytics study, self-service BI is now
offered to some extent in half of the organizations surveyed.
Pervasive BI: Pervasive BI and analytics are increasingly being embedded within
applications or devices, in which the end user
is oblivious to the software and data feeds
running in the background.
Cloud: Many users are looking to the
cloud to support BI data and tools in a more
cost-effective way than on-premises desktop tools. Third-party cloud providers have
almost unlimited capacity and can support
and provide big data analytics in a way that
is prohibitive for most organizations. Cloud
opens up business intelligence and analytics
to more usersnonanalystswithin organizations. With the drive to make BI more
ubiquitous, the cloud will only accelerate this
move toward simplied access.
Mobile: Mobile technology, which is only
just starting to seep into the BI and analytics
realm, promises to be a source of disruption.
The availability of analytics on an easy-to-use
mobile app, for example, will bring analytics
to decision makers almost instantaneously.
With many employees now bringing their
own devices to work, analytics may be readily used by users that previously did not have
access to those capabilities.

The Opportunity to Compete on Analytics


For top-level executives, the new BI/
analytics presents opportunities to compete
on analytics. The ability to employ analytics
means understanding customers and markets
better, as well as spotting trends as they are
starting to happen, or before they happen.
As found in the Unisphere Research survey on big data opportunities, most executives instinctively understand the advantages
big data can bring to their operations, especially with predictive analytics and customer
analytics. A majority of the respondents with
such efforts under way, 59%, seek to improve
existing business processes, while another
41% are concerned with the need to create
new business processes/models.
BI and advanced analytics not only provide snapshots of aspects of the business such
as sales or customer churn, but also makes it
possible to apply key performance indicators

against data to develop a picture of a businesss overall performance.

Whats Ahead
To compete in todays hyper-competitive
global marketplace, businesses need to understand whats around the corner. Predictive
analytics technology enables this to happen,
and the new generation of tools incorporates
such predictive capabilities.
The ability to automate low-level decisions is freeing up organizations to apply
their mind power against tougher, more strategic decisions. These days, analytical applications are being embedded into processes
and applied against business rules engines to
enable applications and machines to handle
the more routine, day-to-day decisions that
come uprerouting deliveries, extending
up-sell offers to customers, or canceling or
revising a purchase order.
Many organizations beginning their journey into the new BI and analytics space are
starting to discover all the possibilities it offers.
But, in an era in which data is now scaling into
the petabyte range, BI and analytics are more
than technologies. Its a disruptive force. And,
with disruption comes new opportunities for
growth. Companies interested in capitalizing on the big data revolution need to move
forward with BI and analytics as a strategic
and tactical part of their business road map.
The benets are profoundincluding vastly
accelerated business decisions and lower IT
costs. This will open new and often surprising
avenues to value.

is an
author and independent
researcher covering innovation, information technology trends, and markets.
Much of his research work
is in conjunction with Unisphere Research, a division of Information
Today, Inc. (ITI), for user groups including
SHARE, the Oracle Applications Users Group,
the Independent Oracle Users Group, and the
International DB2 Users Group. He is also a
regular contributor to Database Trends and
Applications, published by ITI.

Joe McKendrick

sponsored content

Five Key Pieces in the


Big Data Analytics Puzzle
Big data continues to be a mystery
to many companies. Industry research
validates our experience that there are ve
major stages that companies go through
when working with big data. We call these
the 5 EsEvading, Envisioning, Evaluating,
Executing, and Expandingof the big data
journey. Today, approximately 40% of
companies are still in the Evading stage,
waiting to get the clarity, means and purpose
for tackling big data.
To provide some clarity on the subject,
here we present ve essential technological
means needed for this inevitable journey. If
your purpose is to nd measurable returns
from big data, any one of these will be
sufcient to begin tasting the value. When
blended together, these means will provide
an irresistible recipe for big data success.

In business, analytical
modeling is the job of trained
data scientists who use a variety
of tools for developing these
models. Frontline business
users do not have such skill, but
everyday decisions they make can
be vastly improved based on such
big data insights. Challenges arise
in this transfer of knowledge,
since most tools dont typically
talk to one another.
Organizations can enable
data scientists and trained analysts to
easily transfer business insights to frontline
workers by adopting tools that can expose
the widest support for advanced analytics
and predictive techniques, either natively or
through open integration with other tools.

1) ENABLE VISUAL SELF


EXPLORATION OF DATA
Human beings are visual creatures.
Big data analytics is all about seeing
relationships, anomalies and outliers present
in large quantities of data. Techniques for
advanced ways to graph, map and visualize
data, therefore, are a core requirement.
Secondly, visualizations need to be
intuitive and easy to work with. Business
users need the control to dene what data
will be visualized and iterate through ideas
to determine the best visual representation.
They need the exibility to share their
output through web browsers, mobile apps,
email, and other presentation modes.
Finally, the tools used need to be highly
responsive to a users needs. Effective
analysis can only happen when users move
uninterrupted at the speed-of-thought with
every exploration.

3) COMBINE DATA FROM


MULTIPLE SOURCES
Organizations never keep all data in one
place. Even with big data storage like Hadoop,
businesses will be hard pressed to unify all data
under one roof, owing to the ever-proliferating
systems. To date, IT has solved this problem
by transforming and moving data between
sources, before analysis is conducted. In todays
age, exponentially larger datasets make data
movement virtually impossible, especially
when organizations want to be more nimble
but keep costs in check.
New technologies allow business users to
blend data from multiple sources, in-place,
and without involving IT. IT can take this a
step further by providing a scalable analytic
architecture masking the data complexity while
providing common business terminology.
Such architecture will easily facilitate
analyses that span customer information,
sales transactions, cost data, service history,
marketing promotions and more.

2) DEMOCRATIZE
ADVANCED ANALYTICS
Big data has no voice without analytics.
Often the reason to work with large
quantities of low-level data is to apply
sophisticated analytic models, which can
tease out valuable insights not readily
apparent in aggregated information.

4) GIVE STRUCTURE TO
ACTIONABLE UNSTRUCTURED DATA
Unstructured data accounts for 80% of all
data in a business. It typically comprises of
text-heavy formats like internal documents,
service records, web logs, emails, etc.

First, unstructured data has to be


structured to enable any analysis. While
trained analysts can do this interactively
at small scale, larger scale and general
access would demand an offline process.
Second, analysis of unstructured data will
often be useful only in conjunction with
other structured enterprise data. Third, the
insights from such analyses can be quite
amorphous. Unless businesses can take
concrete action based on the insights from
a certain unstructured source, its ROI will
be hard to justify.

5) SETUP CONNECTIVITY
TO REAL-TIME DATA
Not all big data use cases lend themselves
to real-time analysis. But some do. When
decisions need to be taken in real-time (or
near real-time), this capability becomes a
key success factor. Analytic solutions for
nancial trading, customer service, logistics
planning, etc. can all be beneciaries of tying
live actual data to historical information or
forecasted outcomes.
In the end, big data analytics initiatives
are very much like traditional business
intelligence initiatives. These ve technological
needs demand a signicantly greater emphasis
for your big data journey. Will you stop
evading it now?
MICROSTRATEGY To learn how
MicroStrategy can help craft solutions
for your big data analytics needs, visit
microstrategy.com/bigdatabook.

DBTA .CO M

39

industry updates

The State of Social Media

Social Media
Analytic Tools
and Platforms
Offer Promise
By Peter J. Auditore

Social media networks are creating large


datasets that are now enabling companies and
organizations to gain competitive advantage
and improve performance by understanding customer needs and brand experience
in nearly real time. These datasets provide
important insights into real-time customer
behavior, brand reputation, and the overall customer experience. Intelligent or data
analysis-driven organizations are now monitoring, and some are collecting, this data
from propriety social media networks, such
as Salesforce Chatter and Microsoft Yammer
and open social media networks such as
LinkedIn, Twitter, Facebook, and others.
The majority of organizations today are
not harvesting and staging data from these
networks but are leveraging a new breed of

40

BIG DATA S OURCEBOOK 2013

social media listening tools and social analytics platforms. Many are tapping their public
relations agencies to execute this new business
process. Smarter data-driven organizations
are extrapolating social media datasets and
performing predictive analytics in real time
and in-house.
There are, however, significant regulatory issues associated with harvesting, staging, and hosting social media data. These
regulatory issues apply to nearly all data
types in regulated industries such as healthcare and financial services in particular.
The SEC and FINRA with Sarbanes-Oxley
require different types of electronic com munications to be organized, indexed in
a taxonomy schema, and then be archived
and easily discoverable over defined time

periods. Data protection, security, governance, and compliance have entered an


entirely new frontier with introduction and
management of social data.
This article provides a broad overview of
the current state of analytical tools and platforms that enable accelerated and real-time
decision making in organizations based on
customers. Social media is driving organizational demand for insights on customer
everything in addition to BI and analytics
tools. Providing enterprise BI that includes
social analytics will be a signicant challenge
to many enterprises in the near future. This
is one of the primary reasons for the success
of the new wave of innovative and easy-

to-use BI and social media analytical tools


within the last several years.

industry updates

The State of Social Media

The majority of organizations today are not harvesting


and staging data from social media networks but are
leveraging a new breed of social media listening tools
and social analytics platforms.

Analytic Tools Overview


In the beginning, there was SPSS and
SAS Institute, the rst analytical and statistical platforms to be computerized and go
mainstream in the early 1980s. There is no
way in my view you can talk about anything
analytical without mentioning them. When
I was a young marine scientist, these were
the rst DOS-based analytical tools we used
to do basic statistical analysis in addition to
rudimentary predictive analytics employed to
forecast sheries populations.
During the last 40 years, these platforms
evolved to include a host of new capabilities
and functionality and are now considered
business intelligence tools. For the last 20
years, the majority of business intelligence
tools accessed structured datasets in various databases, however now that nearly 80%
of enterprise data is unstructured, many of
the BI platforms incorporate sophisticated
enterprise search capabilities that rely on
metadata, inferences, and connections to
multiple data sources. The vast majority
of social media data is unstructured, as we
know, and this presents signicant challenges to many organizations in its overall
management: collection, staging, archiving,
analysis, governance, and security.
Many organizations today are leveraging
their legacy business intelligence tools and
platforms to perform analysis on social media
datasets, in addition to the use of sophisticated tagging and automated taxonomy tools
that make search (nding the right contents
and/or objects) easier. The most basic and
easy analytical tool used by nearly everyone
is a simple alert, which combs/crawls the web
for topics related to your alert criteria.

42

BIG DATA S OURCEBOOK 2013

Modern capabilities of business intelligence tools and platforms:


Enterprise Searchstructured and
unstructured data
Ad Hoc Query Analysis and Reporting
OLAP, ROLAP, MOLAP
Data Mining
Predictive and Advanced Analytics
In-Database Analytics
In-Memory Analytics
Performance Management Dashboards
Advanced Visualization, Modeling,
Simulation, and Scenario Planning
Cloud and Mobile BI

Cloud-Based and Mobile BI and the New


Innovative Business Intelligence Tools
Within the last several years, a new class
of BI tools has emerged including some open
source and cloud-based platforms/tools, some
of which are specialized for specic vertical
market segments or business processes. They
are easy-to-use, highly collaborative via work
ow, and some include standard and custom
reporting in addition to including some rudimentary ETL tools. Mobile BI is one of the
fastest growing areas; however, many legacy
vendors have been slow to develop applications for BYOD, especially tablets.
These new products have innovative
semantic layers and new ways of visualizing
data, both structured and unstructured. In
some cases, these new tools tout the fact that
they can work with any database and dont
require the building of a data warehouse or
data mart but provide access to any data anywhere. Innovative visualization dashboard
platforms and implementations have been
very attractive to business managers and have

found their way into many organizations, in


some cases, without the knowledge of the IT
department.

In-Memory
In-memory database technology, the next
major innovation in the world of business
intelligence and social media analytics,
is the game changer that will provide the
unfair advantage that leads to the competitive advantage every CEO wants today.
In-memory technologies and built-in analytics are beginning to play major roles in
social analytics. The inherent business value
of in-memory technology revolves around
the ability to make real-time decisions based
on accurate information about seminal business processes such as social media.
The ability to know and understand the
customer experience is paramount in the new
millennium as organizations strive to improve
customer service, keep customers loyal, and
gain greater insights into customer purchasing
patterns. This has become even more important as a result of social media and social media
networks that are now the new word-ofmouth platforms. In-memory promises to
provide real-time data not only from transactional systems but also to allow organizations
to harvest and manage unstructured data
from the social media sphere.

Predictive Analytics and Graph Databases


Graph databases are sometimes faster
than SQL and greatly enhance and extend
the capabilities of predictive analytic by
incorporating multiple data points and
interconnections across multiple sources
in real time. Predictive analytics and graph

industry updates

The State of Social Media

databases are a perfect t for the social media


landscape where various data points are
interconnected.
Social media analytic tools enable business and organizations to enhance:
Brand and sentiment analysis
Identication and ranking of key
inuencers
Campaign tracking and measurement
Product launches
Product innovation through
crowdsourcing
Digital channel inuence
Purchase intent analysis
Customer care
Risk management
Competitive intelligence
Partner monitoring
Category analysis

The Social Media Listening Centers


Many organizations are just starting to
use social data, few are at the forefront, and
most are using off-the-shelf vendor products
to create social media listening/monitoring
centers. These platforms operate in real time
and visually display sentiment and brand
analysis for products and services. The majority of organizations today are at this stage of
social analytics, and again, few appear to be
collecting, staging, and archiving data for
further analysis and predictive analytics.
Monitoring and performing predictive
analytics on social media datasets are the most
obvious and common uses of analytic solutions today. Many solutions use natural language processing in the indexing and staging of
social media data. Predictive analytics enable
a wide array of business functions including
marketing, sales, product development, competitive intelligence, customer service, and
human resources to identify common and
unusual patterns and opportunities in the
unstructured world of social media data.

Social Media Analytical Tools


Social media analytical tools identify and
analyze text strings that contain targeted
search terms, which are then loaded into
databases or data staging platforms such as
Hadoop. This can enable database queries, for
example, by data, region, keyword, sentiment.
This can then enable insights and analysis into
customer attitudes toward brand, product,

44

BIG DATA S OURCEBOOK 2013

services, employees, and partners. The majority of products work at multiple levels and drill
down into conversations with results depicted
in customizable charts and dashboards.
Often analytic results are provided in
customizable charts and dashboards that are
easy to visualize and interpret and can be
shared on enterprise collaborative platforms
for decision makers. Some social media analytic platforms integrate easily with existing
analytic platforms and business processes to
help you act on social media insights, which
can lead to improved customer satisfaction,
enhanced brand reputation, and can even
enable your organization to anticipate new
opportunities or resolve problems.

Predictive analytics and graph


databases are a perfect t for
the social media landscape
where various data points
are interconnected.

On the bleeding edge of social media


analytics is a new wave of tools and highly
integrated platforms that have emerged to
provide not only social media listening tools
but also enable organizations to understand
content preferences (or content intelligence)
by afnity groups and brands they are following or trending. Some of the innovators taking social media data to a new level include
Attensity, InniGraph, Brandwatch, Bamboo,
Kapow, Crimson Hexagon, Sysomos, Simply
Measured, NetBase, and Gnip.

Current Use of Social Media BI Tools


In 2012, the SHARE users group and
Guide SHARE Europe conducted a Social
Media and Business Intelligence Survey, produced by Unisphere Research, a division of
Information Today, Inc., and sponsored by
IBM and Marist College. The survey, which
examined the current state of social media
data monitoring and collection and use of
business intelligence tools in more than 500
organizations, found that IBM, SAS, Oracle,
and SAP were the entrenched BI platform
market leaders. The majority of the sample

base indicated that they were not using thirdparty BI tools for social media analytics.

Whats Ahead
The 2012 social media and BI survey data
still provide a relevant picture of the state of
social media analytics. A majority of organizations will leverage legacy business intelligence vendors with familiar semantic layers
to perform rudimentary social media data
analysis. The big issue is that line-of-business managers will not wait for nonagile IT
departments to collect, harvest, stage/build,
and perform analytics on new social media
data marts or data warehouses.
New bleeding-edge social media analytical platforms are addressing the needs of
line-of-business professionals in real time.
They are also leveraging the economics of
utility computing and the cloud to bring costeffective analytical platforms to nearly all organizations. These highly integrated platforms
include simple social media listening tools,
along with embedded analytics and predictive
analytics that incorporate content and sometimes advertising abilities to meet the needs of
modern digital marketers. There are also other
new vendors that specialize in collecting and
delivering raw social media for those organizations which are building their own in-house
social media analytics platforms.
Traditionally, marketing has always had
four Ps. Today, marketing has ve Ps: product, place, position, price, and people
because in this millennium, the social media
network is the new platform for word-ofmouth marketing.
Peter J. Auditore is currently

the principal researcher at


Asterias Research, a boutique consultancy focused
on information management, traditional and social
analytics, and big data
(www.thedatadog@wordpress.com).
Auditore was a member of SAPs Global
Communications team for 7 years and most

recently head of the SAP Business Inuencer Group. He is a veteran of four technology startups: Zona Research (co-founder);
Hummingbird (VP, marketing, Americas);
Survey.com (president); and Exigen Group
(VP, corporate communications).

sponsored content

Data Virtualization Brings Velocity


and Value to Big Data Analytics
Big data analytic opportunities are
abundant, with business value the driver.
According to the Professors Andrew McAfee
and Erik Brynjolfsson of MIT:
Companies that inject big data and
analytics into their operations show
productivity rates and protability
that are 5% to 6% higher than those
of their peers.

DATA IS THE LIFEBLOOD


OF ANALYTICS
Enterprises, ooded with a deluge of
data about their customers, prospects,
business processes, suppliers, partners and
competitors, understand datas critical role
as the lifeblood of analytics.
THE ANALYTIC DATA CHALLENGE
However, integrating data consumes the
better half of any analytic project as variety
and volume complexity constrain progress.
Diverse data typesIn the past, most
analytic data was tabular, typically
relational. That changed with the rise of
web services and other non-relational
and big data sources. Analysts must now
work with multiple data types, including
tabular, XML, key-value pairs and semistructured log data.
Multiple interfaces and protocols
Accessing data is now more complicated.
Before, analysts used ODBC to access a
database or a spreadsheet. Now, analysts
must access data through a variety of
protocols, including web services via
SOAP or REST, Hadoop data through
Hive, and other types of NOSQL data
via proprietary APIs.
Larger data setsData sets are
signicantly larger. Analysts can no
longer assemble all data in one place,
especially if that place is their desktop.
Analysts must be able to work with data
where it is, intelligently sub-setting it

and combining it with relevant data


from other high volume sources.
Iterative analytic methodsExploration
and experimentation denes the analytic
process. Finding, accessing and pulling
together data is difcult alone, with
continuous updating and reassembling
of data sets also a must have.

CONSOLIDATING EVERYTHING,
SLOW AND COSTLY
Providing analytics with the data
required has always been difcult, with
data integration long considered the biggest
bottleneck in any analytics or BI project.
No longer is consolidating all analytics
data into a data warehouse the answer.
When you need to integrate data from
new sources to perform a wider, more
far-reaching analysis, does it make sense
to create yet another silo that physically
consolidates other data silos?
Or is it better to federate these silos
using data virtualization?
DATA VIRTUALIZATION
TO THE RESCUE
Ciscos Data Virtualization Suite addresses
your difcult analytic data challenges.
Rapid Data Gathering Accelerates
Analytics ImpactCiscos nimble data
discovery and access tools makes it faster
and easier to gather together the data sets
each new analytic project requires.
Data Discovery Addresses Data
ProliferationData discovery automates
entity and relationship identication;
accelerating data modeling so your
analysts can better understand and
leverage your distributed data assets.
Query Optimization for Timely
Business InsightOptimization
algorithms and techniques deliver the
timely information your analytics require.
Data Federation Provides the Complete
PictureVirtual data integration in

memory provides the complete picture


without the cost and overhead of
physical data consolidation.
Data Abstraction Simplies Complex
Data Data abstraction transforms
data from native structures to common
semantics your analysts understand.
Analytic Sandbox and Data Hub Options
Provide Deployment FlexibilityData
virtualization supports your diverse
analytic requirements from ad hoc
analyses via sandboxes to recurring
analyses via data hubs.
Data Governance Maximizes Control
Built-in governance ensures data security,
data quality and 7x24 operations to
balance business agility with needed
controls.
Layered Data Architecture Enables
Rapid ChangeLoose coupling and
rapid development tools provide the
agility required to keep pace with your
ever-changing analytic needs.

CONCLUSION
The business value of analytics has never
been greater. But data volumes and variety
impact the velocity of analytic success.
Data virtualization helps overcome data
challenges to fulll critical analytic data needs
signicantly faster with far fewer resources
than other data integration techniques.
Empower your people with instant access to
all the data they want, the way they want it
Respond faster to your changing analytics
and business intelligence needs
Reduce complexity and save money
Better analysis equals business advantage.
So take advantage of data virtualization.

LEARN MORE
To learn more about Ciscos data
virtualization offerings for big data
analytics, visit www.compositesw.com

DBTA .CO M

45

industry updates

The State of Data Integration

Big Data Is
Transforming
the Practice
of Data
Integration
By Stephen Swoyer

Big data is transforming both the


scope and the practice of data integration.
After all, the tools and methods of classic data
integration evolved over time to address the
requirements of the data warehouse and its
orbiting constellation of business intelligence
tools. In a sense, then, the single biggest change
wrought by big data is a conceptual one: Big
data has displaced the warehouse from its position as the focal point for data integration.
The warehouse remains a critical system
and will continue to service a critical constituency of users; for this reason, data integration in the context of data warehousing and
BI will continue to be important. Nevertheless, we now conceive of the warehouse as
just one system among many systems, as one
provider in a universe of providers. In this
respect, the impact of big data isnt unlike that
of the Copernican Revolution: The universe,
after Copernicus, looked a lot bigger. The
same can be said about data integration after
big data: The size and scope of its projects
to say nothing of the problems or challenges
its tasked with addressinglook a lot bigger.
This isnt so much a function of the bigness of big dataof its celebrated volumes,
varieties, or velocitiesas of the new use cases,
scenarios, projects, or possibilities that stem

46

BIG DATA S OURCEBOOK 2013

from our ability to collect, process, andmost


importantto imaginatively conceive of big
data management. To say that big data is the
sum of its volume, variety, and velocity is a lot
like saying that nuclear power is simply and
irreducibly a function of ssion, decay, and
fusion. Its to ignore the societal and economic
factors thatfor good or illultimately determine how big data gets used. In other words,
if we want to understand how big data has
changed data integration, we need to consider
the ways in which were usingor in which we
want to usebig data.

Big Data Integration in Practice


In this respect, no applicationno use
caseis more challenging than that of
advanced analytics. This is an umbrella term
for a class of analytics that involves statistical
analysis, machine learning, and the use of new
techniques such as numerical linear algebra.
From a data integration perspective, whats
most challenging about advanced analytics is
that it involves the combination of data from
an array of multistructured sources. Multistructured is a category that includes structured hierarchical databases (such as IMS
or ADABAS on the mainframe ora recent
innovationHBase on Hadoop); semistruc-

tured sources, such as graph and network databases, along with human-readable sources,
including JSON, XML, and txt documents);
and a host of so-called unstructured le
typesdocuments, emails, audio and video
recordings, etc. (The term unstructured is
misleading: Syntax is structure; semantics is
structure. Understood in this context, most
so-called unstructured artifactsemails,
tweets, PDF les, even audio and video les
have structure. Much of the work of the next
decade will focus on automating the proling,
preparation, analysis, andyesintegration
of unstructured artifacts.)
If all of this multistructured information is
to be analyzed, it needs to be prepared; however, the tools or techniques required to prepare
multistructured data for analysis far outstrip
the capabilities of the handiest tools (e.g., ETL)
in the data integration toolset. For one thing,
multistructured information cant efciently
or, more to the point, cost-effectively, be loaded
into a data warehouse or OLTP database. The
warehouse, for example, is a schema-mandatory platform; it needs to store and manage
information in terms of facts or dimen-

sions. It is most comfortable speaking SQL,


and to the extent that information from
nonrelational sources (such as hiearchical

industry updates

The State of Data Integration

The impact of big data isnt unlike that of the Copernican


Revolution: The universe, after Copernicus, looked a
lot bigger. The same can be said about data integration
after big data: The size and scope of its projectsto say
nothing of the problems or challenges its tasked with
addressinglook a lot bigger.

databases, sensor events, or machine logs) can


be transformed into tabular format, they can
be expressed in SQL and ingested by the data
warehouse. But what about information from
all multistructured sources?
Enter the category of the NoSQL data
store, which includes a raft of open source software (OSS) projects, such as the Apache Cassandra distributed database, MongoDB, CouchDB,
andlast but not leastthe Hadoop stack.
Increasingly, Hadoop and its Hadoop Distributed File System (HDFS) are being touted as an
all-purpose landing zone or staging area for
multistructured information.

ETL Processing and Hadoop


Hadoop is a schema-optional platform; it
can function as a virtual warehousei.e., as
a general-purpose storage areafor information of any kind. In this respect, Hadoop can
be used to land, to stage, to prepare, and
in many casesto permanently store data.
This approach makes sense because Hadoop
comes with its own baked-in data processing
engine: MapReduce.
For this reason, many data integration vendors now market ETL products
for Hadoop. Some use MapReduce itself to
perform ETL operations; others substitute
their own, ETL-optimized libraries for the
MapReduce engine. Traditionally, programming for MapReduce is a nontrivial task:
MapReduce jobs can be coded in Java, Pig
Latin (the high-level language used by Pig, a
platform designed to abstract the complexity of the MapReduce engine), Perl, Python,
and (using open source libraries) C, C++,
Ruby, and other languages. Moreover, using
MapReduce as an ETL technology also presupposes a detailed knowledge of data management structures and concepts. For this reason, ETL tools that support Hadoop usually

48

BIG DATA S OURCEBOOK 2013

generate MapReduce jobs in the form of Java


code, which can be fed into Hadoop. In this
scheme, users design Hadoop MapReduce
jobs just like theyd design other ETL jobs or
workowsin a GUI-based design studio.
The benets of doing ETL processing in
Hadoop are manifold: For starters, Hadoop
is a massively parallel processing (MPP) environment. An ETL workload scheduled as a
MapReduce job can be efciently distributedi.e., parallelizedacross a Hadoop
cluster. This makes MapReduce ideal for
crunching massive datasets, and, while the
sizes of the datasets used in decision support
workloads arent all that big, those used in
advanced analytic workloads are. From a data
integration perspective, theyre also considerably more complicated, inasmuch as they
involve a mix of analytic methods and traditional data preparation techniques.
Lets consider the steps involved in an
analysis of several hundred terabytes of
image or audio les sitting in HDFS. Before
this data can be analyzed, it must be proled; this means using MapReduce (or custom-coded analytic libraries) to run a series of
statistical and numerical analyses, the results
of which will contain information about the
working dataset. From there, a series of traditional ETL operationsperformed via
MapReducecan be used to prepare the data
for additional analysis.
Theres still another benet to doing ETL
processing in Hadoop: The information is
already there. It has an adequatethough
by no means spectaculardata management
toolset. For example, Hive, an interpreter
that compiles its own language (HiveQL)
into Hadoop MapReduce jobs, exposes a
SQL-like query facility; HBase is a hierarchical data store for Hadoop that supports high
user concurrency levels as well as basic insert

and update operations. Finally, HCatalog is a


primitive metadata catalog for Hadoop.

Data Integration Use Cases


Right now, most data integration use cases
involve getting information out of Hadoop.
This is chiey because Hadoops data management feature set is primitive compared to those
of more established platforms. Hadoop, for
example, isnt ACID-compliant. In the advanced
analytic example cited above, a SQL platform
not Hadoopwould be the most likely destination for the resultant dataset. Almost all
database vendors and a growing number of
analytic applications boast connectivity of some
kind into Hadoop. Others promote the use of
Hadoop as a kind of queryable archive. This
use case could involve using Hadoop to persist historical datae.g., cold or infrequently
accessed data that (by virtue of its sheer volume)
could impact the performance or cost of a data
warehouse. Still another emerging scenario
involves using Hadoop as a repository in which
to persist the raw data that feeds a data warehouse. In traditional data integration, this data
is often staged in a middle tier, which can consist of an ETL repository or an operational data
store (ODS). On a per-gigabyte or per-terabyte
basis, both the ETL and ODS stores are more
expensive than Hadoop. In this scheme, some
or all of this data could be shifted into Hadoop,
where it could be used to (inexpensively) augment analytic discovery (which prefers denormalized or raw data) or to assist with data warehouse maintenancee.g., in case dimensions
are added or have to be rekeyed.
Still another use case involves ofoading
workloads from Hadoop to SQL analytic
platforms. Some of these platforms are able

to execute analytic algorithms inside their


database engines. Some SQL DBMS vendors claim that an advanced analysis will

industry updates

The State of Data Integration

run faster on their own MPP platforms than


on Hadoop using MapReduce. They note that
MapReduce is a brute-force data processing
tool, and while its ideal for certain kinds of
workloads, its far from ideal as a general-purpose compute engine. This is why so much
Hadoop development work has focused on
YARNYet Another Resource Negotiator
which will permit Hadoop to schedule,
execute, and manage non-MapReduce jobs.
The benets of doing so are manifold, especially from a data integration perspective.
First, even though some ETL tools run in
Hadoop and replace MapReduce with their
own engines, Hadoop itself provides no
native facility to schedule or manage nonMapReduce jobs. (Hadoops existing JobTracker and TaskTracker paradigm is tightly
coupled to the MapReduce compute engine.)
Second, YARN should permit users to run
optimized analytic librariesmuch like the
SQL analytic database vendors doin the
Hadoop environment. This promises to be
faster and more efcient than the status quo,
which involves coding analytic workloads as
MapReduce jobs. Third, YARN could help
stem the ow of analytic workloads out of
Hadoop and encourage analytic workloads to
be shifted from the SQL world into Hadoop.
Even though it might be faster to run an analytic workload in an MPP database platform,
it probably isnt cheaperrelative, that is, to
running the same workload in Hadoop.

Alternatives to Hadoop
But while big data is often discussed
through the prism of Hadoop, owing to the
popularity and prominence of that platform,
alternatives abound. Among NoSQL platforms, for example, theres Apache Cassandra, which is able to host and run Hadoop
MapReduce workloads, and whichunlike
Hadoopis fault-tolerant. Theres also Spanner, Googles successor to BigTable. Google
runs its F1 DBMSa SQL- and ACID-compliant database platformon top of Spanner,
which has already garnered the sobriquet
NewSQL. (And F1, unlike Hadoop, can be
used as a streaming database. Here and elsewhere, Hadoops le-based architecture is a
signicant constraint.) Remember, a primary
contributor to Hadoops success is its cost
as an MPP storage and compute platform,
Hadoop is signicantly less expensive than

50

BIG DATA S OURCEBOOK 2013

existing alternatives. But Hadoop by itself isnt


ACID-compliant and doesnt expose a native
SQL interface. To the extent that technologies
such as F1 address existing data management
requirements, enable scalable parallel workload processing, and expose more intuitive
programming interfaces, they could comprise
compelling alternatives to Hadoop.

Whats Ahead
Big data, along with the related technologies such as Hadoop and other NoSQL
platforms, is just one of several destabilizing
forces on the IT horizon, however. Other
technologies are changing the practice of data
integrationsuch as the shift to the cloud
and the emergence of data virtualization.
Cloud will change how we consume and
interact withand, for that matter, what we
expect ofapplications and services. From
a data integration perspective, cloud, like
big data, entails its own set of technological, methodological, and conceptual challenges. Traditional data integration evolved
in a client-server context; it emphasizes
direct connectivity between resourcese.g.,
a requesting client and a providing server.
The conceptual model for cloud, on the other
hand, is that of representational state transfer,
or REST. In place of client-servers emphasis on direct, stateful connectivity between
resources, REST emphasizes abstract, stateless
connectivity. It prescribes the use of new and
nontraditional APIs or interfaces. Traditional
data integration makes use of tools such as
ODBC, JDBC, or SQL to query for and return
a subset of source data. REST components, on
the other hand, structure and transfer information in the form of lese.g., HTML,
XML, or JSON documentsthat are representations of a subset of source data. For this
reason, data integration in the context of the
cloud entails new constraints, makes use of
new tools, and will require the development
of new practices and techniques.
That said, it doesnt mean throwing out
existing best practices: If you want to run
sales analytics on data in your Salesforce.
com cloud, youve either got to load it into an
existing, on-premises repository oralternativelyexpose it to a cloud analytics provider.
In the former case, youre going to have to
extract your data from Salesforce, prepare it,
and load it into the analytic repository of your

choice, much as you would do with data from


any other source. The shift to the cloud isnt
going to mean the complete abandonment of
on-premises systems. Both will coexist.
Data Virtualization, or DV, is another
technology that should be of interest to data
integration practitioners. DV could play a role
in knitting together the fabric of the post-big
data, post-cloud application-scape. Traditionally, data integration was practiced under fairly
controlled conditions: Most systems (or most
consumables, in the case of at les or les
uploaded via FTP) were internal to an organization, i.e., accessible via a local area network.
In the context of both big data and the cloud,
data integration is a far-ung practice. Data
virtualization technology gives data architects
a means to abstract resources, regardless of
architecture, connectivity, or physical location.
Conceptually, DV is REST-esque in that
it exposes canonical representations (i.e.,
so-called business views) of source data. In
most cases, in fact, a DV business view is a
representation of subsets of data stored in
multiple distributed systems. DV can provide a virtual abstraction layer that unies
resources strewn acrossand outside of
the information enterprise, from traditional
data warehouse systems to Hadoop and other
NoSQL platforms to the cloud. DV platforms
are polyglot: They speak SQL, ODBC, JDBC,
and other data access languages, along with
procedural languages such as Java and (of
course) REST APIs.
Moreover, DVs prime directive is to move
as little data as possible. As data volumes scale
into the petabyte range, data architects must
be alert to the practical physics of data movement. Its difcult if not impossible to move
even a subset of a multi-petabyte repository
in a timely or cost-effective manner.
Stephen Swoyer is a tech-

nology writer with more than


15 years of experience.
His writing has focused
on business intelligence,
data warehousing, and analytics for almost a decade.
Hes particularly intrigued by the thorny
people and process problems most BI and DW
vendors almost never want to acknowledge,
let alone talk about. You can contact him at
stephen.swoyer@gmail.com.

industry directory

Appuent transforms the economics of Big Data and Hadoop.

Attunity is a leading provider of data integration software solutions

Appuent provides IT organizations with unprecedented visibility into

that make Big Data available where and when needed across

usage and performance of data warehouse and business intelligence

heterogeneous enterprise platforms and the cloud. Attunity solutions

systems. IT decision makers can view exactly which data is being


used or not used, determine how business intelligence systems are
performing and identify causes of database performance issues.
With Appuent, enterprises can address exploding data growth with

accelerate mission-critical initiatives including BI/Big Data Analytics,


Disaster Recovery, Content Distribution and more. Solutions include
data replication, change data capture (CDC), data connectivity,

condence, proactively manage performance of BI and data warehouse

enterprise le replication (EFR), managed-le-transfer (MFT), and cloud

systems, and realize the tremendous economies of Hadoop.

data delivery. For 20 years, Attunity has supplied innovative software


solutions to thousands of enterprise-class customers worldwide to

Learn more at www.appuent.com.

enable real-time access and availability of any data, anytime, anywhere


across the maze of systems making up todays IT environment.

APPFLUENT TECHNOLOGY, INC.


6001 Montrose Road, Suite 1000
Rockville, MD 20852

Learn more at www.attunity.com.

301-770-2888
sales@appuent.com

ATTUNITY

www.appuent.com

www.attunity.com

SEE OUR AD ON

PAGE 49

codeFutures
CodeFutures is the provider of dbShards, the Big Data platform that

Composite Software, now part of Cisco, is the data virtualization

makes your database scalable and reliable. dbShards is not a database


instead dbShards works with proven DBMS engines you know
and trust. dbShards gives your application transparent access to
one or more DBMS engines, providing the Big Data scalability,

market leader. Hundreds of organizations use the Composite Data


Virtualization Platforms streamlined approach to data integration
to gain more insight from their data, respond faster to ever changing

High-Availability, and Disaster Recovery you need for demanding


always-on operation. You can even use dbShards to seamlessly
migrate your database from one environment to anotherbetween

analytics and BI needs, and save 5075% over data replication and
consolidation.

regions, cloud vendors and your own data center.


For more information, go to www.dbshards.com.

Cisco Systems, Inc. completed the acquisition of Composite Software, Inc.


on July 29, 2013.

CODEFUTURES CORPORATION
11001 West 120th Avenue, Suite 400
Broomeld, CO 80021

COMPOSITE SOFTWARE

SEE OUR AD ON

PAGE 35

(303) 625-4084

Please call us: (650) 227-8200

sales@codefutures.com

Follow us on Twitter: http://twitter.com/compositesw

www.dbshards.com

www.compositesw.com

DBTA .CO M

51

industry directory

DataMentors provides award-winning data quality and database


marketing solutions. Offered as either a customer-premise installation
or ASP delivered solution, DataMentors leverages proprietary data
discovery, analysis, campaign management, data mining and
modeling practices to identify proactive, knowledge-driven decisions.
DataFuse, DataMentors data quality and integration solution, is
consistently recognized by industry-leading analysts for its extreme
exibility and ease of householding. DataMentors marketing database
solution, PinPoint, quickly and accurately analyzes, segments, and
proles customers preferences and behaviors. DataMentors also
offers social media marketing, drive time analysis, email marketing,
data enhancements and behavior models to further enrich the customer
experience across all channels.

DATAMENTORS
2319-104 Oak Myrtle Lane

Datawatch is the leading provider of visual data discovery solutions


that allow organizations to optimize the use of any information,
whether it is structured, unstructured, or semi-structured data locked
in content like static reports, PDF les, and EDI streamsin real-time
sources like CEP engines, tick or other real-time sources like machine
data. Through an unmatched visual data discovery environment and
the industrys leading information optimization software, Datawatch
allows you to utilize ALL data to deliver a complete picture of your
business from every aspect and then manage, secure, and deliver
that information to transform business processes, increase visibility
to critical Big Data sources, and improve business intelligence
applications offering broader analytical capabilities.
Datawatch provides the solution to Get the Whole Story!

SEE OUR AD ON

PAGE 47

DATAWATCH CORPORATION

Email: info@datamentors.com

271 Mill Road, Quorum Ofce Park


Chelmsford, MA 01824
978-441-2200
Sales@datawatch.com

www.DataMentors.com

www.datawatch.com

Wesley Chapel, FL 33544


Phone: 813-960-7800

SEE OUR AD ON

PAGE 11

Delphix delivers agility to enterprise application projects, addressing

Denodo is the leader in data virtualization. Denodo enables hybrid data

the largest source of inefciency and inexibility in the datacenter

storage for big data warehouse and analyticsproviding unmatched

provisioning, managing, and refreshing databases for business-critical


applications. With Delphix in place, QA engineers spend more time

performance, unied virtual access to the broadest range of enterprise,


big data, cloud and unstructured sources, and agile data services

testing and less time waiting for new data, increasing utilization of
expensive test infrastructure. Analysts and managers make better

provisioningwhich has allowed reference customers in every major

decisions with fresh data in data marts and warehouses. Leading global

industry to minimize the cost and pitfalls of big data technology

organizations use Delphix to dramatically reduce the time, cost, and risk

and accelerate its adoption and value by making it transparent to

of application rollouts, accelerating packaged and custom applications

business users. Denodo is also used for cloud integration, single-view

projects and reporting.

applications, and RESTful linked data services. Founded in 1999,


Denodo is privately held.

DELPHIX
275 Middleeld Road

52

Menlo Park, CA 94025

DENODO TECHNOLOGIES

sales@delphix.com

info@denodo.com

www.delphix.com

www.denodo.com

BIG DATA S OURCEBOOK 2013

industry directory

Nearly 80% of all existing data is generally only available in


unstructured form and does not contain additional, descriptive
metadata. This content, therefore, cannot be machine-processed
automatically with conventional IT. It demands human interaction for
interpretation, which is impossible to achieve when faced with the
sheer volume of information. Based on the highly scalable Information
Access System, Empolis offers methods for analyzing unstructured
content perfectly suitable for a wide range of applications. For
instance, Empolis technology is able to semantically annotate and
process an entire day of trafc on Twitter in less than 20 minutes,
or the German version of Wikipedia in three minutes. In addition to
statistical algorithms, this also covers massive parallel processing
utilizing linguistic methods for information extraction. These, in turn,
form the basis for our Smart Information Management solutions, which
transform unstructured content into structured information that can be
automatically processed with the help of content analysis.

EMPOLIS INFORMATION MANAGEMENT GMBH

DBMoto is the preferred solution for heterogeneous Data Replication


and Change Data Capture requirements in an enterprise environment.
Whether replicating data to a lower TCO database, synchronizing
data among disparate operational systems, creating a new columnar
or high-speed analytic database or data mart, or building a business
intelligence application, DBMoto is the solution of choice for fast,
trouble-free, easy-to-maintain Data Replication and Change Data
Capture projects. DBMoto is mature and approved by enterprises
ranging from midsized to Fortune 1000 worldwide. HiT Software, Inc.,
a BackOfce Associates LLC Company, is based in San Jose, CA.
For more information see www.info.hitsw.com/DBTA-bds2013/

HIT SOFTWARE, INC.,


A BACKOFFICE ASSOCIATES
LLC COMPANY

SEE OUR AD ON

PAGE 37

Europaallee 10 | 67657 Kaiserslautern | Germany


Phone +49 631 68037-0 | Fax +49 631 68037-77
info@empolis.com

Contact: Giacomo Lorenzin


408-345-4001
info@hitsw.com

www.empolis.com

www.hitsw.com

Kapow Software, a Kofax company, harnesses the power of legacy data

HPCC Systems from LexisNexis is an open-source, enterprise-ready

and big data, making it actionable and accessible across organizations.

solution designed to help detect patterns and hidden relationships in

Hundreds of large global enterprises including Audi, Intel, Fiserv,

Big Data across disparate data sets. Proven for more than 10 years,

Deutsche Telekom, and more than a dozen federal agencies rely on its

HPCC Systems helped LexisNexis Risk Solutions scale to a $1.4 billion

agile big data integration platform to make smarter decisions, automate

information company now managing several petabytes of data on a

processes, and drive better outcomes faster. They leverage the platform

daily basis from 10,000 different sources.

to give business consumers a exible 360-degree view of information


across any internal and external source, providing organizations with a

HPCC Systems was built for small development teams and offers a single

data-driven advantage.

architecture and one programming language for efcient data processing


of large or complex queries. Customers, such as nancial institutions,

For more information, please visit: www.kapowsoftware.com.

insurance companies, law enforcement agencies, federal government and


other enterprise organizations, leverage the HPCC Systems technology

KAPOW SOFTWARE
260 Sheridan Avenue, Suite 420
Palo Alto, CA 94306

through LexisNexis products and services. HPCC Systems is available


in an Enterprise and Community version under the Apache license.

Phone: +1 800 805 0828

LEXISNEXIS

Fax: +1 650 330 1062

Phone: 877.316.9669

Email: marketing@kapowsoftware.com

www.hpccsystems.com

www.kapowsoftware.com

www.lexisnexis.com/risk

SEE OUR AD ON

PAGE 43

DBTA .CO M

53

industry directory

Founded in 1989, MicroStrategy (Nasdaq: MSTR) is a leading

Since 1988, Objectivity, Inc. has been the Enterprise NoSQL leader,

worldwide provider of enterprise software platforms. Millions of users

helping customers harness the power of Big Data. Our leading edge

use the MicroStrategy Analytics Platform to analyze vast amounts


of data and distribute actionable business insight throughout the
enterprise. Our analytics platform delivers interactive dashboards and
reports that users can access and share via web browsers, information-

technologies: InniteGraph, The Distributed Graph Database,


and Objectivity/DB, a distributed and scalable object management
database, enable organizations to discover hidden relationships for

rich mobile apps, and inside Microsoft Ofce applications. Big data

improved Big Data analytics and develop applications with signicant

analytics delivered with MicroStrategy will enable businesses to

time-to-market advantages and technical cost savings, achieving

analyze big data visually without writing code and apply advanced

greater return on data related investments. Objectivity, Inc. is

analytics to obtain deep insights from all of their data.

committed to our customers success, with representatives worldwide.


Our clients include: AWD Financial, CUNA Mutual, Draeger Medical,

To learn more and try MicroStrategy free, visit

Ericsson, McKesson, IPL, Siemens and the US Department of Defense.

microstrategy.com/bigdatabook.

MICROSTRATEGY
1850 Towers Crescent Plaza

SEE OUR AD ON

COVER 4

OBJECTIVITY, INC.
3099 North First Street, Suite 200

SEE OUR AD ON

PAGE 7

Tysons Corner, VA 22182 USA

San Jose, CA 95134 USA

Phone: 888.537.8135

408-992-7100

Email: info@microstrategy.com

info@objectivity.com

www.microstrategy.com/bigdatabook

www.objectivity.com

Percona has made MySQL and integrated MySQL/big data solutions

Progress DataDirect provides high-performance, real-time connectivity

faster and more reliable for over 2,000 customers worldwide. Our

to applications and data deployed anywhere. From SaaS applications

experts help companies integrate MySQL with big data solutions

like Salesforce to Big Data sources such as Hadoop, DataDirect

including Hadoop, Hbase, Hive, MongoDB, Vertica, and Redis.

makes these sources appear just like a regular relational database.

Percona provides enterprise-grade Support, Consulting, Training,

Whether you are connecting your own application, or your favorite BI

Remote DBA, and Server Development services for MySQL or

and reporting tools, DataDirect makes it easy to access your critical

integrated MySQL/big data deployments. Our founders authored the

business information.

book High Performance MySQL and the MySQL Performance Blog.


We provide open source software including Percona Server, Percona

More than 300 leading independent software vendors embed Progress

XtraDB Cluster, Percona Toolkit, and Percona XtraBackup. We also

Softwares DataDirect components in over 400 commercial products.

host Percona Live conferences for MySQL users worldwide.

Further, 96 of the Fortune 100 turn to Progress Softwares DataDirect


to simplify and streamline data connectivity.

For more information, visit www.percona.com.

54

PERCONA

PROGRESS DATADIRECT

www.percona.com

www.datadirect.com

BIG DATA S OURCEBOOK 2013

SEE OUR AD ON

PAGE 41

industry directory

Splice Machine is the only transactional SQL-on-Hadoop database

TransLattice provides its customers corporate-wide visibility,

for real-time Big Data applications. Splice Machine provides all the

dramatically improved system availability, simple scalability and

benets of NoSQL databases, such as auto-sharding, scalability, fault


tolerance and high availability, while retaining SQLthe industry

signicantly reduced deployment complexity, all while enabling data


location compliance. Computing resources are tightly integrated to
enable enterprise databases to be spread across an organization as

standard. It optimizes complex queries to power real-time OLTP and

needed, whether on-premise or in the cloud, providing data where and

OLAP apps at scale without rewriting existing SQL-based apps and

when it is needed. Nodes work seamlessly together and if a portion

BI tool integrations. Splice Machine provides fully ACID transactions

of the system goes down, the rest of the system is not affected. Data

and uses Multiple Version Concurrency Control (MVCC) with lockless

location is policy-driven, enabling proactive compliance with regulatory

snapshot isolation to enable real-time database updates with very

requirements. This simplied approach is fundamentally more reliable,


more scalable, and more cost-effective than traditional approaches.

high throughput.

SPLICE MACHINE

SEE OUR AD ON

COVER 2

TRANSLATTICE
+1 408 749-8478

info@splicemachine.com

info@TransLattice.com

www.splicemachine.com

www.TransLattice.com

SEE OUR AD ON

PAGE 9

From the publishers of

DBTA .CO M

55

Need Help Unlocking the Full Value


of Your Information?

DBTA magazine is here to help.

S C A N

Each issue of DBTA features original and valuable contentproviding you with
clarity, perspective, and objectivity in a complex and exciting world where data
assets hold the key to organizational competitiveness.
Dont miss an issue!
Subscribe FREE* today!
*Print edition free to qualied U.S. subscribers.

Best Practices and Thought Leadership Reports


Get the inside scoop on
the hottest topics in data
management and analysis:

Big Data
technologies,
including
Hadoop, NoSQL,
and in-memory
databases
Solving
complex data
and application
integration
challenges

Increasing
efciency
through cloud
technologies
and services

Tools and
techniques
reshaping the
world of business
intelligence

New
approaches
for agile data
warehousing

For information on upcoming reports:


http://iti.bz/dbta-editorial-calendar

Key strategies
for increasing
database
performance
and availability

To review past reports:


http://iti.bz/dbta-whitepapers

Das könnte Ihnen auch gefallen