Beruflich Dokumente
Kultur Dokumente
SQL on Hadoop
Real-Time
Real Scale
Real Apps
Real SQL
CONTENTS
BIG DATA
SOURCEBOOK
DECEMBER 2013
introduction
2
PUBLISHED BY Unisphere Mediaa Division of Information Today, Inc.
EDITORIAL & SALES OFFICE 630 Central Avenue, Murray Hill, New Providence, NJ 07974
CORPORATE HEADQUARTERS 143 Old Marlton Pike, Medford, NJ 08055
Denise M. Erickson,
Thomas Hogan Jr.,
Senior Graphic Designer
Group Publisher
808-795-3701; thoganjr@infotoday
Jackie Crawford,
Joyce Wells, Managing Editor
Ad Trafcking Coordinator
908-795-3704; joyce@dbta.com
Alexis Sopko, Advertising Coordinator
Joseph McKendrick,
908-795-3703; asopko@dbta.com
Contributing Editor; joseph@dbta.com
Sheila Willison, Marketing Manager,
Sheryl Markovits, Editorial and Project
Events and Circulation
Management Assistant
859-278-2223; sheila@infotoday.com
(908) 795-3705; smarkovits@dbta.com
DawnEl Harris, Director of Web Events;
Celeste Peterson-Sloss, Deborah Poulson,
dawnel@infotoday.com
Alison A. Trotta, Editorial Services
Joyce Wells
industry updates
4
10
16
POSTMASTER
Send all address changes to:
Big Data Sourcebook, 143 Old Marlton Pike, Medford, NJ 08055
26
34
COPYRIGHT INFORMATION
Authorization to photocopy items for internal or personal use, or the internal or personal use
of specic clients, is granted by Information Today, Inc., provided that the base fee of US
$2.00 per page is paid directly to Copyright Clearance Center (CCC), 222 Rosewood Drive,
Danvers, MA 01923, phone 978-750-8400, fax 978-750-4744, USA. For those organizations
that have been grated a photocopy license by CCC, a separate system of payment has been
arranged. Photocopies for academic use: Persons desiring to make academic course packs
with articles from this journal should contact the Copyright Clearance Center to request
authorization through CCCs Academic Permissions Service (APS), subject to the conditions
thereof. Same CCC address as above. Be sure to reference APS.
Creation of derivative works, such as informative abstracts, unless agreed to in writing by
the copyright owner, is forbidden.
Acceptance of advertisement does not imply an endorsement by Big Data Sourcebook.
Big Data Sourcebook disclaims responsibility for the statements, either of fact or opinion,
advanced by the contributors and/or authors.
Joe McKendrick
40
DBTAs Big Data Sourcebook is a guide to the enterprise and technology issues IT professionals are being asked to
cope with as business or organizational leadership increasingly
denes strategies that leverage the big data phenomenon.
It has been well-documented that social media, web,
transactional, as well as machine-generated and traditional relational data, are being collected within organizations at an accelerated pace. Today, according to common
industry estimates, 80% of enterprise data is unstructured
or schema-less.
The reality of what is taking place in IT organizations
today is more than hype. According to an SAP-sponsored survey of 304 data managers and professionals,
conducted earlier this year by Unisphere Research, a
division of Information Today, Inc., between one-third
and one-half of respondents have high levels of volume,
variety, velocity, and value in that datathe well-known
four characteristics that dene big data. The 2013 Big
Data Opportunities Survey found that two-fths of
respondents have data stores reaching into the hundreds
of terabytes and greater. Eleven percent of respondents
said the total data they manage ranges from 500TBs to
1PB, 8% had between 1PB and 10PBs, and 9% had more
than 10PBs.
In addition, data stores are growing rapidly. According to another study produced by Unisphere Research
and sponsored by Oracle, almost nine-tenths of the 322
respondents say they are experiencing year-over-year
growth in their data assets. Respondents to the survey were
data managers and professionals who are members of the
Independent Oracle Users Group (IOUG). For many, this
growth is in double-digit ranges. Forty-one percent report
signicant growth levels, dened as exceeding 25% a year.
Seventeen percent report that the rate of growth has been
more than 50% (Achieving Enterprise Data Performance:
2013 IOUG Database Growth Survey).
Big data offers enormous potential to organizations
and represents a major transformation of information
technology. Beyond the obvious need to effectively store
and protect this data, IT organizations are increasingly
sponsored content
AN EXAMPLE
Every year, NASA and the National
Science Foundation host a contest across
the scientific communities, the results
often resonating in both the academic
and business worlds. The latest challenge:
How can organizations pull together all the
right data from a variety of sources before
performing analysis, drawing conclusions
and making decisions? Sounds like big
data, right?
Consider the problem of determining
if life ever existed on Mars. A huge variety
of data collected by the Mars rover is
fed into clusters of databases around the
world. It then gets transmitted as a whole
to a variety of data sets and Hadoop
clusters. What do we do with it? How does
the scientific community organize itself
to deal with this influx?
There are similar examples in every
industry, all leading to key integration
challenges: How do we make dissimilar data
sets uniformly accessible? And how do we
extract the most relevant information in a
fast, scalable and consistent way?
The problems of data access and relevancy
are complicated by three additional data
processing realities:
PROGRESS DATADIRECT
www.datadirect.com
DB TA .CO M
industry updates
The Battle
Over Persistence
and the Race
for Access Hill
By John OBrien
industry updates
Data Is Data
The NoSQL family of data stores was born
out of the business demands to capitalize on
industry updates
need a point of access and navigation. Otherwise, the MDP is simply a bunch of databases.
One major concept at stake for modern
data architects in the Race for Access Hill is
how to centralize semantic context for consistency, collaboration, and navigation. Previously in the organized world of data schemas, there were many database vendors and
technologies that made data access heterogeneous, but it was still unied SQL data access
under a single paradigm. Federated data
architectures were predominantly still SQL
schema in nature and easier to unify. Todays
key-value stores, such as Hadoop, have the
ability to separate the context of data or its
schema from the data itself, which has great
discovery-oriented benets for late-binding
the schema with the data, rather than analyzing and designing a schema prior to loading
data in as a traditional RDBMS.
Centralizing context can be done in a
Hadoop clusters HCatalog or Hive components for semantic integration with other
SQL-oriented databases for federation,
hence joining the SQL world where possible.
(Reminds me of my favorite recent Twitter
quote, Who knew the future of NoSQL was
SQL?) Data virtualization (DV) can serve as
industry updates
Whats Ahead
In 2013, two major shifts in the data landscape occurred. The acceptance of leveraging
the strengths of various database technologies
in an optimized Modern Data Platform has
more or less been resolved, but the recognition of a single point of access and context is
next. Likewise, the race for access will continue well into 2014and while one solution
www.TransLattice.com
industry updates
industry updates
12
Relational Databases
Are Not Going Anywhere
While there is much speculation about
how modern data processing technologies
are displacing proven relational databases, the
reality is that most companies will be better
served with relational technologies for most
of their needs.
As the saying goes, if all you have is a
hammer, everything looks like a nail. When
database professionals drink enough of the
big data Kool-Aid, many of their challenges
look like big data problems. In reality, though,
most of their problems are self-inicted. A
bad data model is not a big data problem.
Using 7-year-old hardware is not a big data
problem. Lack of data purging policy is not
a big data problem. Miscongured databases,
operating systems, and storage arrays are not
big data problems.
There is one good rule of thumb to assess
whether you have a big data problem or not
if you are not using new data sources, you likely
dont have a big data problem. If you are con-
Whats Ahead
There are a few areas in which we can certainly expect to have many innovations over
the next few years.
Real-time analytics on massive data volumes has more and more demand. While
there are many in-memory database technologies including many proprietary solutions,
I believe the future is with the Hadoop ecosystem and open standards. However, proprietary solutions such as SAP HANA or just
announced Oracle In-Memory Database are
very credible alternatives.
DBTA .CO M
13
industry updates
sponsored content
Elephant Traps
How to Avoid Them With Data Virtualization
Big Data is being talked about
everywhere in IT and business conferences,
venture capital, legal, medical and government
summits, blogs and tweets even Fox News!
The prevailing mindset is that if you dont
have a Big Data project, youre going to be left
behind. In turn, CIOs are feeling pressured
to do somethinganythingabout Big
Data. So while they are putting up Hadoop
clusters and crunching some data, it seems
that the really big (data) question all of them
should be asking is where is the value going
to come from?, what are the real use cases?,
and nally how can they prevent this from
becoming yet another money pit, or elephant
trap, of technologies and consultants?
14
sponsored content
Big
Data i
n the We
b/
Clo
u
Data.gov
WWW
CLOUD
STORAGE
Hadoop
Web Streams
Denodo Platform
Unstructured
Content
Connect
MDX
->
Unified
Data Access
Combine
Unified
Data Layer
->
Publish
Universal Data
Publishing
Log Files
Data
Query Widget
Address
New Jersey
Seattle
New Jersey
Minnesota
Relational / Parallel /
Columnar
Seattle
Seattle
Minnesota
New Jersey
Chart Widget
Map Widget
Customer Name
Chevron Corporation
IBM
JPMorgan
Chevron Corporation
JPMorgan
IBM
Chevron Corporation
JPMorgan
Minnesota
IBM
New Jersey
Chevron Corporation
Enterprise Apps
Users
B ig D
se
ata in the Enterpri
CONCLUSION
CIOs and Chief Data Ofcers alike would
do well to keep the dangers of elephant
traps in mind before they find themselves
ensnared. The truth is that every Big Data
project needs a balance between the Big Data
technologies for storage and processing on
the one hand and data virtualization for data
access and data services delivery on the other.
DATA VIRTUALIZATION
Several technologies and approaches
serve the Big Data needs of which two
categories are particularly important.
The rst has received a lot of attention
and involves distributed computing across
standard hardware clusters or cloud
resources, using open source technologies.
Technologies that fall in this category and
DBTA .CO M
15
industry updates
16
industry updates
Understanding the legal and regulatory consequences will help keep your company safe
from those dangerous rocks.
DBTA .CO M
17
industry updates
Information security risks are also important factors to consider within the larger legal
and risk context. If they are not mitigated
early on, they alone can lead to opening the
door for broader discovery related to big
datasets and systems. Information security in
a broad sense can include:
Data Integrity and Privacy
Encryption
Access Control
Chain-of-Custody
Relevant Laws/Regulations
Corporate Policies
Specic examples of situations where information security policies should be monitored
include:
Vendor Agreements
Data Ownership & Custody
Requirements
International Regulations
Condentiality Terms
Data Retention/Archiving
Geographical Issues
Entering into contracts with third-party
big data-related providers is an area that warrants special attention and where legal or risk
problems may arise. Strict controls related to
third-parties are important. More and more
big data systems and technologies are supplied by third parties, so the organization
must have certain restrictions and protections
in place to ensure side-door and backdoor
discovery doesnt occur.
When dealing with third-party control,
avoiding common pitfalls leads to better data
risk and cost control. Common problems that
arise include:
Inadvertent data spoliation, which
can include stripping metadata and
truncating communication threads
Custody and control of the data,
including access rights and issues with
data removal
Problems with relevant policies/
procedures, which can include a lack
of planning and a lack of enforcement
of rules
International rules and regulations,
including cross-border issues
Big data sources are no different than traditional data sources in that big data sources
and the use of big data should be protected
18
Mitigating Risk
To best mitigate risk from both internal
and third-party users, certain procedures
related to data access and handling should be
implemented via IT control:
Auditing and validation of logins and access
Logging of actions
Monitoring
Chain-of-custody
Executive oversight, however, is also an
extremely important method of managing data
risk. Organizational commitment to appropriate control procedures evidenced through
executive support is a key factor to creating,
deploying, and maintaining a successful information risk management program. Employees
who are able to see the value of the procedures
through the actions and attitudes of those in
management more appreciate the importance
of those procedures themselves.
All in all, a practical, holistic approach is
best for risk mitigation. Here are some tips for
managing legal information/data risk:
Use a team approach: Include
representatives from legal, IT, risk, and
executives to cover all bases.
Use written SOPs and protocols:
Standard ways of operating/responding/
process management and following
written protocols are key to consistency.
Consistency helps defend the process in
legal proceedings if needed.
Leverage native functionality when
responding to legal requests: Reporting
that is sufcient for the business should
be appropriate for the courts. Also be sure
to establish a strong separation of the
presentation layer from the underlying
data for implicated system identication
purposes.
Multi-departmental involvement is also
very important to creating and maintaining
a successful risk mitigation environment and
plan. It is easy to lose track of weak spots in
data handling when only one group is trying to guess the activities of all the others in
an organization. Executives, IT, legal, and
risk all have experiences to share that could
implicate weakness in the systems. Review by
a team helps cover all the bases.
Whats Ahead
This is a new eld for legal professionals
and the courts. Big data is here to stay and
will become increasingly ubiquitous and a
necessary part of running an efcient and
successful business. Because of that, those systems and data (including derived analysis and
underlying raw information) will be implicated in legal matters and will thus be subject
to legal rules of preservation, discovery, and
evidence. Those types of legal requirements
are typically burdensome and expensive
when processes are not in place and people
are not trained. Relevant big data systems and
applications are not designed for the type of
operations required by legal rules of preservation and discoveryrequirements related
to maintaining evidentiary integrity, chainof-custody, data origination, use, metadata
information, and historical access control.
This new technical domain will quickly
become critical to the legal fact-nding process. Thus, organizations must begin to think
about how the data is used and maintained
during the normal course of business and
how that may affect their legal obligations if
big data or related systems are implicated
which may likely be the case with every legal
situation an organization may face.
sponsored content
THE SOLUTION
LexisNexis Global Content Systems Group
consolidated the content management and
document enhancement and mining systems
onto HPCC Systems to solve multiple data
challenges, including content enrichment
since data enrichment must be applied across
all the content simultaneously to provide a
superior search result.
HPCC Systems from LexisNexis is an
open-source, enterprise-ready solution
designed to help detect patterns and hidden
relationships in Big Data across disparate data
sets. Proven for more than 10 years, HPCC
Systems helped LexisNexis Risk Solutions
scale to a $1.4 billion information company
now managing several petabytes of data on
a daily basis from 10,000 different sources.
HPCC Systems is proven in entity
recognition/resolution, clustering and content
analytics. The massively parallel nature of the
HPCC platform provides both the processing
and storage resources required to fulll the
dual missions of content storage and content
enrichment.
HPCC Systems was easily integrated with
the existing Content Management workow
engine to provide document level locking and
other editorial constraints.
The migration of the content repository
and data enhancement processing to the
HPCC platform involved creating several
HPCC worker clusters of varying sizes
to perform data enrichments and a single
THE RESULTS
The new system achieves the goal
of having a tightly integrated content
management and enrichment system that
takes full advantage of HPCC Systems
supercomputing capabilities for both
computation and high speed data access.
The elapsed time to perform an
enrichment pass of the entire data collection
dropped from six to eight weeks to less
than a day. This change is so signicant that
LexisNexis has already increased the degree
of enrichment into other capabilities that
were previously out of reach.
ABOUT HPCC SYSTEMS
HPCC Systems was built for small
development teams and offers a single
architecture and one programming
language for efcient data processing
of large or complex queries. Customers,
such as nancial institutions, insurance
companies, law enforcement agencies,
federal government and other enterprise
organizations, leverage the HPCC Systems
technology through LexisNexis products and
services. For more information, visit
www.hpccsystems.com.
LEXISNEXIS
www.hpccsystems.com
LexisNexis and the Knowledge Burst Logo are
registered trademarks of Reed Elsevier Properties Inc.,
used under license. HPCC Systems is a registered
trademark of LexisNexis Risk Data Management Inc.
Copyright 2012 LexisNexis. All rights reserved.
DBTA .CO M
19
industry updates
Unlocking the
Potential of Big Data in a
Data Warehouse Environment
By W. H. Inmon
industry updates
Determining Context
There have been several earlier attempts
to analyze unstructured data. Each of the
attempts has its own major weakness. The
previous attempts to analyze unstructured
data include:
1. NLPnatural language processing.
NLP is intuitive. But the aw with NLP is
that NLP assumes context can be determined
from the examination of text. The problem
with this assumption is that most context is
nonverbal and never nds its way into any
form of text.
2. Data scientists. The problem with
throwing a data scientist at the problem of
needing to analyze unstructured data is that
the world only has a nite supply of those
scientists. Even if the universities of the world
started to turn out droves of data scientist, the
demand for data scientists everywhere there is
big data would far outstrip the supply.
3. MapReduce. The leading technology of
big dataHadoophas technology called
MapReduce. In MapReduce, you can create and
manage unstructured data to the nth degree.
But the problem with MapReduce is that it
requires very technical coding in order to be
implemented. In many ways MapReduce is like
coding in Assembler. Thousands and thousands
of lines of custom code are required. Furthermore, as business functionality changes, those
thousands of lines of code need to be maintained. And no organization likes to be stuck
with ongoing maintenance of thousands of lines
of detailed, technical custom code.
4. MapReduce on steroids. Organizations
have recognized that creating thousands of
lines of custom code is no real solution.
DBTA .CO M
21
industry updates
Deriving Context
In fact, there are two ways to derive context for unstructured data. Those ways are
general context and specic context. General context can be derived by merely declaring a document to be of a particular variety. A
document may be about shing. A document
may be about legislation. A document may
be about healthcare, and so forth. Once the
general context of the document is declared,
then the interpretation of text can be made in
accordance with the general category.
As a simple example, suppose there were
in the raw text this sentence: President Ford
drove a Ford. If the general context were
about motor cars, then Ford would be interpreted to be an automobile. If the general
context were about the history of presidents
of the U.S., then Ford would be interpreted to
be a reference to a former president.
Textual Disambiguation
The other type of context is specic context. Specic context can be derived in many
different ways. Specic context can be derived
by the structure of a word, the text surrounding a word, the placement of words in
proximity to each other, and so forth. There
is new technology called textual disambiguation which allows raw unstructured text
to have its context specically determined.
In addition, textual disambiguation allows
the output of its processing to be placed in
a standard database format so that classical
analytical tools can be used.
At the end of textual disambiguation,
analytical processing can be done on the
raw unstructured text that has now been
disambiguated.
22
Whats Ahead
The argument can be made that the process of disambiguating the raw text then
rewriting it to big data in a disambiguated
state increases the amount of data in the
environment. Such an observation is absolutely true. However, given that big data is
cheap and that the big data infrastructure is
designed to handle large volumes of data, it
should be of little concern that there is some
degree of duplication of data after raw text
passes through the disambiguation process.
Only after big data has been disambiguated is
the big data store t to be called a data warehouse. However, once the big data is disambiguated, it makes a really valuable and really
innovative addition to the analytical, data
warehouse environment.
Big data has much potential. But unlocking that potential is going to be a real challenge. Textual disambiguation promises to be
as profound as data warehousing once was.
Textual disambiguation is still in its infancy,
but then again, everything was once in its
infancy. However, the early seeds sewn in textual disambiguation are bearing some most
interesting fruit.
sponsored content
UNDERSTANDING THE
CONTENT BLIND SPOT
A growing number of IT organizations
now see value in information contained
within these content blind spots. The key
reason: It enhances their business leaders
ability to make smarter decisions because
much of this data provides a link to past
decisions.
Companies also realize that these nontraditional data sources are growing at an
OPTIMIZING INFORMATION
THROUGH VISUAL
DATA DISCOVERY
Next-generation analytics enable
businesses to analyze any data variety,
regardless of structure, at real-time velocity
for fast decision making in a visual data
discovery environment. These analytic tools
link diverse data types with traditional
decision-making tools like spreadsheets and
DATAWATCH
www.datawatch.com
DBTA .CO M
23
GROW
your connections
Join DBTA via Facebook, Twitter, Google+, and LinkedIn to connect with industry
peers, receive the latest-breaking news, gain insights, get conference discounts,
download white papers, hear about webinars, and much more.
sponsored content
Overcoming the
Big Data Transfer Bottleneck
Businesses all over the world are
beginning to realize the promise of Big
Data. After all, being able to extract data
from various sources across the enterprise,
including operational data, customer
data, and machine/sensor data, and then
transform it all into key business insights can
provide signicant competitive advantage. In
fact, having up-to-date, accurate information
for analytics can make the difference
between success and failure for companies.
However, its not easy. A recent study by
Wikibon noted that returns thus far on Big
Data investments are only 50 cents to the
dollar. A number of challenges stand in front
of maximizing return. The data transfer
bottleneck is but one real and pervasive issue
thats causing many headaches in IT today.
THE ANSWER
There is a solution to overcoming this
challenge. Attunity beats the Big Data
bottleneck by providing high-performance
data replication and loading for the broadest
range of databases and data warehouses
in the industry. Its easy, Click-2-Replicate
design and unique TurboStream DX data
transfer and CDC technologies give it the
power to stand up to the largest bottlenecks
and win. Partner with Attunity. You too can
beat the data transfer bottleneck!
HARSH REALITY
When timely information isnt available,
key decisions need to be deferred. This
can lead to lost revenues, decreased
competitiveness, or lower levels of customer
satisfaction. Additionally, the reliability of
Learn more!
Download this eBook by data
management expert, David Loshin:
Big Data Analytics Strategies
Beating the Data Transfer Bottleneck
for Competitive Gain
http://bit.ly/ATTUeBook
ATTUNITY
For more information,
visit www.Attunity.com
or call (800) 288-8648 (toll free)
+1 (781) 730-4070.
DBTA .CO M
25
industry updates
Cloud Technologies
Are Maturing to Address Emerging
Challenges and Opportunities
By Chandramouli Venkatesan
industry updates
Hybrid Cloud
For enterprises that are adopting the hybrid
(public/private/community) cloud pay-as-yougo model for IaaS, PaaS, and SaaS cloud
deployments, the key drivers are cost, exibility, and speed (time to set up hardware, software, and services). The primary use cases for
the new hybrid model include the ability to do
data migration, fraud detection, and the ability
to manage unstructured data in real time.
But the move to hybrid cloud deployment comes with new challenges and risks.
The biggest challenge for cloud deployments
today is in the area of data security and identity. There are several cloud providers who
offer IaaS, PaaS, SaaS, network as a service,
and everything as a service and probably
offer good rewalls to protect data within
the boundaries of their data center. The challenges include data at rest, data in ight used
in mobile devices accessing the cloud provider, and data derived from multiple cloud
providers and provision of a single-view to
the mobile customer.
BYOD
The ubiquitous mobile computing is driving the new cloud adoption model faster than
anticipated and a key driver is BYOD (bring
your own device). The traditional IT shop
had control of its assets whether on-premise
or on cloud. However, the demands of BYOD
and the myriad mobile devices, applications,
DBTA .CO M
27
industry updates
Emerging Standards
There are several emerging standards
for cloud deployments, primarily to address
identity, security, and software-defined
networking (SDN). IaaS, PaaS, and SaaS
cloud deployments have matured, and there
are several players that coexist in the cloud
ecosystem today. The standards such as
OpenID, Open Connect, OAuth, and Open
Data Center Alliance have several cloud providers and enterprises signing up every day,
but the adoption will take a few more years
to evolve and mature. Open standards are the
key to the future adoption of cloud and the
seamless ow of secure data among different cloud providers. This offers a paradigm
similar to a free market economy, which is a
goal, but in reality, the goal to be strived for
by future cloud players is about 60% open
standards and 40% proprietary frameworks
28
sponsored content
DBTA .CO M
29
industry updates
30
broadening. Master data management presents many of the same challenges that data
quality itself presents. Moreover, the complexity of implementing master data management
solutions has restricted them to relatively large
companies. At the bottom line, both data quality program and master data management
solutions are tricky to successfully implement,
in part because, to a large degree, the impact
of poor quality and disjointed data is hidden
from sight. Too often, data quality seems to be
nobodys specic responsibility.
Despite the difculties in gathering corporate resources to address these issues, during
the past decade, the high cost of poor quality
and poorly integrated data has become clearer,
and a better understanding of what denes
data quality, as well as a general methodology
for implementing data quality programs, has
emerged. The establishment of the general
foundation for data quality and master data
management programs is signicant, particularly because the corporate information environment is undergoing a tremendous upheaval,
generating turbulence as vigorous as that created by mainframe and personal computers.
The spread of the internet and mobile
devices such as smartphones and tablets is not
During the past decade, the high cost of poor quality and poorly
integrated data has become clearer, and a better understanding
of what denes data quality, as well as a general methodology
for implementing data quality programs, has emerged.
DBTA .CO M
31
industry updates
industry updates
32
Whats Ahead
sponsored content
DATA SOURCES
BEYOND THE TRADITIONAL
Enterprises understand the intrinsic
value in mining and analyzing traditional
data sources such as demographics,
consumer transactions, behavior models,
industry trends, and competitor information.
However, the age of Big Data and advanced
technologies necessitate the analysis of new
data universes, such as social media and
mobile technologies.
Social media is one of the major elements
driving the overall Big Data phenomenon.
Twitter streams, Facebook posts and
blogging forums ood organizations
with massive amounts of data. Successful
Big Data strategies include the adoption
of technologies to pull relevant social
media into a single stream and integrate
the information into the core functions
of the enterprise. Automated processes,
matching technology and lters extract
content and consumer sentiment. When
social stream data is cleansed and integrated
into a database, enterprises gain invaluable
information on customer insights,
DBTA .CO M
33
industry updates
In Todays BI
and Advanced
Analytics World,
There Is Something
for Everyone
By Joe McKendrick
industry updates
[T]he new BI/analytics world is all about diving deep into datasets
and being able to engage in storytelling as a way to connect
data to the business.
2012 IOUG Big Data Strategies Survey, sponsored by Oracle and conducted by Unisphere
Research, September 2012).
Relational Database Management Systems: RDBMSs, on the market for close to 3
decades, structure data into tables that can
be cross-indexed within applications and are
increasingly being tweaked for the data surge
ahead. The IOUG survey nds nine out of
10 enterprises intend to continue using relational databases for the foreseeable future,
and it is likely that many organizations will
have hybrid environments with both SQL and
NoSQL running side by side.
Cloud: Cloud-based BI solutions offer
functionality on demand, along with more
rapid deployment, low upfront cost, and scalability. Many database vendors now support
data management and storage capabilities via
a cloud or software as a service environment.
In addition, other vendors are also optimizing their data products to be able to leverage
cloud resourceseither as the foundation
of private clouds, or running in on-premises
server environments that also access application programming interfaces (APIs) or web
services for additional functions.
In another survey of 262 data managers,
37% say their organizations are either running private cloudsdened as on-demand
shared services provided to internal departments or lines of business within enterprisesat full or limited scale, or are in pilot
stages (Enterprise CloudscapesDeeper
and More Strategic: 201213 IOUG Cloud
Computing Survey, sponsored by Oracle and
conducted by Unisphere Research, February
2013). This is up from 29% in 2010, the rst
year this survey was conducted. In addition,
adoption of public cloudsdened as ondemand services provided by public cloud
providersis on the upswing. Twenty-six
percent of respondents say they now use public cloud services either in full or limited ways,
or within pilot projects. This is up by 86%
36
industry updates
be applied for more routine business problems, which potentially can uncover unforeseen outcomes. For example, one bank found
that its most protable customers were not
high-wealth individuals, but rather those who
were not meeting minimums and overdrafting
accounts and thus anteing up fees. In another
case, an airline found that passengers specifying vegetarian preferences in their on-board
meals were less likely to miss ights. Or even
counterintuitive ndingssuch as the dating
site that found people rated the most attractive
received less attention than average-looking
members. (Suitors felt they faced more competition with more attractive members.)
Programming Tools: A range of scripting and open source languagesincluding
Python, Ruby, and Perlalso include extensions for parallel programming and machine
learning.
38
Whats Ahead
To compete in todays hyper-competitive
global marketplace, businesses need to understand whats around the corner. Predictive
analytics technology enables this to happen,
and the new generation of tools incorporates
such predictive capabilities.
The ability to automate low-level decisions is freeing up organizations to apply
their mind power against tougher, more strategic decisions. These days, analytical applications are being embedded into processes
and applied against business rules engines to
enable applications and machines to handle
the more routine, day-to-day decisions that
come uprerouting deliveries, extending
up-sell offers to customers, or canceling or
revising a purchase order.
Many organizations beginning their journey into the new BI and analytics space are
starting to discover all the possibilities it offers.
But, in an era in which data is now scaling into
the petabyte range, BI and analytics are more
than technologies. Its a disruptive force. And,
with disruption comes new opportunities for
growth. Companies interested in capitalizing on the big data revolution need to move
forward with BI and analytics as a strategic
and tactical part of their business road map.
The benets are profoundincluding vastly
accelerated business decisions and lower IT
costs. This will open new and often surprising
avenues to value.
is an
author and independent
researcher covering innovation, information technology trends, and markets.
Much of his research work
is in conjunction with Unisphere Research, a division of Information
Today, Inc. (ITI), for user groups including
SHARE, the Oracle Applications Users Group,
the Independent Oracle Users Group, and the
International DB2 Users Group. He is also a
regular contributor to Database Trends and
Applications, published by ITI.
Joe McKendrick
sponsored content
In business, analytical
modeling is the job of trained
data scientists who use a variety
of tools for developing these
models. Frontline business
users do not have such skill, but
everyday decisions they make can
be vastly improved based on such
big data insights. Challenges arise
in this transfer of knowledge,
since most tools dont typically
talk to one another.
Organizations can enable
data scientists and trained analysts to
easily transfer business insights to frontline
workers by adopting tools that can expose
the widest support for advanced analytics
and predictive techniques, either natively or
through open integration with other tools.
2) DEMOCRATIZE
ADVANCED ANALYTICS
Big data has no voice without analytics.
Often the reason to work with large
quantities of low-level data is to apply
sophisticated analytic models, which can
tease out valuable insights not readily
apparent in aggregated information.
4) GIVE STRUCTURE TO
ACTIONABLE UNSTRUCTURED DATA
Unstructured data accounts for 80% of all
data in a business. It typically comprises of
text-heavy formats like internal documents,
service records, web logs, emails, etc.
5) SETUP CONNECTIVITY
TO REAL-TIME DATA
Not all big data use cases lend themselves
to real-time analysis. But some do. When
decisions need to be taken in real-time (or
near real-time), this capability becomes a
key success factor. Analytic solutions for
nancial trading, customer service, logistics
planning, etc. can all be beneciaries of tying
live actual data to historical information or
forecasted outcomes.
In the end, big data analytics initiatives
are very much like traditional business
intelligence initiatives. These ve technological
needs demand a signicantly greater emphasis
for your big data journey. Will you stop
evading it now?
MICROSTRATEGY To learn how
MicroStrategy can help craft solutions
for your big data analytics needs, visit
microstrategy.com/bigdatabook.
DBTA .CO M
39
industry updates
Social Media
Analytic Tools
and Platforms
Offer Promise
By Peter J. Auditore
40
social media listening tools and social analytics platforms. Many are tapping their public
relations agencies to execute this new business
process. Smarter data-driven organizations
are extrapolating social media datasets and
performing predictive analytics in real time
and in-house.
There are, however, significant regulatory issues associated with harvesting, staging, and hosting social media data. These
regulatory issues apply to nearly all data
types in regulated industries such as healthcare and financial services in particular.
The SEC and FINRA with Sarbanes-Oxley
require different types of electronic com munications to be organized, indexed in
a taxonomy schema, and then be archived
and easily discoverable over defined time
industry updates
42
In-Memory
In-memory database technology, the next
major innovation in the world of business
intelligence and social media analytics,
is the game changer that will provide the
unfair advantage that leads to the competitive advantage every CEO wants today.
In-memory technologies and built-in analytics are beginning to play major roles in
social analytics. The inherent business value
of in-memory technology revolves around
the ability to make real-time decisions based
on accurate information about seminal business processes such as social media.
The ability to know and understand the
customer experience is paramount in the new
millennium as organizations strive to improve
customer service, keep customers loyal, and
gain greater insights into customer purchasing
patterns. This has become even more important as a result of social media and social media
networks that are now the new word-ofmouth platforms. In-memory promises to
provide real-time data not only from transactional systems but also to allow organizations
to harvest and manage unstructured data
from the social media sphere.
industry updates
44
services, employees, and partners. The majority of products work at multiple levels and drill
down into conversations with results depicted
in customizable charts and dashboards.
Often analytic results are provided in
customizable charts and dashboards that are
easy to visualize and interpret and can be
shared on enterprise collaborative platforms
for decision makers. Some social media analytic platforms integrate easily with existing
analytic platforms and business processes to
help you act on social media insights, which
can lead to improved customer satisfaction,
enhanced brand reputation, and can even
enable your organization to anticipate new
opportunities or resolve problems.
base indicated that they were not using thirdparty BI tools for social media analytics.
Whats Ahead
The 2012 social media and BI survey data
still provide a relevant picture of the state of
social media analytics. A majority of organizations will leverage legacy business intelligence vendors with familiar semantic layers
to perform rudimentary social media data
analysis. The big issue is that line-of-business managers will not wait for nonagile IT
departments to collect, harvest, stage/build,
and perform analytics on new social media
data marts or data warehouses.
New bleeding-edge social media analytical platforms are addressing the needs of
line-of-business professionals in real time.
They are also leveraging the economics of
utility computing and the cloud to bring costeffective analytical platforms to nearly all organizations. These highly integrated platforms
include simple social media listening tools,
along with embedded analytics and predictive
analytics that incorporate content and sometimes advertising abilities to meet the needs of
modern digital marketers. There are also other
new vendors that specialize in collecting and
delivering raw social media for those organizations which are building their own in-house
social media analytics platforms.
Traditionally, marketing has always had
four Ps. Today, marketing has ve Ps: product, place, position, price, and people
because in this millennium, the social media
network is the new platform for word-ofmouth marketing.
Peter J. Auditore is currently
recently head of the SAP Business Inuencer Group. He is a veteran of four technology startups: Zona Research (co-founder);
Hummingbird (VP, marketing, Americas);
Survey.com (president); and Exigen Group
(VP, corporate communications).
sponsored content
CONSOLIDATING EVERYTHING,
SLOW AND COSTLY
Providing analytics with the data
required has always been difcult, with
data integration long considered the biggest
bottleneck in any analytics or BI project.
No longer is consolidating all analytics
data into a data warehouse the answer.
When you need to integrate data from
new sources to perform a wider, more
far-reaching analysis, does it make sense
to create yet another silo that physically
consolidates other data silos?
Or is it better to federate these silos
using data virtualization?
DATA VIRTUALIZATION
TO THE RESCUE
Ciscos Data Virtualization Suite addresses
your difcult analytic data challenges.
Rapid Data Gathering Accelerates
Analytics ImpactCiscos nimble data
discovery and access tools makes it faster
and easier to gather together the data sets
each new analytic project requires.
Data Discovery Addresses Data
ProliferationData discovery automates
entity and relationship identication;
accelerating data modeling so your
analysts can better understand and
leverage your distributed data assets.
Query Optimization for Timely
Business InsightOptimization
algorithms and techniques deliver the
timely information your analytics require.
Data Federation Provides the Complete
PictureVirtual data integration in
CONCLUSION
The business value of analytics has never
been greater. But data volumes and variety
impact the velocity of analytic success.
Data virtualization helps overcome data
challenges to fulll critical analytic data needs
signicantly faster with far fewer resources
than other data integration techniques.
Empower your people with instant access to
all the data they want, the way they want it
Respond faster to your changing analytics
and business intelligence needs
Reduce complexity and save money
Better analysis equals business advantage.
So take advantage of data virtualization.
LEARN MORE
To learn more about Ciscos data
virtualization offerings for big data
analytics, visit www.compositesw.com
DBTA .CO M
45
industry updates
Big Data Is
Transforming
the Practice
of Data
Integration
By Stephen Swoyer
46
tured sources, such as graph and network databases, along with human-readable sources,
including JSON, XML, and txt documents);
and a host of so-called unstructured le
typesdocuments, emails, audio and video
recordings, etc. (The term unstructured is
misleading: Syntax is structure; semantics is
structure. Understood in this context, most
so-called unstructured artifactsemails,
tweets, PDF les, even audio and video les
have structure. Much of the work of the next
decade will focus on automating the proling,
preparation, analysis, andyesintegration
of unstructured artifacts.)
If all of this multistructured information is
to be analyzed, it needs to be prepared; however, the tools or techniques required to prepare
multistructured data for analysis far outstrip
the capabilities of the handiest tools (e.g., ETL)
in the data integration toolset. For one thing,
multistructured information cant efciently
or, more to the point, cost-effectively, be loaded
into a data warehouse or OLTP database. The
warehouse, for example, is a schema-mandatory platform; it needs to store and manage
information in terms of facts or dimen-
industry updates
48
industry updates
Alternatives to Hadoop
But while big data is often discussed
through the prism of Hadoop, owing to the
popularity and prominence of that platform,
alternatives abound. Among NoSQL platforms, for example, theres Apache Cassandra, which is able to host and run Hadoop
MapReduce workloads, and whichunlike
Hadoopis fault-tolerant. Theres also Spanner, Googles successor to BigTable. Google
runs its F1 DBMSa SQL- and ACID-compliant database platformon top of Spanner,
which has already garnered the sobriquet
NewSQL. (And F1, unlike Hadoop, can be
used as a streaming database. Here and elsewhere, Hadoops le-based architecture is a
signicant constraint.) Remember, a primary
contributor to Hadoops success is its cost
as an MPP storage and compute platform,
Hadoop is signicantly less expensive than
50
Whats Ahead
Big data, along with the related technologies such as Hadoop and other NoSQL
platforms, is just one of several destabilizing
forces on the IT horizon, however. Other
technologies are changing the practice of data
integrationsuch as the shift to the cloud
and the emergence of data virtualization.
Cloud will change how we consume and
interact withand, for that matter, what we
expect ofapplications and services. From
a data integration perspective, cloud, like
big data, entails its own set of technological, methodological, and conceptual challenges. Traditional data integration evolved
in a client-server context; it emphasizes
direct connectivity between resourcese.g.,
a requesting client and a providing server.
The conceptual model for cloud, on the other
hand, is that of representational state transfer,
or REST. In place of client-servers emphasis on direct, stateful connectivity between
resources, REST emphasizes abstract, stateless
connectivity. It prescribes the use of new and
nontraditional APIs or interfaces. Traditional
data integration makes use of tools such as
ODBC, JDBC, or SQL to query for and return
a subset of source data. REST components, on
the other hand, structure and transfer information in the form of lese.g., HTML,
XML, or JSON documentsthat are representations of a subset of source data. For this
reason, data integration in the context of the
cloud entails new constraints, makes use of
new tools, and will require the development
of new practices and techniques.
That said, it doesnt mean throwing out
existing best practices: If you want to run
sales analytics on data in your Salesforce.
com cloud, youve either got to load it into an
existing, on-premises repository oralternativelyexpose it to a cloud analytics provider.
In the former case, youre going to have to
extract your data from Salesforce, prepare it,
and load it into the analytic repository of your
industry directory
that make Big Data available where and when needed across
301-770-2888
sales@appuent.com
ATTUNITY
www.appuent.com
www.attunity.com
SEE OUR AD ON
PAGE 49
codeFutures
CodeFutures is the provider of dbShards, the Big Data platform that
analytics and BI needs, and save 5075% over data replication and
consolidation.
CODEFUTURES CORPORATION
11001 West 120th Avenue, Suite 400
Broomeld, CO 80021
COMPOSITE SOFTWARE
SEE OUR AD ON
PAGE 35
(303) 625-4084
sales@codefutures.com
www.dbshards.com
www.compositesw.com
DBTA .CO M
51
industry directory
DATAMENTORS
2319-104 Oak Myrtle Lane
SEE OUR AD ON
PAGE 47
DATAWATCH CORPORATION
Email: info@datamentors.com
www.DataMentors.com
www.datawatch.com
SEE OUR AD ON
PAGE 11
testing and less time waiting for new data, increasing utilization of
expensive test infrastructure. Analysts and managers make better
decisions with fresh data in data marts and warehouses. Leading global
organizations use Delphix to dramatically reduce the time, cost, and risk
DELPHIX
275 Middleeld Road
52
DENODO TECHNOLOGIES
sales@delphix.com
info@denodo.com
www.delphix.com
www.denodo.com
industry directory
SEE OUR AD ON
PAGE 37
www.empolis.com
www.hitsw.com
Big Data across disparate data sets. Proven for more than 10 years,
Deutsche Telekom, and more than a dozen federal agencies rely on its
processes, and drive better outcomes faster. They leverage the platform
HPCC Systems was built for small development teams and offers a single
data-driven advantage.
KAPOW SOFTWARE
260 Sheridan Avenue, Suite 420
Palo Alto, CA 94306
LEXISNEXIS
Phone: 877.316.9669
Email: marketing@kapowsoftware.com
www.hpccsystems.com
www.kapowsoftware.com
www.lexisnexis.com/risk
SEE OUR AD ON
PAGE 43
DBTA .CO M
53
industry directory
Since 1988, Objectivity, Inc. has been the Enterprise NoSQL leader,
helping customers harness the power of Big Data. Our leading edge
rich mobile apps, and inside Microsoft Ofce applications. Big data
analyze big data visually without writing code and apply advanced
microstrategy.com/bigdatabook.
MICROSTRATEGY
1850 Towers Crescent Plaza
SEE OUR AD ON
COVER 4
OBJECTIVITY, INC.
3099 North First Street, Suite 200
SEE OUR AD ON
PAGE 7
Phone: 888.537.8135
408-992-7100
Email: info@microstrategy.com
info@objectivity.com
www.microstrategy.com/bigdatabook
www.objectivity.com
faster and more reliable for over 2,000 customers worldwide. Our
business information.
54
PERCONA
PROGRESS DATADIRECT
www.percona.com
www.datadirect.com
SEE OUR AD ON
PAGE 41
industry directory
for real-time Big Data applications. Splice Machine provides all the
of the system goes down, the rest of the system is not affected. Data
high throughput.
SPLICE MACHINE
SEE OUR AD ON
COVER 2
TRANSLATTICE
+1 408 749-8478
info@splicemachine.com
info@TransLattice.com
www.splicemachine.com
www.TransLattice.com
SEE OUR AD ON
PAGE 9
DBTA .CO M
55
S C A N
Each issue of DBTA features original and valuable contentproviding you with
clarity, perspective, and objectivity in a complex and exciting world where data
assets hold the key to organizational competitiveness.
Dont miss an issue!
Subscribe FREE* today!
*Print edition free to qualied U.S. subscribers.
Big Data
technologies,
including
Hadoop, NoSQL,
and in-memory
databases
Solving
complex data
and application
integration
challenges
Increasing
efciency
through cloud
technologies
and services
Tools and
techniques
reshaping the
world of business
intelligence
New
approaches
for agile data
warehousing
Key strategies
for increasing
database
performance
and availability