Beruflich Dokumente
Kultur Dokumente
ENTERPRISE ARCHITECTURE:
TECHNOLOGIES AND SKILLS
THAT BUILD THE FOUNDATION
FOR DATA MANAGEMENT 12
12
WWW.DBTA.COM
CONTENTS
BIG DATA
QUARTERLY
Summer 2017
departments
THE DEMANDS FOR faster development cycles, and capabilities being delivered by a cloud
better software and services that are more provider.
secure and always-on, and smarter decisions The expansion of cloud technologies is
fueled by real-time data are creating a strong also driving the use of DevOps, an approach
force in the big data world today. that helps increase the speed and alignment
As a result, big data technologies to support of software development and IT operations.
software development and data management Classical DevOps was about creating veloc-
are constantly evolving. And in this issue ity, but performance needs to be a first-level
of Big Data Quarterly, the rapidly changing player in DevOps for big data, notes Pepper-
enterprise requirements and IT trends are data CEO Ash Munshi in an interview.
explored from a range of vantage points in The use of containers can help organiza-
interviews, articles, and columns. tions gain greater agility since the approach
The rise of security issues posed by cloud, supports multi-cloud deployments, while
as well as mobility and big data, are consid- also allowing them to deploy software more
ered in an interview with Steve Grobman, easily and better utilize their resources, adds
CTO of McAfee. There are tremendous effi- MapR’s Jim Scott. Microservices, a technol-
ciencies and cost benefits that organizations ogy being used more frequently with con-
can achieve by moving to cloud-based archi- tainers, is another approach that helps orga-
tectures. However, organizations must also nizations achieve much-needed agility, notes
be aware that the value of data in these cloud Unisphere Research analyst Joe McKendrick
environments also makes it a sought-after in his cover article on the technology changes
target for cybercriminals, he observes. And, if taking place in enterprise architecture.
such a breach occurs in a multitenant cloud, With this issue, we also welcome Software
the impact could be severe. AG’s Bart Schouw as the new author of the
Moreover, with the growing use of cloud IoT Insider column. In his piece, Schouw
technology in the enterprise, there is a keen highlights the security advantages of block-
awareness that the loss of services even for a chain for connected devices. Blockchain was
short time in a cloud environment can wreak originally intended for data integrity, but
havoc, a point noted by a number of BDQ writ- why not use it for device integrity?
ers in this issue. To help alleviate the risk of a And, there are many other great articles in
major disruption, Navisite’s Michael Corey and this issue on the changes taking place in the
VMware’s Don Sullivan suggest there needs to world of big data. To stay on top of the latest
be a stronger embrace of hybrid approaches in big data developments, research reports, and
addition to an understanding of the features news, visit www.dbta.com/bigdataquarterly.
When you look at the data security forecasts that came out of Steve Grobman, CTO for McAfee
the recent McAfee Labs 2017 Threat Predictions report, was
there anything that stood out to you?
We are seeing IT aggressively moving to cloud-based architec-
tures in order to improve efficiency and decrease their costs. There ransom in bitcoin, all the email archives of their top executives—
are tremendous benefits from doing so, and many of those will be with information about salaries and off-color comments—will
security benefits as well—in that cloud providers can inherently be released. There is the potential to use data to threaten harm
invest in building a strong security architecture. But we also need and extort companies. But then, the other part of that is an
to recognize that, given the value of the data that will be held in offshoot of what we saw during the presidential election cycle.
these cloud environments, the benefit to bad actors in breaching
those environments will be very high. Please explain.
Compromised data can be augmented with fabricated data
How so? to make things even worse. For example, in a corporate envi-
Whether that is from a low-level perspective like virtualiza- ronment, if you had your CEO’s email stolen, a bad actor could
tion technologies, or the orchestration capabilities that tie it all release the legitimate stolen data to establish credibility but
together, or even the bridging technologies that allow cloud archi- then add fabricated data to do even more harm. Think of a
tectures to interoperate with traditional environments—all of scenario where the bad actor’s objective is to make money on
those will be targeted. One of the most profound areas from my manipulating the stock price of an organization. Data stolen
perspective is that when a multitenant cloud system is breached, from a key executive, that when vetted and evaluated will be
the impact can be much more severe than breaching a single found to be legitimate, can be interlaced with fabricated data
company’s application or data architecture. The reason for that is that makes it appear as though there were scandals, or corrup-
that the bad actor could either steal or corrupt many parties’ data tion, or illegal activity.
versus just a single organization’s. I think we will start to see issues
related to the cloud becoming much more common. What needs to be done?
It is very important to make the point to the general public
It is possible for a data breach of any type to cause damage that we need to be very suspicious of data that is identified from
even if there is no financial opportunity. a data leak. With the media continuously reporting the content
You are right that using data as a weapon as opposed to just of breached data, the general public is being conditioned to
monetizing it by stealing it and selling it is key. That can benefit a essentially trust this information. Think of something as mun-
cybercriminal in many different ways and one of the easiest is sim- dane as the Ashley Madison breach—there was nothing to pre-
ply by extortion and threatening a company that if it does not pay a vent the hackers from adding names to those lists.
OVER THE NEXT 6 years, the Internet of Things (IoT) tion technologies tout further benefits, enabling
market is expected to reach $883.55 billion, as con- cars and trucks to “talk” to one another to steer
Whether for nected devices continue to pour into just about drivers clear of accidents and other hazards. In
a connected every aspect of our lives. For enterprises, the IoT fact, the National Highway Traffic Safety Admin-
car, home, is helping to transform products into connected istration (NHTSA) estimates that V2V technology
factory, or services, capable of creating recurring revenue could prevent more than half a million accidents
business—you streams, reducing costs, and enhancing customers’ and save more than 1,000 lives each year in the
name it—end- experiences. U.S. And, while we are still a few years away from
to-end security
Despite these benefits, the IoT brings a flood of fully autonomous or driverless vehicles zipping
is crucial to
security risks. Vulnerability to hackers, privacy con- down our highways, the possibilities of safer roads,
thwarting
cybersecurity cerns, and sheer uncertainty of what these devices less traffic, and reduced emissions are extremely
threats and are capable of doing are just a few of the IoT’s inher- enticing. Connected, highly automated cars also
keeping users ent implications. open up vistas for whole new experiences such as
safe. Organizations and their networks are unpre- immersive entertainment and collaboration, add-
pared for the tsunami of devices on the way. Accord- ing to the value.
ing to AT&T’s Cybersecurity Insights Report, 85% On the other hand, connected cars pose threats
of enterprises are in the process of or intend to not only to privacy (given all the data they collect),
deploy IoT devices. Yet, just 10% are confident in but also to safety, when not properly secured. Case
their ability to secure devices against hackers. In in point: 2 years ago, a pair of hackers demon-
order to reap the benefits of the IoT, securing these strated the potential dangers of a cybersecurity
devices and streams of data that flow between them attack on a connected car by remotely hijacking
must be top-of-mind. and crashing a Jeep over the internet. The inci-
dent led to the recall of 1.4 million Chrysler vehi-
The Connected Car: cles. Considering this dramatic demonstration, it
The Epitome of IoT Hopes … And Fears is no surprise that consumers have been hesitant
Looking at the wealth of devices entering the to “turn on” the various connected devices in their
market, the connected car is the most talked about, connected cars. A 2016 Spireon survey revealed
and perhaps the most controversial. A “data cen- that despite interest in connected cars, 54% of
ter on wheels,” the connected car is emblematic of participants said they have not actually used con-
everyone’s best hopes and worst fears of the IoT. nected car features. Although their concerns are
Shaun Kirby
On one hand, connected cars promise to pro- slowly but surely diminishing (willingness to pay
is director, mote safer and more efficient driving through for connected services went from 21% in 2014 to
Automotive & technologies such as collision avoidance systems, 32% in 2015), auto manufacturers are missing out
Connected Car, remote diagnostics, predictive maintenance tools, on the opportunity to capitalize on a $155 billion
at Cisco.
and on-board GPS. Connected transportation sys- market and the recurring revenue that stems from
tems and “vehicle-to-vehicle” (V2V) communica- subscriptions to different connected services.
Figure 2: Simple federation (A) versus advanced data virtualization approach (B)
It is also straightforward to create different logical views The following example illustrates how both capabilities
over the same physical data, adapted to the needs of each work, while the figure below shows a simplified logical data
type of user. Furthermore, data virtualization provides a warehouse scenario:
single entry point to apply global security and governance • An enterprise data warehouse (EDW) contains sales
policies across all the underlying systems. from the current year (290 million rows) and a Hadoop
Nevertheless, to realize the full potential of logical system contains the sales data from previous years (3 bil-
architectures, it is crucial that the data virtualization sys- lion rows). Sales data include, among other information,
tem includes query optimization techniques specifically the customer ID associated to each sale.
designed for combining large distributed datasets. In turn, • A CRM database contains customer data (5 million rows,
many of the data federation/virtualization systems available one row for each customer). The information for each
in the market reuse the query optimizers of conventional customer includes its country of origin.
databases, with only slight adaptations. This is the case of Parts A and B of the figure show two alternative strategies
the data federation extensions recently introduced by some to calculate a certain report that looks at the total amount
database and BI tool vendors. However, those optimizers of sales by customer country in the last 2 years. As depicted,
cannot apply the most effective optimization techniques for the report needs sales data from both the current and previ-
logical architectures, and the associated performance pen- ous years and the country of origin of each customer. There-
alty can be very significant. fore, it needs to combine data from the three data sources.
More precisely, there are two types of capabilities needed Data federation tools, using extensions of conventional
to achieve the best performance in these scenarios: query optimizers, would compute this report using Strategy
A, while DV tools with optimizers designed for logical archi-
1. Applying automatic optimizations to minimize net- tectures would use Strategy B. In fact, the most sophisticated
work traffic, pushing down as much processing as pos- DV tools would consider additional strategies to Strategy B
sible to the data sources. and choose the best one using cost information.
2. Using parallel in-memory computation to perform at In Strategy A, the federation tool pushes down the fil-
the DV layer the post-processing operations that can- tering conditions to the data sources and retrieves the data
not be pushed down to the data sources. required to calculate the report. Since the report includes one
W
HAT ARE THE enabling
technologies that make
enterprise architecture
what it is today? There are a range
of new-generation technologies
and approaches shaping today’s
data environments. The key is
By Joe putting them all together to help
McKendrick enterprise architecture fit into the
enterprise’s vision of itself as a data-
driven organization. Tools and tech-
nologies emerging within today’s
data-driven enterprise include
cloud, data lakes, real-time analyt-
ics, microservices, containers, Spark,
Hadoop, and open source trends.
CLOUD For smaller operations, cloud is the Interestingly, the cloud “is already start-
Cloud computing, in its current form, de facto platform, as pointed out by Eric ing to be seen as the more secure place to
has been on the scene for close to a decade. Mizell, vice president of global solution operate your business,” Glickman said.
It has only been within the past 2–3 years, engineering for Kinetica. “Most startups “Regulators might soon begin rewarding
however, that it has hit its stride as the solu- are 100% cloud, as it’s easier to spin up and their constituents who operate in the cloud
tion of choice for data environments. “The down instances versus standing up servers since they can provide greater transparency
acceleration to the cloud has passed the in an office.” to their respective businesses.” Ultimately,
point of no return,” said Matthew Glickman, At the same time, Mizell sees move- he added, “cloud adoption will reach its
vice president of product management for ment even among the largest data centers point of full-on adoption once everyone
Snowflake. “More and more companies, “away from traditional datacenters for stops talking about cloud adoption.”
regardless of scale, are all in the cloud.” most workloads.” That is the case, he says,
Organizations are embracing cloud because “it is now essential to have global DATA LAKES
“to reach new levels of agility, increase the collection and processing zones in the Industry experts are bullish on the con-
speed of innovation, and improve time- cloud for easier and faster data handling cept of the data lake. As Syed Mahmood,
to-market rates,” said Mat Keep, director around the world. They say that data has director of product marketing at Horton-
of product and market analysis for Mon- gravity, and what is collected in the cloud works, pointed out, “The data lake is a nat-
goDB. “We estimate that the majority of stays in the cloud.” Moreover, the infra- ural extension of a company’s decision to
our deployments today are in the cloud, structure behind the cloud keeps getting embark on its big data journey.”
and we’re seeing those numbers increase.” faster and more powerful. However, they disagree about whether
There are a range of benefits enterprises Areas where cloud is gaining the most Spark or Hadoop is being used to support
are already seeing from cloud, including traction include “newer digital business these environments. The urgency of the
the ability to “scale applications to new projects that provide responsive and per- data lake concept is acute. “The need to
geographies, decrease investments in local sonalized customer and employee-centric bring data from different systems together
data center resources, and improve the experiences using mobile, web, and IoT into a centralized repository for analytics
ability to deliver apps quickly—all while applications,” said Ravi Mayuram, senior and reporting is nothing new but with data
reducing application and infrastructure vice president of products and engineering volumes exploding, and much of that data
provisioning,” said Keep. For the most for Couchbase. “We see these new systems now being semi-structured and unstruc-
part, startups “will never have their own being built across many industries, includ- tured, traditional enterprise data ware-
data centers, opting instead to be cloud ing ecommerce, travel and hospitality, dig- houses are buckling under the load,” said
natives,” according to Joe Pasqua, executive ital health, digital media, financial services, Keep. “Data lakes augment, rather than
vice president of products for MarkLogic. and gaming.” While tech and media were replace, the enterprise data warehouse.” He
Benefits also include “agility, in which, the early cloud adopters, other industries noted that in building data lakes, Hadoop
for example, public clouds allow for quick are now joining the cloud movement, isn’t the only solution available, and likely
spin-up or spin-down of infrastructure,” as Glickman agreed. introduces complexity. “If organizations go
well as “scale, in which public clouds allow Be careful not to associate cloud exclu- the Hadoop route, they need to consider
for nearly-unlimited storage and compute, sively with “public” cloud services, Norris how they will integrate the analytics cre-
enabling customers to burst data and/or cautioned. There’s a key role for on-prem- ated in the data lake with the operational
analytics into a cloud on an as-needed ises data centers, as well. “Cloud is less systems that need to consume those ana-
basis,” said Jack Norris, senior vice presi- about which sites to deploy to, and more lytics in real time. This demands the inte-
dent of data and applications for MapR. about taking advantage of all physical sites gration of a highly scalable, highly flexible
Finally, there are cost savings, in which available,” Norris said. “Hybrid models, operational database layer.”
“public clouds allow for a pay-as-you-go where services or resources are managed While Hadoop has made the data lake
model, where customers are charged based in some combination of on-premises and possible, it also introduces challenges, such
on resources used.” public cloud, are quite prevalent.” as “the potential to become a data dump,
security issues, lack of skill sets, and slow out advanced data science degrees to access ics, you often want to be able to scope your
performance, causing smaller or less agile information faster and more reliably than analysis to a very fine target.”
companies to either not try or give up on ever before,” he explained.
Hadoop,” said Keep. He noted that he has Apache Spark is appealing for real- MICROSERVICES AND CONTAINERS
seen many companies add “a fast data time environments “because users can Containers and microservices play a
layer on top of Hadoop to help increase compute analytics very quickly, which key role in helping to achieve agility in
its value.” According to Keep, “Spark offers is especially important in today’s highly hybrid cloud or on-premises environments,
new life to the data lake concept. It brings responsive customer-facing applications,” industry observers agree. “Containers and
performance and machine-learning algo- said Mayuram. He pointed to another microservices were born out of the cloud
rithms that enable the desired data mung- real-time enabler, Apache Kafka, which environment and are critical components
ing businesses want. It also plays well in the provides a “standard way to move data to help developers be more agile,” said Jason
cloud by enabling data in cloud storage to from an application context into a broker, McGee, IBM fellow, vice president, and
be processed faster than ever before.” so your web application team doesn’t need CTO for IBM Cloud Platform. “It’s all about
Still, some experts caution against to worry about how to make it available to enabling developers to progress and iterate
diving too deep into a data lake. “Unfor- downstream consumers—their respon- quickly. Developers have to spend a lot of
time setting up the environments that sup-
port their application, installing and con-
figuring software, setting up infrastructure,
Containers and microservices and moving applications between develop-
ment, test, and production systems. Con-
play a key role in helping to tainers solve this challenge by standardizing
how developers package their applications
achieve agility in hybrid cloud and dependencies, making it super simple
to create, move, and maintain applications
or on-premises environments. and allowing more time for what devel-
opers really want to do, which is create.”
Keep agreed that containers provide much-
needed application portability, “making it
tunately, many companies have seen sibility ends at Kafka. Likewise, different simpler to move services between on-prem
their data lakes turn into data swamps,” application teams can build analytics on and cloud environments, facilitated increas-
said Norris. With respect to Hadoop the website data by consuming it from ingly by the public cloud vendors rolling out
and Spark, “we see two types of cus- Kafka—no prearrangement required.” container services.”
tomer adoption patterns,” he said. “The A notable benefit of both Kafka and For their part, microservices contrib-
first group started with Hadoop and Spark, Mayuram continued, “is the abil- ute to agility “by enabling the formation of
then adopted Spark and are using both ity to support real-time data streaming, smaller teams that do not have to coordi-
technologies. The second group adopted which significantly reduces the traditional nate as much with the larger organization,”
Spark initially and use Spark independent time lag between when data enters the McGee continued. Keep added that “the
of Hadoop.” Spark’s streaming analytics, system and when the results of ETL and large, monolithic code bases that tradition-
he added, benefits significantly from run- analytical processes are available.” ally power enterprise applications make it
ning on a data platform that is not limited Ultimately, for the success of analytics difficult to quickly launch new services. In
by Hadoop’s batch constraints. and real-time solutions, data needs to be the last few years, microservices—often
trusted. “Most analytics technologies fall enabled by containers—have come to the
REAL TIME down in this area,” said Pasqua. He noted forefront of the conversation. Containers
What are the best technologies for that “the goal of many real-time analytic work very well in a microservices environ-
enabling real-time analytics? For Dinesh processes is to determine as much as pos- ment as they isolate services to an individ-
Nirmal, vice president of analytics devel- sible about an individual entity as opposed ual container. Updating a service becomes
opment at IBM, the answer is Spark. to a population. While many people think a simple process to automate and manage,
“Spark radically simplifies the analysis of about analytics in terms of statistics over and changing one service will not impact
large datasets, enabling even those with- large groups of data, in real-time analyt- other services.”
Containers and microservices may go for big data processing,” said Couchbase’s TomEE, Web Server, Cordova, Axis, Zoo-
together, but are not joined at the hip. “Just Mayuram. “Spark performs better, is eas- Keeper, Mesos, Groovy, Commons, Open-
to be clear, containers are not required ier to manage, and provides additional JPA, ServiceMix, Zeppelin, and Lucene.
for microservices, nor are microservices functionality like machine learning, which Mahmood sees another solution,
required for containers,” said Mayuram. tends to make it much more attractive than Apache Ranger, also gaining traction
“While it’s correct that both containers and Hadoop for big data processing,” among enterprises that “are increasingly
microservices are frequently used together Hadoop is dying in the enterprise, concerned about providing secure and
in today’s modern web, mobile, and IoT Glickman agreed. “Hadoop-based proj- authorized access to data such that it can
applications, they are not a requirement ects are slowly failing and will eventually be widely used across the organization,
for each other.” be replaced with cloud-based services that while also keeping sensitive information
Flexibility and adaptability are critical are better suited to the tasks Hadoop tried safe. Apache Ranger is being used by some
to container and microservices success. to solve on-premises. Apache Spark, on of the largest companies across industries
“Choose a database that meets the require- the other hand, is thriving. By being data- to provide a framework for authoriza-
ments of microservices and continuous source agnostic by design, Spark never had tion, auditing, and encryption and key
delivery,” Keep said. “When you build a a tight coupling to Hadoop, or more pre- management capabilities across big data
new service that changes the data model, cisely, HDFS.” infrastructure.” Other open source tools
you shouldn’t have to update all of the exist- Some industry observers, however, include Apache Atlas, which addresses data
ing records, something that can take weeks believe Spark and Hadoop can coexist management and governance, and Apache
for a relational database.” Instead, Keep and deliver impressive synergies. “We Zeppelin, which assures “access to data is
noted, it is important “to ensure that you don’t view this as a Spark-versus-Hadoop democratized and citizen data scientists
can quickly iterate and model data against debate,” said Mahmood. “We believe that can use a web-based tool to explore data,
an ever-changing microservices landscape, analysts and data scientists require a cen- create models and interact with machine
resulting in faster time to market and tralized platform to develop predictive learning models,” Mahmood stated.
greater agility.” applications. Apache Hadoop provides this
One risk is the distributed nature of foundational platform for big data pro- BLOCKCHAIN
microservices, Keep said. “There are more cessing with HDFS for storage and YARN And, finally, there is an increasing role
potential failure points. Microservices for compute management. We believe for blockchain—the global, distributed
should be designed with redundancy in that Apache Spark is more effective when database—in today’s enterprise environ-
mind.” it operates as part of a Hadoop platform. ments. While the direction and impact of
Automation is also essential to these envi- With the burden of the platform being this technology is not yet clear, blockchain
ronments, he added. “With a small number taken care of by Hadoop, data scientists can promises to disrupt many data manage-
of services, it is not difficult to manage tasks be more productive by simply focusing on ment approaches. “Blockchain technology
manually. As the number of services grows, building predictive applications.” excels at building trust between groups
productivity can stall if there is not an auto- of inherently untrusting legal entities,”
mated process in place to handle the growing OPEN SOURCE said Jerry Cuomo, IBM fellow and vice
complexity.” Finally, he advises, “learn from Open source is also gaining traction president of blockchain technologies.
the experiences of others.” and, in particular, a number of key Apache “If everyone trusts each other like in a
projects are getting a foothold in the enter- private enterprise, we really don’t need
SPARK VERSUS HADOOP prise. “We often see different technologies a blockchain. However, every enterprise
While Hadoop has emerged as a popu- being brought in to address application has business-to-business relationships
lar open source framework in recent years, development, data management, and oper- where value is exchanged.” For example,
another contender, Apache Spark, is steal- ational challenges,” said Mayuram. Some he noted, in a supply chain, partners, sup-
ing its thunder. “Our customers, especially of the more common Apache projects that pliers, and shippers manage the exchange
those who are building newer big data Couchbase sees within enterprise customers of goods across enterprises. “This is where
projects, tend to choose Spark over Hadoop are Spark, Kafka, ActiveMQ, Flume, Arrow, blockchain shines.”
Management
FOR DIGITAL
TRANSFORMATION:
WHY THEY MATTER,
AND WHAT YOU NEED
TO KNOW
Informatica
PAGE 23
TO STREAM OR
& Analytics
NOT TO STREAM—
JUST FLIP A SWITCH
formance is an important piece of the puzzle, and that’s where Employ machine learning and other real-time approaches.
database technology converges with the drive to real time. Behind just about every analytics-driven interaction is an algo-
Here are the key elements to consider in moving to a fast or rithm that employs techniques to gather data and do some type
streaming data environment: of pattern matching to measure preferences or predict future
Mind your storage. Fast data requires a technology compo- outcomes. Machine learning approaches enable these systems to
nent that is essential: abundant and responsive storage. This is adjust software to data streams without time-consuming manual
where data managers and their business counterparts need to intervention.
understand when and where data pulsing through their orga- Look to the cloud. Today’s cloud services support many
nizations need only to be read once and discarded, or stored of the components required for fast or streaming data—from
for historical purposes. Many forms of data—such as constant machine-learning algorithms to in-memory technologies. Most
streams of normal readings from sensors—simply aren’t import- respondents in the OpsClarity survey (68%) cite the use of either
ant enough to invest in archival storage. public cloud or hybrid deployments as the preferred mechanism
Consider alternative databases. Much of the data that is for hosting their streaming data pipelines.
being sought across enterprises these days is the unstructured, Pump up your skills base. The next-generation approaches
non-relational variety—video, graphical, log data, and so forth. required for delivering fast or streaming data and analytics also
Relational data systems, for example, tend to be slower than call for new types of skills in these areas. Data professionals need
necessary for the tasks that employ unstructured data streams. greater familiarity with new tools and frameworks, including
NoSQL databases, for example, have lighter-weight footprints Apache Spark or Apache Kafka. Organizations must increase
and can process these data streams at faster rates than established their levels of training for current data management staffs, as
relational database environments. well as seek out these skills in the market.
Employ analytics close to the data. It may also be helpful to Look at data lifecycle management. It’s important to be able
use data analytics that are embedded with database solutions for to filter the data that is required for eventual long-term storage,
many basic queries. This enables greater response times, versus versus the data that is only valuable in the moment. Otherwise,
routing data and queries through networks and centralized algo- the amount of data that would need to be stored would be over-
rithms that may drag on performance and increase wait times. whelming—and mostly unnecessary. A way to address potential
Examine in-memory options. The delivery of highly intelligent, storage overload is data lifecycle management, in which certain
interactive experiences requires that back-end systems and applica- types of data are either eliminated or moved to low-cost storage
tions operate at peak performance. That requires movement and vehicles, such as tape, after a predetermined amount of time.
delivery of data at blazing speeds, recognizing that every nanosec-
ond counts in a user interaction. In-memory technologies—which
can support entire datasets in memory— can deliver this speed. —Joe McKendrick
There are many possible journeys In industries as diverse as banking, which means access not just to that
on the Road to Big Data. In this article transportation, building maintenance streaming data, but also to the full
we will look at handling streaming and particle physics, predictive mainte- richness of your new and existing
data and how to deliver effective nance solutions rely on rapid analysis of datasets.
analytics on high-speed data. Most sensor data. Let’s look at what that takes.
importantly, we want to provide you And the fraud detection that we
a path to doing this successfully in a all experience on credit card transac- THREE STEPS
short amount of time. tions as well as ecommerce sites is only We can break this down into three
possible with immediate analytics on simple steps:
USE CASES transactions as they stream in. 1. Establishing a data lake that can
Many common big data use cases host all the non-streaming data
rest on streaming data. Successful B2C MORE DATA—BETTER RESULTS you need, along with the analytics
companies need to interact with, and But it takes more than streaming to capitalize on it.
make decisions with their customers data to bring these use cases to life. 2. Capturing your data stream(s),
in real-time. Fast analytics on fast data Building the right predictive analytics establishing real-time analytics,
underpins everything from understand- requires a rich collection of historical and depositing that data into your
ing your customer’s next best move to data against which to build and test data lake for future use.
delivering targeted promotions over different models. Adding more data 3. Integrating analytics to act upon
multiple channels to handling inquiries is almost always more effective than both your data stream(s) and your
promptly. building better algorithms on less data, data lake.
A QUICK PROOF
OF CONCEPT
Seeing is believing.
Building a fast data
environment need not
be a time-consuming
project. With the right
examples, and the right
experimental data to work
with, it’s possible to spin
up a cloud environment
and start getting to work.
Oracle’s Big Data Cloud
Platform offers a compre-
hensive environment that
lets you quickly:
• Load data into an object OBJECT STORAGE— enterprise data lakes. Think of object
store—the “new data lake”— THE BEST DATA LAKE store as the lowest tier in your storage
and do basic batch analyses PLATFORM hierarchy. Object Store allows you to
• Set up a Kafka stream to Historically (if that’s the right term decouple storage from compute giving
ingest and analyze fast, for something that is not yet a teen- organizations more flexibility, durabil-
streaming data ager) data lakes have been based on ity and cost savings. Our guidance is
• Integrate contextual data Hadoop and HDFS. But in the cloud to store everything in Object Store and
in your data lake with there’s a better option. Object storage read only the data you need into the
streaming data using the automatically replicates and distributes application tier on demand. At the end
power of Kafka, Spark and data across multiple data centers to of the day, the cost of copying this data
Hadoop increase availability and data integrity. as needed is small compared with the
With object storage you can: savings and the increased flexibility.
THE NEW DATA LAKE • Detach compute from storage to
SOLUTION allow each to grow independently CALL TO ACTION
Big Data Cloud Service—Compute • Persist data in a lower cost store The best way to learn is by doing.
Edition is the right platform to get that offers greater durability Visit oracle.com/bigdatajourney for
started with. It offers a simple, devel- • Maintain a centralized, multi- step-by-step instructions on how to
oper-friendly interface, can provision tenant platform that has the get started using sample data and a
a new cluster in minutes, and encrypts flexibility to handle new work- real use case. You’ll also find instruc-
data-at-rest and data-in-motion to loads, new types of data and new tions on how to redeem $300 in free
keep it secure. Event Hub Cloud Ser- software frameworks as the com- cloud credits towards Oracle Cloud
vice delivers the power of Kafka as a plexity of use cases increase. Services necessary to build “The New
managed platform to handle streaming Hadoop HDFS’ strategy of intrin- Data Lake.”
data. And Object Storage is the foun- sically tying storage and compute is
dation for the most cost effective and increasing becoming an inefficient ORACLE
flexible data lake. use of resources when it comes to www.oracle.com
ACCORDING TO Unisphere Research, over the past lems for IT professionals as they have to manage
5 years, storing data in the cloud has become an mission critical layers of their application ser-
increasingly important feature of the overall data vices across networks, systems, and services that
management infrastructure, and the amount of they neither own nor control completely. This
data stored in the cloud continues to expand at a decreases their visibility into performance and
healthy rate. challenges their authority to identify and resolve
The growing assortment of private cloud, problems such as downtime and outages. With
virtual private cloud, and public infrastruc- deployments that were previously onsite spread
ture-as-a-service offerings means IT professionals across cloud service providers, IT administrators
must start thinking about how to be successful in need to monitor their environments more effi-
this increasingly hybrid context as well as strate- ciently and effectively than ever, and develop new
gies for managing what are becoming highly com- skills to succeed.
Embracing
plex environments.
monitoring
as a discipline Kong Yang, head geek at SolarWinds, a pro- What do these changes mean for IT
is of great vider of IT management software, believes the professionals?
importance rise of the mobile workforce and the pressure to With these changes, IT professionals must con-
to successfully implement new technologies means that modern tinue to develop new skills to keep pace and avoid
implementing IT professionals must be able to quickly evolve being “left behind.” IT professionals can no longer
and maintaining beyond the confines of on-premises deployment have a single area of expertise; as their IT environ-
hybrid IT. and shift into the realm of hybrid IT. Here, Yang ments become increasingly de-siloed, their areas
reflects on some of the ways that IT professionals of expertise must also extend beyond their usual
can begin that journey. discipline.
Discerning what can be moved outside the
What is changing in IT environments now data center to best realize the benefits of hybrid
with respect to cloud and on-premises IT requires a deep understanding of cloud ser-
deployments? vices and how they integrate with on-premises
As the technology industry continues to trans- deployments. IT professionals must learn to use
form, IT environments are becoming increasingly this knowledge to determine what services and
hybrid. A hybrid IT environment encompasses a applications are best suited for on-premises, as
mix of cloud services and on-premises deploy- the decision to migrate a portion of existing IT
ments, and has positive and negative effects on IT services to the cloud should not be taken lightly.
professionals and organizations in general. IT professionals and their ultimate value will be in
For example, hybrid IT gives organizations the balancing the cloud’s benefits with performance,
opportunity to consider a workload’s resource, cost, governance, and security objectives.
security, and performance needs before deter-
mining whether it’s a better fit for the cloud or What are the skills needed to succeed in a
if it should remain on-premises. Public cloud hybrid context?
vendors supply IT organizations with the ser- An IT professional operating in a hybrid IT
vices necessary to implement hybrid IT on an environment must surpass traditional roles and
as-needed basis; this ultimately gives organiza- develop a keen understanding of enterprise net-
tions opportunities to choose services and scale works, data centers, and application delivery;
as they are needed. these skills are actually a mix of adapting existing
While convenient, affordable, and full of skills and acquiring new ones. They must hone
choices, hybrid IT also creates a host of prob- their skills of managing infrastructure services
T
he need for speed and agility are among the key driv-
ers of the growing DevOps movement, which seeks to
better align software development and IT operations.
Yet, challenges still exist.
42%
As organizations embrace new models such as DevOps,
a requirement for greater visibility and operational
efficiency is also driving tool consolidation.
Organizations are using between are deploying and updating apps more
frequently than in the past
4 and 10 tools to manage their
68%
growing portfolios of custom apps
44%
of respondents see their Source: Chef Survey 2017: Community Metrics
workloads increasing
12%
11%
11%
Source:
Quali’s 2016 DevOps
Survey (March 2017)
DBTA. COM/ BI GDATAQUARTERLY 27
Getting Real Business Value
INSIGHTS
AI has the
potential
to benefit
businesses of
every size,
but right off
the bat it’s
important to
understand
that generating
ROI from AI
is not easy.
TODAY’S HEADLINES ARE filled with news about artificial mean for everyone else? Does this signify that only the
intelligence (AI), proclaiming variously that robots giants will profit from AI, while the rest of us languish?
will take our jobs, cure cancer, or change industries The answer is an emphatic no. AI has the poten-
in ways unseen since the industrial revolution. One tial to benefit businesses of every size, but right off
thing is clear to those of us watching closely, how- the bat, it’s important to understand that generating
ever: It’s not all hype. In 2016 alone, the quantity of ROI from AI is not easy. There is still a gap between
AI startup acquisitions was remarkable, but most research into AI and delivering actual tangible busi-
of these massive investments were made by an elite ness results in the real world; in fact, a recent For-
corps of companies, such as Amazon, Google, Apple, rester survey found that while 58% of companies are
Facebook and a few others. researching AI, only 12% are actually using AI sys-
The fact that these heavy hitters are leading the tems in their businesses.
charge makes sense. AI investments, deployments, and Not everyone will be able to develop their own AI
resources are largely siloed within a small Silicon Val- solution in-house, so to catch up with the tech giants,
Roman ley circle, and the high cost of development or acquisi- many smaller companies will seek technology part-
Stanek tion combined with the fact that there is currently only ners to set up the least expensive, most effective way
is CEO of a small cadre of truly talented AI experts means that to harness AI and machine learning. To that end, here
GoodData.
only a small number of companies have the resources are the three keys that will help you achieve the most
to deliver AI innovation at scale. But what does that success in setting up your AI strategy.
DESPITE THE LARGE investments that organizations How is DevOps for big data different?
are making in big data applications, difficulties still Classical DevOps was all about creating veloc-
persist for developers and operators who need to ity between business requirements and needs—the
find efficient ways to adjust and correct their appli- developers writing the code, the systems that actu-
cation’s code. To address these challenges, Pepper- ally embody the code and solve the business prob-
data has introduced a new product based on the Dr. lems—and it was all around processes like Agile,
Elephant project that gives developers an under- continuous integration, and continuous delivery.
standing of bottlenecks and provides suggestions on That is all well and good, but for big data, there is
how to fix them throughout the big data DevOps another big component which is this performance
lifecycle. aspect. We believe that performance needs to be a
Pepperdata CEO Ash Munshi recently discussed first-level player in DevOps for big data.
the need for DevOps for big data, and the role of
Dr. Elephant, which was open sourced in 2016 by What does this involve?
LinkedIn and is available under the Apache v2 Taking information about how things are actu-
License. ally performing and providing feedback to earlier
parts of the DevOps chain is vital for big data. In
Taking
What is happening in the big data space now? particular, if you collect information about resource
information
about how We are seeing more and more customers going to utilization, contention for the resources, the appli-
things are production with big data. Our company tripled its cations, and whether the places where they are
actually growth last year and we have a nice vector for this deployed match or don’t match, and give that back
performing year planned, as well. Our customers are proof that to the people who do the release part of it, and can
and providing the technology and the solutions are leaving the lab also say that the developers made a set of changes
feedback to and actually becoming business-critical. but they might be detrimental to performance—
earlier parts that is an important aspect of feedback. In addition,
of the DevOps What is changing as customers go into going back to the developers and saying that they
chain is vital
production with big data? might want to change their algorithm because it is
for big data.
When we think about big data going into pro- not using the cluster efficiently or it is taking up too
duction, there are three big components. The first many resources, and then going back to the people
is making sure that things are reliable; the second who actually provisioned the cluster and saying that
is that they scale; and the third is that they perform. the assumptions that they made around the num-
It is the performance aspect that we focus on as a ber of users, data volume, and the workload are not
company. The reason that performance is so hard actually resulting in the response times that they are
for big data is that you are dealing with hundreds expecting—these are all important feedback loops
and thousands of computers, you are dealing with back into the DevOps chain, and they are vital for
datasets that are usually two orders of magnitude big data. That is really our fundamental thesis.
larger than what classic IT has dealt with, and you The products that we have today—the Clus-
are dealing with data that changes rapidly. And ter Analyzer, the Capacity Optimizer, and the Pol-
then, you are dealing with a lot of people that are icy Enforcer—provide that type of feedback to the
doing things simultaneously. They are doing inter- operators.
active work and decision support all on the same The Cluster Analyzer gathers all of the data and
machines. That combination of variables is very answers questions about what resource is being used
hard to get your hands around, and the perfor- for what and how they are correlated. The Capacity
mance implications of that are even more difficult Analyzer takes automated action and says there are
to understand. That is really why performance is additional resources that are available now to run
such a big deal for big data. We like to say that per- more jobs or run them quicker if more resources
formance can mean the difference between busi- are allocated to them—so it does automated anal-
ness-critical and “business-useless” for big data ysis using machine learning to use the resources
systems. better. And, then the Policy Enforcer guarantees
New Technologies in a
FALL Big Data World
2017 For sponsorship details, contact Stephen Faig,
stephen@dbta.com, or 908-795-3702.
Another type of data to consider is textual data. Examples of dimensions may still be too large for practical analysis. Sin-
are product reviews, Facebook posts, Twitter tweets, book rec- gular value decomposition (SVD) offers a more advanced way
ommendations, complaints, and legislation. Textual data is to do dimension reduction. SVD works in a way that is similar
difficult to process analytically since it is unstructured and to principal component analysis (PCA) and summarizes the
cannot be directly represented into a matrix format. Moreover, document term matrix into a set of singular vectors, also called
this data depends upon linguistic structure and is typically latent concepts, which are linear combinations of the original
quite “noisy” due to grammatical or spelling errors, synonyms, terms. These reduced dimensions can then be added as new fea-
and homographs. However, this type of data can contain very tures to an existing, structured dataset.
relevant information for an analytical modeling exercise. Just Besides textual data, other types of unstructured data such as
as with network data, it is important to find ways to featurize audio, images, videos, fingerprint, GPS, and RFID data can be
text documents and combine them with other structured data. A considered. To successfully leverage these types of data in ana-
popular way of doing this is by using a document term matrix lytical models, it is critical to carefully think about creative ways
indicating what terms appear, and how frequently, in which doc- of featurizing them. When doing so, it is recommended that any
uments. Such a matrix will be large and sparse. Dimension reduc- accompanying metadata is taken into account. For example, in
tion will thus be very important, making it necessary to represent fraud detection, not only an image may be relevant but also who
every term in lowercase; remove terms which are uninformative, took it, where, and at what time.
such as stop words and articles; use synonym lists to map syn- The bottom line is that the best way to boost the perfor-
onym terms to one single term; stem all terms to their root; and mance and ROI of analytical models is by investing in data first.
remove terms that only occur in a single document. And remember that alternative data sources can contain valu-
Even after these activities have been performed, the number able information about the behavior of customers.