Sie sind auf Seite 1von 14

White Paper

An Overview of Big Data Problems and Solutions

Understanding Big Data: Storage, Analytics and Learning

MomentumSI 2011

Understanding Big Data

Page 1

Table of Contents
Table of Contents ...................................................................................... 2 Overview .................................................................................................. 3
How Much Data is Big Data? .......................................................................................................... 3

Big Databases ........................................................................................... 4


Introducing NoSQL Data Stores .................................................................................................... 5 As-A-Service Data Stores ................................................................................................................. 6

Big Analytics ............................................................................................. 7


Column Oriented Storage ........................................................................................................... 8 Simultaneous Load and Query ................................................................................................. 8 Large Blocks and Compressed Data ....................................................................................... 8 Massively Parallel / Share Nothing Architecture .............................................................. 8 MapReduce and Hadoop Solutions.............................................................................................. 9

Big Learning ............................................................................................ 11


Big Learning Technology.............................................................................................................. 11

Our Solutions .......................................................................................... 13

MomentumSI 2011

Understanding Big Data

Page 2

Overview
This paper provides an overview of the exponential growth of data in organizations and the need for businesses to extract value from it. Big Data represents the next phase of data management, both from an opportunity perspective and a delivery capability view. Organizations will be challenged to use new techniques to capture data, analyze it and to mine the gems.

How Much Data is Big Data?


An interesting question is, How much data is big data? And although there is no right answer, there are some commonly held views. First, its worth noting that what was big just a few years ago is no longer considered big today. For example, at the time of this writing, a general-purpose 1 terabyte hard drive costs about $70 online. Increases in storage density, I/O operations and throughput have enabled the use of cheap disks to store massive amounts of information. Although views differ, most people agree that Big Data starts in the 10s of terabytes but becomes more interesting and appropriate as the data approaches 100s of terabytes or even 1,000s (aka, a petabyte). Some have pointed out that theyre not sure if they have enough data to warrant a Big Data approach. Many of the Big Data technologies are so disruptive that they are being applied in Small and Medium Data. In some ways, the term Big Data is unfortunate in that it implies that it will only work with the largest of data sets. The truth is that it depends on which aspects and technologies youre working with. Some solutions may be overkill for the problem at hand; however, as a rule MomentumSI sees many of the Big Data techniques being used at small- and medium-sized shops.

MomentumSI 2011

Understanding Big Data

Page 3

Big Databases
Big Data deals with more than just the amount of storage. One of the key considerations of a Big Data database is the ability to on-board data at a much more rapid pace than its predecessors. This feature is often called streaming because the data comes in a fluid, often non-stop motion. Consider the case presented by Associate Professor Deb Roy, head of the MIT Media Labs Cognitive Machines research group:
Roy is recording nearly all of his new sons waking hours in an ambitious attempt to use these data to unravel the mystery of how humans naturally acquire language within the context of their primary social setting. He will pay particular attention to the role of physical and social context in how his son, 9 months old, learns early words and early grammatical constructions. Roys vast recording and analysis effort, known as The Human Speechome Project (speech + home), will yield some 400,000 hours of audio and video data over three years. Roy and his wife have already gathered more than 300 gigabytes per day of compressed data by recording an average of 12-14 hours a day.

Obviously, the capturing of so much real-time information in such a rapid pace requires a rethinking of the approach and technologies.

The data storage requirements of the Human Speechome Project present challenges that cannot be easily addressed with conventional storage technologies. Basic requirements include high-performance reads/writes in excess of 160 gigabits/second, massive shared volumes in excess of several hundred terabytes, and smooth scalability from an initial 50 terabytes to capacity well in excess of a petabyte. Additional requirements include 100 percent data redundancy, file access by computers running multiple operating systems, a fully virtualized storage fabric, and affordability using low cost, high capacity SATA hard drives.

MomentumSI 2011

Understanding Big Data

Page 4

This is the new world of Big Data. Individuals and organizations are taking advantage of cheap storage and commodity computing cycles to solve problems that would not have been within their reach only a few years ago. The need to capture information in rapid succession has driven the requirements for Big Data databases. The various needs have driven an array of new databases, many of which do not conform to the classic SQL standard and are dubbed NoSQL.

Introducing NoSQL Data Stores


The need to capture new kinds of information, and at rates previously unheard of, drove a series of exciting introductions. Recent advances include: Document-Oriented Databases focus on capturing data even when you dont have a predefined schematic view of what is going to be needed. Hence, the system needs to allow one to easily grow the document structure by adding new leafs (or extensions). These systems often scale by using read-only replicas and auto-sharding. Examples include MongoDB and CouchDB. Key/Value Stores focus on caching small pieces of information to either memory or disk-backed memory. Because the focus is really caching, little emphasis is placed on query languages. Examples include: Memcached, Redis and Project Voldemort. Peer Data Stores focus on scaling data across multiple servers (or peers) in order to alleviate potential read or write bottlenecks to any one server. In addition, by creating replicas of data on multiple servers this method provides a built-in redundancy increasing the availability of the system. A key goal of the Peer Data Store is to provide near-linear horizontal scaling. Examples include: Cassandra, Riak and hBase. Many of the databases that were developed for special use cases chose not to implement the SQL interface and collectively became known as NoSQL solutions. However, many of them have now gone back and implemented a SQL access layer on top of their solution to make it easier for developers to use. In some cases though, like Key/Value stores, it just didnt make sense to force SQL on their model.

MomentumSI 2011

Understanding Big Data

Page 5

As-A-Service Data Stores


In addition to the data stores mentioned above, there is another movement towards delivering software and data in an on-demand, over-the-network, model called as a Service. This model is enabled by cloud computing where resources are opaquely provisioned based on user demand. Popular options found here include: BLOB Storage BLOB (Binary Large Objects) are just continuous streams of bits ranging from a few bits long to gigabytes in length. BLOBs are typically things like images, videos, software files or documents. Amazon provides a BLOB system called S3 (Simple Storage System), while Microsoft offers a similar service called, Azure BLOB Storage. Similar features are now being found in private clouds: Eucalyptus provides Walrus, while OpenStack offers OpenStack Object Storage. Relational Database as a Service Since so many companies have their databases rooted in the relational world, there is tremendous pressure to preserve the model that their systems currently use and their staff is productive at using. The focus of Relational Data as a Service is to offer developers a model that they are used to, but to make the access, operations and scaling of the database much easier than before. This is accomplished by having a service that gives developers an on-demand model while also providing automated scaling, automated backup and automated recovery of the system. Examples include Amazons Relational Data Service, Microsoft SQL Azure Database and Saleforces Database.com. In addition to the categories mentioned, additional popular services include Amazon SimpleDB and Googles Big Table. This software falls into the Peer Data Store category but in this case, it is delivered in the as-a-Service Model. It is worth noting that both offerings are considered significant advances in the way in which they store and process large volumes of data. They serve as the inspiration for many of the open-source databases that have been available under traditional licenses, but that are now being introduced into the as-a-Service world to compete with the large Internet companies.

MomentumSI 2011

Understanding Big Data

Page 6

Big Database Organizational Call-To-Action


The efficient use of data has always been the heart of competitive advantage. Today, organizations must revisit their data collection strategy. Although the technology enables cost-effective and simplified on-boarding and processing, this doesnt mean collecting data for the sake of collecting data it means experimenting with large data sets to understand if business advantage can be established. In addition, companies should review the current use of databases in their organization. The modern data stores that are available today have significant cost advantages over their predecessors.

Big Analytics
Big Analytics refers to the need to analyze our large data stores using new tools and techniques. This kind of analytics is driven by a person using a report-writing tool. The reason for pointing this out is that other types of analytics (like Big Mining) are now being driven by automated routines using machine learning to make recommendations and the like. It wasnt so long ago that running data analysis routines across 10s or 100s of gigabytes was a real challenge. As the size of the data continues to grows, so does the need to use new techniques to solve the problems. Over the past few years weve seen two trends emerge: Use of proprietary MPP (massively parallel processing) techniques Use of MapReduce (a standardized MPP technique)

Both solutions are viable techniques for analyzing large datasets, as they focus on decomposing very large problems into smaller units. Recently, however, there is growing interest in using MapReduce as a common foundation for batch jobs, data analytics, data mining and more. First, its important to understand the key characteristics of modern data analytics platforms:

MomentumSI 2011

Understanding Big Data

Page 7

Column-Oriented Storage
Huge efficiency gains are available if a hard drive is able to read continuous data. Many analytics requests enable this type of a read, hence the associated gains. Relational databases focused on storing data in rows, for instance:
1, Bill Smith, 9900 Main Street 2, John Carter, 8804 Vine Street 3, Tim Jones, 505 1st Street

In a column-oriented database, all of the columns are serialized together:


1,2,3; Bill Smith, John Carter, Tim Jones; 9900 Main Street, 8804 Vine Street, 505 1st Street;

In general, row-oriented access is now more commonly found in relational database systems, while column-oriented is common in OLAP / data warehousing.

Simultaneous Load and Query


Historically, organizations had a choice: Do you want to load new data into the warehouse or do you want to query data thats already in there? Naturally, customers answered, both. Vendors have responded with solutions to the problem by offering what is often called, real-time analytics. Fundamentally, it enables incremental data changes to be added on the fly without having to take down the system and execute a massive reload.

Large Blocks and Compressed Data


Since data analytics deals with very large data sets, its necessary to manage the data in a very efficient way. Modern analytics solutions typically work on large blocks of data (10s or 100s of megabytes) in a single pass. In addition, there is a need to be able to compress the data, typically reducing the size by 3X to 10X. Naturally, data compression cant negatively affect the actual data analysis routines.

Massively Parallel / Share Nothing Architecture


Divide-and-conquer is a time-tested technique to break large problems into smaller ones. Massively parallel processing breaks large problems into smaller ones and passes the local processing of a data set to an independent server. This enables multiple servers to work on different parts of the problem in parallel, expediting the overall execution. A side benefit of this approach is the inherent resilience gained by not relying on a single system to do all the work. If designed using a Share Nothing Architecture, an MPP system can share no critical outage points (a.k.a., Single Point of Failure).

MomentumSI 2011

Understanding Big Data

Page 8

MapReduce and Hadoop Solutions


MapReduce is a software technique introduced by Google to enable massively parallel processing of very large data sets across clusters of computers. Hadoop is an opensource implementation of the MapReduce technique written in Java and available from Apache. Fundamentally, Hadoop enables very large data sets to be spread over a distributed file system so that clusters of commodity computers can process the data. It is now common for data-intensive shops, including enterprise IT organizations, to either run their own Hadoop cluster or to leverage one from an external cloud provider (e.g., Amazon offers Elastic MapReduce).

Hadoop has been instrumental in enabling agile data analysis. In software development, agile practices are associated with faster product cycles, closer interaction between developers and consumers, and testing. Traditional data analysis has been hampered by extremely long turn-around times. If you start a calculation, it might not finish for hours, or even days. But Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Faster computations make it easier to test different assumptions, different datasets, and different algorithms. Its easier to consult with clients to figure out whether youre asking the right questions, and its possible to pursue intriguing possibilities that youd otherwise have to drop for lack of time. An OReilly Radar Report: What is Data Science?

Organizations that have committed to the MapReduce and Hadoop model are now interested in using their current clusters to solve Big Analytics problems. Success stories from companies like Yahoo! have indicated that the size of the data that can be processed exceeds the petabyte mark with the ability to run analytics routines on thousands of commodity servers simultaneously. In other words, its a proven model that has open industry support and a tremendous amount of momentum behind the movement. For this reason, virtually all of the major analytics vendors have had to readdress their execution model as customers are questioning the use of specialized

MomentumSI 2011

Understanding Big Data

Page 9

hardware, proprietary architectures and often, exceedingly high maintenance and operations costs.
HP/Vertica have announced the Vertica Connector for Hadoop EMC/Greenplum have Greenplum MapReduce IBM has InfoSphere BigInsights (for Hadoop) Aster Data has Aster Hadoop Data Connector

In addition to the mainstream players, the Apache Software Foundation has also released an open source analytics tool designed specifically to run on Hadoop called Hive. Although early in its maturity cycle, Hive shows significant promise and is sure to spawn off a number of complementary or even competitive projects, all aiming to bring a simple but powerful model of Big Analytics to the Hadoop world.

Big Analytics Organizational Call-To-Action


Over the last decade, weve witnessed a strong move from very large and expensive symmetric multiprocessor (SMP) boxes to more cost-effective MPP solutions. The new push is to avoid using dedicated analytics equipment to move toward a more reusable and scalable clustered approach using Hadoop on either bare metal or on Infrastructure-as-a-Service. Organizations should revisit their strategy to understand the pros and cons of each model to determine the solution that best fits their needs.

MomentumSI 2011

Understanding Big Data

Page 10

Big Learning
By "Big Learning, were referring to the use of artificial intelligence techniques (with an emphasis on Machine Learning) on very large data sets. The scenarios for which mining approaches are used vary widely. The techniques are applied to virtually any field (business, engineering, core research, etc.) to identify better ways to solve a problem. Examples include: People who bought X are likely to also buy Y The characteristics of fraud are X: Does the transaction looks like X? (where X changes) A tumor looks like X: Is this a picture of a tumor?

Computing power has become significantly cheaper, and the ability to easily store and process it on large clusters is readily available. In addition, artificial intelligence routines have become more mainstream. Once, only the largest companies could afford the kinds of computers necessary to perform these complex calculations. Only the richest companies could afford to hire the data scientists needed to manage the process. Today, most students graduating with a computer science degree have the requisite skills to be productive by using off-the-shelf software and a credit card to access an on-line cloud computing cluster. It wasnt so long ago that most companies didnt have a warehousing / analytics group or even a database group. Now, its more common to see organizations with a data mining or machine learning group. This function has moved from an interesting research area to a core function for driving competitive business advantage.

Big Learning Technology


From a technology perspective, a variety of software systems and algorithms are used. Many of the solutions are prepackaged into specialize solutions. For instance, there are several packages on the market that help companies determine if someone is attempting to hack into their network (intrusion detection). Most of the advanced systems use some kind of machine learning algorithm to identify these rogue patterns. Recommendation engines are also commonly found in bundled solutions. These solutions are an easy way to get started, but often come up short when the required intelligence surpasses their built-in functionality or the sheer amount of data exceeds their ability to analyze it.

MomentumSI 2011

Understanding Big Data

Page 11

General-purpose machine learning routines are now becoming more widespread to accommodate the large data sets and need for custom routines. In these cases, companies will often use their existing Hadoop cluster to store and analyze large volumes of data but will apply new machine learning routines against the data. This enables them to share the costs associated with Big Database and Big Analytics, both from an infrastructure and operations perspective.

One of the more interesting projects to use while learning algorithms is from Apache called, Mahout. Like other Big Data solutions, it leverages Hadoop to provide the backbone for analysis. Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together. Mahout.Apache.Org Website

Big Learning Organizational Call-To-Action


Organizations should identify the areas where Big Learning will impact their business. Create a plan to incrementally grow out an internal capability related to this field. Assume that large amounts of data will be collected and analyzed via commodity compute clusters. Build out the necessary computing environment to harvest the intelligence and train up staff on the new model.

MomentumSI 2011

Understanding Big Data

Page 12

Our Solutions
MomentumSI helps organizations to take advantage of Big Data concepts and to grow their internal capabilities.

Hadoop Fast Start


Our Fast Start program is designed for organizations that are new to Hadoop and want to quickly implement a state of the art cluster. Momentum offers our Tough Hadoop Distribution which binds to popular private cloud (IaaS) technologies like Eucalyptus and vCloud Director from VMware.

Big Database Projects


Selecting and implementing the right Big Database can make the difference between success and failure. MomentumSIs consultants are experts in Big Database implementations and bring their consultative experience to your problem.

Big Analytics Projects


Momentum provides the consulting expertise to turn your big analytics problem into a manageable solution. Our solutions leverage Hadoop and industry-leading packaged solutions to deliver analytical solutions on commodity compute stacks at a fraction of the cost of traditional infrastructures.

Big Learning Projects


Machine learning on Big Data represents the next wave of innovation. Momentum provides consulting expertise on the use Apache Mahout on Hadoop.

MomentumSI 2011

Understanding Big Data

Page 13

About MomentumSI
MomentumSI is a leading IT services and solutions company focused on enterprise transformation. It helps organizations quickly and cost-efficiently adopt innovative, agile practices to align business needs with IT processes. MomentumSI specializes in helping companies incorporate disruptive technologies, including Cloud Computing, DevOps, BPM and SOA. Industries served include financial services, insurance, healthcare, pharmaceuticals, hightech, retail and manufacturing. Founded in 1997, MomentumSI is a privately held company that operates globally with headquarters in Austin, Texas and offices in San Francisco, Washington D.C., New York and Sydney. For more information, contact sales@momentumsi.com or call 1-888-886-8560 or visit. Visit our website at http://www.MomentumSI.com.

MomentumSI and Tough are brand names of Momentum Software, Inc. All other brand names and product names are trademarks or registered trademarks of their respective companies.

MomentumSI 2011