Beruflich Dokumente
Kultur Dokumente
MomentumSI 2011
Page 1
Table of Contents
Table of Contents ...................................................................................... 2 Overview .................................................................................................. 3
How Much Data is Big Data? .......................................................................................................... 3
MomentumSI 2011
Page 2
Overview
This paper provides an overview of the exponential growth of data in organizations and the need for businesses to extract value from it. Big Data represents the next phase of data management, both from an opportunity perspective and a delivery capability view. Organizations will be challenged to use new techniques to capture data, analyze it and to mine the gems.
MomentumSI 2011
Page 3
Big Databases
Big Data deals with more than just the amount of storage. One of the key considerations of a Big Data database is the ability to on-board data at a much more rapid pace than its predecessors. This feature is often called streaming because the data comes in a fluid, often non-stop motion. Consider the case presented by Associate Professor Deb Roy, head of the MIT Media Labs Cognitive Machines research group:
Roy is recording nearly all of his new sons waking hours in an ambitious attempt to use these data to unravel the mystery of how humans naturally acquire language within the context of their primary social setting. He will pay particular attention to the role of physical and social context in how his son, 9 months old, learns early words and early grammatical constructions. Roys vast recording and analysis effort, known as The Human Speechome Project (speech + home), will yield some 400,000 hours of audio and video data over three years. Roy and his wife have already gathered more than 300 gigabytes per day of compressed data by recording an average of 12-14 hours a day.
Obviously, the capturing of so much real-time information in such a rapid pace requires a rethinking of the approach and technologies.
The data storage requirements of the Human Speechome Project present challenges that cannot be easily addressed with conventional storage technologies. Basic requirements include high-performance reads/writes in excess of 160 gigabits/second, massive shared volumes in excess of several hundred terabytes, and smooth scalability from an initial 50 terabytes to capacity well in excess of a petabyte. Additional requirements include 100 percent data redundancy, file access by computers running multiple operating systems, a fully virtualized storage fabric, and affordability using low cost, high capacity SATA hard drives.
MomentumSI 2011
Page 4
This is the new world of Big Data. Individuals and organizations are taking advantage of cheap storage and commodity computing cycles to solve problems that would not have been within their reach only a few years ago. The need to capture information in rapid succession has driven the requirements for Big Data databases. The various needs have driven an array of new databases, many of which do not conform to the classic SQL standard and are dubbed NoSQL.
MomentumSI 2011
Page 5
MomentumSI 2011
Page 6
Big Analytics
Big Analytics refers to the need to analyze our large data stores using new tools and techniques. This kind of analytics is driven by a person using a report-writing tool. The reason for pointing this out is that other types of analytics (like Big Mining) are now being driven by automated routines using machine learning to make recommendations and the like. It wasnt so long ago that running data analysis routines across 10s or 100s of gigabytes was a real challenge. As the size of the data continues to grows, so does the need to use new techniques to solve the problems. Over the past few years weve seen two trends emerge: Use of proprietary MPP (massively parallel processing) techniques Use of MapReduce (a standardized MPP technique)
Both solutions are viable techniques for analyzing large datasets, as they focus on decomposing very large problems into smaller units. Recently, however, there is growing interest in using MapReduce as a common foundation for batch jobs, data analytics, data mining and more. First, its important to understand the key characteristics of modern data analytics platforms:
MomentumSI 2011
Page 7
Column-Oriented Storage
Huge efficiency gains are available if a hard drive is able to read continuous data. Many analytics requests enable this type of a read, hence the associated gains. Relational databases focused on storing data in rows, for instance:
1, Bill Smith, 9900 Main Street 2, John Carter, 8804 Vine Street 3, Tim Jones, 505 1st Street
In general, row-oriented access is now more commonly found in relational database systems, while column-oriented is common in OLAP / data warehousing.
MomentumSI 2011
Page 8
Hadoop has been instrumental in enabling agile data analysis. In software development, agile practices are associated with faster product cycles, closer interaction between developers and consumers, and testing. Traditional data analysis has been hampered by extremely long turn-around times. If you start a calculation, it might not finish for hours, or even days. But Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Faster computations make it easier to test different assumptions, different datasets, and different algorithms. Its easier to consult with clients to figure out whether youre asking the right questions, and its possible to pursue intriguing possibilities that youd otherwise have to drop for lack of time. An OReilly Radar Report: What is Data Science?
Organizations that have committed to the MapReduce and Hadoop model are now interested in using their current clusters to solve Big Analytics problems. Success stories from companies like Yahoo! have indicated that the size of the data that can be processed exceeds the petabyte mark with the ability to run analytics routines on thousands of commodity servers simultaneously. In other words, its a proven model that has open industry support and a tremendous amount of momentum behind the movement. For this reason, virtually all of the major analytics vendors have had to readdress their execution model as customers are questioning the use of specialized
MomentumSI 2011
Page 9
hardware, proprietary architectures and often, exceedingly high maintenance and operations costs.
HP/Vertica have announced the Vertica Connector for Hadoop EMC/Greenplum have Greenplum MapReduce IBM has InfoSphere BigInsights (for Hadoop) Aster Data has Aster Hadoop Data Connector
In addition to the mainstream players, the Apache Software Foundation has also released an open source analytics tool designed specifically to run on Hadoop called Hive. Although early in its maturity cycle, Hive shows significant promise and is sure to spawn off a number of complementary or even competitive projects, all aiming to bring a simple but powerful model of Big Analytics to the Hadoop world.
MomentumSI 2011
Page 10
Big Learning
By "Big Learning, were referring to the use of artificial intelligence techniques (with an emphasis on Machine Learning) on very large data sets. The scenarios for which mining approaches are used vary widely. The techniques are applied to virtually any field (business, engineering, core research, etc.) to identify better ways to solve a problem. Examples include: People who bought X are likely to also buy Y The characteristics of fraud are X: Does the transaction looks like X? (where X changes) A tumor looks like X: Is this a picture of a tumor?
Computing power has become significantly cheaper, and the ability to easily store and process it on large clusters is readily available. In addition, artificial intelligence routines have become more mainstream. Once, only the largest companies could afford the kinds of computers necessary to perform these complex calculations. Only the richest companies could afford to hire the data scientists needed to manage the process. Today, most students graduating with a computer science degree have the requisite skills to be productive by using off-the-shelf software and a credit card to access an on-line cloud computing cluster. It wasnt so long ago that most companies didnt have a warehousing / analytics group or even a database group. Now, its more common to see organizations with a data mining or machine learning group. This function has moved from an interesting research area to a core function for driving competitive business advantage.
MomentumSI 2011
Page 11
General-purpose machine learning routines are now becoming more widespread to accommodate the large data sets and need for custom routines. In these cases, companies will often use their existing Hadoop cluster to store and analyze large volumes of data but will apply new machine learning routines against the data. This enables them to share the costs associated with Big Database and Big Analytics, both from an infrastructure and operations perspective.
One of the more interesting projects to use while learning algorithms is from Apache called, Mahout. Like other Big Data solutions, it leverages Hadoop to provide the backbone for analysis. Currently Mahout supports mainly four use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together. Mahout.Apache.Org Website
MomentumSI 2011
Page 12
Our Solutions
MomentumSI helps organizations to take advantage of Big Data concepts and to grow their internal capabilities.
MomentumSI 2011
Page 13
About MomentumSI
MomentumSI is a leading IT services and solutions company focused on enterprise transformation. It helps organizations quickly and cost-efficiently adopt innovative, agile practices to align business needs with IT processes. MomentumSI specializes in helping companies incorporate disruptive technologies, including Cloud Computing, DevOps, BPM and SOA. Industries served include financial services, insurance, healthcare, pharmaceuticals, hightech, retail and manufacturing. Founded in 1997, MomentumSI is a privately held company that operates globally with headquarters in Austin, Texas and offices in San Francisco, Washington D.C., New York and Sydney. For more information, contact sales@momentumsi.com or call 1-888-886-8560 or visit. Visit our website at http://www.MomentumSI.com.
MomentumSI and Tough are brand names of Momentum Software, Inc. All other brand names and product names are trademarks or registered trademarks of their respective companies.
MomentumSI 2011