You are on page 1of 14

IS470 Guided Research in Information

Systems

Distributed Data Analysis


Using Map-Reduce

Student: Shahfik Amasha (shahfik.2006@smu.edu.sg)


Faculty Supervisor: Asst Prof Jason Woodard
(jwoodard@smu.edu.sg)
Abstract: Hadoop clusters are relatively easy to set up and manage, making them an invaluable tool
for researchers to quickly get up and running on crunching large datasets. The Hadoop benchmarks
show an a linear trend for processing different size of data for the same number of nodes and a
leveling off for throughput performance. This is useful to know the optimum number of nodes to
provision in a cluster. Finally, there is a huge potential for further the project into interactive use
cases and real-time data analytics

1 Introduction
Increasingly, more businesses are collecting data about their businesses, often times without even
being aware of it. From customer sales data to employee performance data, the typical business is
not taking full advantage of the deep insights that the data is able to provide. However, there is a
silver lining as businesses have just recently recognized the value of the data that have been sitting
within their organizations. IDC estimates that the worldwide market for business analytics is worth
$25 billion in 2009, a growth of 4 percent over 2008 (IBM, 2009). IT service providers who are
attuned to their customers needs are picking up on the growing interests of the industry by
providing solutions for the various functional areas of a business. Companies are using analytics in
various functions such as supply chain, customer relationship management and pricing. Amazon is a
good example of a company that has leveraged on the power of data analytics to provide
customized recommendations to its customers.

In the recent Global CIO Study 2009 by IBM, a survey of global CIOs, eighty three percent of survey
respondents identified business intelligence and analytics as a competitive advantage for their
organizations. In addition, IBM CIO Pat Toole commented that CIOs are investing in business
analytics capabilities to help them improve decision making and can be key to new growth markets
(IBM, 2009). In The McKinsey Quarterly (2007), the consulting firm identified analytics as one of the
eight business technology trends to watch in the next decade. Davenport (2006) highlighted that as
firms in many industries offer similar products using comparable technologies, business processes
are among the last remaining points of differentiation. Analytics allows competitors to get the most
value from those processes to make the best decisions at every level of the firm. Now that
businesses are waking up to this new reality, they are finding solutions that could help them build
this capability. MapReduce is one framework for doing analytics on large amounts of data, and
Hadoop is a solution that has implemented this framework. Other solutions include building data
warehouses and implementing database clusters, which both require a large capital investment and
high operational costs. In this paper, we use Hadoop, an Apache Foundation project, for doing data
analytics on a large dataset. Hadoop is based on the Java programming language.

The foundation of this paper is the start of collaboration between the Singapore Management
University School of Information Systems (SMU SIS) and a large taxi company two years ago. The
goal was to gain better insights into the dynamics that occur in a taxi network using the GPS location
data that they provide using data analytics. Leveraging on the ongoing efforts with the company, this
project hopes to use the MapReduce algorithm to accelerate the process of studying of these
dynamics. The current process for discovering these dynamic is to use a large database for querying,
which is extremely slow due to the large dataset. The dataset stands at a couple of hundred
gigabytes at the moment, and the typical query would only cover a period of a days worth of data
and take about half a day to complete.

MapReduce is one of the methods used in doing distributed data analysis, first made popular
through its use by Google for crawling and indexing the web. In 2004, Google published a paper
titled MapReduce: Simplified Data Processing on Large Clusters which sparked off a series of
events that led to the formation of Hadoop. Hadoop forms the basis of the data analytics used in this
project.

2 Hadoop
In several use cases of Hadoop, it has tremendously saved time and money. The New York Times
used 100 machines on the Amazon Elastic Computing Cloud to convert 4 terabytes of scanned
archives into PDFs within 24 hours. In a contest of speed, Hadoop broke the world record on the
fastest system to sort a terabyte of data with a time of 209 seconds in 2008. In the following year, it
took just 62 seconds to perform the same feat.

The Hadoop project consists of various sub-projects (Figure 1), each with a specific use of
MapReduce. The use cases introduced later will make use of the Hadoop Core sub-project. The Core
sub-project allows a developer to create programs that does data crunching on the Hadoop cluster.

Figure 1: Hadoop sub-projects. Source: Hadoop: The Definitive Guide

The MapReduce API (Figure 2) sets up the mapping and reduction process for the developer. All the
developer has to do is to write code within the mapping and reduction function to determine the
operations that will be executed on the data coming in. This will be clearer when explained in the
use case later.

Figure 2: The MapReduce API. Source: Hadoop: The Definitive Guide


Once a program is developed and the dataset uploaded to the HDFS shared filesystem, it is
submitted to Hadoop (Figure 3), where it will manage the distribution of the job to the cluster. In the
diagram below, the client node is the developers machine and the jobtracker node is the master
node in the Hadoop cluster while the tasktracker node is the slave node in the cluster. The Hadoop
cluster can be made up of one or more slave nodes. The tasktracker continuously sends heartbeats
the jobtracker to tell it that it is still alive. The jobtracker automatically stops assigning jobs to a
tasktracker after a user-defined period of inactivity. While the tasktracker is working on the job, it
continues sending heartbeats with progress information. When a job finishes, the client is informed
and the client JVM exits.

Figure 3: Job submission. Source: Hadoop: The Definitive Guide

Hadoop provides an administration interface (Figure 4) to query the status and details of a job. The
detailed view includes various performance indicators such as bytes read from the HDFS shared
filesystem. It also keeps a history of all the jobs that has been submitted previously.
Figure 4: The admin interface

3 Use Cases
There will be 2 use cases that will be presented. The first is generating secondary data from the
primary data source and the second is generating a frequency analysis of the GPS location. These
two use cases are chosen because the result requires a sweep of the entire dataset, something
which Hadoop is particularly good at. This begs the question of when do we use a traditional RDBMS
and when do we use MapReduce. Below is a good summary of the characteristics of each approach.

Traditional RDBMS MapReduce


Data Size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
Table 1: Comparison between RDBMS and MapReduce. Source: Hadoop: The Definitive Guide

In both of these use cases, we simply need to output a single file which contains the results. This fits
with the update characteristic. As you will see later the raw data is cleaned and processed before
being operated on, though this is not necessary as Hadoop has a flexible dynamic schema and
operates best on text files.

3.1 The dataset


As mentioned in the introduction, faculty at the SMU School of Information Systems (SIS) have been
collaborating the taxi company to analyze GPS traces and trip data from their fleet of about 15,000
taxis. This effort has resulted in a dataset of over 4 billion GPS observations from 150 million trips
(about 300 GB in uncompressed form). A major bottleneck in the analysis is the time required to run
algorithms over the entire data set, which can take weeks on a single machine. As a result, the
published results have been limited to analysis on a day or weeks worth of data at a time. Thus, we
explored the possibility of using a distributed systems approach to break this bottleneck and chose
Hadoop.
In the diagram below, Hermes-1 is a single machine and Beowulf-5 is a 6 node cluster (1 master and
5 slaves). They represent the resources which are available to this paper and are available within the
university. The use cases discussed here are executed on Beowulf-5. Cirrus-x is part of the Open
Cirrus Could Computing Testbed initiative by the Infocomm Development Authority of Singapore
(IDA). Cirrus-15 and Cirrus-60 are the theoretical projections of a 15 node and 60 node clusters
respectively. Distributing the computation using Hadoop should allow the analysis of larger subsets
of the data in less time potentially even the full years worth of data in hours, or a days worth in
seconds.

Figure 5: Time required versus scale of analysis that can be done

The raw dataset provided by the taxi company is in the following format:

Date time, vehicle no., driver ID, long, lat, speed, status
E.g.:
01/03/2009 00:00:00, SH1234S, 1809481,103.94063, 1.32617, 0, PAYMENT

This dataset is cleaned for errors and processed for anonymity. The final dataset is provided and is
not part of this paper. The final record that the use cases will be operating on will be as follows:

LogSerialNo,Datetime,vehicleID,driverID,long,lat,speed,status,week,DayOfWeek,day,hour
E.g.:
20090301000000000,2009/03/01 00:00:00,454,1809481,103.94063,1.32617,0,3,1520,0,01,00

The hardware configuration for each of the node in the Beowulf-5 cluster is the same and is as
follows:

Item Description
Processor 2 x AMD Opteron 250
RAM 4 GB

3.2 Use Case 1


The first use case generates a file which shows the start and end time of a particular taxi in a
particular state. The algorithm is illustrated as follows:
Figure 6: Use case 1 algorithm
An input split is a section of the original dataset. In this use case splits are of size 64MB. Tasktrackers
are fed with splits and the developers program is executed on them.

1. Mapper. Each line in the file is read and represented as a <key, value> pair to the Mapper. In
this instance, the key is the byte pos of the start of the line and the value is the line itself. I
parse the line and extract out the vehicle ID and log serial number. I created a custom key
consisting of a combination of the two pieces of information. A <VehicleSnPair, value> is
then written out, where the value is the original line.
2. The GroupComparator sorts the records by vehicle ID. The output is the same key value pair
as above.
3. The SortComparator then sorts by log serial number within each vehicle ID. The output is the
same key value pair as above.
4. The output from the SortComparator usually goes to the Reducer. But in this use case, a
custom Partitioner is implemented to determine the machine that a vehicle ID goes to.
5. The Reducer writes the relevant information to the output.

The algorithm is executed on varying input file sizes, from 16MB to 30GB. The results are shown in
two separate graphs below, from 16MB to 1GB (Figure 7) and from 1GB to 30GB (Figure 8). In Figure
7, there is an increasing marginal return as the file size increases from 16MB to 64MB. This is due to
the file split size of 64MB. As the file size goes beyond 64MB, the amount of time it takes is roughly
linear. This is also observed in Figure 8. The jobs are executed on a 5 slave node configuration,
(noted by n=5). Each job is executed for 3 times and the average time is taken.

Avg time, n=5, size=16MB to 1024MB


120.00

100.00 98.67
87.00
80.00
Seconds

60.00 61.00
48.67
40.00 44.0045.00
37.67
29.00
20.00

0.00
0 200 400 600 800 1000 1200
Size in MB

Figure 7: Average job completion time on 5 slave nodes for file sizes between 16MB to 1024MB
Avg time, n=5, size=1GB to 30GB
3000.00
2,792.67
2500.00

2000.00
Seconds

1,717.00
1500.00
1,314.67
1000.00 1,022.33
671.67
500.00
325.00
0.00 98.67

0 5000 10000 15000 20000 25000 30000 35000


Size in MB

Figure 8: Average job completion time on 5 slave nodes for file sizes between 1GB to 30GB

Figure 9 and Figure 10 shows the same job executed for 3 and 4 slave nodes to understand how the
number of nodes affect the job completion time. Figure 9 shows the overall trend which is a
relatively linear. Figure 10 shows a smaller range of file size. There is no significant change in the
pattern, other than the increased time taken for lesser number of slave nodes.

Avg time, n={3,4,5}, size=16MB to 8096MB


1400.00
1200.00
1000.00
Seconds

800.00
600.00
400.00
200.00
0.00
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Size in MB

n(5) n(4) n(3)

Figure 9: Average job completion time over 3, 4 and 5 nodes for file sizes 16MB to 8096MB
Avg time, n={3,4,5}, size=16MB to 1024MB
180.00
160.00
140.00
120.00
Seconds

100.00
80.00
60.00
40.00
20.00
0.00
0 200 400 600 800 1000 1200
Size in MB

n(5) n(4) n(3)

Figure 10: Average job completion time over 3, 4 and 5 nodes for file sizes 16MB to 1024MB

From the above graphs we note that there is an increasing marginal return as the file size increases.
The point at which there is no longer an increase tells us the optimum file size to work on. Figure 11
tells us that from file size 4096MB, the speed is constant and there is no increase in marginal returns
from larger file sizes. The results are however limited, as we would normally expect to see a point at
which there is diminishing marginal returns. Given more time, there is a possibility of finding out the
point at which that happens, though Figure 8 does show that it scales almost linearly to 30GB.

MB/s, n={3,4,5}, size=16MB to 8096MB


14.00
12.00
10.00
8.00
MB/s

6.00
4.00
2.00
0.00
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Size in MB

n(5) n(4) n(3)

Figure 11: Throughput over 3, 4 and 5 nodes for file sizes 16MB to 8096MB

As the size of the analysis required for gathering insights into the taxi data increases, Hadoop has
shown to be able to scale linearly to a months worth of data. Further experiments will need to be
done to determine if the Hadoop is able to scale to a years worth of data because patterns that are
observed over that time period would prove useful for answering research questions.
3.3 Use Case 2
The second use case uses the GPS data to plot a colored frequency map of Singapore. The taxi
company has divided Singapore into 86 different zones, most of them bordering neighborhood
estates. These colored maps help identify zones which are having the highest frequency. In this use
case the color represents the number of times a particular record appears in the log file. The
following illustration (Figure 12) shows a representation of the algorithm.

Figure 12: Use case 2 algorithm

Much simpler than the previous use case, this use case consists of only a Mapper and a Reducer.
However, in addition to this it uses the Java Topological Suite for simple GIS queries.

1. Mapper. It goes through each zone to check if the current record is in the zone. The zone
array consists of polygon definitions of each of the 86 zones (shown in Figure 13). If a zone is
found, it will write a <key, value> pair where the key is the zone number. If a zone is not
found, possibly due to anomalous data, the key is -1.
2. Reducer. It simply counts the number of records for each zone.

The output file is then fed into a JSP web application where it is parsed and colored using the Google
Maps API (Figure 14).
Figure 13: Map of SIngapore sivided into the 86 zones

Figure 14: Map of Singapore with colors representing the frequency

This use case displays the potential for using Hadoop not simply as an end, but also as a means to an
end, where that end could be the visualization of data. Colored maps showing the historical
passenger demand could prove useful to taxi drivers as it could influence behavior and thus increase
the overall efficiency of the taxi system.

4 Alternatives
The alternatives to Hadoop include parallel databases such as those by Oracle and IBM. Increasingly,
column-oriented databases are an excellent alternative on similar workload profiles. Various
research papers have been published comparing these alternatives. There are other MapReduce
solutions that are under research such as Dryad by Microsoft Research and Clustera by University of
Wisconsin-Madison. These frameworks are still under heavy development.

5 Related Work
Prior work in MapReduce and Hadoop has mostly focused on its inner mechanisms such schedulers,
performance debugging and file systems. In Improving MapReduce Performance in Heterogeneous
Environments (Zaharia, Konwinski, Joseph, Katz, & Stoica, 2008), it takes a look at the improving the
scheduler for heterogeneous compute environments.

As mentioned in the previous section on the comparisons made between MapReduce and other
alternatives, Pavlo, et al. (2009) has compared two approaches to large scale data analysis, the
MapReduce model and parallel databases. They tested Hadoop against two parallel DBMSs, Vertica
and another system from a major relational DB vendor.

Other works on studying the use of Hadoop on datasets include Loebman, et al. (2009) for
astrophysical simulations, comparing Hadoop and a commercial relational database. Cary, et al.
(2009)explored the use of MapReduce for solving spatial problems.

6 Conclusion
With the results from the use cases above, the taxi project now has the capability of analyzing data
that spans across multiple months or years and receive results much more quickly than before. This
saves the time of the faculty members by getting answers to their questions faster and, could
possibly result in better quality research and better funding. In addition, the findings and efficiencies
could go back to the taxi company and their drivers which could reap large economic and
environmental benefits such as reduced fuel costs, increase driver revenue and lower carbon
emissions.

In the beginning, there are extensive references to the business use of analytics and how it creates a
new dimension for competition. Hadoop has democratized data analytics, however there is still
some ways to go before it is easy enough for businesses to take advantage of its capabilities. With
heavy development still continuing, there is certainly potential for Hadoop to be improved upon for
business analytics.

7 Future Work
We have seen in the first use case where the scaling of the dataset is largely linear past the input
split size. However, where does it end? Storage space limitations on the Beowulf-5 cluster caused us
to stop at the 30GB file size. With the dataset at a couple hundred gigabytes, there is enough data to
experiment on if the limitations are eased. One of the ongoing efforts is to get approval to use the
Open Cirrus cloud computing testbed for this project to enable faster and larger analysis.

The second use case provides us a glimpse of what could potentially be an interactive visual
representation of the data on a web front end, driven by a Hadoop back-end. Even though the
typical job is batch in nature, higher level projects such as Hbase, Pig and Hive give the developer the
ability to drive real-time analytics while making the life of the developer easier.

Bibliography
1. Abadi, D. J., Madden, S. R., & Hachem, N. (2008). Column-stores vs. row-stores: how different
are they really? SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on
Management of data, (pp. 967-980).
2. Cary, A., Sun, Z., Hristidis, V., & Rishe, N. (2009). Lecture Notes in Computer Science. 21st
International Conference on Scientific and Statistical Database Management (pp. 302-319).
Springer.
3. Davenport, T. H. (2006). Competing on Analytics. Harvard Business Review , 84 (1), p98-107.
4. IBM. (2009, July 28). IBM to Acquire SPSS Inc. to Provide Clients Predictive Analytics Capabilities.
Retrieved Nov 2009, from IBM: http://www-03.ibm.com/press/us/en/pressrelease/27936.wss
5. IBM. (2009, Sep 10). New IBM Study Highlights Analytics As Top Priority For Todays CIO.
Retrieved November 2009, from IBM: http://www-
03.ibm.com/press/us/en/pressrelease/28314.wss
6. Loebman, S., Nunley, D., Kwon, Y., Howe, B., Balazinska, M., & Gardner., J. P. (2009). Analyzing
Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help? Workshop on
Interfaces and Architectures for Scientific Data Storage.
7. Manyika, J. M., Roberts, R. P., & Sprague, K. L. (2007). Eight Business Technology Trends to
Watch. The McKinsey Quarterly , 1-11.
8. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., et al. (2009). A
Comparison of Approaches to Large-Scale Data Analysis. SIGMOD '09: Proceedings of the 35th
SIGMOD international conference on Management of data , (pp. 165-178). New York.
9. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., et al. (2005). C-
store: a column-oriented dbms. VLDB '05: Proceedings of the 31st international conference on
Very large data bases (pp. 553-564). VLDB Endowment.
10. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., & Stoica, I. (2008). Improving mapreduce
performance in heterogeneous environments. 8th Symposium on Operating Systems Design and
Implementation (pp. 29-42). USENIX Association.