Sie sind auf Seite 1von 10

Hadoop on Windows Azure:

Hive vs. JavaScript for


Processing Big Data
For some time Microsoft didnt offer a solution for processing big data in cloud environments.
SQL Server is good for storage, but its ability to analyze terabytes of data is limited. Hadoop,
which was designed for this purpose, is written in Java and was not available to .NET developers.
So, Microsoft launched the Hadoop on Windows Azure service to make it possible to distribute
the load and speed up big data computations.
But it is hard to find guides explaining how to work with Hadoop on Windows Azure, so here we
present an overview of two out-of-the-box ways of processing big data with Hadoop on Windows
Azure and compare their performance.

By Sergey Klimov and Andrei Paleyes, Senior R&D Engineers at Altoros.

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

Contents
1. Introduction .............................................................................................................................3
2. Testing environment ................................................................................................................3
3. The results of the research.......................................................................................................4
3.1.Test results for an 8 MB HDFS block.............................................................................................. 4
3.2.Test results for 64 a MB HDFS block .............................................................................................. 5
3.3.Dependency between the block size and the number of Map tasks .................................................. 6
3.4.Dependency between performanceand the number of MapReduce tasks ......................................... 7
3.5.Dependency between performance and the type of a query ........................................................... 8
4. Conclusion ................................................................................................................................8
5. About the authors: ...................................................................................................................9

Altoros Systems

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

1. Introduction
When the R&D department at Altoros Systems, Inc. started this research, we only had access to
a community technology preview (CTP) release of Apache Hadoop-based Service on Windows
Azure. To connect to the service, Microsoft provides a Web panel and Remote Desktop
Connection. We analyzed two ways of querying with Hadoop that were available from the Web
panel: HiveQL querying and a JavaScript implementation of MapReduce jobs.
A test data set was generated based on US Air Carrier Flight Delays information downloaded
from Windows Azure Marketplace. It was used to test how the system would handle big data.
We created the following eight types of queries in both languages and measured how fast they
were processed:

Count the number of flight delays by year.

Count the number of flight delays and display information by year and month.
Count the number of flight delays and display information by year, month, and day of
month.
Calculate the average flight delay time by year.
Calculate the average flight delay and display information by year and month.
Calculate the average flight delay time and display information by year, month, and
day of month.
From this analysis you will see performance results tests and observe how the throughput varies
depending on the block size. The research contains twotables and three diagrams that
demonstrate the findings.

2. Testing environment
As a testing environment we used a Windows Azure cluster. The capacities of its physical CPU
were divided among three virtual machines that served as nodes.Obviously, this could introduce
some errors into performance measurements. Therefore, we launched each query several times
and used the average value for our benchmark. The cluster consisted of three nodes (a small
cluster). The data we used for the tests consisted of five CSV files of 1.83 GB each. In total, we
processed 9.15 GB of data. The replication factor was equal to three. This means that each data
set had a synchronized replica on each node in the cluster.
The speed of data processing varied depending on the block sizetherefore, we compared
results achieved with 8 MB, 64 MB, and 256 MB blocks.

Altoros Systems

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

3. The results of the research


3.1.Test results for an 8 MB HDFS block
Query Type

Total MapReduce CPU time


spent (min:sec.msec)
HQL

JavaScript

Number of Map/Reduce tasks

HQL

JavaScript

1. Count the number of flight


delays by year

7:21.635

106:33.722

38/10

1170/1

2. Count the number of flight


delays
and
display
information by year and
month

7:56.113

111:13.209

38/10

1170/1

3. Count the number of flight


delays
and
display
information by year, month,
and day of month

9:27.940

115:59.188

38/10

1170/1

4. Calculate the average flight


delay time by year

12:41.158

99:4.989

38/10

1170/1

5. Calculate the average flight


delay and display information
by year and month

13:33.45

103:54.367

38/10

1170/1

6. Calculate the average flight


delay time and display
information by year, month,
and day of month

14:48.310

110:18.658

38/10

1170/1

7. Count the number of


delays by the origin airport

7:26.364

106:43.959

38/10

1170/1

8. Count the number of


delays by the destination
airport

7:26.57

106:50.534

38/10

1170/1

Table 1

Brief summary
As it is seen form the table,time spent on processing JavaScript queries didnt vary significantly
and ranged from 99 to 116 seconds. Although Hive queries were processed 7-15 times faster
than JavaScript implementations, their computation speed differed greatly depending on the type
of the query. Well expand on this dependency in Figure 3 Dependency between performance
and the type of a query.
Altoros Systems

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

3.2.Test results for 64 a MB HDFS block


Query Type

Total MapReduce CPU time


spent (min:sec.msec)
HQL

JavaScript

Number of Map/Reduce tasks

HQL

JavaScript

1. Count the number of flight


delays by year

7:0.277

50:29.8

37/10

150/1

2. Count the number of flight


delays
and
display
information by year and
month

7:40.574

52:2.86

37/10

150/1

3. Count the number of flight


delays
and
display
information by year, month,
and day of month

9:9.143

55:56.7

37/10

150/1

4. Calculate the average flight


delay time by year

12:45.775

47:40.880

37/10

150/1

5. Calculate the average flight


delay and display information
by year and month

13:21.515

50:54.123

37/10

150/1

6. Calculate the average flight


delay time and display
information by year, month,
and day of month

14:35.23

53:55.645

37/10

150/1

7. Count the number of


delays by the origin airport

7:17.265

49:54.347

37/10

150/1

8. Count the number of


delays by the destination
airport

7:11.670

49:15.914

37/10

150/1

Table2
Brief summary
As you can see, it took us seven minutes to process the first query created with Hive, while
processing the same query based on JavaScript took 50 minutes and 29 seconds. The rest of the
Hive queries were also processed several times faster than queries based on JavaScript.
To provide a more detailed picture, we indicated the total number of Map and Reduce tasks in
both tables. ThetestresultsshowedthatHDFSblocksizeandthe number of MapReduce tasks affects
computation speed.As you can see in Table 1, the Hive query generated 38 Map tasks and 10
Reduce tasks, but for the JavaScript implementations this ratio was 1170 Map tasks to a Reduce
Altoros Systems

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

task. The similar results were achieved for 64 MB HDFS block. As you can see in Table 2, the
first Hive query produced 37 Map tasks and 10 Reduce tasks. The JavaScript query generated
150 Map tasks and a Reduce task.
This can be explained by the fact that JavaScript is not a native language for Hadoop, which was
written in Java. Hive features a task manager that analyses the load, divides the data set into a
number of Map and Reduce tasks, and chooses a certain ratio of Map tasks to Reduce tasks to
ensure the fastest computation speed. Unfortunately, it is not clear yet how to optimize the
JavaScript code and configure the task manager, so that it uses available resources in a more
efficient manner.
The results of the JavaScript query are written to the outputFileof the runJs command (codeFile,
inputFiles, outputFile) using a single Reduce task, that may be a reason of such a great
difference in performance..

3.3.Dependency between the block size and the number of Map tasks
We have also analyzed how the size of a block in a distributed file system influenced the number
of Map tasks triggered in Hive and JavaScript queries.

For a 64 MB block, the HQL query ran 37 Map tasks and 10 Reduce tasks. When a JavaScript
query was processed, the task manager divided the total load into 150 Map tasks and a Reduce
task.

Altoros Systems

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

Referring to the table, we can conclude that the number of Reduce tasks does not depend on the
block size and is equal to 10 for Hive queries and to 1 for JavaScript queries.

3.4.Dependency between performanceand the number of MapReduce tasks


We also analyzed how the number of Map and Reduce tasks influenced the speed of processing
Hive and JavaScript queries.

From this diagram, you can see that Hive queries were properly optimized and the block size had
almost no impact on execution time. In JavaScript, on the contrary, the processing speed
depended directly on the number of Map tasks.

Altoros Systems

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

3.5.Dependency between performance and the type of a query


Below you can see the diagram that shows how processing speed depends on the query type for
a 64 MB HDFS block.

The difference between the queries 1-3 and 4-6 was in the number of grouping parameters. The
first query calculated flight delay times by year. In the second query, we added such parameters
as month and in the thirdday. The fourth query returned the average flight delay times by ear,
which is a different arithmetic operation. The fifth and sixths queries calculated the average flight
delay times by year, month, and by year, month, and day respectively.
Judging by the diagram, additional grouping parameters had much greater influence on
JavaScript queries than the performed arithmetic operations. In case of Hive, such operations as
transforming, converting, and computing data caused the processing speed to degrade
significantly, which can be seen from the difference in processing time between the first and the
fourth queries. The sixth query calculated average values and included three grouping
parameters, which resulted in the slowest processing speed.

4. Conclusion
When we started this analysis, only the community technology preview release was available.
Hadoop on Windows Azure had no documentation and there were no manuals showing how to
optimize JavaScript queries. On the other hand, HiveQL had been built on top of Apache Hadoop
Altoros Systems

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

long before Microsoft offered their solution. That is why Hive is much faster when performing
basic operations, such as various selections or doing various data manipulations like
creating/updating/deleting, random data sampling, statistics, etc. However, you would have to opt
for JavaScript for more complex algorithms, such as data mining or machine learning, anyway,
since they cannot be implemented with Hive.
In October 2012 Microsoft released a new CTP version of this service, which is now called
Windows Azure HDInsight. Some of the issues we mentioned before were fixed, since the
improvements included:
updated versions of Hive, Pig, Hadoop Common, etc.
an updated version of the Hive ODBC driver
a local version of the HDInsight community technology preview (for Windows Server)
guides and documentation describing how to use the service

Now Microsoft offers a browser-based console that serves as an interface for running
MapReduce jobs on Azure. The implementation of Hadoop on Windows Azure also simplifies
installation, configuration, and management of a cloud cluster. In addition, the updated platform
can be used with such tools as Excel, PowerPivot, Powerview, SQL Server Analysis Services,
and Reporting Services.There is also the ODBC driver that connects Hive to Microsofts products
and helps to deal with business intelligence tasks. Such a solution that would enable .NET
developers to process huge amounts of data fast was long awaited.
Although this article describes the out-of-the-box querying methods, the .NET community is
contributing to .NET SDK for Hadoop. Currently, the 0.1.0.0 version is available to public at
CodePlex. This library already enables developers to implement MapReduce jobs using any of
the CLI Languagesthe solution comes with examples written in C# and F#and provides tools
for building Hive queries using LINQ to Hive.
Therefore, soon .NET developers will be able to build native Hadoop-based applications,
employing other libraries that conform to the Common Language Infrastructure. This SDK will be
an even more efficient tool for in-depth data analysis, data mining, machine learning, and creating
recommendation systems with .NET.

5. About the authors:


Andrei Paleyes has 5+ years of experience in MS .NET-related technologies applied in largescale international projects. Having a masters degree in mathematics, he is interested in big data
analysis and implementation of mathematical methods used in data mining. He is a knowledge
discovery enthusiast and presented a number of sessions on data science at local conferences.
Recently, Andrei participated in architecture development of the analytical cloud-based platforms
for genome sequencing and energy consumption solutions.

Altoros Systems

Hadoop on Windows Azure:


Hive vs. JavaScript for Processing Big Data

Sergey Klimov is proficient in developing large-scale applications and corporate systems, as


well as processes automation using MS .NET and cloud technologies. He has degrees
in softwareengineering and technical automatics. Sergey focuses on projects that require
processing large volumes of data using Hadoop and cloud technologies, in particular Windows
Azure.
Altoros is a global software delivery acceleration specialist that provides focused product
engineering to technology companies and start-ups. Founded in 2001 and headquartered in
Silicon Valley (Sunnyvale, California), Altoros has a sales office in Western Massachusetts,
branch offices in Norway, Denmark, Switzerland, and UK, and a software development centers in
Argentina and Eastern Europe (Minsk, Belarus). For more information, please visit
www.altoros.com

Altoros Systems

10

Das könnte Ihnen auch gefallen