Beruflich Dokumente
Kultur Dokumente
III. S AWZALL
C. Using Protocol Buffers in Sawzall
In this section, we briefly describe the language Sawzall
As mentioned above, Sawzall implicitly assumes that all the
based on the literature [14].
data in the storage is serialized in Protocol Buffers format. To
parse the serialized data, the proto files are required. Sawzall
A. Overview provides proto statement to refer the proto files.
Sawzall [14] is a language based on MapReduce, which is proto "p4stat.proto"
said to be used in Google for log analysis. The goal of Sawzall With this statement, data structures that correspond to
is to make it even easier to write MapReduce program. The the schema in the proto file are declared and implicit data
idea is to expose only the map portion to the programmers conversion functions between the structures and byte array are
providing built-in reducers as a part of the language runtime. defined, so that the data structure could be used in the Sawzall
The programmers are only responsible for the mappers. It is program.
intended to make non-expert users possible to write scripts for
themselves. D. Special Variable input
Sawzall is a typed language. It is specialized for mapping, The Sawzall programs will receive data from a special
i.e., it processes all the target entries one by one. It is Similar variable named input; just like AWK scripts get each input
to the AWK language [8], which processes text files one line line stored special variable $0.
by one line. There is, however, a fundamental difference. In The input variable is typed as byte array, often storing
AWK, there is a global state that could be shared and modified serializing some data structure with Protocol Buffers. To use
by each line processing. In Sawzall, there is nothing shared. the value as a structured data, it have to be deserialized.
Each entry is processed in completely independent way. In Sawzall, data type conversion happens implicitly upon
Sawzall provides a set of aggregators to do the reducing assignment. In the following code snippet, the incoming data
job. Using the aggregators, programmers are freed from burden is deserialized into a data structure typed P4ChangelistStats.
of writing reduction phase. log: P4ChangelistStats = input;
Sawzall implicitly assumes that all the structured data on the
storage are serialized using Protocol Buffer, described below. E. Tables and emit
Aggregators appear in Sawzall programs as special language
B. Protocol Buffers structures called tables, defined with keyword table. Here
Protocol Buffers [5] is a binary data representation of struc- is a sample declaration of table.
tured data used heavily in Google. It is language neutral and submitsthroughweek: table sum[minute: int] of count: int;
architecture neutral, i.e., binary data serialized with Protocol Note that sum is a predefined table type that means the table
Buffers on a machine with a program are guaranteed to be sums up all the data it gets. The declaration above defines an
readable no matter what machine and language is used by array of sum type table of int.
3
TABLE I
Sawzall
S AWZALL TABLES .
script
Mapper
Reducer
Table name Description
collection creates collection of all the emitted data
maximum outputs largest items Compiler
Compiled
Code
sample statistically samples specified number of items Aggregators
sum sums up all the data Mapper
Library
top outputs most frequent items
unique statistically infers number of unique items Main
Hadoop
protoc
Protoc
To output to tables, emit statements are used. The data Library
emitted to a table are handled by an aggregator that corre- Schema
In
Schema
In
sponds to the table type. Table I shows principal tables in ProtocolBuer
ProtocolBuer
Sawzall. Base64
Transferred
Base64
emit submitsthroughweek[minute] <- 1; in configuration
Schema
In
Schema
In
ProtocolBuer
ProtocolBuer
In
Base64
In
Base64
F. An Example of Sawzall script
Code shown in figure 6 is an example of Sawzall script taken Main
Mapper Library
from [14], which analyze web server logs to get histogram of
request frequency for each minute. Data
Encoded
in
In line 1, a proto statement imports a proto file ProtocolBuers
WhenStatement =
when ( {QuantifierDecl} BoolExpression ) Statement. memories and SATA HDDs connected with 10Gbit Ethernet.
QuantifierDecl = var_name : quantifier Type ;. We used Hadoop version 0.20.
quantifier = all | each | some.
Fig. 9. when Statement in BNF.
B. Measurement of Compilation time
def whenStmt = ("when" > "(" > whenVarDefs We have measured time to compile the word count script
expr < ")" statement ) @ WhenStmt
def whenVarDefs = rep1(whenVarDef)
shown in figure 13. The whole compilation took 1.2 seconds
def whenVarDef = (varNames < ":" whenMod including the Java compilation phase that took 0.7 seconds.
dataType < ";") @ WhenDefVar
def whenMod: WhenMod[Parser] = "some" @ SomeMod() |
Given that typical MapReduce computation lasts more than
"each" @ EachMod() | tens of minutes, the compilation time is ignorable.
"all" @ AllMod()
Fig. 10. When implementation with Parser Combinator. C. Comparison with Hadoop Native Program
Here, we compare SawzallClone with Java program with
Hadoop API. We employed the word counting program for
Another reason is the Parser Combinators library that is comparison. Figure 13 and 14 show word counting programs
included in Scala [11] [6] standard libraries. Parser Combina- in Sawzall and Java with Hadoop API, respectively. The
tors is a technique common in functional languages to describe former is in just 6 lines, while the latter is in about 60 lines.
parsers. Where, parsing functions, which correspond to each This proves the efficiency of Sawzall in writing scripts.
syntax component, are combined by combinator functions We compared the execution speed using the programs. As
composing parsing function for the whole language. With the the input files we generated files with 69,906 lines, with 1
Parser Combinators, parser functions could be described in to 100 words in each line. The mapper is invoked on each
similar way with the description in BNF. line. It means that the loop in the script will iterate number
While the technique Parser Combinators is not specific for of the words times for each invocation. The evaluation was
Scala and could be used in any language with first-order performed with just one worker and reducer, since the goal of
functions and operator overloading, Scala provides Parser the experiment is to know the overhead of the Protocol Buffers.
Combinators as a part of standard libraries comes with the Note that the Sawzall version uses Protocol Buffers encoded
language. file although it does not explicitly appears in the script, while
To demonstrate parser combinator, implementation of when the Java version uses plain text.
statement is shown in figure 10. Note that this is a plain Scala Figure 12 shows the result. The x-axis denotes number of
program, not a parser generator source code. When statement words in each line and the y-axis denotes the time spent. Each
is a Sawzall specific language construct that combines while experiment was performed 5 times and the average time is
loop statement and if condition statements. A BNF based plotted in the graph. The error bar in the graph denotes the
description is shown in figure 9. Although keywords, such as standard deviation of the value.
WhenStatement and whenStmt, are different, it is obvious SawzallClone is substantially slower than the Java program.
that the Scala description resembles the BNF definition. The overhead could be divided into Sawzall compiled code
overhead and Protocol Buffers overhead. To divide these two,
D. Compilation we modified the Java program so that it also uses Protocol
We employed Java as the intermediate code and avoided to Buffers. The result is also shown in figure 12. The difference
directly compile the Sawzall code into Java byte code. One between Sawzall and Java with Protocol Buffers is quite small,
of the reasons of the design decision is to simply make the proving that the overhead mainly comes from Protocol Buffers.
implementation easy. Another reason is that, since Sawzall
programs are usually not so big, the increased compilation D. Comparison with Szl
time is ignorable. We compared SawzallClone with szl, the open source
Figure 11 shows a part of the generated code for the word Sawzall implementation from Google. Since szl does not
count code shown in figure 13. The compiled code implements support parallel execution so far unfortunately, we compared
SCHelpers.Mapper interface, which will be called from them with sequential execution. For comparison, we employed
the mapper library of SawzallClone. the sample code appeared in the paper [14], shown in figure
6. We generated log files with 0.5, 1, 1.5, 2, and 2.5 million
V. E VALUATION records in Protocol Buffers format and run SawzallClone and
We have evaluated SawzallClone from several point of view; szl on them. 1
compilation time, comparison with programs with Hadoop The result is shown in figure 15. While SawzallClone
native API, and comparison with szl. Furthermore, we com- is substantially slower than szl, note that the difference is
pared parallel execution performance with program written in constant and does not depends on the number of records in
Hadoop native API. the file. This means that the execution speed of SawzallClone
is comparable with szl and the difference comes from start up
A. Evaluation Environment overhead, presumably due to Java VM initialization.
For the following evaluations, we used 16 nodes PC cluster 1 The protocol buffer support in szl is intentionally turned off. We slightly
each with two sockets of four core Xeon 5580 and 48 GB modified the code to turn on the support for evaluation.
5
for (i: int = 0;i < len(words);i = i + 1) { public static void main(String[] args) throws Exception {
emit t[words[i]] <- 1; Configuration conf = new Configuration();
} Job job = new Job(conf, "wordcount");
10
szl
9 SawzallClone 1.2
8
1
6 0.8
Time [s]
5
0.6
4
3 0.4
2
0.2
1
0 0
500000 1e+06 1.5e+06 2e+06 2.5e+06 1 2 4 8 16
No. of Entries # of workers
Fig. 15. Comparison with szl. Fig. 17. Normalized Execution Time.
1000
SawzallClone
Java While this concept made it unique solution, apparently
900
there is room to improve. We will extend the language
800
so that it can also handle reducers, not only mappers.
700
600 ACKNOWLEDGEMENT
Time [s]
500 This work was partly funded by the New Energy and Indus-
400 trial Technology Development Organization (NEDO) Green-IT
project.
300
200
R EFERENCES
100
[1] Hadoop. http://hadoop.apache.org/.
0 [2] Hive. http://hive.apache.org/.
0 2 4 6 8 10 12 14 16 [3] jaql: Query Language for JavaScript(r) Object Notation. http://code.
# of workers google.com/p/jaql/.
[4] Pig. http://pig.apache.org/.
[5] Protocol buffers - googles data interchange format. http://code.google.
Fig. 16. Parallel Execution Comparison with Hadoop. com/p/protobuf/.
[6] Scala. http://scala-lang.org/.
[7] Szl - a compiler and runtime for the sawzall language. http://code.
google.com/p/szl/.
overhead by the language layer is small enough. We also [8] A. V. Aho, Z. W. Kernighan, and P. J. Weinberger. The AWK Program-
compared it with Googles open source implementation, which ming Language. Addison Wesley, 1988.
is implemented in C++, and found that execution speed is [9] S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPher-
son. Ricardo: Integrating R and Hadoop. 2010.
comparable. [10] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on
The followings are the future work: large clusters. In OSDI04: Sixth Symposium on Operating System
Design and Implementation, 2004.
Compatibility Improvement SawzallClone has some [11] M. Odersky, L. Spoon, and B. Venners. Programming in Scala, Second
compatibility issue with szl, the open source implemen- Edition, 2010.
tation from Google, including lack of function definition [12] H. Ogawa, H. Nakada, R. Takano, and T. Kudoh. Sss: An implemen-
tation of key-value store based mapreduce framework. In Proceedings
capability. We will fix the issue. of 2nd IEEE International Conference on Cloud Computing Technology
Adoption to other MapReduce systems The system and Science (Accepted as a paper for First International Workshop on
could be implemented on any MapReduce based system. Theory and Practice of MapReduce (MAPRED2010)), pages 754761,
2010.
We are developing a MapReduce system called SSS [12], [13] C. Olston, S. Chopra, and U. Srivastava. Generating example data for
based on distributed key-value store, which is designed dataflow programs. In SIGMOD 2009, 2009.
to be faster than Hadoop. We will adopt SawzallClone to [14] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data:
Parallel analysis with sawzall. Scientific Programming Journal, Special
the system. Issue on Grids and Worldwide Computing Programming Models and
Improve Aggregator While Sawzall is designed to al- Infrastructure, 13(4):227298, 2005.
low optimized aggregator implementations [14], Sawza- [15] I. Sun Microsystems. Xdr: External data representation standard. RFC
1014, June 1987.
llClone just provide naively implemented ones. We will [16] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang,
optimize the aggregator implementation. S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse
Improve Language Construct Sawzall focuses on map- using hadoop. In ICDE 2010: 26th IEEE International Conference on
Data Engineering, 2010.
pers only making program easy to understand and write.