Sie sind auf Seite 1von 12

Pig Latin: A Not-So-Foreign Language for Data Processing

∗ † ‡
Christopher Olston Benjamin Reed Utkarsh Srivastava
Yahoo! Research Yahoo! Research Yahoo! Research
§ ¶
Ravi Kumar Andrew Tomkins
Yahoo! Research Yahoo! Research

ABSTRACT 1. INTRODUCTION
There is a growing need for ad-hoc analysis of extremely At a growing number of organizations, innovation revolves
large data sets, especially at internet companies where inno- around the collection and analysis of enormous data sets
vation critically depends on being able to analyze terabytes such as web crawls, search logs, and click streams. Inter-
of data collected every day. Parallel database products, e.g., net companies such as Amazon, Google, Microsoft, and Ya-
Teradata, offer a solution, but are usually prohibitively ex- hoo! are prime examples. Analysis of this data constitutes
pensive at this scale. Besides, many of the people who ana- the innermost loop of the product improvement cycle. For
lyze this data are entrenched procedural programmers, who example, the engineers who develop search engine ranking
find the declarative, SQL style to be unnatural. The success algorithms spend much of their time analyzing search logs
of the more procedural map-reduce programming model, and looking for exploitable trends.
its associated scalable implementations on commodity hard- The sheer size of these data sets dictates that it be stored
ware, is evidence of the above. However, the map-reduce and processed on highly parallel systems, such as shared-
paradigm is too low-level and rigid, and leads to a great deal nothing clusters. Parallel database products, e.g., Teradata,
of custom user code that is hard to maintain, and reuse. Oracle RAC, Netezza, offer a solution by providing a simple
We describe a new language called Pig Latin that we have SQL query interface and hiding the complexity of the phys-
designed to fit in a sweet spot between the declarative style ical cluster. These products however, can be prohibitively
of SQL, and the low-level, procedural style of map-reduce. expensive at web scale. Besides, they wrench programmers
The accompanying system, Pig, is fully implemented, and away from their preferred method of analyzing data, namely
compiles Pig Latin into physical plans that are executed writing imperative scripts or code, toward writing declara-
over Hadoop, an open-source, map-reduce implementation. tive queries in SQL, which they often find unnatural, and
We give a few examples of how engineers at Yahoo! are using overly restrictive.
Pig to dramatically reduce the time required for the develop- As evidence of the above, programmers have been flock-
ment and execution of their data analysis tasks, compared to ing to the more procedural map-reduce [4] programming
using Hadoop directly. We also report on a novel debugging model. A map-reduce program essentially performs a group-
environment that comes integrated with Pig, that can lead by-aggregation in parallel over a cluster of machines. The
to even higher productivity gains. Pig is an open-source, programmer provides a map function that dictates how the
Apache-incubator project, and available for general use. grouping is performed, and a reduce function that performs
the aggregation. What is appealing to programmers about
Categories and Subject Descriptors: this model is that there are only two high-level declarative
H.2.3 Database Management: Languages primitives (map and reduce) to enable parallel processing,
General Terms: Languages. but the rest of the code, i.e., the map and reduce functions,
can be written in any programming language of choice, and
∗olston@yahoo-inc.com without worrying about parallelism.
†breed@yahoo-inc.com Unfortunately, the map-reduce model has its own set of
limitations. Its one-input, two-stage data flow is extremely
‡utkarsh@yahoo-inc.com rigid. To perform tasks having a different data flow, e.g.,
§ravikuma@yahoo-inc.com joins or n stages, inelegant workarounds have to be devised.
¶atomkins@yahoo-inc.com Also, custom code has to be written for even the most com-
mon operations, e.g., projection and filtering. These factors
lead to code that is difficult to reuse and maintain, and in
which the semantics of the analysis task are obscured. More-
over, the opaque nature of the map and reduce functions
Permission to make digital or hard copies of all or part of this work for impedes the ability of the system to perform optimizations.
personal or classroom use is granted without fee provided that copies are We have developed a new language called Pig Latin that
not made or distributed for profit or commercial advantage and that copies
combines the best of both worlds: high-level declarative
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific querying in the spirit of SQL, and low-level, procedural pro-
permission and/or a fee. gramming à la map-reduce.
SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.
Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00.
Example 1. Suppose we have a table urls: (url, describe the current implementation of Pig. We describe
category, pagerank). The following is a simple SQL query our novel debugging environment in Section 5, and outline
that finds, for each sufficiently large category, the average a few real usage scenarios of Pig in Section 6. Finally, we
pagerank of high-pagerank urls in that category. discuss related work in Section 7, and conclude.
SELECT category, AVG(pagerank)
FROM urls WHERE pagerank > 0.2
2. FEATURES AND MOTIVATION
GROUP BY category HAVING COUNT(*) > 106 The overarching design goal of Pig is to be appealing to
experienced programmers for performing ad-hoc analysis of
An equivalent Pig Latin program is the following. (Pig extremely large data sets. Consequently, Pig Latin has a
Latin is described in detail in Section 3; a detailed under- number of features that might seem surprising when viewed
standing of the language is not required to follow this exam- from a traditional database and SQL perspective. In this
ple.) section, we describe the features of Pig, and the rationale
behind them.
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category; 2.1 Dataflow Language
big_groups = FILTER groups BY COUNT(good_urls)>106 ; As seen in Example 1, in Pig Latin, a user specifies a se-
output = FOREACH big_groups GENERATE quence of steps where each step specifies only a single, high-
category, AVG(good_urls.pagerank); level data transformation. This is stylistically different from
the SQL approach where the user specifies a set of declara-
As evident from the above example, a Pig Latin program tive constraints that collectively define the result. While the
is a sequence of steps, much like in a programming language, SQL approach is good for non-programmers and/or small
each of which carries out a single data transformation. This data sets, experienced programmers who must manipulate
characteristic is immediately appealing to many program- large data sets often prefer the Pig Latin approach. As one
mers. At the same time, the transformations carried out in of our users says,
each step are fairly high-level, e.g., filtering, grouping, and
“I much prefer writing in Pig [Latin] versus SQL.
aggregation, much like in SQL. The use of such high-level
The step-by-step method of creating a program
primitives renders low-level manipulations (as required in
in Pig [Latin] is much cleaner and simpler to use
map-reduce) unnecessary.
than the single block method of SQL. It is eas-
In effect, writing a Pig Latin program is similar to specify-
ier to keep track of what your variables are, and
ing a query execution plan (i.e., a dataflow graph), thereby
where you are in the process of analyzing your
making it easier for programmers to understand and control
data.” – Jasmine Novak, Engineer, Yahoo!
how their data processing task is executed. To experienced
system programmers, this method is much more appealing Note that although Pig Latin programs supply an explicit
than encoding their task as an SQL query, and then coerc- sequence of operations, it is not necessary that the oper-
ing the system to choose the desired plan through optimizer ations be executed in that order. The use of high-level,
hints. (Automatic query optimization has its limits, espe- relational-algebra-style primitives, e.g., GROUP, FILTER, al-
cially with uncataloged data, prevalent user-defined func- lows traditional database optimizations to be carried out, in
tions, and parallel execution, which are all features of our cases where the system and/or user have enough confidence
target environment; see Section 2.) Even with a Pig Latin that query optimization can succeed.
optimizer in the future, users will still be able to request con- For example, suppose one is interested in the set of urls of
formation to the execution plan implied by their program. pages that are classified as spam, but have a high pagerank
Pig Latin has several other unconventional features that score. In Pig Latin, one can write:
are important for our setting of casual ad-hoc data analy-
sis by programmers. These features include support for a spam_urls = FILTER urls BY isSpam(url);
flexible, fully nested data model, extensive support for user- culprit_urls = FILTER spam_urls BY pagerank > 0.8;
defined functions, and the ability to operate over plain input where culprit_urls contains the final set of urls that the
files without any schema information. Pig Latin also comes user is interested in.
with a novel debugging environment that is especially use- The above Pig Latin fragment suggests that we first find
ful when dealing with enormous data sets. We elaborate on the spam urls through the function isSpam, and then filter
these features in Section 2. them by pagerank. However, this might not be the most
Pig Latin is fully implemented by our system, Pig, and is efficient method. In particular, isSpam might be an expen-
being used by programmers at Yahoo! for data analysis. Pig sive user-defined function that analyzes the url’s content for
Latin programs are currently compiled into (ensembles of) spaminess. Then, it will be much more efficient to filter the
map-reduce jobs that are executed using Hadoop, an open- urls by pagerank first, and invoke isSpam only on the pages
source, scalable implementation of map-reduce. Alternative that have high pagerank.
backends can also be plugged in. Pig is an open-source With Pig Latin, this optimization opportunity is available
project in the Apache incubator1 . to the system. On the other hand, if these filters were buried
The rest of the paper is organized as follows. In the next within an opaque map or reduce function, such reordering
section, we describe the various features of Pig, and the and optimization would effectively be impossible.
underlying motivation for each. In Section 3, we dive into
the Pig Latin data model and language. In Section 4, we 2.2 Quick Start and Interoperability
Pig is designed to support ad-hoc data analysis. If a user
1
http://incubator.apache.org/pig has a data file obtained, say, from a dump of the search
engine logs, she can run Pig Latin queries over it directly. • Data is often stored on disk in an inherently nested fash-
She need only provide a function that gives Pig the ability to ion. For example, a web crawler might output for each
parse the content of the file into tuples. There is no need to url, the set of outlinks from that url. Since Pig oper-
go through a time-consuming data import process prior to ates directly on files (Section 2.2), separating the data
running queries, as in conventional database management out into normalized form, and later recombining through
systems. Similarly, the output of a Pig program can be joins can be prohibitively expensive for web-scale data.
formatted in the manner of the user’s choosing, according • A nested data model also allows us to fulfill our goal of
to a user-provided function that converts tuples into a byte having an algebraic language (Section 2.1), where each
sequence. Hence it is easy to use the output of a Pig analysis step carries out only a single data transformation. For
session in a subsequent application, e.g., a visualization or example, each tuple output by our GROUP primitive has
spreadsheet application such as Excel. one non-atomic field: a nested set of tuples from the
It is important to keep in mind that Pig is but one of many input that belong to that group. The GROUP construct is
applications in the rich “data ecosystem” of a company like explained in detail in Section 3.5.
Yahoo! By operating over data residing in external files, and
not taking over control over the data, Pig readily interoper- • A nested data model allows programmers to easily write
ates with other applications in the ecosystem. a rich set of user-defined functions, as shown in the next
The reasons that conventional database systems do re- section.
quire importing data into system-managed tables are three- 2.4 UDFs as First-Class Citizens
fold: (1) to enable transactional consistency guarantees, (2)
to enable efficient point lookups (via physical tuple identi- A significant part of the analysis of search logs, crawl data,
fiers), and (3) to curate the data on behalf of the user, and click streams, etc., is custom processing. For example, a user
record the schema so that other users can make sense of the may be interested in performing natural language stemming
data. Pig only supports read-only data analysis workloads, of a search term, or figuring out whether a particular web
and those workloads tend to be scan-centric, so transactional page is spam, and countless other tasks.
consistency and index-based lookups are not required. Also, To accommodate specialized data processing tasks, Pig
in our environment users often analyze a temporary data set Latin has extensive support for user-defined functions
for a day or two, and then discard it, so data curating and (UDFs). Essentially all aspects of processing in Pig Latin in-
schema management can be overkill. cluding grouping, filtering, joining, and per-tuple processing
In Pig, stored schemas are strictly optional. Users may can be customized through the use of UDFs.
supply schema information on the fly, or perhaps not at all. The input and output of UDFs in Pig Latin follow our
Thus, in Example 1, if the user knows the the third field of flexible, fully nested data model. Consequently, a UDF to
the file that stores the urls table is pagerank but does not be used in Pig Latin can take non-atomic parameters as
want to provide the schema, the first line of the Pig Latin input, and also output non-atomic values. This flexibility is
program can be written as: often very useful as shown by the following example.
good_urls = FILTER urls BY $2 > 0.2; Example 2. Continuing with the setting of Example 1,
where $2 uses positional notation to refer to the third field. suppose we want to find for each category, the top 10 urls
according to pagerank. In Pig Latin, one can simply write:
2.3 Nested Data Model
groups = GROUP urls BY category;
Programmers often think in terms of nested data struc-
output = FOREACH groups GENERATE
tures. For example, to capture information about the posi-
category, top10(urls);
tional occurrences of terms in a collection of documents, a
programmer would not think twice about creating a struc- where top10() is a UDF that accepts a set of urls (for each
ture of the form Map< documentId, Set<positions> > for group at a time), and outputs a set containing the top 10
each term. urls by pagerank for that group.2 Note that our final output
Databases, on the other hand, allow only flat tables, i.e., in this case contains non-atomic fields: there is a tuple for
only atomic fields as columns, unless one is willing to violate each category, and one of the fields of the tuple is the set of
the First Normal Form (1NF) [7]. To capture the same in- the top 10 urls in that category.
formation about terms above, while conforming to 1NF, one
would need to normalize the data by creating two tables: Due to our flexible data model, the return type of a UDF
does not restrict the context in which it can be used. Pig
term_info: (termId, termString, ...) Latin has only one type of UDF that can be used in all the
position_info: (termId, documentId, position) constructs such as filtering, grouping, and per-tuple process-
The same positional occurence information can then be ing. This is in contrast to SQL, where only scalar functions
reconstructed by joining these two tables on termId and may be used in the SELECT clause, set-valued functions can
grouping on termId, documentId. only appear in the FROM clause, and aggregation functions
Pig Latin has a flexible, fully nested data model (described can only be applied in conjunction with a GROUP BY or a
in Section 3.1), and allows complex, non-atomic data types PARTITION BY.
such as set, map, and tuple to occur as fields of a table. Currently, Pig UDFs are written in Java. We are building
There are several reasons why a nested model is more ap- support for interfacing with UDFs written in arbitrary lan-
propriate for our setting than 1NF: 2
In practice, a user would probably write a more generic
• A nested data model is closer to how programmers think, function than top10(): one that takes k as a parameter to
and consequently much more natural to them than nor- find the top k tuples, and also the field according to which
malization. the top k must be found (pagerank in this example).
„  ff «
guages, including C/C++, Java, Perl and Python, so that (‘lakers’, 1) ˆ ˜
t = ‘alice’, , ‘age’ → 20
users can stick with their language of choice. (‘iPod’, 2)

2.5 Parallelism Required Let fields of tuple t be called f1, f2, f3


Since Pig Latin is geared toward processing web-scale
data, it does not make sense to consider non-parallel eval- Expression Type Example Value for t
uation. Consequently, we have only included in Pig Latin Constant ‘bob’ Independent of t
a small set of carefully chosen primitives that can be easily Field by position $0 ˆ ‘alice’ ˜
parallelized. Language primitives that do not lend them- Field by name f3  ‘age’ → 20 ff
selves to efficient parallel evaluation (e.g., non-equi-joins, (‘lakers’)
correlated subqueries) have been deliberately excluded. Projection f2.$0
(‘iPod’)
Such operations can of course, still be carried out by writ- Map Lookup f3#‘age’ 20
ing UDFs. However, since the language does not provide Function Evaluation SUM(f2.$1) 1 + 2 = 3
explicit primitives for such operations, users are aware of Conditional f3#‘age’>18?
how efficient their programs will be and whether they will ‘adult’
Expression ‘adult’:‘minor’
be parallelized.
‘lakers’, 1
Flattening FLATTEN(f2)
2.6 Debugging Environment ‘iPod’, 2
In any language, getting a data processing program right
Table 1: Expressions in Pig Latin.
usually takes many iterations, with the first few iterations
usually having some user-introduced erroneous processing.
At the scale of data that Pig is meant to process, a single The above example also demonstrates that tuples can be
iteration can take many minutes or hours (even with large- nested, e.g., the second tuple of the bag has a nested
scale parallel processing). Thus, the usual run-debug-run tuple as its second field.
cycle can be very slow and inefficient. • Map: A map is a collection of data items, where each
Pig comes with a novel interactive debugging environment item has an associated key through which it can be
that generates a concise example data table illustrating the looked up. As with bags, the schema of the constituent
output of each step of the user’s program. The example data data items is flexible, i.e., all the data items in the map
is carefully chosen to resemble the real data as far as possible need not be of the same type. However, the keys are re-
and also to fully illustrate the semantics of the program. quired to be data atoms, mainly for efficiency of lookups.
Moreover, the example data is automatically adjusted as The following is an example of a map:
the program evolves. 2 
(‘lakers’)
ff 3
This step-by-step example data can help in detecting er- 4 ‘fan of’ → (‘iPod’) 5
rors early (even before the first iteration of running the pro-
‘age’ → 20
gram on the full data), and also in pinpointing the step that
has errors. The details of our debugging environment are In the above map, the key ‘fan of’ is mapped to a bag
provided in Section 5. containing two tuples, and the key ‘age’ is mapped to
an atom 20.
3. PIG LATIN A map is especially useful to model data sets where
In this section, we describe the details of the Pig Latin schemas might change over time. For example, if web
language. We describe our data model in Section 3.1, and servers decide to include a new field while dumping logs,
the Pig Latin statements in the subsequent subsections. The that new field can simply be included as an additional
emphasis of this section is not on the syntactical details of key in a map, thus requiring no change to existing pro-
Pig Latin, but on how it meets the design goals and features grams, and also allowing access of the new field to new
laid out in Section 2. Also, this section only focusses on programs.
the language primitives, and not on how they can be imple- Table 1 shows the expression types in Pig Latin, and how
mented to execute in parallel over a cluster. Implementation they operate. (The flattening expression is explained in de-
is covered in Section 4. tail in Section 3.3.) It should be evident that our data model
is very flexible and permits arbitrary nesting. This flexibil-
3.1 Data Model ity allows us to achieve the aims outlined in Section 2.3,
Pig has a rich, yet simple data model consisting of the where we motivated our use of a nested data model. Next,
following four types: we describe the Pig Latin commands.
• Atom: An atom contains a simple atomic value such as
a string or a number, e.g., ‘alice’. 3.2 Specifying Input Data: LOAD
• Tuple: A tuple is a sequence of fields, each of which can The first step in a Pig Latin program is to specify what
be any of the data types, e.g., (‘alice’, ‘lakers’). the input data files are, and how the file contents are to be
deserialized, i.e., converted into Pig’s data model. An input
• Bag: A bag is a collection of tuples with possible dupli- file is assumed to contain a sequence of tuples, i.e., a bag.
cates. The schema of the constituent tuples is flexible, This step is carried out by the LOAD command. For example,
i.e., not all tuples in a bag need to have the same number
and type of fields, e.g., queries = LOAD ‘query_log.txt’
 ff USING myLoad()
(‘alice’, ‘lakers’)
` ´ AS (userId, queryString, timestamp);
‘alice’, (‘iPod’, ‘apple’)
Figure 1: Example of flattening in FOREACH.

The above command specifies the following: In general, the GENERATE clause can be followed by a list of
• The input file is query_log.txt. expressions that can be in any of the forms listed in Table 1.
Most of the expression forms shown are straightforward, and
• The input file should be converted into tuples by using have been included here only for completeness. The last
the custom myLoad deserializer. expression type, i.e., flattening, deserves some attention.
• The loaded tuples have three fields named userId, Often, we want to eliminate nesting in data. For example,
queryString, and timestamp. in Figure 1, we might want to flatten the set of possible
Both the USING clause (the custom deserializer) and the expansions of a query string into separate tuples, so that
AS clause (the schema information) are optional. If no de- they can be processed independently. We could also want
serializer is specified, a default one, that expects a plain to flatten the final result just prior to storing it. Nesting
text, tab-delimited file, is used. If no schema is specified, can be eliminated by the use of the FLATTEN keyword in the
fields must be referred to by position instead of by name GENERATE clause. Flattening operates on bags by extracting
(e.g., $0 for the first field). The ability to operate over plain the fields of the tuples in the bag, and making them fields
text files, and the ability to specify schema information on of the tuple being output by GENERATE, thus removing one
the fly or not at all, allows the user to get started quickly level of nesting. For example, the output of the following
(Section 2.2). To aid readability, it is desirable to include command is shown as the second step in Figure 1.
schemas while writing large Pig Latin programs.
expanded_queries = FOREACH queries GENERATE
The return value of a LOAD command is a handle to a
userId, FLATTEN(expandQuery(queryString));
bag which, in the above example, is assigned to the variable
queries. This variable can then be used as an input in sub-
sequent Pig Latin commands. Note that the LOAD command 3.4 Discarding Unwanted Data: FILTER
does not imply database-style loading into tables. Bag han- Another very common operation is to retain only some
dles in Pig Latin are only logical—the LOAD command merely subset of the data that is of interest, while discarding the
specifies what the input file is, and how it should be read. rest. This operation is done by the FILTER command. For
No data is actually read, and no processing carried out, until example, to get rid of bot traffic in the bag queries:
the user explicitly asks for output (see STORE command in real_queries = FILTER queries BY userId neq ‘bot’;
Section 3.8).
The operator neq in the above example is used to signify
string comparison, as opposed to numeric comparison which
3.3 Per-tuple Processing: FOREACH is specified via ==. Filtering conditions in Pig Latin can
Once input data file(s) have been specified through LOAD, involve a combination of expressions (Table 1), comparison
one can specify the processing that needs to be carried out operators such as ==, eq, !=, neq, and the logical connec-
on the data. One of the basic operations is that of applying tors AND, OR, and NOT.
some processing to every tuple of a data set. This is achieved Since arbitrary expressions are allowed, it follows that we
through the FOREACH command. For example, can use UDFs while filtering. Thus, in our less ideal world,
where bots don’t identify themselves, we can use a sophisti-
expanded_queries = FOREACH queries GENERATE cated UDF (isBot) to perform the filtering, e.g.,
userId, expandQuery(queryString);
real_queries =
The above command specifies that each tuple of the bag FILTER queries BY NOT isBot(userId);
queries (loaded in the previous section) should be processed
independently to produce an output tuple. The first field of 3.5 Getting Related Data Together: COGROUP
the output tuple is the userId field of the input tuple, and Per-tuple processing only takes us so far. It is often nec-
the second field of the output tuple is the result of applying essary to group together tuples from one or more data sets,
the UDF expandQuery to the queryString field of the input that are related in some way, so that they can subsequently
tuple. Suppose the UDF expandQuery generates a bag of be processed together. This grouping operation is done by
likely expansions of a given query string. Then an exam- the COGROUP command. For example, suppose we have two
ple transformation carried out by the above statement is as data sets that we have specified through a LOAD command:
shown in the first step of Figure 1.
Note that the semantics of the FOREACH command are such results: (queryString, url, position)
that there can be no dependence between the processing of revenue: (queryString, adSlot, amount)
different tuples of the input, thereby permitting an efficient results contains, for different query strings, the urls shown
parallel implementation. These semantics conform to our as search results, and the position at which they were shown.
goal of only having parallelizable operations (Section 2.5). revenue contains, for different query strings, and different
Figure 2: COGROUP versus JOIN.

advertisement slots, the average amount of revenue made by To specify the same operation in SQL, one would have
the advertisements for that query string at that slot. Then to join by queryString, then group by queryString, and
to group together all search result data and revenue data for then apply a custom aggregation function. But while doing
the same query string, we can write: the join, the system would compute the cross product of the
search and revenue information, which the custom aggre-
grouped_data = COGROUP results BY queryString, gation function would then have to undo. Thus, the whole
revenue BY queryString; process become quite inefficient, and the query becomes hard
to read and understand.
Figure 2 shows what a tuple in grouped_data looks like.
In general, the output of a COGROUP contains one tuple for To reiterate, the COGROUP statement illustrates a key dif-
each group. The first field of the tuple (named group) is the ference between Pig Latin and SQL. The COGROUP state-
group identifier (in this case, the value of the queryString ments conforms to our goal of having an algebraic language,
field). Each of the next fields is a bag, one for each input where each step carries out only a single transformation
being cogrouped, and is named the same as the alias of that (Section 2.1). COGROUP carries out only the operation of
input. The ith bag contains all tuples from the ith input grouping together tuples into nested bags. The user can
belonging to that group. As in the case of filtering, grouping subsequently choose to apply either an aggregation function
can also be performed according to arbitrary expressions on those tuples, or cross-product them to get the join result,
which may include UDFs. or process it in a custom way as in Example 3. In SQL,
The reader may wonder why a COGROUP primitive is needed grouping is available only bundled with either aggregation
at all, since a very similar primitive is provided by the fa- (group-by-aggregate queries), or with cross-producting (the
miliar, well-understood, JOIN operation in databases. For JOIN operation). Users find this additional flexibility of Pig
comparison, Figure 2 also shows the result of joining our Latin over SQL quite attractive, e.g.,
data sets on queryString. It is evident that JOIN is equiv-
alent to COGROUP, followed by taking a cross product of the “I frankly like pig much better than SQL in some
tuples in the nested bags. While joins are widely applicable, respects (group + optional flatten works better
certain custom processing might require access to the tuples for me, I love nested data structures).”
of the groups before the cross-product is taken, as shown by – Ted Dunning, Chief Scientist, Veoh Networks
the following example.
Note that it is our nested data model that allows us to
have COGROUP as an independent operation—the input tu-
Example 3. Suppose we were trying to attribute search ples are grouped together and put in nested bags. Such
revenue to search-result urls to figure out the monetary worth a primitive is not possible in SQL since the data model is
of each url. We might have a sophisticated model for doing flat. Of course, such a nested model raises serious concerns
so. To accomplish this task in Pig Latin, we can follow the about efficiency of implementation: since groups can be very
COGROUP with the following statement: large (bigger than main memory, perhaps), we might build
url_revenues = FOREACH grouped_data GENERATE up gigantic tuples, which have these enormous nested bags
FLATTEN(distributeRevenue(results, revenue)); within them. We address these efficiency concerns in our
implementation section (Section 4).
where distributeRevenue is a UDF that accepts search re-
sults and revenue information for a query string at a time, 3.5.1 Special Case of COGROUP: GROUP
and outputs a bag of urls and the revenue attributed to them. A common special case of COGROUP is when there is only
For example, distributeRevenue might attribute revenue one data set involved. In this case, we can use the alter-
from the top slot entirely to the first search result, while the native, more intuitive keyword GROUP. Continuing with our
revenue from the side slot may be attributed equally to all example, if we wanted to find the total revenue for each
the results. In this case, the output of the above statement query string, (a typical group-by-aggregate query), we can
for our example data is shown in Figure 2. write it as follows:
grouped_revenue = GROUP revenue BY queryString; These commands are used as one would expect. For exam-
query_revenues = FOREACH grouped_revenue GENERATE ple, continuing with the example of Section 3.5.1, to order
queryString, the query strings by their revenue:
SUM(revenue.amount) AS totalRevenue;
ordered_result = ORDER query_revenues BY
totalRevenue;
In the second statement above, revenue.amount refers
to a projection of the nested bag in the tuples of
grouped_revenue. Also, as in SQL, the AS clause is used 3.7 Nested Operations
to assign names to fields on the fly. Each of the Pig Latin processing commands described so
To group all tuples of a data set together (e.g., to com- far operate over one or more bags of tuples as input. As
pute the overall total revenue), one uses the syntax GROUP illustrated in the previous sections, these commands collec-
revenue ALL. tively form a very powerful language. When we have nested
bags within tuples, either as a result of (co)grouping, or due
3.5.2 JOIN in Pig Latin to the base data being nested, we might want to harness the
Not all users need the flexibility offered by COGROUP. In same power of Pig Latin to process even these nested bags.
many cases, all that is required is a regular equi-join. Thus, To allow such processing, Pig Latin allows some commands
Pig Latin provides a JOIN keyword for equi-joins. For ex- to be nested within a FOREACH command.
ample, For example, continuing with the data set of Section 3.5,
suppose we wanted to compute for each queryString, the
join_result = JOIN results BY queryString,
total revenue due to the ‘top’ ad slot, and also the overall
revenue BY queryString;
total revenue. This can be written in Pig Latin as follows:
It is easy to verify that JOIN is only a syntactic shortcut
grouped_revenue = GROUP revenue BY queryString;
for COGROUP followed by flattening. The above join command
query_revenues = FOREACH grouped_revenue{
is equivalent to:
top_slot = FILTER revenue BY
temp_var = COGROUP results BY queryString, adSlot eq ‘top’;
revenue BY queryString; GENERATE queryString,
join_result = FOREACH temp_var GENERATE SUM(top_slot.amount),
FLATTEN(results), FLATTEN(revenue); SUM(revenue.amount);
};

3.5.3 Map-Reduce in Pig Latin In the above Pig Latin fragment, revenue is first grouped
by queryString as before. Then each group is processed
With the GROUP and FOREACH statements, it is trivial to
by a FOREACH command, within which is a FILTER com-
express a map-reduce [4] program in Pig Latin. Converting
mand that operates on the nested bags on the tuples of
to our data-model terminology, a map function operates on
grouped_revenue. Finally the GENERATE statement within
one input tuple at a time, and outputs a bag of key-value
the FOREACH outputs the required fields.
pairs. The reduce function then operates on all values for a
At present, we only allow FILTER, ORDER, and DISTINCT
key at a time to produce the final result. In Pig Latin,
to be nested within FOREACH. In the future, as need arises,
map_result = FOREACH input GENERATE FLATTEN(map(*)); we might allow other constructs to be nested as well.
key_groups = GROUP map_result BY $0;
output = FOREACH key_groups GENERATE reduce(*);
3.8 Asking for Output: STORE
The user can ask for the result of a Pig Latin expression
The first line applies the map UDF to every tuple on the sequence to be materialized to a file, by issuing the STORE
input, and flattens the bag of key value pairs that it pro- command, e.g.,
duces. (We use the shorthand * as in SQL to denote that STORE query_revenues INTO ‘myoutput’
all the fields of the input tuples are passed to the map UDF.) USING myStore();
Assuming the first field of the map output to be the key, the
second statement groups by key. The third statement then The above command specifies that bag query_revenues
passes the bag of values for every key to the reduce UDF to should be serialized to the file myoutput using the custom
obtain the final result. serializer myStore. As with LOAD, the USING clause may be
omitted for a default serializer that writes plain text, tab-
3.6 Other Commands delimited files. Our system also comes with a built-in serial-
Pig Latin has a number of other commands that are very izer/deserializer that can load/store arbitrarily nested data.
similar to their SQL counterparts. These are:
1. UNION: Returns the union of two or more bags. 4. IMPLEMENTATION
Pig Latin is fully implemented by our system, Pig. Pig
2. CROSS: Returns the cross product of two or more bags. is architected to allow different systems to be plugged in as
3. ORDER: Orders a bag by the specified field(s). the execution platform for Pig Latin. Our current imple-
mentation uses Hadoop [10], an open-source, scalable imple-
4. DISTINCT: Eliminates duplicate tuples in a bag. This mentation of map-reduce [4], as the execution platform. Pig
command is just a shortcut for grouping the bag by all Latin programs are compiled into map-reduce jobs, and exe-
fields, and then projecting out the groups. cuted using Hadoop. Pig, along with its Hadoop compiler, is
an open-source project in the Apache incubator, and hence
available for general use.
We first describe how Pig builds a logical plan for a Pig
Latin program. We then describe our current compiler, that
compiles a logical plan into map-reduce jobs executed using
Hadoop. Last, we describe how our implementation avoids
large nested bags, and how it handles them if they do arise.

4.1 Building a Logical Plan Figure 3: Map-reduce compilation of Pig Latin.


As clients issue Pig Latin commands, the Pig interpreter
first parses it, and verifies that the input files and bags be-
ing referred to by the command are valid. For example, if Parallelism for LOAD is obtained since Pig operates over
the user enters c = COGROUP a BY . . ., b BY . . ., Pig veri- files residing in the Hadoop distributed file system. We also
fies that the bags a and b have already been defined. Pig automatically get parallelism for FILTER and FOREACH oper-
builds a logical plan for every bag that the user defines. ations since for a given map-reduce job, several map and re-
When a new bag is defined by a command, the logical plan duce instances are run in parallel. Parallelism for (CO)GROUP
for the new bag is constructed by combining the logical plans is achieved since the output from the multiple map instances
for the input bags, and the current command. Thus, in the is repartitioned in parallel to the multiple reduce instances.
above example, the logical plan for c consists of a cogroup The ORDER command is implemented by compiling into
command having the logical plans for a and b as inputs. two map-reduce jobs. The first job samples the input to
Note that no processing is carried out when the logical determine quantiles of the sort key. The second job range-
plans are constructed. Processing is triggered only when the partitions the input according to the quantiles (thereby en-
user invokes a STORE command on a bag. At that point, the suring roughly equal-sized partitions), followed by local sort-
logical plan for that bag is compiled into a physical plan, ing in the reduce phase, resulting in a globally sorted file.
and is executed. This lazy style of execution is beneficial The inflexibility of the map-reduce primitive results in
because it permits in-memory pipelining, and other opti- some overheads while compiling Pig Latin into map-reduce
mizations such as filter reordering across multiple Pig Latin jobs. For example, data must be materialized and replicated
commands. on the distributed file system between successive map-reduce
Pig is architected such that the parsing of Pig Latin and jobs. When dealing with multiple data sets, an additional
the logical plan construction is independent of the execu- field must be inserted in every tuple to indicate which data
tion platform. Only the compilation of the logical plan into set it came from. However, the Hadoop map-reduce im-
a physical plan depends on the specific execution platform plementation does provide many desired properties such as
chosen. Next, we describe the compilation into Hadoop parallelism, load-balancing, and fault-tolerance. Given the
map-reduce, the execution platform currently used by Pig. productivity gains to be had through Pig Latin, the asso-
ciated overhead is often acceptable. Besides, there is the
4.2 Map-Reduce Plan Compilation possibility of plugging in a different execution platform that
Compilation of a Pig Latin logical plan into map-reduce can implement Pig Latin operations without such overheads.
jobs is fairly simple. The map-reduce primitive essentially
provides the ability to do a large-scale group by, where the
map tasks assign keys for grouping, and the reduce tasks 4.3 Efficiency With Nested Bags
process a group at a time. Our compiler begins by converting Recall Section 3.5. Conceptually speaking, our (CO)GROUP
each (CO)GROUP command in the logical plan into a distinct command places tuples belonging to the same group into
map-reduce job with its own map and reduce functions. one or more nested bags. In many cases, the system can
The map function for (CO)GROUP command C initially just avoid actually materializing these bags, which is especially
assigns keys to tuples based on the BY clause(s) of C; the important when the bags are larger than one machine’s main
reduce function is initially a no-op. The map-reduce bound- memory.
ary is the cogroup command. The sequence of FILTER, and One common case is where the user applies a distribu-
FOREACH commands from the LOAD to the first COGROUP op- tive or algebraic [8] aggregation function over the result of
eration C1 , are pushed into the map function corresponding a (CO)GROUP operation. (Distributive is a special case of
to C1 (see Figure 3). The commands that intervene between algebraic, so we will only discuss algebraic functions.) An
subsequent COGROUP commands Ci and Ci+1 can be pushed algebraic function is one that can be structured as a tree
into either (a) the reduce function corresponding to Ci , or of subfunctions, with each leaf subfunction operating over a
(b) the map function corresponding to Ci+1 . Pig currently subset of the input data. If nodes in this tree achieve data
always follows option (a). Since grouping is often followed reduction, then the system can keep the amount of data
by aggregation, this approach reduces the amount of data materialized in any single location small. Examples of al-
that has to be materialized between map-reduce jobs. gebraic functions abound: COUNT, SUM, MIN, MAX, AVERAGE,
In the case of a COGROUP command with more than one VARIANCE, although some useful functions are not algebraic,
input data set, the map function appends an extra field to e.g., MEDIAN.
each tuple that identifies the data set from which the tuple When Pig compiles programs into Hadoop map-reduce
originated. The accompanying reduce function decodes this jobs, it uses Hadoop’s combiner feature to achieve a two-tier
information and uses it to insert the tuple into the appro- tree evaluation of algebraic functions. Pig provides a special
priate nested bag when cogrouped tuples are formed (recall API for algebraic user-defined functions, so that custom user
Figure 2). functions can take advantage of this important optimization.
Figure 4: Pig Pen screenshot; displayed program finds users who tend to visit high-pagerank pages.

Nevertheless, there still remain cases where (CO)GROUP is To avoid these problems successfully, the side data set must
followed by something other than an algebraic UDF, e.g., be tailored to the particular user program at hand. We refer
the program in Example 3.5, where distributeRevenue is to this dynamically-constructed side data set as a sandbox
not algebraic. To cope with these cases, our implementation data set; we briefly describe how it is created in Section 5.1.
allows for nested bags to spill to disk. Our disk-resident Pig Pen’s user interface consists of a two-panel window as
bag implementation comes with database-style external sort shown in Figure 4. The left-hand panel is where the user
algorithms to do operations such as sorting and duplicate enters her Pig Latin commands. The right-hand panel is
elimination of the nested bags (recall Section 3.7). populated automatically, and shows the effect of the user’s
program on the sandbox data set. In particular, the interme-
diate bag produced by each Pig Latin command is displayed.
5. DEBUGGING ENVIRONMENT Suppose we have two data sets: a log of page visits, vis-
The process of constructing a Pig Latin program is typ- its: (user, url, time), and a catalog of pages and their
ically an iterative one: The user makes an initial stab at pageranks, pages: (url, pagerank). The program shown
writing a program, submits it to the system for execution, in Figure 4 finds web surfers who tend to visit high-pagerank
and inspects the output to determine whether the program pages. The program joins the two data sets after first run-
had the intended effect. If not, the user revises the program ning the log entries through a UDF that converts urls to a
and repeats this process. If programs take a long time to canonical form. After the join, the program groups tuples
execute (e.g., because the data is large), this process can be by user, computes the average pagerank for each user, and
inefficient. then filters users by average pagerank.
To avoid this inefficiency, users often create a side data set The right-hand panel of Figure 4 shows a sandbox data
consisting of a small sample of the original one, for experi- set, and how it is transformed by each successive command.
mentation. Unfortunately this method does not always work The main semantics of each command are illustrated via the
well. As a simple example, suppose the program performs sandbox data set: We see that the JOIN command matches
an equijoin of tables A(x,y) and B(x,z) on attribute x. If visits tuples with pages tuples on url. We also see that
the original data contains many distinct values for x, then grouping by user creates one tuple per group, possibly con-
it is unlikely that a small sample of A and a small sample taining multiple nested tuples as in the case of Amy. Lastly
of B will contain any matching x values [3]. Hence the join we see that the FOREACH command eliminates the nesting via
over the sample data set may well produce an empty result, aggregation, and that the FILTER command eliminates Fred,
even if the program is correct. Similarly, a program with a whose average pagerank is too low.
selective filter executed on a sample data set may produce If one or more commands had been written incorrectly,
an empty result. In general it can be difficult to test the e.g., if the user had forgotten to include group following
semantics of a program over a sample data set. FOREACH, the problem would be apparent in the right-hand
Pig comes with a debugging environment called Pig Pen, panel. Similarly, if the program contains UDFs (as is com-
which creates a side data set automatically, and in a manner mon among real Pig users), the right-hand panel indicates
that avoids the problems outlined in the previous paragraph. whether the correct UDF is being applied, and whether it
is behaving as intended. The sandbox data set also helps ones are also easy to write in map-reduce) to more complex
users understand the schema at each step, which is espe- tasks that use the full flexibility and power of Pig Latin.
cially helpful in the presence of nested data.
In addition to helping the user spot bugs, this kind of in- Rollup aggregates: A common category of processing
terface facilitates writing a program in an incremental fash- done using Pig involves computing various kinds of rollup
ion: We write the first three commands and inspect the aggregates against user activity logs, web crawls, and other
right-hand panel to ensure that we are joining the tables data sets. One example is to calculate the frequency of
properly. Then we add the GROUP command and use the search terms aggregated over days, weeks, or months, and
right-hand panel to help understand the schema of its out- also over geographical location as implied by the IP ad-
put tuples. Then we proceed to add the FOREACH and FIL- dress. Some tasks require two or more successive aggre-
TER commands, one at a time, until we arrive at the desired gation passes, e.g., count the number of searches per user,
result. Once we are convinced the program is correct, we and then compute the average per-user count. Other tasks
submit it for execution over the real data. require a join followed by aggregation, e.g., match search
phrases with n-grams found in web page anchortext strings,
5.1 Generating a Sandbox Data Set and then count the number of matches per page. In such
Pig Pen’s sandbox data set generator takes as input a Pig cases, Pig orchestrates a sequence of multiple map-reduce
Latin program P consisting of a sequence of n commands, jobs on the user’s behalf.
where each command consumes one or more input bags and The primary reason for using Pig rather than a
produces an output bag. The output of the data set gen- database/OLAP system for these rollup analyses, is that the
erator is a set of example bags {B1 , B2 , . . . , Bn }, one corre- search logs are too big and continuous to be curated and
sponding to the output of each command in P , as illustrated loaded into databases, and hence are just present as (dis-
in Figure 4. The set of example bags is required to be consis- tributed) files. Pig provides an easy way to compute such
tent, meaning that the example output bag of each operator aggregates directly over these files, and also makes it easy to
is exactly the bag produced by executing the command over incorporate custom processing, such as IP-to-geo mapping
its example input bag(s). The example bags that correspond and n-gram extraction.
to the LOAD commands comprise the sandbox data set.
Temporal analysis: Temporal analysis of search logs
There are three primary objectives in selecting a sandbox
mainly involves studying how search query distributions
data set:
change over time. This task usually involves cogrouping
• Realism. The sandbox data set should be a subset of search queries of one period with those of another period
the actual data set, if possible. If not, then to the extent in the past, followed by custom processing of the queries in
possible the individual data values should be ones found each group. The COGROUP command is a big win here since
in the actual data set. it provides access to the set of search queries in each group
• Conciseness. The example bags should be as small as (recall Figure 2), making it easy and intuitive to write UDFs
possible. for custom processing. If a regular join were carried out in-
• Completeness. The example bags should collectively stead, a cross product of search queries in the same group
illustrate the key semantics of each command. would be returned, leading to problems as in Example 3.
As an example of what is meant by the completeness ob- Session analysis: In session analysis, web user sessions,
jective, the example bags before and after the GROUP com- i.e., sequences of page views and clicks made by users, are
mand in Figure 4 serve to illustrate the semantics of group- analyzed to calculate various metrics such as: how long is
ing by user, namely that input tuples about the same user the average user session, how many links does a user click
are placed into a single output tuple. As another exam- on before leaving a website, how do click patterns vary in
ple, the example bags before and after the FILTER com- the course of a day/week/month. This analysis task mainly
mand illustrate the semantics of filtering by average pager- consists of grouping the activity logs by user and/or web-
ank, namely that input tuples with low average pagerank site, ordering the activity in each group by timestamp, and
are not propagated to the output. then applying custom processing on this sequence. In this
The procedure used in Pig Pen to generate a sandbox scenario, our nested data model provides a natural abstrac-
database starts by taking small random samples of the base tion for representing and manipulating sessions, and nested
data. Then, Pig Pen synthesizes additional data tuples to declarative operations such as order-by (Section 3.7) help
improve completeness. (When possible, the synthetic tuples minimize custom code and avoid the need for out-of-core
are populated with real data values from the underlying do- algorithms inside UDFs in case of large groups.
main, to minimize the impact on realism.) A final pruning
pass eliminates redundant example tuples to improve con- 7. RELATED WORK
ciseness. The details of the algorithm are beyond the scope We have discussed the relationship of Pig Latin to SQL
of this paper. and map-reduce throughout the paper. In this section, we
compare Pig against other data processing languages and
6. USAGE SCENARIOS systems.
In this section, we describe a sample of data analysis tasks Pig is meant for offline, ad-hoc, scan-centric workloads.
that are being carried out at Yahoo! with Pig. Due to con- There is a family of recent, large-scale, distributed systems
fidentiality reasons, we describe these tasks only at a high- that provide database-like capabilites, but are geared toward
level, and are not able to provide real Pig Latin programs. transactional workloads and/or point lookups. Amazon’s
The use of Pig Latin ranges from group-by-aggregate and Dynamo [5], Google’s BigTable [2], and Yahoo!’s PNUTS [9]
rollup queries (which are easy to write in SQL, and simple are examples. Unlike Pig, these systems are not meant for
data analysis. BigTable does have hooks through which data User interfaces: The productivity obtained by Pig can
residing in BigTable can be analyzed using a map-reduce job, be enhanced through the right user interfaces. Pig Pen is
but in the end, the analysis engine is still map-reduce. a first step in that direction. Another idea worth explor-
A variant of map-reduce that deals well with joins has ing is to have a “boxes-and-arrows” GUI for specifying and
been proposed [14], but it still does not deal with multi- viewing Pig Latin programs, in addition to the current tex-
stage programs. tual language. A boxes-and-arrows paradigm illustrates the
Dryad [12] is a distributed platform that is being de- data flow structure of a program very explicitly, as reveals
veloped at Microsoft to provide large-scale, parallel, fault- dependencies among the various computation steps. The
tolerant execution of processing tasks. Dryad is more flexible UI should also promote sharing and collaboration, e.g., one
than map-reduce as it allows the execution of arbitrary com- should easily be able to link to a fragment of another user’s
putation that can be expressed as directed acyclic graphs. program, and incorporate UDFs provided by other users.
Map-reduce, on the other hand, can only execute a simple,
two-step chain of a map followed by a reduce. As men- External functions: Pig programs currently run as Java
tioned in Section 4, Pig Latin is independent of the choice map-reduce jobs, and consequently only support UDFs writ-
of the execution platform. Hence in principle, Pig Latin ten in Java. However, for quick, ad-hoc tasks, the average
can be compiled into Dryad jobs. As with map-reduce, the user wants to write UDFs in a scripting language such as Perl
Dryad layer is hard to program to. Hence Dryad has its own or Python, instead of in a full-blown programming language
high-level language called DryadLINQ [6]. Little is known like Java. Support for such functions requires a light-weight
publicly about the language except that it is “SQL-like.” serialization/deserialization layer with bindings in the lan-
Sawzall [13] is a scripting language used at Google on top guages that we wish to support. Pig can then serialize data
of map-reduce. Like map-reduce, a Sawzall program also using this layer and pass it to the external process that runs
has a fairly rigid structure consisting of a filtering phase the non-Java UDF, where it can be deserialized into that
(the map step) followed by an aggregation phase (the reduce language’s data structures. We are in the process of build-
step). Furthermore, only the filtering phase can be written ing such a layer and integrating it with Pig.
by the user, and only a pre-built set of aggregations are
available (new ones are non-trivial to add). While Pig Latin Unified environment: Pig Latin does not have control
has similar higher level primitives like filtering and aggrega- structures like loops and conditionals. If those are needed,
tion, an arbitrary number of them can be flexibly chained Pig Latin, just like SQL, can be embedded in Java with a
together in a Pig Latin program, and all primitives can use JDBC-style interface. However, as with SQL, this embed-
user-defined functions with equal ease. Further, Pig Latin ding can be awkward and difficult to work with. For ex-
has additional primitives such as cogrouping, that allow op- ample, all Pig Latin programs must be enclosed in strings,
erations such as joins (which require multiple programs in thus preventing static syntax checking. Results need to be
Sawzall) to be written in a single line in Pig Latin. converted back and forth between Java’s data model and
We are certainly not the first to consider nested data mod- that of Pig. Moreover, the user now has to deal with three
els. Nested data models have been explored before in the distinct environments: (a) the program in which the Pig
context of object-oriented databases [11]. The programming Latin commands are embedded, (b) the Pig environment
languages community has explored data-parallel languages that understands Pig Latin commands, and (c) the pro-
over nested data, e.g., NESL [1], to which Pig Latin bears gramming language environment used to write the UDFs.
some resemblance. Just as with map-reduce and Sawzall, We are exploring the idea of having a single, unified envi-
NESL lacks support for combining multiple data sets (e.g., ronment that supports development at all three levels. For
cogroup and join). this purpose, rather than designing yet another language,
As for our debugging environment, we are not aware of any we want to embed Pig Latin in established languages such
prior work on automatically generating intermediate exam- as Perl and Python3 by making the language parsers aware
ple data for the purpose of illustrating processing semantics. of Pig Latin and able to package executable units for remote
execution.
8. FUTURE WORK
Pig is a project under active development. We are con- 9. SUMMARY
tinually adding new features to improve the user experience We described a new data processing environment being
and yield even higher productivity gains. There are a num- deployed at Yahoo! called Pig, and its associated language,
ber of promising directions that are yet unexplored in the Pig Latin. Pig’s target demographic is experienced proce-
context of the Pig system. dural programmers who prefer map-reduce style program-
ming over the more declarative, SQL-style programming,
“Safe” optimizer: One of the arguments that we have put for stylistic reasons as well as the ability to control the ex-
forward in the favor of Pig Latin is that due to its procedural ecution plan. Pig aims for a sweet spot between these two
nature, users have tighter control over how their programs extremes, offering high-level data manipulation primitives
are executed. However, we do not want to ignore database- such as projection and join, but in a much less declarative
style optimization altogether since it can often lead to huge style than SQL.
improvements in performance. What is needed is a “safe” We also described a novel debugging environment we are
optimizer that will only perform optimizations that almost developing for Pig, called Pig Pen. In conjunction with
surely yield a performance benefit. In other cases, when the the step-by-step nature of our Pig Latin language, Pig Pen
performance benefit is uncertain, or depends on unknown
data characteristics, it should simply follow the Pig Latin 3
Such variants would obviously be called Erlpay and Ython-
sequence written by the user. pay.
makes it easy and fast for users to construct and debug their [3] S. Chaudhuri, R. Motwani, and V. Narasayya. On
programs in an incremental fashion. The user can write a random sampling over joins. In Proc. ACM SIGMOD,
prefix of their overall program, examine the output on the 1999.
sandbox data set, and iterate until the output matches what [4] J. Dean and S. Ghemawat. MapReduce: Simplified
they intended. At this point the user can “freeze” the pro- data processing on large clusters. In Proc. OSDI, 2004.
gram prefix, and begin to append further commands, with- [5] G. DeCandia et al. Dynamo: Amazon’s highly
out worrying about regressing on the progress made so far. available key-value store. In Proc. SOSP, 2007.
While Pig Pen is still in early stages of development, the [6] Dryad LINQ.
core Pig system is fully implemented and available as an http://research.microsoft.com/research/sv/DryadLINQ/,
open-source Apache incubator project. The Pig system com- 2007.
piles Pig Latin expressions into a sequence of map-reduce [7] R. Elmasri and S. Navathe. Fundamentals of Database
jobs, and orchestrates the execution of these jobs on Hadoop, Systems. Benjamin/Cummings, 1989.
an open-source scalable map-reduce implementation. Pig
[8] J. Gray et al. Data cube: A relational aggregation
has an active and growing user base inside Yahoo!, and with
operator generalizing group-by, cross-tab, and sub
our recent open-source release we are beginning to attract
totals. Data Min. Knowl. Discov., 1(1):29–53, 1997.
users in the broader community.
[9] C. S. Group. Community Systems Research at Yahoo!
SIGMOD Record, 36(3):47–54, September 2007.
Acknowledgments [10] Hadoop. http://lucene.apache.org/hadoop/, 2007.
We are grateful to the Hadoop and Pig engineering teams [11] R. Hull. A survey of theoretical research on typed
at Yahoo! for helping us make Pig real. In particular we complex database objects. In XP7.52 Workshop on
would like to thank Alan Gates and Olga Natkovich, the Database Theory, 1986.
Pig engineering leads, for their invaluable contributions. [12] M. Isard et al. Dryad: Distributed data-parallel
programs from sequential building blocks. In European
Conference on Computer Systems (EuroSys), pages
10. REFERENCES 59–72, Lisbon, Portugal, March 21-23 2007.
[1] G. E. Blelloch. Programming parallel algorithms. [13] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan.
Communications of the ACM, 39(3):85–97, March Interpreting the data: Parallel analysis with Sawzall.
1996. Scientific Programming Journal, 13(4), 2005.
[2] F. Chang et al. Bigtable: A distributed storage system [14] H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker.
for structured data. In Proc. OSDI, pages 205–218. Map-reduce-merge: Simplified relational data
USENIX Association, 2006. processing on large clusters. In Proc. ACM SIGMOD,
2007.

Das könnte Ihnen auch gefallen