Sie sind auf Seite 1von 64

Technische Universitt Berlin

Fakultt IV
Master Thesis Computer Science
Performance Evaluation of Dynamic
Resource Allocation for Big Data
Jobs

Jan Henning, TU Berlin, 336291

July 16, 2016

Supervisor:
Advisors:

Prof. Anja Feldmann, Ph. D.


Dr. Stefan Schmid
Carlo Frst
Lalith Suresh
Niklas Semmler

Submission Date:

July 16th 2016

Immer mehr rechenaufwndige Anwendungen werden in virtualisierten Cloudumgebungen ausgefhrt. Obwohl dieser Ansatz kostenezient sein kann und potentiell eine hohe
Ausnutzung der vorhandenen Cloud-Resourcen ermglicht, kann er dennoch zu Beeintrchtigungen eines Jobs aufgrund anderer Nutzer im selben Cluster fhren. Dies hat
negative Auswirkungen auf die Geschwindigkeit und Vorhersagbarkeit der Jobs. Andere
Faktoren die zu schwankenden Ausfhrungszeiten fhren knnen sind in der unterliegenden liegenden Cluster-Architektur zu nden wie zum Beispiel fehlerhafte Hardware oder
ein berlastetes Netzwerk welches die Knoten innerhalb des Clusters verbindet.

Dies

kann in vielfltigen Anwendungsszenarios Probleme verursachen: Beispielsweise knnte


der Job Teil einer greren Verarbeitungskette sein, die aufgrund der Verzgerung in der
Job-Ausfhrung still steht. Ein anderes Beispiel sind Szenarien in denen die Ergebnisse
der Anwendung schnell ihren Wert verlieren wie beispielsweise Wetter-Berechnungen und
daher zu einem bestimmten Zeitpunkt verfgbar sein mssen.
Das Ziel dieser Arbeit ist es ein vorhersagbares Laufzeitverhalten fr Big Data Jobs
zu erreichen.

 Vorhersagbar  bedeuted in diesem Kontext dass der Besitzer des Jobs

Informationen ber den Fortschritt seines laufendes Jobs zur Verfgung gestellt bekommt
und eine Vorhersage erhlt ob dieser eine von ihm gesetzte Deadline einhalten kann.
Ein zweites Ziel der Arbeit besteht darin dieses vorhersagbare Laufzeitverhalten mit so
wenig vorhergehender Information ber das Verhalten des Jobs zu erreichen wie mglich.
Dieser Ansatz kann zwar zu potentiell schlechteren Vorhersagen fhren, er sollte am
Ende allerdings zu einem intelligenteren und dynamischeren System fhren. Ein weiterer
Vorteil hier ist, dass der Bedarf an zeitraubenden Probelufen im Vorfeld entfllt und
ist darber hinaus die einzige Mglichkeit fr einmalig auszufhrende Jobs. Zustzlich
schlgt die Arbeit ein Verfahren zur dynamischen Erhhung der Resourcen welche ein
Big Data Job zur Verfgung hat vor um auf sich nderende Bedinungen im Cluster
reagieren zu knnen und eine Deadline zu halten die bei normalem weiterfhren des
Jobs verpasst worden wre.
Der hauptschliche Beitrag den diese Arbeit bei der Lsung des Problems der vorhersagbaren Performance in Cloud basierten Umgebungen leistet, ist die Umsetzung eines
Hadoop-basierten Systems welches dynamisches resourcen management in Kombination
mit Deadline Untersttzung bietet. In einer Reihe von Experimenten in einer kleinen
Cloud Umgebung wird die Genauigkeit der Laufzeitvorhersage berprft und die Wirksamkeit von elastischer Resourcen Verwaltung fr einen Big Data Job evaluiert.

Am

Ende der Arbeit steht das Ergebnis dass selbst relativ einfache Algorithmen unter guten
Bedinungen eine fr die meisten Anwendungsflle ausreichende Vorhersagegenauigkeit
bieten, es aber mit zunehmender Verknappung der cluster resourcen oder unvorhersagbaren Ausfhrungsmustern des Jobs schwieriger wird eine genaue Vorhersage zu treffen. Aus der Erfahrung die durch die Durchfhrung der Experimente gewonnen wurde
werden aus diesem Grund Vorschlge abgeleitet die die Vorhersagegenauigkeit erhhen
knnen.

More and more computationally expensive applications are executed in virtualized cloud
environments. While this approach can be cost eective and potentially allows for high
resource utilization, it can also lead to interference with other users (tenants ) of the
cloud environment, or in a closer sense, the cluster. This can have a negative impact
on the predictability of the execution times of the running jobs. Other factors are the
underlying hardware and topology of the chosen cloud environment, for example faulty
hardware or a congested network can cause a signicant slow down of the job.

This

can cause problems in a variety of scenarios, for example if the Big Data job is part
of a larger processing chain and the next part of it can only start if the results of the
job are available. Another scenario would be jobs with hard deadline requirements like
weather predictions or statistical applications that lose their value after a certain amount
of time.
The main goal of this work is to achieve a more predictable performance for Big Data
jobs. Predictable in here means that the tenant has information available about how his
or her job progresses and if the job can meet the given timing constraints. While striving
for predictable performance another goal here is to have as little proling information
of previous job iterations available as possible. While this can lead to potentially less
precise predictions, the resulting system should end up more universal and intelligent.
It also eliminates the need for potentially time consuming proling runs, which also do
not work if the job is singular and not recurring.

In addition to that the work also

tries to nd methods to inuence the job execution in specic ways, like assigning more
cluster resources to the job in order to help maintain the provided constraints.
The main contributions of this work to help solving the problem of predictable performance in cloud based environments include the assessment of current approaches in the
eld, their implementation and integration for measurement purposes into Hadoop and
the execution of a series of experiments on real hardware that mimics a small cloud
environment. The essential insights gained can be divided into the theoretical examination of dierent approaches in several scenarios and how they t into existing solutions.
However, the main conclusions are gathered from the second part: A series of conducted
experiments that were run in a reproducible manner ranging from the as-is state to more
sophisticated approaches for extrapolating the remaining time a job takes in combination with elastic resource management. As a conclusion the works aims for a realistic
evaluation of dierent approaches for predicting the (remaining) time a job takes and
their respective accuracy. The other part of the conclusion is to measure and assess how
elastic resource management can inuence the actual runtime behavior of a job and how
useful this technique is in order to keep timing constraints for the job. The result of this
work is that even with simple algorithms a runtime prediction can be good enough for
most of the use cases but gets gradually harder the more adverse factors come into play.
Proposals how to increase the accuracy of the prediction are also discussed as a direct
result of the conducted experiments.

Contents
1

Introduction

1.1
2

Structure of the thesis

10

Overview of related work

13

3.1

13

Academical Papers

3.2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Kraken: Online and Elastic Resource Reservations for Multi-tenant


Datacenters

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

3.1.2

Longest Approximate Time to End: LATE . . . . . . . . . . . . .

14

3.1.3

How the LATE estimation works

. . . . . . . . . . . . . . . . . .

17

3.1.4

Jockey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.1.5

Quasar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

3.1.6

Eective Straggler Mitigation: Attack of the Clones . . . . . . . .

21

Evaluated Implementations
3.2.1

TEZ

. . . . . . . . . . . . . . . . . . . . . . . . .

22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

Realizing elastic resource management within Big Data environments

24

4.1

State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

4.2

Schematic overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4.2.1

. . . . . . . . . . . . . . . . . . . . . . . . .

25

4.3

4.4
4.5

8
9

YARN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.1.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

Background

2.1
3

Conceptual decisions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

4.3.1

Core Algorithms

Scaling of parallel running tasks . . . . . . . . . . . . . . . . . . .

27

4.3.2

Calculating speedup using Amdahls Law . . . . . . . . . . . . . .

28

4.3.3

Computing of predicted job end time . . . . . . . . . . . . . . . .

29

How to implement custom tasks: Usage of the Task Interface . . . . . . .

31

Challenges of elastic operation . . . . . . . . . . . . . . . . . . . . . . . .

33

4.5.1

33

Implementing elastic resource management in Hadoop/YARN

. .

4.6

Data transfer between Reducers and Mappers

. . . . . . . . . . . . . . .

34

4.7

Protocol and message format for out-of-band communication . . . . . . .

35

4.7.1

Task

Application Master

4.7.2

Application Master

Task

. . . . . . . . . . . . . . . . . . . . .

35

. . . . . . . . . . . . . . . . . . . . .

36

Structure, scope and challenges of the performance evaluation

38

5.1

38

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Contents
5.2

Scale of experiments and workload


5.2.1

5.3

Job structure for conducted experiments

38

. . . . . . . . . . . . . .

39

Results of the measurements . . . . . . . . . . . . . . . . . . . . . . . . .

40

5.3.1

How to measure the job progress? . . . . . . . . . . . . . . . . . .

40

5.3.2

Summary

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

5.4

Cluster setup

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.5

Workloads and input size . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.6

Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.6.1

Per-Task Measurements

. . . . . . . . . . . . . . . . . . . . . . .

42

5.6.2

Per-Job Measurements . . . . . . . . . . . . . . . . . . . . . . . .

45

Summary of insights and conclusion

6.1
6.2
7

. . . . . . . . . . . . . . . . . . . . .

55

Optimal conditions and limitations of elastic resource allocation and run


time predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

Technical and conceptual limitations of the elastic approach

56

. . . . . . .

Future Work

58

7.1

Improving straggler mitigation . . . . . . . . . . . . . . . . . . . . . . . .

58

7.2

Better identication of performance bottlenecks

7.3

Employ more sophisticated methods for job nishing estimation

7.4

Improving the proof of concept implementation

References

. . . . . . . . . . . . . .

59

. . . . .

59

. . . . . . . . . . . . . .

60
63

Chapter 1
Introduction
This thesis investigates and devises approaches that can be used to achieve predictable
and scalable performance in Big Data applications running on computing clusters interconnected with widely used and cost-ecient networking equipment like Fast Ethernet
or 10GbE. The term scalable in this context means that the central instance managing
the job is able to identify bottlenecks and in reaction to them can aect the job execution by increasing its resources at runtime without the need to restart it. The main
contribution of this thesis work is implement a system that has the ability to predict the
runtime behavior of the job and react accordingly to meet a desired time constraint the
job might have.
This work will be focusing on the Apache Hadoop -Framework as a testbed where it is not
possible at the current state to make exact predictions of the job progress or adding more
resources to a running job in order to speed it up when necessary. Other Map/Reduce
implementations like CouchDB [12] or Dynamo [4], which basically use Map/Reduce as
a method to query databases instead of running generic, completely user-dened jobs
will not be taken into consideration here.
In order to evaluate the eectiveness of these algorithms several experiments using
Hadoop, a real world, open source framework for distributed computing and storage,
are conducted.

These experiments are carried out on real hardware containing four

compute nodes and the use of a newly designed and implemented central element, a
custom Application Master which makes use of an asynchronous messaging system to
keep track of the job execution. In summary the contributions for this work in order to
achieve a more predictable and scalable job execution are:

Assess the runtime behavior of the job and try to make a good prediction when
the job will nish in its current environment. One important goal here was that
the prediction should be as accurate as possible without much prior knowledge of
the runtime behavior of the job, e.g. without any form of proling data available.

In order to make educated guesses over the predicted end time of the job, identication of especially slow or lagging tasks is vital. For the experiments conducted
in this thesis the approach described by the LATE -paper [22] is utilized to achieve
that since it showed good results in the course of the benchmarks.

1.1 Structure of the thesis

When the assessment of the job runtime (that is done at a specied sample rate
throughout the whole job) shows that it will most likely miss its deadline the
system has to react upon that. The approach chosen for this work was to increase
the available resources the job can use in order to decrease its runtime so that it
can meet the provided deadline.

The last part is the evaluation of the results and an assessment that reviews the
dierent approaches that were chosen and their potential drawbacks.

This also

includes a discussion under which circumstances the chosen techniques are not
working well or, on the contrary, in which scenarios the implemented and tested
approaches can work well and help to achieve a predictable, scalable performance
for Big Data jobs that utilize Hadoop.
The

bundled CD contains the latest iteration of the proof of concept implementation

at the time of writing and the documentation as Javadoc.

1.1 Structure of the thesis


The thesis at hand is broken up into several chapters that deal with one particular topic.
In the background, covered in chapter 2, the current state of performance prediction in
Big Data applications will be discussed and a brief overview of the related work this
thesis is build upon will be given.
The following chapter 3, Related Work, will give a more in-depth review of the associated
work and states what actual algorithms and ideas were used from it and how they
were incorporated into the conducted benchmarks as well as more general ideas on
how to solve the problem of predictable runtime behavior.

The actual environment

where all of the experiments take place is discussed in chapter 5, Experiments. After
the frame conditions are set the actual workload and the articially created scenarios
where for example, stragglers are introduced will be discussed.

The chapter aims to

provide a general overview and will also discuss the actual results. Chapter 4 will give
an overview over the technical details of the hard- and software stack used for running
the benchmarks.

This is important in order to be able to set the benchmarks in the

right scope and judge their expressiveness for dierent environments. In particular the
architecture of the custom Application Master be described here. This should give the
reader an idea what can and what can not be done with the implementation created for
this work and the experiments.
Following the measurements, the Conclusions in chapter 6 attempts to draw more general deductions from the experiments and tries to put them into a global perspective
regarding the related work and other Big Data scenarios. Finally, the Future Work will
be presented in chapter 7 and an outlook for additional improvements will be given
there.

This includes possible enhancements in order to improve the accuracy of the

execution time predictions presented here and also supplementary improvements for the
implementation.

Chapter 2
Background
This chapter describes the general background and scope of this thesis.

It will give

a detailed description of the problem this work tries to solve and what the current
state in terms of dynamic resource allocation in Big Data environments and especially
Hadoop is at the moment of writing. One of the biggest problem when deploying Big
Data jobs on a cluster using Hadoop is that after submitting the job the user does
not have much possibilities to inuence the runtime behavior of it through the
interface exposed by the Map/Reduce framework.

Job()

So if the job straggles there is no

possibility to increase its resources at runtime in order to speed up the execution. This
can eventually lead to missed deadlines for time critical applications. Although there
already is an eort underway since 2013 to allow the dynamic adjustment of resources
on a per-container basis (see YARN-1197 [17]) it is not incorporated yet into the YARN
upstream by the time of writing. Also, the spark project has put eort into dynamic
resource allocation and incorporated it into spark with version 1.2 and most likely other
projects will eventually deploy a similar solution. These eorts clearly show that there
is a need for these kind of technology.
In addition to the ability to increase job resources at runtime there needs to be some
sort of policy when to do it. For example, the tenant could provide a deadline which
the system tries to adhere to and uses more resources if available when this deadline
is violated.

Hadoop and most other data processing systems however do not have a

mechanism to employ a deadline and therefore cannot act upon it. The most important
thing for a deadline aware mechanism is the ability to predict at which point in time
the job will most likely end. Without that prediction it would not be possible to make
assumptions if the given deadline will be missed or not. With that in mind, the general
goals of this work in regards to Hadoop can be summarized as follows:
1. Provide a mechanism to estimate the nishing time of the job that is as accurate
as possible and robust enough to cope with stragglers.
2. Allow the Application Master to keep track of the deadline and detect a possible
violation.
3. Deploy methods for increasing the resources for that job at runtime without the
need to restart it and potentially lose some (or all, depending on the workload and
current state the job is in) progress.

2.1 YARN
At the time of writing the stable release of Hadoop/YARN does not provide any of these
possibilities and implementing them in order to be able to conduct several measurements
is one of the main goals of this work.

2.1 YARN
In this chapter the current capabilities of YARN (

Yet Another Resource Negotiator

) will

be discussed and it will give a more in-depth explanation why the actual implementation
that was done for the measurements directly interacts with it. Additionally to that the
general architecture of YARN will be discussed and what it made a sane choice as a
foundation for the conduction and assessment of the following experiments.
YARN can be thought of as the core engine Hadoop utilizes since version 0.23. It forms
the main component which is utilized by higher-level frameworks like Map/Reduce or
Spark that make use of the underlying architecture of YARN. Its core idea is that every
part that is responsible for a specic set of work runs as an own daemon on an arbitrary
node and can be controlled separately in order to achieve a high fault-tolerance.

For

example, if one node daemon stops due to faulty hardware or a software bug YARN can
dynamically react upon that and is still able to nish the job it got assigned without
the need of interaction with the job owner.
Therefore the main task of YARN is to keep track of the nodes, their execution units
(containers ) and the resource usage.
beyond these.

It does not provide any high level capabilities

Since it would be impractical to force the user to keep track of things

like container handling and communication between the components, usually a highlevel framework that sits on top of YARN like Map/Reduce or Apache Spark is utilized
that exposes a set of functionality to the user. In order to draw a comparison to general
computing, YARN can be seen as an Operating System that is providing a set of low-level
system calls which higher-level applications use to expose an easier and more friendly
interface to the user and extend the functionality.

The frameworks on top of YARN

can therefore be seen as the applications that make use of the core features that YARN
oers.

10

2.1 YARN

Figure 2.1: YARN architecture, source:

http://hadoop.apache.org

The following list explains what the most important components of YARN are and what
they do:
1.

Resource Manager:

The one central element inside YARN that keeps track of

the resource usage and that is responsible to grant or deny resources a tenant asks
for. It has a global view of the cluster and knows what resources on which node
are available due to periodic communication with them.
2.

Node Manager:

The Node Manager, as the name suggests, manages one node

of the cluster. Usually a node is a distinctive machine inside the cluster that can
work on one ore more tasks that belong to a job. It has to communicate with the
resource manager in order to ask for computational resources that are needed for
a task that was assigned to the node it is responsible for.
3.

Application Master:
the job execution.

The Application Master runs on a node itself and controls

It can directly ask the resource manager for containers that

can execute a sub-task and usually also keeps track of the job progress.

It is

basically the central element of a job and handles the communication with dierent
components of YARN.
In order to achieve predictable performance and enable the use of deadlines YARN
mainly lacks three important features at the time of writing. Firstly, there is no ability
to increase or decrease the resources of a job at runtime. What YARN allows, though, is

11

2.1 YARN
the termination of a container and the requesting of a new one with increased resources.
This can lead to the losing of intermediate results computed by that container, however.
Secondly it is lacking bandwidth awareness.

In clusters interconnected by standard

ethernet hardware it is often the case that limited bandwidth or congested links can
lead to a bottleneck when it comes to fetching or transferring data. Without the ability
to guarantee bandwidth constraints it can be impossible to make some sort of deadline
projections. Lastly there are no mechanisms to predict when the job will presumably
end. Having a robust method of doing this is crucial for implementing deadlines and
react upon a possible deadline miss as early as possible during the job execution.

12

Chapter 3
Overview of related work
This chapter describes the existing academic work which proved useful and was taken
into account for this thesis work. The main topics that these papers are dealing with
are the identication of stragglers which are basically tasks that are running multiple
times slower than expected, and the prediction of the approximate end time of the whole
job. Some of this work also combine these topics in order to achieve a more predictable
performance in data centers, for example the approach that lead to the development
of Jockey [7].

In addition to the academic papers other implementations that were

evaluated will be described in this chapter. This will mainly focus on Tez [15], Mesos
[9] and Spark [21] which, at least in some parts, tackle the same problems that are
discussed in this thesis.

3.1 Academical Papers


3.1.1 Kraken: Online and Elastic Resource Reservations for
Multi-tenant Datacenters
This paper deals with predictable job run times and also sat the general topic for this
thesis work.

However, the Kraken-algorithm focuses mainly (but not exclusively) on

the network aspect of Big Data jobs whereas the proof of concept implementation done
for this thesis aims to achieve this goal by assigning more resources to the running job
in order to be able to meet a deadline given by the job owner. However, the network
part of solving this problem is also important since especially the transition from the
map phase to the reduce phase can be quite demanding on network resources and has
the potential to slow a job down signicantly and therefore making a prediction about
its end time harder.

The general idea of Kraken is to use an online approach which

means that no information should be supplied beforehand by the tenant. This has the
advantage that it can learn while the job is running and react dynamically on changing
conditions within the cluster, like bandwidth becoming scarce. Also, for many jobs it is
simply not known beforehand how it will behave and how resource intensive its specic
parts are. This is especially the case with non-recurring jobs that are executed often just

13

3.1 Academical Papers


once or with greatly varying datasets. Kraken achieves this goal by letting the tenant
dynamically request and update minimum resource constraints at runtime.
Albeit the proof of concept implementation has testing code in it for dynamic bandwidth reservation via the cgroup

net_cls

[14] linux interface in combination with

tc

[18], the measurements where done without them because of technical problems and
time constraints.

However, the main idea to use an online approach for solving the

problem of predictable job run times was largely adopted from Kraken which makes it
a fundamental, conceptual groundwork for this thesis.

3.1.2 Longest Approximate Time to End: LATE


The main idea of LATE [22] which turned out to be useful in order to achieve a more
predictable performance for Big Data Jobs is its approach to take the heterogeneity of
the nodes that make up a cluster into account. This means that each individual node
can have a better or worse connectivity network-wise to the rest of the cluster and/or
can dier in computing resources like main memory or processing power. This is done
by identifying potentially slow nodes and a quantization of actually how slow they are
compared to the other nodes.

This stems from the assumption that a node that is

just slightly slower than the average one can also be a worthwhile candidate to execute
speculative tasks if a faster is not available.
In addition to that the estimated nishing time of a task is calculated and taken into
consideration when an actual task for speculative execution has to be chosen. A basic
example for that might be a task that suers from a low progress score but should end
quite soon according to the calculated estimated time to nish. Quite soon here means
that it will presumably nish considerably sooner than a new, speculative task would.
In that case it is more resource-ecient and would not aect the overall runtime of the
job negatively to just let that task in question nish and to avoid spawning a speculative
task at all.

Principle of operation of LATE

This section describes the inner workings of the LATE algorithm in more detail and
states what ideas where extracted from it for the measurements that will be discussed
in a later part of this work. In addition to the previous section the general assumptions
and ideas behind late can be summarized as follows:

Speculative execution is dicult in heterogeneous environments.

The current Hadoop scheduler does not scale well in heterogeneous clusters and
can make job performance even worse there.

LATE proposed mitigate these shortcomings by building on three main principles:


1. Prioritize tasks to speculate on

14

3.1 Academical Papers


2. Selecting appropriate nodes for the backup task
3. Limit the number speculative tasks to prevent system thrashing which can
occur if too many tasks are transferring the same le over and over again.

Start

Running tasks <


speculativeCap?

Yes
Rank non-speculative tasks
by estimated time left

Task progress rate <


SlowTaskThreshold?

No

No

Choose next, lower


ranked Task

Yes
Launch speculative copy of
task

End

Figure 3.1: High Level overview of LATE algorithm

Figure 3.1 shows a high level description of the approach chosen by LATE. Since the
algorithm employs a limit to the maximum number of speculative tasks that can be
employed at a time, this is checked in the beginning. When the limit is already reached
the algorithm does return immediately since the limit for speculative execution has
already been reached.

When a speculative task can be launched however, the next

question is which task should receive a backup task. In order to do that for each running
task the estimated time to nish is computed. How this is done is shown in gure 3.2

15

3.1 Academical Papers


and will be covered separately following this section. However, the result of this step is
a list where all running tasks are ranked according to their nishing time with later end
times ranked higher than tasks that will complete at a point sooner in time. Basically,
tasks that will take the longest are more likely to receive a backup task. The next step
avoids using speculative execution on tasks that are considered fast tasks and have a
high progress rating. The main reason behind this is that a fast progressing tasks with
a late estimated end time most probably will not be sped up by employing speculative
execution because it is unlikely that the backup task will execute any faster and thous
wasting resources. Finally, if the task is considered slow and there there are still slots
left for speculative task, the chosen task will receive a backup task. However, it has to
be noted that the LATE paper focuses on jobs where each task is quite short-lived and
takes around 8 to 10 minutes to complete.

Current state of the Hadoop scheduler

To decide which task to run next, the Hadoop scheduler uses one of three categories.
Their priorities are sorted descending, where the highest priority is on top.
1. Try to run a previously failed job. This is done to detect tasks that fail repeatedly
in order to locate possible bugs and to stop the whole job.
2. Start a non-running task. Task which have their data local to the given node get
a higher priority and should run rst.
3. If there are still free slots after 1) and 2) are satised, Hadoop will use the available
slot for speculative execution.
To select a job that should be executed speculatively, based on the task progress. The
progress lies between 0 and 1. For Map tasks, the metric to decide how far the job has
progressed is the amount of input data read by it. For a reduce task, three phases are
looked into:
1. Copy phase, e.g. the fetching of the outputs from the map tasks
2. Sort phase, which means the sorting of the output by key.
3. Reduce phase, where nally the user-provided functions is applied to the list of
map outputs with each key.
For every phase the fraction of processed input data is used for measuring the progress,
similar to the map phase. Hadoop then uses this calculated progress to identify stragglers. When the task progress is less than the average for its category minus

0.2 and the

task ran for at least one minute, Hadoop marks it as a straggler. This has the conceptual

> 0.8 cannot be set for speculative execution


0.8 cannot be slower than the the ones they are compared with,
because of the skew factor of 0.2 and the maximum job progress of 1.0. In other words,
if a task begins to straggle after it reached a progress of 80% it cannot be marked as a
weakness that tasks that have an progress
because progresses over

straggler anymore by Hadoop.

16

3.1 Academical Papers


The scheduler used by Hadoop makes several direct assumptions:
1. Nodes all perform roughly the same.
2. Tasks progress at a constant rate.
3. No costs for launching a backup task on a node that would otherwise have an idle
slot.
4. Every task phase is weighted in the progress calculation by the same amount.
5. Tasks are nishing in waves, so a task with a low progress score is likely a straggler.
6. Tasks in the same category (map or reduce) require roughly the same amount of
time.
Many of these assumptions do not hold true in a virtualized cloud environment, though.

3.1.3 How the LATE estimation works


Start

Calculate Progress Score

Calculate Progress Rate:


Progress Score / T

Determine estimated time to finish

End
Figure 3.2: Estimation of nishing time per task

The general idea of the LATE scheduler is to speculatively execute the task that will
nish farthest into the future and suers from slow progress. To measure the progress of
a task, a ProgressScore that is divided by T, that is, the amount of time a task is running

17

3.1 Academical Papers


will be used. The general steps that are taken for this are visualized in gure 3.2. The approximate time the task will take is calculated with

(1P rogressScore)/P rogressRate.

An example for a LATE approximation with a task done to 50% would be:

Algorithm 1 Example of LATE prediction


progressScoretask = 3 div 6
time = 60
progressRatetask = (3 div 6) div time
estimatedEnd = (1 3 div 6) div progressRatetask

. This in seconds
. Result is 0.0083
.

Result is 60.24 seconds

The approximation here makes sense since the task needed 60s to get halfway through.
So roughly another 60s are needed for completion if the underlying assumption that task
progress is linear holds true.
The actual algorithm also needs to take the following factors into account:

SpeculativeCap : Maximum Number of speculative tasks that can be run at once.

SlowTaskThreshold : A tasks progress rate will be compared to a given SlowTaskThreshold in order to decide if it gets speculatively executed, e.g. is "slow enough".

SlowNodeThreshold : This value is used to decide whether or not to launch a speculative task on the given node. This is needed in order to avoid launching a backup
task on a slow node.

3.1.4 Jockey
The main goal if Jockey is quite similar to the one pursued in this thesis: Making guarantees on job nishing times by providing a deadline for it. In their work the authors
focused on the Cosmos ecosystem [8] developed by Microsoft and did not take Hadoop/YARN into account. The ideas presented there however proved useful in developing a
proof of concept implementations for carrying out the practical experiments presented
in the later parts of this thesis. In contrast to the work presented in this thesis which
uses a pure online approach, the team behind Jockey decided to use a mixture between
on- and oine approach.
The rst step is to analyze the job which is described in the SCOPE job description
language [3] using a simulator developed by the jockey team. Basically this simulator
works with two sorts of input given to it:

Sample data from previous runs and the

high-level structure of the job like stages, tasks and their dependencies from each other.
It also receives the resources that will be assigned to that same job and computes an
approximate nishing time for it. According to their work the predictions achieved with
this simulator are quite close to their real nishing times with an error margin of around
10% depending on the job size. This rst oine step therefore gives an idea about the
structure of the job and provides a starting point for the online resource management.

18

3.1 Academical Papers


The second step deals with the dynamic resource management that is done while the job
runs, e.g. in the online phase. This next part is quite similar to the approach chosen
in this work: Monitoring the progress of the job and assign more resources to it when
the prediction hints that the job will most likely fail its provided deadline.

However,

the method chosen for job progress estimation diers from the LATE approach utilized
in the thesis at hand. While also taking the progress of each task the job consists of
and its run time into account, a measure C(p, a) based on the simulator results is used.
Basically it says that based on the simulator results the job needs n more minutes to
complete with a progress p and a allocated tokens.

A token can be thought of as a

guaranteed share of cluster resources and its quantity in terms of resources is dened by
the administrator of the cluster.
In their conclusion the jockey team states that 99% of the Service Level Objectives were
met during their test runs without too much over provisioning of the underlying cluster
resource. However, they also note that Jockey turned out to be more eective in slowing
down a job that is running too far ahead of the provided deadline than to speed up jobs
that were falling behind. This is useful in order to achieve a better cluster utilization
but might be problematic when meeting a deadline is more important than utilizing the
cluster as optimal as possible. The scenarios where this can occur can be manifold, for
example the output of the job is part of a bigger pipeline and therefore slows down the
whole chain.
The main dierences between Jockey and the approach at hand are, in no specic order:

The underlying system. In this work Hadoop was chosen since it is widely used and
can be analyzed and modied further due to its Open Source nature in contrast
to Microsofts cosmos in-house solution.

Due to the development of a custom application master and job model the degree
of control achieved for the measurements is much higher. On problem that Jockey
was facing is to get an exact measure of the job progress which turned out to be
easier using an own implementation.

In contrast to Jockey a pure online approach was chosen for this work. In general,
an oine approach works best for recurring jobs which share a common structure.
However, if the nature of jobs diers a lot a pure online approach might yield
better results and was therefore chosen for this work.

3.1.5 Quasar
Quasar [5] is a cluster management system that aims for increased resource utilization
and high performance for the applications that run on that cluster.

In doing that

Quasar does not rely on a xed resource reservation on the underlying cluster that can
potentially lead to poor utilization of resources and decreases the overall exibility. The
main reason for that decision from the point of view of to the Quasar team is the fact

19

3.1 Academical Papers


that the exact requirements and the runtime behavior of the job is often not known
beforehand.

This holds especially true if the cluster executes a lot of non-recurring

job that dier signicantly from each other so performance evaluations from previous
runs are not helping much in understanding the upcoming job. The jockey paper team
proposed and developed their simulator for exactly that reason: Understanding the job
structure as good as possible in order to make educated guesses about the estimated
resource requirements.
In contrast to Jockey Quasar does not rely on a simulator but gives the tenant the
possibility to express performance constraints for the job at hand. This is taking as a
basis that the use running the job has a general idea about what kind of job he runs and
what its main requirement is in order to perform well. While that assumption might not
be true every time, in a lot of cases it should. For example, Quasar allows the tenant to
specify a constraint regarding queries per second for latency-critical workloads or, more
related to the thesis work at hand, execution time when embedded into Hadoop. This
rst step falls into the oine category since it does not utilize some sort of runtime
information. The second step Quasar takes is conducted online and does limited (a few
seconds up to a few minutes) proling on the job and its dataset. This basic proling
information is combined with the oine constraints the user expressed in the rst step
and leads to a classication of the job.

This classication basically tries to estimate

how the job will perform if the number of nodes or the amount of resources a node can
utilize are varied and tries to nd a near-optimal resource allocation for the job given its
constraints. The last step the Quasar management system performs is the monitoring of
the actual performance of the workload. For example, if some of the assigned resources
are idle Quasar adjusts the resource allocation at runtime and might free some of the
unused nodes so they can be used for other jobs running on the cluster. However, in
the worst case when the classication proves incorrect at runtime and the user-provided
constraint cannot be fullled, Quasar might reschedule the workload from scratch.
Since the implementation for the benchmarks conducted in this thesis aims for a pure
online approach the rst step of Quasar was not considered for the practical implementation. However, the general idea of a coarse-grained classication that takes information
from the job owner into account makes sense and helped to increase the accuracy of
the runtime prediction of the custom Application Master implemented. Albeit simpler
in practical terms and more restricted than the Quasar approach, the decision to give
the custom Application Master the feature to "weight" each dierent phase of a task
made the nish time prediction using a slightly modied variant of LATE more accurate. For example, if we assume the job-owner knows that each reduce task will spend
most of its time in the network limited fetch -phase he could weight that phase in more
at the expense of the computational phase, for example. The user therefore helps the
implemented, modied LATE predictor to produce more accurate results. However, this
would violate the strictly-online approach, so weighting each phase is not a requirement
and static default values which weight each phase equally are used instead. But if the
user has at least an approximate knowledge of the job structure a more precise deadline
prediction will basically come for free if he or she assigns a sane weighting.

20

3.1 Academical Papers

3.1.6 Eective Straggler Mitigation: Attack of the Clones


This paper written by Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker and Ion
Stoica is proposing Dolly [2], an algorithm that works by creating full clones of small
tasks and therefore avoid speculative execution altogether. Due to the fact that stragglers
(the Dolly -paper classies tasks that are on average eight times slower than the median)
will negatively aect the overall performance of the job almost every time there is also
a decrease the accuracy of the nishing time prediction of the job. Since the approach
chosen for this thesis in achieving predictable performance relies strongly on a good
runtime prediction it was important to take a look at techniques to lessen the eect
of stragglers.

However, eective straggler mitigation is a vast and complex topic in

itself and there are many dierent approaches to lessen their impact.

Those include

replication based algorithms which make use of statistical methods to determine what
task gets replicated [19] up to more complexes Machine Learning based ones as described
in [20]. For that reason this thesis just shows the impact of stragglers and their negative
eect without employing actual mechanisms to avoid them.

However, adding a solid

mechanism for straggler mitigation will potentially help to improve the implementation
done for this thesis and will be left for future work. From all the methods evaluated Dolly
seems to be the best tting one for the scope of the experiments conducted for this thesis
since it has a focus on small jobs. Small according to the Dolly-team mean jobs that
consists of less than then tasks which made up 80% of the Hadoop jobs at Facebook they
analyzed. This is roughly comparable to the job size used for the conducted experiments
presented in the corresponding chapter 5 later in this thesis.
In general there are two dierent kinds of approaches when dealing with straggler mitigation: Blacklisting based ones and approaches based on speculative execution. The rst
one works by blacklisting slow nodes and avoids scheduling jobs on these nodes. The
main problem with this approach is that stragglers can also occur on non-blacklisted
machines for reasons not directly tied to faulty hardware, e.g. periodic maintenance operations, background services or contested I/O. Because the blacklisting-approach does
not get rid of the whole straggler problematic a second approach utilizing speculative

execution was proposed.

Basically, this works by collecting proling data at runtime

(online approach ) or from previous job executions (oine approach ) and use it to identify tasks inside the job that have not completed within a reasonable time frame. The
main problem with this approach is that short jobs might not run long enough to collect a sucient amount of proling data and therefore cannot identify stragglers fast
enough to prevent an impact on the job completion time. This also became apparent
in the experiments conducted where one long straggler made the end time prediction
basically unusable. In addition to that, a task might be identied as straggler when it is
progressed already so far that a backup task will not help. These limitation also became
apparent during the benchmark phase conducted for this thesis work.
For Dolly another approach was chosen which mitigates the problem of gathering enough
proling data for small jobs. Instead of trying to predict stragglers at runtime multiple
clones of every task of a job are started "`just in case"'. The results of the clone that

21

3.2 Evaluated Implementations


nishes rst are then used. This approach might look inecient at the rst glance due
to a) Additional resources taken up by the clones and b) more contention on the cluster.
However, the authors of Dolly state that according to their sample data taken from
Facebook and Microsoft Bing the smallest 90% of the jobs consume less than 6% of
the available cluster resources. They make the point there that consuming a few extra
resources for these jobs in exchange for a quite reliable straggler mitigation makes sense
there. But it also proves the point that the Dolly algorithm is best used for small jobs.
The second problem, the additional cluster contention, is tackled with delay assignment.
That solution builds upon the expectation that most clones will nish at about the
same time, except for the stragglers. This assumption should hold true in general for
homogeneous jobs and could also be observed in the conducted thesis benchmarks. The

downstream clones, for example the Reducers that follow the Mappers in a Map/Reduce
Job then wait for a certain time window to obtain an exclusive copy of the intermediate
data. The main idea here is that the variable waiting time should avoid the feeding from
from all upstream clones to downstream clones at the same time which could potentially
lead to I/O contention.
In conclusion Dolly seems to be a viable, yet no over-complex approach that would t for
the job structure used for the thesis. The implementation presented in the Dolly-paper
states a speed up for small jobs between 34% to 46% while using just 5% extra cluster
resources.

3.2 Evaluated Implementations


3.2.1 TEZ
Apache TEZ is another Framework (or, more precise a execution engine) that also sits
directly on top of YARN. The main goal of TEZ is to enable more exibility than
the Map/Reduce framework and therefore allowing a better tailoring towards specic
workloads. With the gained exibility the user can do a more thorough optimization than
Map/Reduce would allow and could therefore benet from improved performance.
The main idea of TEZ is to express the job as a dataow diagram which is basically
a directed, acyclic graph (DAG ). Figure 3.3 shows an example how TEZ uses such a
graph to represent a generic Map/Reduce job.

Essentially, each Vertex can have an

input and an output which can be connected together via the edges.
a highly customizable data ow.

This approach

The primary reason for evaluating TEZ was that it

allowed more exibility than the Map/Reduce implementation distributed with Hadoop.
This includes adding vertices at runtime which could have been potentially useful for
doing measurements that are meant to evaluate predictions of the presumed end time of
a job. Additionally it might also have yield the possibility to add a deadline mechanism
to it.

However, upon closer assessment the codebase of TEZ seemed too complex to

modify it in the time frame available for a master thesis and lacked some functionality

22

3.2 Evaluated Implementations

Figure 3.3: Example usage of a DAG in TEZ, source:

https://tez.apache.org/

needed for evaluating dierent approaches that can eventually lead to a more predictable
performance for typical big data workloads.

The two most important things where

the lacking of a deadline mechanism and no ne grained control over the underlying
YARN containers which is a essential build block needed for the approach presented in
the course of this thesis. These diculties nally lead to the decision to implement a
custom application master directly on top of YARN which allowed complete control over
the container execution at the cost of a higher implementation eort. A more thorough
description of TEZ is given in the paper Apache Tez: A Unifying Framework for Modeling

and Building Data Processing Applications which was published at SIGMOD '15 [15].

23

Chapter 4
Realizing elastic resource
management within Big Data
environments
This chapter explains how the dynamic resource management mentioned in the previous
chapters is implemented in practice.

Since the implementation was mainly done as a

proof-of-concept with its own set of limitations a remark will be added when needed.
This should mainly give an idea about what the implementation can and can not do in
its current state and should also provide a starting point for eventual future work.
The organization of this chapter is as follows: It begins with an overview of the state

of the art of YARN, followed by a schematic overview with a simple example workload
provided in order to get a general idea about what is actually happening. After that the
steps the Application Master needs to take for this workload will be described in more
detail, especially which monitoring- and communication facilities had to be added for
this. Additionally the core algorithms used for elastic task/container management will
be discussed in the form of simplied pseudo code. The following part will discuss the
messages exchanged between the Application Master and its containers and how they
are used to calculate the job progress and extrapolate a deadline for the job. The last
part serves as a retro perspective of the general approach taken and what can possibly
be improved there to increase accuracy and performance.

4.1 State of the Art


YARN does not provide elastic resource management or deadline based scheduling innately. YARNs main task is to provide resource management via the Resource Manager
and the Node Manager(s) by giving out Containers, which can be seen as a execution
unit that handles exactly one task within a job. Because of that it makes sense to implement more advanced techniques like deadline awareness within the Application Master
that acts as the central control element of a job. The capabilities that have to be implemented in order to achieve elastic resource management and deadline awareness within
Hadoop and that are missing at the time of writing can be summarized as follows:

24

4.2 Schematic overview

Requesting additional containers at run time without breaking the job or falsify
the results.

While the underlying YARN architecture allows this, the bundled

Map/Reduce Application Master does not make use of this feature in order to
dynamically speed up the job execution.

Adding an end time prediction for the whole job. Since this kind of functionality
lies outside the scope of YARN which does not know anything about the job beside
the number of tasks running at the moment and the resources that are occupied
and free.

The logical solution here is to let the Application Master handle this

functionality, too.

Allowing the Application Master to react to a violated deadline. As stated in the


previous point, YARN itself does not know anything about the semantics of the
job that is running.

This also means that it has no ability to do some sort of

deadline aware, intelligent scheduling.


With these limitations of YARN in mind the necessity became obvious that an custom
Application Master with its own task structure had to be implemented in order to show
that elastic resource management with deadline constraints can work on top of a real
world big data system like Hadoop.

4.2 Schematic overview


The main functionality of the implementation is to allow the tenant to provide a deadline
for its job and let the Application Master adjust the execution of it in a way that this
deadline can be met by adjusting the resources the job has at its disposal dynamically at
runtime. One of the design goals here was that the user should not have to touch the job
at this point again and let the Application Master decide what to do in order to adhere
to the deadline automatically.

The main technique to achieve this is to constantly

gather information how each task in the job progresses and extrapolate that to the
whole job. The extrapolation is done with a modied version of the LATE-algorithm
described in the last chapter. Since information gets distributed between all nodes in
the cluster frequently at runtime, an asynchronous, message based approach was chosen.
The management component of the Application Master will be described in more detail
later in this chapter.

4.2.1 Conceptual decisions


In general the implementation orientates itself at the Map/Reduce -paradigm which
means that the tenant has to provide at least an implementation for a Mapper and
a Reducer. Also the job execution is strictly split into two parts because of this: The
map phase and the reduce phase. Strictly split here means that the reducing can only
start when the mapping has nished successfully. This approach was chosen to allow for

25

4.2 Schematic overview


a more straight forwarded concept when, for example the number of reducers has to be
increased at run time in order to hit the deadline.

Pull assigned, partitioned results from Mapper 1 and 3


Mapper 1

Pull assigned, partitioned results from Mapper 2 and 3

Reducer 1

3. Process Data

5. Fetch Data
Application Master

6. Reduce Data
Mapper 2
3. Process Data

Assign Partition 1

Assign Partition 2

1. Partition Input Data

Assign
Partition 1 and 2

7. Write
Results

2. Spawn N Mappers

Assign Partition 3
Mapper 3

Assign Partition 3

Reducer 2
4. Spawn M Reducers

5. Fetch Data

3. Process Data

6. Reduce Data

Pull assigned, partitioned results from Mapper 1,2 3

Fetch Input Partition from HDFS

HDFS

7. Write
Results

Write reduces results to HDFS

Figure 4.1: Architecture overview of implementation used for measurements

The gure above, 4.1, gives a general overview of the architecture used for the measurements that were done. In general, the communication between all involved entities
like the Application Master, the Reducers and the Mappers is bi-directional.

This is

necessary in order to give additional information to them and to be able to share data
about partitioning and progress needed for dynamic resource management. The general
steps are numbered through in chronological order and describe a word count job:
1. The rst step is the

partitioning of the input data

and is done by the Ap-

plication Master. For example the job owner congured the Application Master
so that each mapper should take 200mb of data from the input le which could,
depending on the size of the input le, could yield 40 input partitions which can
then be freely distributed to an available mapper.
2. After the partitioning is done the

number of mappers.

Application Master starts an arbitrary

The important step here is that every mapper has to know

exactly which part of the input data it needs to process. Since this does not change

26

4.3 Core Algorithms


during the lifetime of one mapping task, this information is passed directly to it
upon starting.
3. In the next step the

mapper fetches the input data from the HDFS disprocesses it. The container terminates after nishing its

tributed le system and

task since the intermediate result is written to its local disk and can are safely
pulled by a reducer later.
4. After the map phase has nished,

application master.

one or more reducers are started by the

Since the keys for every reducer are unique and the Ap-

plication Master keeps track of the

key reducer

assignment it can use the

communication channel to the reducers to tell them where to fetch the data from.
For the reduce step, the general assumptions that have been made for the mapping
are also holding true. If one decides for two initial reducers and assigns the ranges

a - m to the rst and n - z to the second one the possibility to spawn an additional
reducer is not possible anymore. Choosing a ner grained approach is benecial
here, depending on the input data. For the proof of concept implementation, 26
groups where used, where each group corresponds to one letter in the alphabet.
5. When the reducers have received their set of unique keys they can

reduce the data and get a valid, intermediate result.

sort and

Because every set of keys

a reducer might receive is unique there are not any conicts when it comes to the
validity of the nal result.

every reducer writes its part-results to a single le


into the HDFS. In the case of word count the result is a le which holds every

6. The last step is that

key, or word in that case and the absolute number of occurrences for it. This is
also the nal result of the job and can then be fetched by the tenant from the
HDFS.

4.3 Core Algorithms


This section describes the main algorithms used inside the Application Master that
are required for elastic resource management and the prediction of the presumed end
time of the job.

These algorithms will be presented in the form of pseudo code and

an explanation in text form. Language specic details were left out on purpose since
the main goal of this section is to give a general idea of how the implemented system
works.

4.3.1 Scaling of parallel running tasks


In order to adjust the number of running containers the Application Master has to keep
track of the state the containers are in. This is mainly done via the
which atomically keeps tack of the following properties:

27

StateTracker-class

4.3 Core Algorithms

Running containers : How many containers are currently executed in parallel?

Pending containers requests : How many container requests where issued to YARN
and were not yet served.

Completed containers : How many containers have nished their task and wrote
back their results.

Parallel containers : States how many containers may run in parallel at the same
time.

These numbers are kept separately for the map and the reduce phase in order to avoid
confusion and to enable a strict separation of the map and reduce phases.

Also the

StateTracker class has to make sure due to the asynchronous nature of the system that
the writing to each of these counters happens atomically to avoid errors that would be
hard to track otherwise. Having these numbers in place the actual increase of parallel
running tasks simply happens by increasing the parallel containers upon a deadline
violation. The actual algorithm for increasing the number of running tasks looks like
this:

Algorithm 2 Adjustment of parallel running tasks


if parallelContainers > (runningContainers + pendingRequests) then

f reeSlots (parallelContainers (runningContainers + pendingRequests))


if containersLef t < f reeSlots then
f reeSlots = containersLef t

end if
for i = 0; i < f reeSlots; + + i do
requestContainer()

end for
else
return

end if

The algorithm in itself is relatively simple and works by computing the number of free
container slots the system has in its current state.

What is important here is that

the number of pending requests is also taken into consideration here since there is a
noticeable delay between the container requests and actually getting it granted from
YARN.

4.3.2 Calculating speedup using Amdahls Law


Since the general idea of predicting the nishing time of the job is to use the LATEalgorithm to calculate the end time of a task and combine that with the speedup factor
provided by Amdahls Law, it is useful to describe its application in the system in more

28

4.3 Core Algorithms


detail. As a short reminder, the form of Amdahls Law chosen for the implementation is
1
the one where the parallel fraction of a task is given instead of the serial one:
(1P )+ P
S
with

being the number of parallel tasks and

be parallelized.

the fraction of the task that can

The main question here is how to decide which fraction of the job is

parallelizable. Since one main goal of the Big Data approach is to divide one big problem
into many smaller ones and distribute these to other nodes in the cluster it could be
assumed that a large fraction of the job is parallelizable.

However, there are a few

caveats here: First of all, some operations Hadoop does are not happening purely in
parallel. For example, the Resource Manager as the one single element that keeps track
of the resource allocations within the cluster needs some kind of locking to make sure
that container requests are served atomically to avoid having the system in an unstable
state. Another factor is the underlying distributed le system, HDFS, that has to make
sure that if two tasks are writing to the same le these writes also happen atomically in
a specic order to avoid le corruption. These are general problems that will have an
impact on every job that uses Hadoop.
However, there are also factors that depend on the job. Since there is a certain overhead
when a new container is requested there will be inevitable pauses when a task is waiting
for its container. Naturally, this becomes worse if the job is made up of a lot of small
tasks: The system overhead caused by the Resource Manager becomes higher and more
time passes where no actual work is done.

Also I/O intensive tasks that do a lot of

reading and writing from/to the HDFS may cause more waiting times across the job
due to locking mechanisms and generally more congestion of the underlying storage
medium.
For the example word count job the proof of concept implementation yields solid prediction results when a parallel factor of

0.7 is used.

This number was determined by

knowing that the job just causes moderate load on HDFS and the fact that only the
reduce phase is comprised of very small tasks. In addition to that experiments where
conducted that showed that a value of
results, similar to values below

1.0 is by far too optimistic and yielded unrealistic

0.5.

4.3.3 Computing of predicted job end time


The algorithm described in this subsection is used for determining the projected nishing time of the job.

This is important for detecting a potential violation of the

user provided deadline as early as possible.

The algorithm makes use of a combina-

tion between the LATE system proposed in [22] and Amdahls Law [1] and is dened
in the

LateCalculator-class

of the proof of concept implementation. However, a few

simplications were made in that implementation that will be addressed in the textual
description of the actual algorithm when necessary and are also covered in Chapter 7,
future work.
The following pseudo codes describe each component used to achieve the end time prediction:

29

4.3 Core Algorithms

Algorithm 3 Calculation of average task runtime


results = 0
sum = 0
for taskData : allT asks do
if isF inished(taskData) then
+ + results
sum+ = endT ime(taskData)
else if isRunning(taskData) and progressScore(taskData) > 0.0
+ + results
sum+ = predictedT askEnd(taskData)

then

end if
end for
if numResults == 0 then
return 0

end if

return sum

div numResults

This part of the algorithm calculates the average runtime over all tasks. The basic idea
here is that the actual time the task took is taken when it is already nished and if not
the predicted end time for that individual task is used based on the LATE algorithm.
The following algorithms presented are used by the method described above which main
task is to calculate the average runtime over all tasks, whether they are nished or still
in progress.

Algorithm 4 Helper methods used for determining average task runtime


procedure predictedTaskEnd
return((1.0 progressScore)/progressRate) + taskRuntime

end procedure
procedure calculateProgressScore
if f etchEnded then
if computeEnded then
return 1.0

end if

return 0.2 + (0.8 computeP rogress)

end if

return 0

end procedure
procedure calculateProgressRate
return calculateP rogressScore()/taskRuntime

end procedure

Here, the progress is a oating point number ranging from 0.0 to 1.0 where 0.0 means
there is no progress yet and 1.0 means the task has nished. The method presented here

30

4.4 How to implement custom tasks: Usage of the Task Interface


is using a simplied approach in contrast to the one presented in the LATE paper due
to timing constraints. Practically this means that static values are used to measure the
progress of the fetch phase and the write back phase for the individual task. However,
the compute progress is determined quite accurately by using the progress information
sent from the task to the Application Master. Benchmark-wise, for normal tasks this
does not cause many inaccuracies since the fraction of time they need is known for them
since the task structure never changes signicantly. However, it should be noted that
in a real, releasable system the Application Master should make also use of detailed
information sent by the tasks about their fetch and write-back progress.

This would

allow for more precise end time predictions if the job structure is not known and would
also simplify the detection of straggling tasks.

4.4 How to implement custom tasks: Usage of the


Task Interface
Since the basic concepts and the algorithms used by the implemented system were discussed in the previous sections, one might ask how to actually use the elastic resource
management and the deadline awareness. This section answers that question and gives
a brief overview over the interface that has to be used in order to allow other kinds
of tasks, like sorting ones for example, to make use of the capabilities of the enhanced
Application Master. However, it has to be noted that for the conducted measurements
just a word count task was used that utilizes the implemented task interface. The main
reasons for that were time constraints and the decision to focus on the capabilities the
system has. One other aspect is the fact that the interface and the underlying, abstract
classes can be seen as a proof of concept and are still quite rough around the edges.
This means that the eort necessary to implement an own task is quite time consuming
since the actual mechanisms like fetching data from the mappers need to be implemented
from the ground up. For the measured example task, SCP was used but for other use
cases a dierent technique might be more benecial.

Figure 4.2: High Level overview of the task interface

31

4.4 How to implement custom tasks: Usage of the Task Interface


The Diagram 4.2 shows a high level overview of the interface. At its base lies the

Task-

interface which can be seen as a contract that needs to be fullled if a task wants to make
use of the extended capabilities of the Application Master. The minimal functionality
such a task should have is formulated as follows in Java:
1
2
3
4
5

public interface Task {


void notifyAM ( ToApplicationMasterMsg msg ) ;
List < FromApplicationMasterMsg > fromAM () ;
void tearDown () ;
}
Listing 4.1: Task Interface

This interface makes sure that each task can send information to the Application Master

notifyAM(), receive messages from the it by calling fromAM() and can close possible
sockets with tearDown(). Because it would be unreasonable to require that the user of

via

this interface has to implement the underlying functionality (like socket communication
and message handling) all by himself, the abstract class

TaskBase was created.

This class

can be thought of as the lowest common denominator and has code in it that handles
the messaging between the Application Master and its tasks. Internally, this is done via
the

TaskCommChannel

and the

TaskMessageBus

classes which provide a message queue

where the messages that came in via the communication socket are stored. From there
they can be pumped periodically without having a message getting lost.
With this specic baseline functionality that mappers and reducers have in common,
the task can be further specialized via abstract classes.
1
2

abstract class MapTask extends TaskBase {


public final TaskType taskType = TaskType . MAP ;

public MapTask ( InetSocketAddress amAddr ) {


super ( amAddr );
}

4
5
6
7
8
9
10
11

public abstract byte [] fetch () ;


public abstract void compute ( byte [] buffer ) ;
public abstract void writeBack () ;
Listing 4.2: Abstract map task

Listing 4.2 shows the abstract class that should be extended when implementing a custom
mapper. The

fetch()

method expects the returned part of the input le that usually

compute() method
writeBack() is then

is read from HDFS, as a byte array. This data is then fed into the
which is supposed to do the actual computation. The last method,

used to write the results back to the local disk of the node the mapper runs on.

32

4.5 Challenges of elastic operation

1
2

abstract class ReduceTask extends TaskBase {


public final TaskType taskType = TaskType . REDUCE ;

3
4
5
6
7
8
9
10

public ReduceTask ( InetSocketAddress amAddr ) {


super ( amAddr );
}
public abstract void fetch () ;
public abstract void reduce () ;
public abstract void writeBack () ;
Listing 4.3: Abstract reduce task

The Listing 4.3 above shows the abstract class that needs to be implement for the reducer
that matches the mapper in the job. The general approach here is the same:
used to pull the data from the mapper,
part of the task and

writeBack()

reduce() reduces it, e.g.

fetch() is

does the computational

nally writes the result back to HDFS.

Note that these abstractions for a map- and a reduce task might not be sucient or optimal for each kind of job. For example, forcing the mapper to read the whole input chunk
into a buer might cause problems in the future if the size of the input chunk exceeds the
amount of main memory available for that task. Allowing buered reading at this point
might be a better choice in the long run here. However, for the word count job used
for performance measurements these abstract classed worked out well. Their complete
implementation can be found in the

MapTaskWordCount

and

ReduceTaskWorkdCount

classes. The possible improvements in regards to interface and usability are discussed
separately in Chapter 7 as part of the future work.

4.5 Challenges of elastic operation


4.5.1 Implementing elastic resource management in
Hadoop/YARN
Since the design decision was made to implement the proof of concept testbed on top of
YARN to to achieve the a higher degree of control, a few diculties where faced in the
form of technical limitations in the existing system. First of all Hadoop does not have a
notion of bandwidth in its resource system. The only resources a unit of execution, called
a container in Hadoop/YARN context has are number of available CPU cores and usable
main memory. So the rst step was to add bandwidth as a resource to YARN which
aected basically every core component of Hadoop: The resource manager, the node
managers and the container abstraction itself. In addition to that the protocol used by
Hadoop to communicate between each component had to be made aware of bandwidth.

33

4.6 Data transfer between Reducers and Mappers


The next challenge was to implement a convenient possibility to add containers to a
job at runtime in order to increase its parallelism and overall resource usage.

The

stock Map/Reduce application master that ships with the Hadoop distribution did not
have this capability, though. This lead to the implementation of a custom Application
Master which is able to increase the number of running containers for a Job without
compromising the correctness of the result. To achieve that goal the task structure also
had to be modied and exchanged with an own implementation here that went along
with the custom Application Master. This task structure mimics the one of the Hadoop
Map/Reduce implementation in the way that a task within the job is either a mapper or
reducer with the reducers pulling their intermediate results they should work on directly
via SCP from a nished mapper.
In order to allow a dynamic increase of running mappers and reducers a few concessions
had to be made, primarily that the tenant has to specify a maximum number of reducers
beforehand. This works by providing an upper limit to the number of tasks for each phase
and information about how many chunks of data one Task should handle at maximum.
This was necessary due to the fact that the vanilla Map/Reduce implementation does
not allow an exact specication of how many tasks a job should have but decides that
more intelligently on factors like HDFS block size and other settings made by the
cluster administrator.

While this in itself is ne for normal day operation it turned

out problematic for doing reproducible benchmarks which utilize comparatively small
jobs.
Another big challenge was the fact that Hadoop/YARN did do not provide some kind
of deadline-mechanism. In order to make meaningful use of a deadline there was also a
need for a solid mechanism to predict the approximate end time of the job in order to
react on potentially violated deadlines as fast as possible. This was implemented using
a slightly modied version of LATE which is combined with Amdahls Law, similar to
the system Jockey proposes. However, this turned out to be problematic when there are
straggling tasks in the job, so the measurements assume that there are no stragglers and
solving this was left out for future work. So basically the main challenges lay in making
Hadoop aware of a deadline, predict the estimated nishing time for a job and react to
that accordingly by assigning more resources to the job in order to meet the deadline.
The remainder of this chapter explains in more detail what was done in order to achieve
these goals.

4.6 Data transfer between Reducers and Mappers


Since the main goal is to reuse as much of the existing infrastructure as possible, the
transfer of the intermediate results from the mapper by a reducer is realized via SCP.
In order to avoid duplicate names and to allow an exact correlation between input data
and the corresponding mapper, the proof of concept implementation uses an UUID [10]
internally for all messages and computing results. The general ow of communication
can be described as follows:

34

4.7 Protocol and message format for out-of-band communication


1. A mapper has nished its assigned task
2. The Application Master is notied: It receives the path to the result le and the
hostname of the node where the mapper has nished.
3. The container where the mapper ran in terminates and the data remains local on
the node.
4. The Reducer that got the assignment from the Application Master to work on this
specic part of the solution uses SCP, which is part of SSH, to pull the result from
the corresponding node.

4.7 Protocol and message format for out-of-band


communication
The following section briey describes the messages and their format that are used for
the communication between all participating components. The arrows in the following
subsections state the direction in which the communication takes place, e.g. Application

Master

Mapper means that the Application Master sends a Message to a running

mapper. The format of the messages is given in JSON-notation [6] as used in the actual
source code where the classes used for messaging can be found in the

messages package.

Note that there are messages that do not have a specic payload and act only as a
mechanism to report back to the Application Master.

4.7.1 Task Application Master


Progress Messages

The two messages presented here are used to report the progress of either a map or a
reduce task to the Application Master and form the foundation for the job end time
prediction.

The payload of this message, the progress, is calculated within each task

as a fraction of their processed data and is then transferred as a oating point number
between

0.0 and 1.0.

The application master then uses these progress reports to calculate

the progress rate using the LATE algorithm.


1

2
3
4
5
6

" ReducerProgressMessage " : {


" TYPE ": " REDUCER_PROGRESS " ,
" progress " : 0 . 4 2 ,
}
Listing 4.4: Mapper Progress Message

35

4.7 Protocol and message format for out-of-band communication

2
3
4
5
6

" MapperProgressMessage " : {


" TYPE ": " MAPPER_PROGRESS " ,
" progress " : 0 . 2 3 ,
}
Listing 4.5: Reducer Progress Message

The reasoning behind the decision that there a two messages for the task progress is
that it simplied the implementation and makes it clear that the progress report came
from a specic task: Either a map or a reduce task.

Phase update messages

These messages are solely identied by their type and indicate that a task has begun or
nished a specic phase like fetch or compute. The Application Master saves the time
the job is running upon receiving one of these messages and the corresponding task.
This data is then used to calculate the statistics for each task which form the basis of
the plots presented in Chapter 5. The following phase messages are implemented in the

messages.incoming package for map tasks: MAPPER_START_FETCH, MAPPER_END_FETCH,


MAPPER_START_COMPUTE and MAPPER_END_COMPUTE. Note that the write back phase to
the local storage medium of the node after completing the calculation is not reported
here since it is masked by the compute phase. That means that the resulting key/value
pair is written out instantaneously in order avoid keeping the assigned chunk of the
input le and the internal data structure (a

HashMap)

that holds the key/value pairs

in memory over the runtime of the task. Analogous to the mapper phase update mes-

REDUCER_START_FETCH, REDUCER_END_FETCH,
REDUCER_START_COMPUTE, REDUCER_END_COMPUTE, REDUCER_START_WRITE and
REDUCER_END_WRITE. However, the reduce tasks are explicitly reporting the beginning

sages the reducer messages look similar:

and the end of the write back phase. The word count task implemented lets the reducers
write back their results to the HDFS which usually involves additional network trac,
so here it made sense to report that to the Application Master.

4.7.2 Application Master Task


This section describes the message that the Application Master uses to inuence the
behavior of a single task. For the current state of the proof of concept implementation
that means the

ADJUST_BANDWIDTH

message. However, in its current state the dynamic

bandwidth adjustment was not used throughout the experiments due to technical diculties and timing constraints. The mechanisms to contact a task however is there so the
message is described here nevertheless. The Java class that abstracts this message can

36

4.7 Protocol and message format for out-of-band communication


be found in the

messages.outgoing package of the source code.

Since tc in combination

with net_cls can only adjust the upstream for a given process id, the message only holds
the upstreamGuarantee eld, which is in kilobit per second.
1

2
3
4
5
6

" AdjustBandwidthMessage " : {


" TYPE ": " ADJUST_BANDWIDTH " ,
" upstreamGuarantee " : 1 0 0 0 ,
}
Listing 4.6: Reducer Progress Message

This example message above is intended to set the outgoing bandwidth for the specic
task to 1000 kbit per second, or 1 mbit per second. However, it has to be noted that
in its current state the task needs to run with root privileges in order to adjust the
corresponding cgroup which is a serious disadvantage here.

37

Chapter 5
Structure, scope and challenges of
the performance evaluation
This chapter will explain the technical scope and general structure as well as the assumptions made for measuring the impact of elastic resource allocation using Hadoop
with the proof of concept Application Master. In addition to that the challenges that
became apparent while conducting the experiments and the chosen solutions for them
will be explained. The last part of this chapter aims to extrapolate the validity of the
conducted experiments to a more wider scope and assesses their practical use for larger,
real world workloads.

5.1 Overview
In general, several steps were needed to get a consistent view of the eects that were
measured. These eects basically boil down to a task that straggles due to one or more
reasons, most prominent blockage in the compute phase and scarce bandwidth which
hinders the fetch -phase.

So in order to assess the behavior of the job as realistic as

possible the following parts of the job execution where looked at:
1. Distribution of the fetch, compute and write-back phases for each individual task
that make up the job. These measurements gave a good understanding how the
actual structure for each task looks and made assumptions regarding the whole
job possible.
2. The overall behavior of the job. This gives a more general idea about the job execution than to just look at a single task. These measurements include the number
of running tasks, a histogram about their runtime behavior and the progress for
the map and reduce phase.

5.2 Scale of experiments and workload


The testing environment consists of one physical machine that runs four virtual machines
on it. All of these VMs (Nodes in Hadoop terminology) are directly connected to each

38

5.2 Scale of experiments and workload


other, so no special cluster topology is used for testing. From the software side, a normal,
unmodied version of Hadoop is used. The Map/Reduce framework that is bundled with
Hadoop and which is sitting on top of the YARN -layer however is

not used but instead

the implemented proof of concept Application Master is doing the main part of the work.
The four main reasons for that are:
1. There is no possibility to modify the resources of a execution unit in a Map/Reduce
task at runtime. An open task for that exists in the Hadoop JIRA (YARN-1197)
which was created in September 2013 with just minor progress made up until now.
This lead to the conclusion that a proper implementation which correctly addresses
all parts of Hadoop, including the Map/Reduce framework, would be out of scope
and is a big task in itself.
2. The Map/Reduce framework within Hadoop is a large codebase that interacts
with every part of Hadoop and is a superset of YARN in its current form. Even
with the rst point possible, the Map/Reduce framework also would need heavy
modication, which is also a large task and not feasible doing in a six month scope
by a single person.
3. No notion of bandwidth exists in Hadoop.

The only known resources are CPU

cores and memory available on a given Node and as reservation for a specic subtask within a job.

Running directly on top of YARN simplied the solving of

this constraint since the Hadoop Map/Reduce scheduler did not need to be made
bandwidth aware, which also would lead to signicant modications beyond the
scope of this work.
4. The number of tasks (map and reduce) needed for a job that gets allocated by the
Hadoop scheduler is not necessarily known and depends upon certain factors, but
mainly the hdfs block size. If there is no exact control of the number of tasks the
calculation of the job progress can become dicult.
Since one of the main goals of this work is to show the potential advantages of modifying
the resources a job has in order to meet a certain deadline the approach of implementing
a custom Application Master on top of YARN was chosen.

5.2.1 Job structure for conducted experiments


The technical limitations mentioned beforehand led to a number of design decisions when
it comes to the structure of the jobs measured.

The Hadoop Map/Reduce framework is not used. This also means that generic
Hadoop Map/Reduce jobs will not run in cooperation with the proof of concept
Application Master used for the experiments in this work.

The actual increasing or decreasing of resources has to be done in a way that


would not be acceptable in a real environment. For example, an increase of the
bandwidth for a give task would be handled completely out of band using tc in

39

5.3 Results of the measurements


conjunction with cgroups on operating system level outside of Hadoop.

In real-

ity this would limit the support to a recent Linux platform that supports these
underlying technologies.
At the current point this means a dedicated Application Master for each Job had to be
implemented along with the workloads (or in other words, the actual algorithms used
by the mappers and reducers) that should take place in mapping-, shue and reduce
phase. In later iterations of the experiments a more generic and user-friendly Application
Master could be implemented, but this is left for future work.

5.3 Results of the measurements


This section discusses the experiments and measurements done and gives a breakdown
of their respective results.

5.3.1 How to measure the job progress?


In order to make predictions about a job and whether or not it can hit a given timing
constraint we have to measure the progress of the job in some way. In general, the job
progression is measured upon the number of completed sub tasks. For example, we have
50 sub tasks and 25 are completed, so we assume the whole job has a progress of 50%.
Since we need to implement the jobs used for the experiments ourselves, we have direct
control about how the job is mapped and reduced, without the Hadoop Map/Reduce
scheduler getting in the way. One serious drawback with that approach however is the
fact that we do not have a view inside a sub-task. It is just known whether the task
is running or nished. If the sub task time is highly uctuating due to bad mapping
decisions, congested network or problems with the underlying hardware, this approach
might proof to imprecise for certain experiments. However, since we have direct control
over the Application Master and the workload, sub-tasks could signal their individual
progress (when the nature of the sub-tasks makes that possible) back to the Application
Master.

5.3.2 Summary
The previous sections explained the restrictions and the scope for the experiments that
will be carried out and analyzed throughout this work. The section at hand will give
a summary of the main points to make clear what to expect from this work and what
not.

To make reproducible measurements possible and to keep the scope of the work
realistic, it was decided to realize each job with a self-implemented Application
Master directly on top of YARN.

40

5.4 Cluster setup

In consequence, the results of this work can't be used to run generic Map/Reduce
jobs designed for the Hadoop Map/Reduce framework.
future work closely related to YARN-1197.

This would possibly be

In other words, this means custom

jobs that mimic Map/Reduce behavior are used for the experiments instead of
Map/Reduce jobs running within the provided Map/Reduce framework of Hadoop.

The nature of the Map/Reduce job used for the experiment solely depends on
how it is implemented for the experiments and how the enhanced possibilities of
YARN are used. This means that jobs with multiple map and reduce phases or
even an iterative Map/Reduce task would be possible. However, this also needs
to be implemented from scratch on top of YARN and would'nt be generic by any
means.

The actual goal of this work is to show how changing the resources of a job and/or
its sub-tasks can inuence its runtime in a way the job can meet certain timing
constraints provided by the tenant.

The goal is not to provide a re-usable, re-

leasable solution but to show that the general idea of speeding up a job at runtime
under certain circumstances can yield more predictable performance in an existing
framework like Hadoop for it.
These points sum up the general scope of the experiments conducted and should give
a better idea of what to expect out of this work and how the results could be used to
implement a more generic solution inside Hadoop.
In this chapter the results conducted on the actual (albeit, small) cluster are discussed.
First of all the frame conditions of the experiments will be described. These will cover
important aspects like number of nodes, interconnect between them and their hardware.
Also, the nature of the input samples and the workloads will be discussed in this chapter.
In the last part the results of the measurements will be discussed and compared to runs
that do not utilize elastic resource allocation.

5.4 Cluster setup


The following experiments where conducted on the Bigfoot cluster at the TU Berlin.
For the benchmarks a small cluster with four computing nodes interconnected with Fast
Ethernet links was set up and all of the following experiments were conducted there.
Each Node runs the stable version of Ubuntu 12.04 and utilizes 8 gigabytes of Ram and
has 4 CPU cores available. The transfer of intermediate results from the mapper to the
reducer takes place via SCP.

5.5 Workloads and input size


The main workload for the experiments was a self-implemented word count that uses
18 Gigabytes of input data generated from a recent HTML dump of Wikipedia. Word

41

5.6 Benchmarks
count was chosen because it is a quite common example application for Big Data and the
algorithm is quite simple. Also, the dump guarantees that there are measurable fetch
times from HDFS [16] and from a mapper to a reducer. In addition to that input data
for that kind of workload is easily obtainable and can be adjusted to the needed sizes
without too much eort.

5.6 Benchmarks
This section discusses the benchmarks that were obtained using the implementations
that were done in the course of this work, primarily due to the custom Application
Master. The order in which these benchmarks are presented goes from a per-task view
up to measurements that evaluate the behavior of the whole job. The observations done
on the whole job are further divided into measurements that will at rst focus on the
prediction of the job runtime under dierent conditions like appearing stragglers. After
that, the inuences of a dynamic resource increase is taken into account. The main goals
of this section can be summarized as follows:

Evaluate and assess how accurate the runtime predictions for a single task are.

Extrapolate these predictions to the whole job and show how accurate the algorithms deployed are there.

Show how the chosen approaches cope with changing conditions like stragglers or
slow nodes.

Conclude from that how dierent strategies to keep the job in provided deadline
bounds work out.

5.6.1 Per-Task Measurements


These measurements focus solely on a single task inside a larger job. In order to make
educated guesses about the runtime behavior of the whole job we need to be as accurate
as possible when it comes to the predicted nishing time of the tasks that make up the
job.

Durations of phases for a single task

Graphic 5.1 shows the per-task behavior of a small sample job. The x-axis shows the
container id associated with the task and the y-axis shows the overall runtime of the job
which is about 1200 seconds. The map phase ended after container 17 and the reduce
phase starting at container 18 begins. What we can basically deduce from the graph is
the following:

42

5.6 Benchmarks

Figure 5.1: Exemplary runtime behavior of a map/reduce job

1. The reading of the input chunk from a mapper is quite rapid and considerably faster
than the computational time. The write-back time for the mapper is masked by
the computation, e.g. the mapper processes one line of the input le and writes
the line directly after that to the opened le stream.
2. The reducers take some to fetch the intermediate results via scp from the mapper.
However, the computational time takes the most time for the reducer.
3. The writing to HDFS for the reducer is quite fast since the results are way smaller
than the input data.

1 1 1
, ,
weighting that the LATE paper [22]
3 3 3
proposes by default does not t well into that task structure. In case of the mapper, a
This leads to the conclusion that a default

weighting of around 20% for read and 80% for compute might be more appropriate.

Runtime Predictions

After showing the behavior of a single task under normal circumstances without any
form of runtime predictions these information will be used to approximate the actual
point in time where the task will most likely nish.

In order to do that a prediction

will be made for dierent types of tasks, e.g. those that are running normally and those
that are exposed to dierent kinds of resource shortcomings in bandwidth and computing
power. In addition to that an approach will be presented to mitigate the shortcomings of
the default weighting of task phases discussed in the Per-Task Measurements section.

43

5.6 Benchmarks
Regular task without straggling

The Diagram 5.2 shows a more detailed view of a single task and it is basically a
magnied view of a reduce task picked out randomly. This graph should primarily give
an idea how the time distribution for each part of the task looks. However, it is clearly
1
visible here that die standard distribution of
of the time of each phase as assumed by
3
LATE does not hold true for this specic task.

Figure 5.2: Behavior for one reduce task

The incorrect weighting for that task would have a negative impact on the nishing
time prediction. However, since the exact behavior of one task is mostly unknown in
general there is no possibility to improve on that without proling information or an
online learning approach. An example how this can work in combination with an online
approach was given in [7]. The explanation for this unevenly distributed times lies within
the nature of a reduce task within a word count job: The fetch time is directly related
to the network speed and according to [22] and [11] often makes up a large fraction of
the time a reduce task needs. This sighting is similar to what was observed during the
experiments and should hold true for dierent kinds of jobs, too. For reducing a map
result of a word count job the computing time also scales nearly linear with the input
size so a possibility here might be to take the input size into account when weighting
each phase. The last phase, the writing back of the result to the underlying, distributed
HDFS is quite short, though.

This is due to the fact that in case of word count the

44

5.6 Benchmarks
result is usually much smaller in size which explains the relatively short write back time
for this task.

Figure 5.3: Inaccurate assumption due to inappropriate task phase weighting

Figure 5.3 shows a wrong prediction for a short map task because of incorrect weighting
assumptions. Here, a relatively small task was chosen (35 seconds runtime in total) to
better show the eects of the wrong prediction and make the individual sampling points
visible. This happens due to the fact that the fetch time from the local disk just makes
1
up a small fraction of the overall runtime of the task. Since it reaches a progress of
3
really fast due to this, the overall prediction is getting too low and does not recover in
time which leads to a too high end time prediction.

5.6.2 Per-Job Measurements


This section assesses the behavior of a whole job that consists of several tasks. As with
the rst series of per-task measurements we begin without any prediction of the nishing
time here. After we sat an outline on how the job behaves in general with and without
stragglers we move on and evaluate dierent approaches for predicting the nishing time
of that job. Additionally this section should give a good idea about how the implemented
system works and behaves under certain circumstances.

45

5.6 Benchmarks
Finishing time prediction

The previous measurements done in the Per-Task Measurements section dealt with the
behavior of a single task and the prediction of its nishing time. This in itself is not
very helpful since the tenant usually does not care about single tasks but the whole job.
However, in order to predict the time properties of a job it is necessary to have a good
understanding of the behavior of the tasks it is composed of. The following benchmarks
show the accuracy of the runtime prediction, starting with a simple approach and ending
with a more rened strategy that makes use of the calculated nishing times of individual
tasks and a straight forwarded extrapolation that orientates itself on Amdahls Law [1].

Taking LATE predictions into account

350
Predicted Endtime

300
250
200

Predicted Endtime (s)

150
100
50
0

50

10

Average Container Runtime (s)


100
Current Runtime (s)

150

200

150

200

Tasks left

8
Remaining tasks

6
4
2
0

Running tasks
0

50

100
Current Runtime (s)

Figure 5.4: Predicted nishing time of the job using Amdahls Law

The pictured graph 5.4 shows the prediction of the nishing time of the task using the
1
speedup calculated using Amdahls Law:
where S is the number of parallel tasks
(1P )+ P
S
and

the fraction of the job that can be parallelized. A more thorough explanation on

how the serial fraction was determined and how this was realized in code can be found
in the section 4.3.2 of Chapter 4. As the graph shows, this apparently simple approach
yields quite solid results and works well if the tasks are all running for roughly the same
time. However, for this plot a smaller job which consists only of mappers was chosen in
order to be able to show the individual measuring points and the general principle. Here
it is observable that the red crosses which are the prediction at a given point in time into
the job, correlate with the average runtime of each task. This is the most basic example
because each task takes almost the exact same time but shows that the general concept

46

5.6 Benchmarks
is working and the prediction correctly oats at around 200 seconds, which is quite close
to the actual end time of the job. The plot below shows how many tasks are running in
parallel and how many ones are left, also with individual measuring points. Since there
is no kind of resource increase in here, the green dots stay static on the x-axis. Though
rooted in traditional parallel computing, Amdahls Law can also be utilized to calculate
the speed up of Big Data Jobs when applied correctly, e.g. with the right factor for the
parallelizable fraction. However, with stragglers that approach does not work very well
if no techniques that take stragglers into account when predicting the nishing time of
the job are applied.

Impact of resource increase

In the last section dierent approaches were shown to estimate the nishing time of a
job, extrapolated from its individual tasks. Since the combination of Amdahls Law and
the LATE calculation gives a good approximation we want to put this information to
actual use. Because achieving predictable performance and a preferably optimal cluster
utilization is the main goal of this thesis we have to react upon unfavorable timing
predictions. For example a deadline for the job was issued and our predictions say at
one point in time that we will miss this deadline if the job progresses at the same rate.
The following benchmarks deal with strategies that try to keep the job in its timing
constraints as close as possible.

Increasing number of parallel running containers

350

Predicted Endtime

300
250
200
150

Predicted Endtime (s)

100
50
0

50

100

Average Container Runtime (s)


150

Current Runtime (s)

200

250

200

250

10

Tasks left

8
6
Remaining tasks
4
2
Running tasks
0

50

100

Current Runtime (s)

150

Figure 5.5: Add one parallel task after around 50% job progress

47

5.6 Benchmarks
This measurement shows a static increase of running containers in order to prove that we
can aect the runtime of the job inside Hadoop dynamically at run time. The number
of parallel running containers is increased by one at 50% overall progress and leads to
a faster job completion time.

However, since this is just a static increase at a given

progress and no intelligent algorithm is used here, the usefulness of this static increase
is limited but can lead to more sophisticated approaches in order to achieve our goal of
a more predictable performance. Also, for this simple measurement perfect parallelism
was assumed which explains the drop in the prediction (the red crosses) at about half
the previous one. This aw will be addressed in the later measurements.

End time prediction without straggler

In this part the complete job with all its tasks as well as the map and reduce phase will
be evaluated. It should give a solid idea about how the system behaves and what kind
of structure the example word count job has. At rst the baseline will be established,
e.g how the job behaves without stragglers and no dynamic resource allocation. After
the basic behavior is analyzed an articial straggler will be added to the job in order to
evaluate if the system is capable to keep its deadline despite the straggler.

Job Summary, 200mb chunks

Job Progress

1.0
0.8
0.6
0.4
0.2

2000
1500

Job Runtime (s)

Deadline

1000
500
0

Running Tasks

Running Tasks

Predicted Endtime (s)

0.0

6
5
4
3
2
1
0

30
25
20
15
10
5
0
0

Job Runtime (s)

Running Mappers
Running Reducers

Job Runtime (s)

200

400

600

Job and Task Runtime (s)

800

1000

1200

Figure 5.6: Job behavior without resource management and no stragglers

Figure 5.6 shows a plot that describes the whole system when executing the example word
count job. The rst subplot at the top shows the overall progress of the job, divided into

48

5.6 Benchmarks
map and reduce progress. When both phases reach 1.0, or 100%, the job has nished.
The

purple line shows the transition from the map to the reduce phase. In the second
the blue line marks the predicted end time of the job as calculated at the

subplot,

specic time the job is running, visible on the x-axis. The prediction algorithm itself is
described in more detail in the implementation part in Chapter 4. The next subplot,

running tasks, shows the number of

reducers running in green.

mappers running in blue

and the number of

The fourth and last subplot basically shows the same as

the previous one, but this time in the form of a histogram. This helps to assess when
each task of the job starts and ends. It also gives insight about the parallelism the job
utilizes at the moment as well as the length of each task. Note that the meaning of each
line and curve in this type of plot stays the same throughout this chapter.
At rst it is notable that the prediction stabilizes in the beginning after the initial
jitter due to more progress data becomes available. It is also visible that the prediction
surpasses the deadline on various occasions which would normally trigger a resource
increase. Since dynamic resource management is disabled for this measurement in order
to see how the job behaves without any form of interference this has no impact, the
number of running tasks stays the same. In this rst experiment it is notable that the
nishing time predictor implemented stays relatively close at around 1000 seconds with
a spike between 600 and 800 seconds into the job. This spike happens because of the
two map tasks that take slightly longer than their counterparts. However, the prediction
that the job will end at slightly more than 1000 seconds stayed true in this case and
the deadline would have been missed by about one and a half minute. This experiment
showed however that an extremely tight deadline (a deviation of 60 or less seconds) is
not feasible for a job of this size because Big Data systems are subject to a lot of factors
that are not directly controllable like the scheduling of the underlying operating system,
disk I/O and polling rates within Hadoop to name few that can cause smaller dierences
here and there. However, even due to this factors the prediction stayed relatively close
to the real end time.

49

5.6 Benchmarks

Job Summary, 200mb chunks

Job Progress

1.0
0.8
0.6
0.4
0.2

2000
1500

Job Runtime (s)

Deadline

1000
500
0

Running Tasks

Running Tasks

Predicted Endtime (s)

0.0

6
5
4
3
2
1
0

30
25
20
15
10
5
0
0

Job Runtime (s)

Running Mappers
Running Reducers

Job Runtime (s)

200

400

600

Job and Task Runtime (s)

800

1000

1200

Figure 5.7: Job behavior with resource management and no straggling task

The Plot 5.7 shows the same job as the previous one, but this time with dynamic resource
management activated. In the beginning there are again the spikes due to the limited
progress data we have when no task has nished yet. This spike however triggered the
dynamic resource management algorithm: One additional container for a map task is
requested and started. In this case it is one container because the system was congured
to increase the resources in a soft way, which means that one deadline violation results
in one additional container and a 30 second long waiting period before the next increase
happens in order to give the system time to adapt. Depending on the use case it would
also be possible to request more additional containers.

However, this also showed a

weakness in the implementation: Due to the lacking progress data in the beginning a
resource increase was triggered. Here, it might be benecial to allow an increase only if
there is enough data available. However, in this case the resource increase was benecial
and the predicted end time stays well under the provided deadline in the map phase.
Also, the estimation stabilizes at around 750 seconds.
A notable characteristic of the job structure shown in the Plots 5.6 and 5.7 is the uneven
distribution in task run times when it comes to the reducers. This can potentially make
the prediction inaccurate and is due to the fact that for the word count job the groups are
not equally large and the proof of concept implementation does not have a more advanced
scheduler in place in its current state due to timing constraints. Possible approaches to
solve this problem are discussed in Section 7.3 in the future work chapter.

50

5.6 Benchmarks

500

Data amounts processed by Reducers

Megabytes

400

300

200

100

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Reducer #

Figure 5.8: Data distribution to each reducer

Figure 5.8 shows the amount of data each reducer pulls in from the mappers. In total
there are 26 groups that have to be processed, each one corresponding to one letter of
the alphabet. Due to the nature of this there are many small groups like y or q and
some larger ones like a and e.

This holds mostly true for the English language [13]

and is also the case with the sample data taken from the English Wikipedia. Due to
timing constraints the Application Master implemented as a proof of concept does not
employ a more sophisticated scheduler that could lessen this eect by assigning more
than one small group to one reduce task in order to distribute the work more evenly
which explains the uneven running times across the reduce tasks.

End time prediction with a straggler

In addition to that an articial straggler was added to the job execution to show its
impact on it.

51

5.6 Benchmarks
Job Summary, 200mb chunks

Job Progress

1.0
0.8
0.6
0.4
0.2

2000
1500

Job Runtime (s)

Deadline

1000
500
0

Running Tasks

Running Tasks

Predicted Endtime (s)

0.0

6
5
4
3
2
1
0

30
25
20
15
10
5
0
0

Job Runtime (s)

Running Mappers
Running Reducers

Job Runtime (s)

200

400

600

Job and Task Runtime (s)

800

1000

1200

Figure 5.9: Ignoring deadline violation

It then raises due to the straggling task which can be identied in the task histogram
at the bottom which shows the duration each task had and their respective start and
endpoints. The straggling task is the wine red mapping tasks which took about 5 times
longer than the average map task. The

red line

shows the deadline provided by the

user which sits at 1000 seconds in this example. Note that the elastic resource allocation
was disabled for this rst experiment since it was conducted to show how a straggling
task without some form of straggler mitigation can have a negative eect on the overall
job performance. Also, the number of parallel running tasks was set to two here. This
can be seen in the next plot that shows the

running tasks at any given point in time

with the blue ones being the map tasks and the green ones the reduce tasks. The small
downwards spikes here are due to the fact that YARN needs a little bit of time to actually
issue a requested container to the application master.

52

5.6 Benchmarks
Job Summary, 200mb chunks

Job Progress

1.0
0.8
0.6
0.4
0.2

2000
1500

Job Runtime (s)

Deadline

1000
500
0

Running Tasks

Running Tasks

Predicted Endtime (s)

0.0

6
5
4
3
2
1
0

30
25
20
15
10
5
0
0

Job Runtime (s)

Running Mappers
Running Reducers

Job Runtime (s)

200

400

600

Job and Task Runtime (s)

800

1000

1200

Figure 5.10: Reacting on deadline violation with straggler

The last gure 5.10 in this chapter shows the same job as in plot 5.9.

In this plot

however, the dierence is that the dynamic resource allocation that was implemented
in the proof of concept implementation is now activated.

The expectations here are

that as soon as a deadline violation is detected the resources for the job are increased.
Increase here means that another task is spawned that should speed up the job enough
so that it can meet is deadline. Since the system takes some seconds to actually spawn
a new task and adjust the deadline prediction accordingly, the current implementation
has a threshold of 30 seconds between each increase. The plots and lines have the same
meaning as in gure 5.9. The rst violation of the deadline occurs at around 3 minutes
into the job and therefore, one more task is started in parallel. This can be tracked via
the histogram that shows 3 parallel tasks at this point. However, the negative impact
of the straggler is still in eect, so the algorithm decides that two more increases are
necessary: One task more every 30 seconds. With ve parallel mappers the prediction
stays slightly under the deadline which would have been violated as the plot without
elastic resource management shows.

However, the prediction just goes down slowly

because the straggling task still has its part in the calculation.

With a backup task

for the straggling one and a replacement in the end time prediction by eliminating the
straggler in the calculation and use the backup task instead the prediction would be more
realistic here. However, due to timing constraints this kind of speculative execution was
not implemented. But even without replacing the straggling task the dynamic resource
increase still managed to hold the deadline. When starting the reduce phase (marked by

53

5.6 Benchmarks
the purple line) the prediction spiked due to a long task started right in the beginning.
This spike in the prediction violated the deadline and triggered a resource increase in the
reduce phase. The algorithm here is the same as for the mappers, so the resources are
increased within the 30 seconds threshold as long as the prediction says the deadline will
be missed. For this experiment a ne-grained approach was chosen to decide how many
additional tasks should be spawned. So per deadline violation within the threshold one
additional task is executed. This can be risky for short jobs due to slow convergence
below the deadline, but for the example job this turned out to be enough: The end time
prediction after the rst wave of resource increases remains at around 800 seconds (with
a few little outliers) which corresponds with the actual nishing time of the job.

54

Chapter 6
Summary of insights and conclusion
This chapter will draw conclusions about the eectiveness of elastic resource allocation
in dierent scenarios based on the conducted experiments and the previous work in the
thesis.

6.1 Optimal conditions and limitations of elastic


resource allocation and run time predictions
In general the experiments showed that the more deterministic and evenly distributed the
input data is that goes to the mappers and reducers, the better the predictions regarding
job nishing times become. Also equally well (bandwidth-wise) connected nodes with
the same computing and storage capabilities do help in getting more exact predictions
with a passable amount of eort. However, adjusting the prediction algorithm in a way
that it can keep track of largely heterogeneous clusters might be possible, but it will
also increase the overall complexity of the resulting system. In addition to knowing how
each node in the cluster is connected to each other and having a good idea about their
individual computing capabilities beyond the possibilities YARN gives the tenant in its
current state (for example, the amount of CPU cores might not be sucient since this
information does not provide hints about the actual speed, just about the exploitable
parallelism) it is also important to have a good idea about the structure of the job. This
essentially means knowing roughly how big the computational eort for each task is in
relation to its input size, how much load the job puts on the cluster and approximately
how the reduce phase compares to the map phase run time wise.
In case of the word count job used for benchmarking the proof of concept implementation,
the amount of work each mapper has to do is roughly the same since the computational
eort for splitting sentences into actual words and group them is very linear in regards
to the size of the input data. However, the picture is very dierent here when it comes
to the workload a reducer has to do. A reducer that needs to process a large group, for
example, words starting with e when processing an English text will take considerably
longer than a reducer that receives a small input group like y. As the histogram of the
experiment showed the time each reduce task took is distributed unevenly.

55

If this is

6.2 Technical and conceptual limitations of the elastic approach


known beforehand for the given job it can also help to achieve better prediction results
by tuning the fraction of time the reduce phase will presumably take accordingly.
Obviously, the absence of stragglers also helps when trying to estimate the end time of
the whole job. As shown in the measurements even a single straggler can have a severe,
negative impact on the prediction. Since straggler mitigation is a large eld of research
in itself this work only covers the basics here and the proof of concept implementation
in its current state does not handle stragglers suciently well to completely nullify their
impact, which may be hard anyway.

6.2 Technical and conceptual limitations of the


elastic approach
The actual implementation showed that doing elastic resource allocations on top of an
existing system that does not support it is hard.

While extending the Hadoop

Job-

interface in order to support it should be possible, but it would certainly take a lot
of time as existing tickets from the project showed. This observation also lead to the
implementation of an own Application Master that sits directly on top of YARN that
can only handle tasks that are implemented using the task interface developed during
the course of this work.

Reasons for this are the inherent complexity of existing and

widely used systems like Hadoop which currently have around 2 millions lines of code in
them (including HDFS) which makes it hard to extend them for elastic operation. One
other aspect is that the tenant usually does not implement a custom Application Master
for running normal jobs but makes use of the already bundled Application Master that
is intended for use with generic Map/Reduce jobs. When changing the existing system
in a way that it can handle dynamic resource allocations upon deadline violations it is
quite likely that old, existing jobs will not continue to work without modications which
also poses an obstacle.
Apart from the technical diculties there are also conceptual challenges which were
mostly covered in this work. These include how to deal with stragglers, how to make
a solid end time predictions, how much oine data should the tenant have to provide
in order to make educated assumptions about the job behavior, what should be done
in case of a deadline violation and how to deal with heterogeneous clusters that might
be under changing load situations. The implementation done for this thesis makes use
of specic methods proposed in academical work like LATE and acts upon potentially
violated deadlines. However, even these relatively small-scaled experiments showed that
under certain circumstances it is hard to make educated guesses about the end time of
the job and by what amount to increase the resources in a certain scenario. Drawing
a conclusion from that, in an ideal world without stragglers, a perfectly homogeneous
environment and a general idea about the job structure pretty good results are possible.
However, in reality the more factors that come in that cannot be controlled directly the

56

6.2 Technical and conceptual limitations of the elastic approach


harder it gets to achieve a predictable performance for Map/Reduce and Big Data jobs
in general.

57

Chapter 7
Future Work
This section gives an overview of the future work and potential improvements of the
implementation used for the performance measurements. Also, a discussion of existing,
theoretical work will be done that was not incorporated into the implementation and
could provide potential improvements in terms of methodology or results. In addition
to that conceptual improvements that became apparent during the course of this work
will be discussed in this chapter.

7.1 Improving straggler mitigation


One of the main problems in trying to achieve a predictable performance in Big Data jobs
is that some tasks of a job or even a whole node performs way slower than the average
ones. This makes an educated guess about the job nishing time hard and can therefore
lead to missed deadlines or over provisioning of resources. Better approaches for softening
up the eect of stragglers that employ a more aggressive speculative execution can be
potentially helpful there. In its current state the implementation used for conducting the
experiments does not handle extremely slow tasks very well and the calculated nishing
time of the job will be way o and therefore it is getting harder and harder to react
on that by just increasing the number of running containers. Generally spoken, at the
current state the lesser the impact of stragglers is in the running job the more accurate
and sane the predicted end time of the job will be.

However, since implementing an

intelligent algorithm that deals with the ecient mitigation of stragglers is out of scope
of this thesis this open topic will be left for future work. Implementing a more intelligent
system that deals with reducing stragglers and therefore improves the job latency is
discussed in the paper [19] from the 2015 ACM SIGMETRICS Workshop and could
be used as a starting point for lessening the impact of stragglers on the job end time
prediction.

58

7.2 Better identication of performance bottlenecks

7.2 Better identication of performance bottlenecks


One important part in trying to guarantee the meeting of a given deadline is to react
properly if the job begins to slowdown at some point in time.

The approach taken

in this thesis focused on increasing the number of simultaneously running containers


which turned out to work ne when there are not too much stragglers and/or slow nodes
involved.

However, this approach is too coarse grained when the job slows down for

a certain reason, for example the bandwidth within the cluster got scarce or one node
has a busy processor because of other software running that does not lie in the control
of the Hadoop instance. In order to react more intelligently on that kind of conditions
it would be benecial to implement an algorithm that tries to estimate why certain
tasks are running slower than their counterparts.

For example, if the fetch phase for

one reducer suddenly takes considerably longer to complete the algorithm could deduce
that their might be a bandwidth problem. However, that could also indicate a problem
regarding the underlying storage of a certain node so a more elaborate algorithm would
be necessary here.
If such an algorithm would be in place that provides relatively accurate guesses on
the actual reasons why the job slowed down the application master could react more
eciently to that. For example it could avoid spawning backup tasks which are networkheavy on a node that presumably suers from low network bandwidth. However, it could
assign computational heavy tasks to it which will not put a lot of stress on the network.
In addition to that it could request more resources of the specic type that is lacking
in order to speed up the container execution without wasting other resources. Having
a proper knowledge of exactly what resources are missing and therefore why the job is
slowing down could lead to a more ecient cluster utilization while still being able to
reach a given deadline.

7.3 Employ more sophisticated methods for job


nishing estimation
For the practical experiments a slightly modied version of LATE coupled with Amdahls
Law was used to extrapolate the estimated nishing time for a single task up to the whole
Job. The approach is described in more detail in Chapter 4. If the predicted end time
then violates the given deadline, the resources allocated to that job will be increased.
However, if the tasks of a job vary greatly in their individual running times like shown
for the reducers in Figures 5.10 and 5.9 the approach using Amdahl's Law does not
work very well and gets inaccurate quickly. Future work would require to nd a more
exible method that predicts the end time of a job and therefore enables the application
master to react more eciently (e.g., without wasting too much resources or missing the
deadline) on the changed job conditions. This has been proven to be a hard problem,
mainly because the behavior of the tasks that have not been started yet is unknown and

59

7.4 Improving the proof of concept implementation


other factors like straggling nodes complicate this further. This was also observed by the
Jockey team [7] and became apparent during the proof of concept implementation. In
general, a pure online approach is dicult because too many unknown factors play a role
here like unevenly distributed task sizes, stragglers, scarce bandwidth or the fact that
the actual relationship between the size of the input data and the computational eort
needed to process it is often unknown. Because of these problems an online approach
combined with previously collected proling data seems like a more accurate solution for
this problem, but the pure online approach should not be completely written o since the
estimate could certainly be proved through collecting more detailed information during
job run time.

7.4 Improving the proof of concept implementation


As previously mentioned in Section 4.4 the proof of concept implementation is quite
specic in regards to the word count job used in the experiments. While designed to be
as task independent as possible, there are still limitations in this regard. Additionally to
that there are also open issues in regards to communication overhead, e.g. the control
messages that are sent between the Application Master and its tasks.

The following

list should provide a starting point on how to improve specic aspects of the proof of
concept implementation and how to make it more general so that it becomes easier for
a potential user to deploy his or hers own task that can make use of deadline aware,
elastic resource management:

Improving for the abstract classes

MapTaskWordCount and ReduceTaskWordCount.

The provided methods here lack the versatility to be used for completely dierent
kinds of jobs as for now. As an example, the enforcement through the

fetch()

method prototype for the mapper-class forces the task to read all of its input at
once into a byte array. This may cause problems for tasks that have a large chunk
of input data that exceeds the amount of memory the task can use. A solution
that would support buered reading would largely solve this problem and improve
the applicability of the task interface.

Add more implicit functionality to the

TaskBase abstract class.

At its current state

the user has to implement the actual data transfer between the mappers and the
reducers by hand. This is acceptable when doing measurements whose goals are to
show that the system works conceptually but is not feasible for day to day use with
dierent kinds of tasks. Here it might be benecial to take the same approach as
Hadoop does: Let the user dene the format of the input data through a generic
class similar to

TextInputFormat

that comes bundled with Hadoop.

Then use

an readily implemented method for data transfer that should be largely operating
system agnostic. This would make it a lot easier to write own tasks since it releases
the user from the burden to implement an own method for data transfer.

Reduce communication overhead. At its current state the proof of concept implementation opens a new TCP connection for every message. While this works it is

60

7.4 Improving the proof of concept implementation


very inecient and could cause problems for jobs that involve a large number of
tasks. The alternative here would be to just leave the socket open on both sides
with a reasonable keepalive time.

Add the ability to track jobs more easily.

While the resource usage of the job

can be monitored with the tools Hadoop already provides, like the web interface,
it might be benecial to be able to monitor the current end time prediction for
the job and keep track of the number of resource increases.

However, the only

possibility the proof of concept implementation has at this moment is through the
log output that the Application Master generates, so the user would need to open
a remote connection via SSH to the node it runs on and keep track of the log le
by using tools like tail or less. The ideal solution here would be to extend the web
interface Hadoop already has and add all necessary information to it.

Enable the bandwidth control mechanisms. The underlying code, e.g. the message
from the Application Master that the tasks understand and the operating system
dependent code that actually sets the bandwidth limits are in place. However, they
are not wired up yet.

Also, there is the conceptual problem that limiting the

bandwidth of one task is not sucient for actual bandwidth guarantees. It would
also involve that other processes are not allowed to use an amount of bandwidth
that potentially breaks this guarantee, so for future work this feature of the proof
of concept implementation has to be planned out more carefully.

61

References
Validity of the single processor approach to achieving
large scale computing capabilities. Technical report, IBM Sunnyvale, Califor-

[1] G. M. Amdahl.
nia, 1967.

[2] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica.

Mitigation: Attack of the Clones.

Eective Straggler

Technical report, University of California,

Berkeley and KTH/Sweden, 12 2010.


[3] R. Chaiken, P.-A. L. Bob Jenkins, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou.

SCOPE: Easy and Ecient Parallel Processing of Massive Data Sets.


Technical report, Microsoft Research, 2008.

[4] G. DeCandia, D. Hastorun, M. Jampani, and G. K. et al.

[5]

[6]

Dynamo: Amazon's

Highly Available Key-value Store . Technical report, Amazon.com, 2007.


C. Delimitrou and C. Kozyrakis. Quasar: Resource-Ecient and QoS-Aware
Cluster Management. Technical report, Stanford University, 12 2009.
ECMA International. ECMA-404: The JSON Data Interchange Format.

Technical report, ECMA International, CH-1204 Geneva, 2013.


[7] A. D. Ferguson, P. Bodik, S. Kandula, S. Kandula, E. Boutin, and R. Fonseca.

Jockey: Guaranteed Job Latency in Data Parallel Clusters.

Technical

report, Microsoft Research, 2011.


[8] P. Helland.

Cosmos: Big Data and Big Challenges. Technical report, Microsoft

Research, 2010.

Mesos: A Platform for FineGrained Resource Sharing in the Data Center. Technical report, University

[9] B. Hindman, A. Konwinski, and M. Z. et. al.


of California, Berkeley, 12 2009.
[10] O. Inc. Oracle documents: Class UUID.

docs/api/java/util/UUID.html,

https://docs.oracle.com/javase/7/

2016. [Online; accessed 15-July-2016].

[11] V. Jalaparti, P. Bodik, I. Menache, and S. R. et al.

Network-Aware Scheduling

for Data-Parallel Jobs: Plan When You Can. Technical report, University of
Illinois, Microsoft Research, 2015.

[12] M. Leben.

CouchDB - relaxed web application development.

report, Hasso-Plattner-Institut, University of Potsdam, 2013.

62

Technical

References
[13] M. L. Moreno.

Frequency Analysis in Light of Language Innovation.

Tech-

nical report, Departement of mathematics, UC San Diego, 2005.

https://access.redhat.
com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_
Management_Guide/ch01.html, 2016. [Online; accessed 15-July-2016].

[14] Red Hat.

Red Hat Resource Management Guide.

Apache Tez: A Unifying Framework


for Modeling and Building Data Processing Applications. Technical report,

[15] B. Saha, H. Shah, S. Seth, and C. C. et. al.


Hortonworks, Santa Clara, CA, 2015.

[16] K. Shvachko, H. Kuang, S. Radia, and R. Chans.

System.

The Hadoop Distributed File

Technical report, Yahoo! Sunnyvale, California, 2007.

[17] W. Tan. YARN-1197: Support changing resources of an allocated container.

//issues.apache.org/jira/browse/YARN-1197, 2016.

https:

[Online; accessed 15-July-

2016].
[18] The Linux Documentation Project. Introduction to Linux Trac Control.

tldp.org/HOWTO/Traffic-Control-HOWTO/intro.html,

2016.

http://

[Online; accessed

15-July-2016].

Using Straggler Replication to Reduce


Latency in Large-scale parallel computing. Technical report, Two Sigma

[19] D. Wang, G. Joshi, and G. Wornell.

Investments, EECS Dept., MIT, 2015.


[20] N. J. Yadwadkar and W. Choi.

Learning.

Proactive Straggler Avoidance using Machine

Technical report, University of California, Berkeley, 2015.

[21] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica.

Cluster Computing with Working Sets.

Spark:

Technical report, University of Cali-

fornia, Berkeley, 12 2009.


[22] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica.

MapReduce Performance in Heterogeneous Environments.


port, University of California, Berkeley, 12 2008.

63

Improving
Technical re-

List of Figures
http://hadoop.apache.org

2.1

YARN architecture, source:

. . . . . . . . .

11

3.1

High Level overview of LATE algorithm

. . . . . . . . . . . . . . . . . .

15

3.2

Estimation of nishing time per task

. . . . . . . . . . . . . . . . . . . .

17

3.3

Example usage of a DAG in TEZ, source:

https://tez.apache.org/

. .

23

4.1

Architecture overview of implementation used for measurements . . . . .

26

4.2

High Level overview of the task interface . . . . . . . . . . . . . . . . . .

31

5.1

Exemplary runtime behavior of a map/reduce job

. . . . . . . . . . . . .

43

5.2

Behavior for one reduce task . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.3

Inaccurate assumption due to inappropriate task phase weighting

. . . .

45

5.4

Predicted nishing time of the job using Amdahls Law

. . . . . . . . . .

46

5.5

Add one parallel task after around 50% job progress . . . . . . . . . . . .

47

5.6

Job behavior without resource management and no stragglers

. . . . . .

48

5.7

Job behavior with resource management and no straggling task

. . . . .

50

5.8

Data distribution to each reducer

. . . . . . . . . . . . . . . . . . . . . .

51

5.9

Ignoring deadline violation . . . . . . . . . . . . . . . . . . . . . . . . . .

52

5.10 Reacting on deadline violation with straggler . . . . . . . . . . . . . . . .

53

64