Beruflich Dokumente
Kultur Dokumente
The remaining part of the paper is organized as follows. number of reduce task is always configured to be much less
We discuss related work in Section II. We formulate the MJSO than the number of map tasks. As a result, processing times
problem and analyze its complexity in Section III. In Section of reduce tasks are much longer than that of map tasks. We
IV, we present a set of algorithms. We evaluate our algorithms also validate this assumption in our experiment.
in Section V. In Section VI, we show an implementation of (M)
our scheme in Hadoop and our evaluation in Amazon EC2. Let duv be the delay between a map task u Tj and
(R)
Finally, Section VII concludes the paper. a reduce task v Tj (e.g., introduced by shuffle phrase).
Let Su be the start time of task u. Let S be set of Su , u T.
II. R ELATED W ORK Let Cj be completion time of job j, which is the time when
(R)
Due to the widely usage of MapReduce systems, there all reduce tasks v Tj finish. Let C be set of Cj , j J.
is a flourish of studies on understanding MapReduce perfor- There is a weight wj associated with job j and our objective
mance and many developed various improvement schemes. is to find a feasible
P schedule to minimize total weighted job
From system point of view, there are many valuable advances completion time jJ wj Cj subject to following constraints
on improving data-shuffling [7], in-network aggregation [8], for every job j.
etc. From algorithmic point of view, people are looking into Su rj u Tj
(M)
(1)
MapReduce job/task scheduling with various considerations in
(M) (R)
different scenarios. Quincy [9] and delay scheduling [10] are Sv Su + duv + pu u Tj , v Tj (2)
proposed to provide fair scheduling for MapReduce systems. (R)
Cj Sv + pv v Tj (3)
Omega [11] is proposed to support cooperation of multi-
ple schedulers in large computer clusters. Zheng et al.[12]
Notation Definition Notation Definition
propose a MapReduce scheduler with provable efficiency on
J Set of all jobs Cj Completion time of job j
total flow time. Data locality is considered where Wang et T Set of all tasks in J wj Weight of job j
al. [13] investigate map task scheduling under heavy traffic and |T| Task number in T rj Release time of job j
(M )
Tan et al. [14] propose a stochastic optimization framework to M Set of machines Tj Map task set of job j
(R)
optimize scheduling of reduce tasks. |M| Machine number in M Tj Reduce task set of job j
S Set of start times of tasks pu Processing time of task u
Two most closely related works of our paper are [3][4]. C Set of job completion duv Delay between task u
time and task v
In [3], Chang et al. propose scheduling algorithms for fast Su Start time of task u Mu Middle finish time of
completion time. In [4], Chen et al. investigate precedence task u
constraints between map tasks and reduce tasks in MapRe- B. Problem Complexity
duce job, and propose a 8-approximation algorithm. However,
they assume that tasks are assigned to processors/servers in Theorem 1: MSJO is NP-complete.
advance. As shown in Fig.1, scheduling of jobs without con- Proof: It is easy to verify that calculating total weighted
sidering server assignment may result in less optimal solutions. job completion time of a schedule result is NP. Therefore,
We fill in this gap in this paper. MSJO is in NP class. To shown MSJO is NP-complete, we
General scheduling of parallel machines has decades of reduce a job schedule problem (problem SS13 in [18]) to it.
study. There are many works with inspiring ideas and ana- Problem SS13 is proven NP-complete. The proven theorem
lytical techniques [15][16]. For minimizing total weighted job can be stated as follow: Given a set J of jobs and a set M
completion time with precedence constraints, the best known of identical machines, every job j has a weight wj . Every
work is a 4-approximation algorithm in [17]. However, these job j can be processed uninterruptedly on every machine with
works focus on the general case. processing time pj . Let Cj be job completion time of job j in
a feasible schedule.
P It is NP-complete to determine a feasible
III. M AP R EDUCE S ERVER -J OB O RGANIZER : T HE schedule where jJ wj Cj is minimized.
P ROBLEM AND C OMPLEXITY A NALYSIS Given every instance (J, M) of problem SS13, we can
A. Problem Formulation construct an instance (JM , MM ) of MSJO. M and MM are
same. For every job j J, there is a job j M JM .
Let J be a set of MapReduce jobs. Let M be a set of j and j M have same job weight. j M has one map task with
identical machines. Let the release time of job j be rj . This processing time 0 and one reduce task with processing time
release time is the time a job entering the system; note that of job j. Release time of j M is 0. Thus, if MSJO can be
it differs from the job start time where the job scheduler can solved optimally with a polynomial algorithm, problem SS13
schedule a job to be started later than this release time. Let can be solved by this algorithm. Because problem SS13 is
(M) (R)
Tj and Tj be the set of map tasks and reduce tasks NP-complete, MSJO is NP-complete.
for each job j. Let T be the set of all tasks of J. For each IV. A LGORITHM D EVELOPMENT AND T HEORETICAL
task u T, let pu be its processing time. We assume that A NALYSIS
a task cannot be preempted. We also assume that for any
job j, processing times of its map tasks are smaller than that We outline our approach described in next three subsec-
of its reduce tasks. We admit that this is a key assumption tions: (1) We introduce a linear programming relaxation to
for our bounding development. Yet this is true in current give a lower bound of the optimal solution for MSJO. This
situation. Every map task simply scatters a chuck of data LP-based solution may not be a feasible solution. (2) Although
while the reduce tasks need to gather, reorganize and process there is a polynomial time algorithm for solving this LP relaxed
data produced by map tasks. To make the situation worse, the problem in theory, the high complexity associated makes it is
3
i1 i1 i1
!2
B. Conditional LP Relaxation and Constraint Generation Al- X X 1 X
gorithm M(i) p(k) p(k) M(k) p(k)
2|M|
k=1 k=1 k=1
Note that there are an exponential number of constraints Pi1
in Equation.4 due to the exponential number of B. In theory, Finally, we have Equation.5 by eliminating k=1 p(k) .
CLS-LPP can be solved in polynomial time using ellipsoid
method with a separation oracle [16]. However, the complexity Given a solution of CLS-LPP, if we schedule tasks in non-
associated with ellipsoid method makes it impractical to solve descending order of their middle finish time, Property 1 gives
large problems with thousands of decision variables. We derive us a basic relation between the total processing time of
a new LP-relaxation problem which has a small subset of previous scheduled tasks and the middle finish time of the
constraints in Equation.4. The optimal result of this new LP- unscheduled tasks. We will use this property to prove the the-
relaxation problem also leads to the 3-approximation algorithm oretical bound of our algorithm MarS. Note that this property
to be developed in section IV-C. Because this new problem holds for any solution satisfying constraints in Equation.4. If
is built by iteratively adding constraints based on checking we can build a CND-LPP whose optimal solution also has this
certain property of its solution, we call it Conditional LP- property, MarS has the same theoretical bound based on this
relaxation problem (CND-LPP). CND-LPP.
4
Recall that the intuition behind Equation.4 is to describe a we check whether (SLP , CLP ) reaches the stop threshold.
polyhedron where the task start times of a feasible schedule lie
In practice, COGE can produce satisfactory result in less
in. We denote this polyhedron as P L . Given a specific CLS-
than 10 iterations.
LPP, its constraints define a polyhedron, denoted as P CLS . All
precedence constraints in Equation.1-3 define another poly- Algorithm 1 COGE()
hedrion, denoted as P P C . We know that P CLS = P L P P C
(see Fig.2). Objective function of the problem defines a Input: 1) Job set J; 2) Machine set M; 3) Stop threshold
Output: A solution (SLP , CLP ).
hyperplane. Solving CLS-LPP is searching for a point in
1: Build initial CND-LPP with Equation.1-3;
P CLS which has the smallest distance to this hyperplane.
2: repeat
Instead of finding the optimal point in P CLS (point A in 3: Solve CND-LPP with a linear programming solver;
Fig.2), we search a solution in P P C (point B in Fig.2) 4: Let (SLP , CLP ) be optimal solution of current CND-LPP;
which has Property.1. We start with an initial CND-LPP which 5: Let violated PGC number vn be 0;
only contains all precedence constraints in Equation.1-3. By 6: (SNV , CNV ) (SLP , CLP );
iteratively adding fine chosen constraints in Equation.4, we 7: Sort tasks by middle finish time according to SLP ;
approach the desirable solution. 8: for all i [1 . . . |T|] do
9: if P GC(, i) is satisfied by SLP then
To check whether an optimal solution satisfies Property.1, 10: continue;
we can check whether constraints in Equation.6 are satisfied by 11: Add P GC(, i) to CND-LPP;
this optimal solution for every task (i). We formally define 12: vn = vn + 1;
these constraints as follows: 13: of f set = V iolationOf f set(P GC(, i), SLP )
14: for all k [i 1 . . . T] do
NV NV
Definition Performance Guarantee Constraint for given and 15: S(k) = S(k) + of f set;
index i, denoted as P GC(, i), is defined as follows: NV NV
16: P C
Update according to SP ;
i1 17: if jJ wj Cj (1 ) jJ wj CjNV then
LP
Thus, we can build a feasible solution by adding this offset to Our algorithm MarS is shown in Algorithm.2. We first
NV
S(k) for k [i1 . . . |T|]. Finally, after all PGCs are checked, produce based on SLP . Then we schedule tasks from (1)
5
Next, we start our formal proof. For a scheduling problem, Because all links in a tight link path are tight, we also have:
we can build a precedence graph G = (V, L) to describe all s1
X
precedence constraints with delay. V is the set of tasks. For SuHs SuH1 = (duk uk+1 + puk ) (9)
two task u and v, directed link (u, v) L with length duv k=1
indicates that execution of v at least waits for a time interval
duv after u finishes. For our problem, we introduce a dummy Based on Equation.8 and Equation.9, we analyze lengths
initial task tI to represent the start point of the schedule. of idle time intervals on machine m. After scheduling (i)
(M) H
Then, constraints Su rj , u Tj in Equation.1 can be to machine h, there are idle time intervals before S(i) .
expressed in the form of precedence constraint with delay: Considering a task u right after a idle time interval, we find
(M)
Su StI + dtI u + ptI , u Tj where StI = 0, dtI u = LT LP (u) = {u1 u2 us }. Because u1 is the
rj , ptI = 0. In remaining part of this proof, we only mention first task in LT LP (us ), there must be a task v scheduled
precedence constraint with delay. on machine m where Sv Su1 (Sv + pv ) (see task v3
in Fig.3) and MvLP MuLP . Otherwise, u1 can start earlier.
Given a schedule result, delay between u and v may be 1
With Equation.8 and Equation.9, we have:
longer than duv . We say a link (u, v) L is tight in this
schedule result if SvH = SuH + duv + pu . In a schedule result, Sus (Sv + pv ) MuLP
s
MvLP
there may be a tight-link path {u1 u2 us } where
link (uk , uk+1 ) L is tight for all k [1 . . . s1], s 2. Fig.3 The left side of this inequation is the maximum length of
demonstrates a three-node tight-link path in a schedule result. idle time interval between u and v. We repeat developing this
If us = u, we call {u1 u2 us } a tight-link path of inequation for idle time interval before v. Finally, we end up
task u. Among all tight-link paths of task u, there is a tight-link with tI whose middle finish time is 0. By adding all these
path which has the maximal number of nodes. We call it the inequations together, we have Equation.7.
longest tight-link path of task u, denoted as LT LP (u). In our Lemma 2 gives an upper bound for total idle time intervals in
(M)
problem, for a map task u Tj , LT LP (u) can be empty or the scheduling of task (i). Note that, this result only holds
6
TWJCT Ratio
40
Iterations
1.5
30
Theorem 3: Our algorithm is a 3-approximation algorithm 20
for MSJO. 1
10
3 3 3
TWJCT Ratio
TWJCT Ratio
2 2 2
1 1 1
MarS HUWF MarS HUWF MarS HUWF
HMARES HJWF HMARES HJWF HMARES HJWF
0.5 0.5 0.5
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Job Number Job Number Job Number
Fig. 6: Performance comparison for most ran- Fig. 7: Performance comparison for uniform task Fig. 8: Performance comparison for uniform task
domized scenario. All job parameters are in number scenario. Other job parameters are in processing time scenario. Other job parameters
random categories. Machine number is 50. random categories. Machine number is 50. are in random categories. Machine number is 50.
2.5 3 2.2
MarS
HMARES
2.5 2 HUWF
2 HJWF
TWJCT Ratio
TWJCT Ratio
TWJCT Ratio
1.8
2
1.5 1.6
1.5
1.4
1
1
MarS HUWF MarS HUWF 1.2
Fig. 9: Performance comparison for most ran- Fig. 10: Impact of prediction error for most ran- Fig. 11: Experiment results for different job set
domized scenario. All job parameters are in domized scenario. Job number is 100. Machine sizes.
random categories. Machine number is 100. number is 100.
while iteration number decreases rapidly. Thus, we choose We see that MarS outperforms other algorithms by over
= 0.3. H-MARES also need a stop threshold in solving its 40% in most cases. The only exception happens when job
LP-relaxation problem. To offer a fair comparison, we choose number is 10 where MarS is 1.32 while H-MARES is 1.58,
= 0.3 for H-MARES instead of 0.5 in the original paper [4]. HUWF is 1.77 and HJWF is 1.86. After job number rises
to 20, H-MARES increases to 1.96, HUWF and HJWF jump
In our simulation, we assume that the task processing time to 2.17 and 2.25 respectively. This is because when there is
and delays between a map task and a reduce task are known to less jobs, machines are not extensive loaded. Thus, different
the scheduler. There are many studies on accurate processing schedule algorithms can gain close performances. However,
time prediction [19][20] and trace study [21] shows that the when job number increases and machines are fully utilized,
majority of map tasks and reduce tasks are highly recurrent algorithm results differ from each other. We also notice that
making prediction feasible. We plan a future work here. H-MARES outperforms HUWF and HJWF in all cases. This
is because workload-based allocation performs well and H-
B. Simulation Result MARES benefits from optimized task order generated by LP
relaxation conditions.
We first discuss the most randomized scenario where all
parameters are generated in randomly. Based on this scenario, Impact of task number in job. Next, we examine the
we evaluate impacts of different job parameters. effect of task number in a job. We generate jobs with uniform
task number category while other job parameters are in random
Performances in the most randomized scenario. Fig.6 category. The results are shown in Fig.7. Compared with result-
shows results where all job parameters are in random category. s in Fig. 6, we see that all algorithms gain better performance.
We see MarS constantly offers efficient solutions when job MarS stays at about 1.15 which is 0.17 lower in terms of
number changes. In theory, we proved that MarS represents TWJCT ratio. H-MARES gradually increase from 1.48 to 1.88.
a 3-approximation of the theoretical lower bound, meaning HUWF and HJWF still have larger performance variance but
that TWJCT ratios of MarS are at most 3. In practice, MarS TWJCT ratio of both algorithms decreases by 0.2 on average.
represents an increase less than 0.4 in terms of TWJCT ratio, It is also shown that there is only a tiny performance difference
compared with theoretical lower bound. More specifically, between HUWF and HJWF.
starting at 1.32, MarS increases to 1.39 when job number is 20
and stays at about 1.38 with a variance less than 0.02 when job Impact of task processing time. We generate jobs with
number further grows. This is because LP relaxation always uniform task processing time category while other job pa-
produces an optimized task schedule order. H-MARES shows rameters are in random category. MarS is very close to the
stable performance after job number reaches 20. However, we theoretical lower bound. Its maximum distance to the lower
see a constant performance difference between H-MARES and bound is 0.08 when job number is 10. When job number rises,
MarS. We consider it as improvement by joint scheduling. this distance decreases to less than 0.02. The gap between
8
1
R EFERENCES
[1] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on
0.8
large clusters, Commun. ACM, vol. 51, pp. 107113, Jan. 2008.
CDF 0.6 [2] D. DeWitt and M. Stonebraker, Mapreduce: A major step backward-
s, 2008. http://www.databasecolumn.com/2008/01/mapreduce-a-major-
0.4 step-back.html.
Map tasks in training data [3] H. Chang, M. Kodialam, R. Kompella, T. V. Lakshman, M. Lee,
0.2 Reduce tasks in training data and S. Mukherjee, Scheduling in mapreduce-like systems for fast
Map tasks in MarS result
Reduce tasks in MarS result
completion time, in Proc. IEEE INFOCOM, 2011.
0
0 200 400 600 800 1000 1200 1400 [4] F. Chen, M. Kodialam, and T. Lakshman, Joint scheduling of pro-
Task Processing Time (second) cessing and shuffle phases in mapreduce systems, in Proc. IEEE
INFOCOM, 2012.
Fig. 13: Comparison of task processing time in [5] Apache hadoop, 2013. http://hadoop.apache.org/.
experiment.
[6] Amazon ec2, 2013. http://aws.amazon.com/cn/ec2/.
[7] J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li,
C. Experiment Result W. Lin, J. Zhou, and L. Zhou, Optimizing data shuffling in data-parallel
computation by understanding user-defined functions, in Proc. USENIX
Performance of different algorithms The result is shown NSDI, 2012.
in Fig.11. In job set 1, we see that MarS outperforms the [8] P. Costa, A. Donnelly, A. Rowstron, and G. OShea, Camdoop:
other algorithms. MarS increases 0.416 to the lower bound exploiting in-network aggregation for big data applications, in Proc.
while H-MARES, HUWF and HJWF increase 0.539, 0.81 USENIX NSDI, 2012.
and 0.672 respectively. In job set 2, we see that MarS still [9] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Gold-
outperforms rest algorithms. Compared with results in job set berg, Quincy: fair scheduling for distributed computing clusters, in
Proc. ACM SOSP, 2009.
1, we see TWJCT ratios of all algorithms increase. The trend
[10] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and
is also reflected in simulation results. We notice that HJWF I. Stoica, Delay scheduling: a simple technique for achieving locality
suffers more performance degradation than other algorithms. and fairness in cluster scheduling, in Proc. ACM EuroSys, 2010.
The reason may be that in job set 2, these jobs process data as [11] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes,
large as 9.98 GB. They are some relatively big to jobs in job set Omega: flexible, scalable schedulers for large compute clusters, in
1. HJWF scheduled big jobs first because their weights are big. Proc. ACM EuroSys, 2013.
However, these jobs have big weights but their unit weights [12] Y. Zheng, N. B. Shroff, and P. Sinha, A new analytical technique
are small because they take a long time to process these data. for designing provably efficient mapreduce schedulers, in Proc. IEEE
INFOCOM, 2013.
As a result, small jobs with big unit weights are delayed. By
[13] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang., Map task scheduling
considering the relation between weight and processing time, in mapreduce with data locality: Throughput and heavy-traffic optimal-
MarS, H-MARES and HUWF do not suffer from this mixture ity, in Proc. IEEE INFOCOM, 2013.
of different size jobs. [14] J. Tan, S. Meng, X. Meng, and L. Zhang., Improving reducetask data
locality for sequential mapreduce jobs, in Proc. IEEE INFOCOM,
Processing times of map tasks and reduce tasks We show 2013.
precessing times of map tasks and reduce tasks. Training data [15] B. W. Lampson, A scheduling philosophy for multiprocessing system-
and MarS results are shown in Fig.13. Both of them have 950 s, Commun. ACM, vol. 11, pp. 347360, May 1968.
tasks. In training data, there is a clear processing time differ- [16] A. S. Schulz et al., Polytopes and scheduling. PhD thesis, Technical
ence between reduce tasks and map tasks. Processing times of University of Berlin, 1996.
map tasks stays around 80 seconds while most of reduce task [17] M. Queyranne and A. S. Schulz, Approximation bounds for a general
are over 150 seconds. We also see that there are gaps between class of precedence constrained parallel machine scheduling problems,
training data and MarS result. The main reason is that training SIAM Journal on Computing, vol. 35, no. 5, pp. 12411253, 2006.
data is produced by running jobs in turn and the cluster is [18] M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide
to the Theory of NP-Completeness. W. H. Freeman & Co., 1990.
not fully utilized. Meanwhile, MarS schedules multiple jobs
simultaneously to fully utilize the cluster. Intensive utilization [19] A. Verma, L. Cherkasova, and R. Campbell, Aria: automatic resource
inference and allocation for mapreduce environments, in Proc. ACM
of the cluster introduce cost from competitions on resources ICAC, 2011.
such as disk I/O, network bandwidth, etc. [20] Y. Yuan, H. Wang, D. Wang, and J. Liu, On interference-aware
provisioning for cloud-based big data processing, in Proc. IEEE/ACM
VII. C ONCLUSION IWQoS, 2013.
[21] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, Predicting execution
In this paper, we studied MapReduce job scheduling with bottlenecks in map-reduce clusters, in Proc.USENIX HotCloud, 2012.
consideration of server assignment. We showed that with- [22] Y. Yuan, D. Wang, and J. Liu, Mars: Exploring performance
out such joint consideration, there can be great performance bounded scheduler for mapreduce jobs (msjo package), 2011.
loss. We formulated a MapReduce server-job organizer prob- http://www4.comp.polyu.edu.hk/~csyiyuan/projects/MarS/MarS.html.
lem. This problem is NP-complete and we developed a 3-
approximation algorithm MarS. We evaluated our algorithm
through extensive simulation. The results show that MarS
can outperform state-of-the-art strategies by as much as 40%
in terms of total weighted job completion time. We also
implement a prototype of MarS in Hadoop and test it with
experiment on Amazon EC2. The experiment results confirm
the advantage of our algorithm.