Sie sind auf Seite 1von 22

Efficient Provisioning of Bursty Scientific

Workloads on the Cloud Using Adaptive

Elasticity Control

Ahmed Ali-Eldin, Johan Tordsson,

and Erik Elmroth
Department of Computing Science
Ume University, Sweden

Maria Kihl
Lund Center for Control of Complex
Engineering Systems
Lund University, Sweden

Ume University


ve research programme in e-science between Uppsala University, Lund University

rch environment that enables a strong interplay between e-science research, e-infr

Motivation & Problem definition

The cloud elasticity problem

How much capacity to (de)allocate to a
cloud service (and when)?
Bursty and unknown workload
Increase ability to meet SLAs
Reduce resource usage
One of the limitations identified by Truong
et al. [1] to the wide adoption of the

Problem Description
Prediction of load/signal/future is not a new problem
Studied extensively within many disciplines
Time series analysis
Control theory
Stock markets
Biology, etc.
Multiple solutions proposed to prediction problem
Neural networks
Fuzzy logic
Adaptive control
Kriging models
<your favorite machine learning technique>
However, solution must be suitable for our problem

Vary capacity allocated to a service

According to current and future load

Fulfill QoS requirements to meet SLAs
Without costly over-provisioning
Avoid oscillations or behavioral changes
Tens of thousands of servers + even more VMs
Adaptive to changing workloads
PID-controllers reliable for certain load patterns,
but unstable once the load or system dynamics
Limited look-ahead control accurate but too slow
Can take 30 min to control 15 servers and 60 VMs

Key to adoption

Our approach:
Adaptive Hybrid control
Closed loop control
Adaptive control:

Adjust error signal by gain parameter
Error signal is the difference between current and
desired output
Change signal adjustments with load dynamics

Hybrid control, a controller that combines

Reactive control (step controller)

Proactive control (proportional, P-controller)

Initial model and

Service with homogeneous requests
Short requests that take one time unit (or
less) to serve
VM startup time is negligible
Delayed requests are dropped
VM capacity constant
Infrastructure modeled as G/G/N queue
N (#VMs) varies over time
Perfect load balancing assumed

A. Ali-Eldin, J. Tordsson, and E. Elmroth. An

adaptive hybrid elasticity controller for cloud
infrastructures. In NOMS 2012, IEEE/IFIP Network
Operations and Management Symposium. IEEE,

Model and assumptions


Homogeneous requests
Short requests that take one time unit
(or less)
Machine startup time is negligible
Delayed requests are dropped
Constant machine capacity
Infrastructure modeled as G/G/N queue

N (#VMs) varies over time

Perfect load balancing assumed

Our approach (cont.)

Adaptive control (cont.)

How to estimate change in workload?
load change

Gain parameter

Average capacity in last time window

Window size changes dynamically
Smaller upon prediction errors

A tolerance level decide how often

window is resized

Two gain parameter alternatives studied

1.Periodical rate of change

2.P = Load change / avg. rate in last time window
3.Denoted P_1 henceforth
2. Ratio of load change over average system rate:
. P = Load change / avg. rate over all time
. Denoted P_2 henceforth

Hybrid control (cont.)

All in all, 9 approaches for

scale up (U) and scale down (D)

Reactively (R) and/or Proactively (P)

UR combined with:

UP combined with:

URP combined with:


Notation in the following:

Scale up: reactive + proactive
Scale down: proactive

Performance Evaluation
Simulation-based evaluations
3 aspects studied

1.Best combination of reactive and proactive

2.Controller stability w.r.t. workload size
3.Comparison with state-of-the art controller
4.Regression control [Iqbal et al, FGCS 2011]

Performance metrics

.VMs allocated but not needed
.VMs needed, but failed to allocate (SLA violation)

Studied workload
FIFA98 traces

~3 month Web server traces (bursty)

Grouped requests per second of arrival

Best controller combination

Scaled FIFA traces x 50

Reasonable Internet growth 1998 > today

Assume that 1 VM handles 500 requests

Reasonable for DB-backend Web servers

Studied (for sake of completion) all 9

combinations of reactive + proactive controller

Some make no sense & indeed performed poorly:

Reactive scale down causes oscillations and lot of
under-provisioning (SLA violations)
Pure proactive scale up tends to skew and cause
Other approaches more promising:
Reactive scale up
Fast reaction to load increases, no skew
Proactive scale-down
Keep VMs for a while (just in case) once they are allocated

Best combination(cont.)
Baseline: UR-DR

1.63% under-provisioning
1.4% over-provisioning

Best combination(cont.)

0.41% under-provisioning (1.63% for UR-DR)

9.44% over-provisioning (1.4% for UR-DR)

Best combination(cont.)

0.18% under-provisioning (1.63% for UR-DR)

14.33% over-provisioning (1.4% for UR-DR)

Stability w.r.t workload size

Multiplied FIFA traces by X=10, 20, , 60

Assume that 1 VM handles 10*X requests/s
Studied UR-DR, UR-DP_1, UR-DP_2



Reactive stable (no surprise)

Proactive controller prediction quality varies with workload
Error in over-provisioning grows slower than workload size

Comparison with regression

Regression-based control:

Scale up: reactively, Scale down: regression

2nd order regression based on full workload history

Evaluation on selected (nasty) part of FIFA trace

2.99% under-provisioning,
2.24% under-provisioning,
1.51% under-provisioning,
1.07% under-provisioning,

19.57% over-prov.
47% over-prov.
32.24% over-prov.
39.75% over-prov.

Controller performance (execution time)

Regression: 0.98s on average, up to 6.5s observed

Our approach: 0.6 ms on average

P-control promising approach to cloud elasticity

Accurate predictions
Controller execution time in ms
Copes with changes in workload dynamics

No one-size-fits all controller

Tradeoff between over- and under-provisioning

Costs for SLA violation (under-provisioning) and
resource wastage (over-provisioning) decides
strategy to use