Beruflich Dokumente
Kultur Dokumente
May 2009
Nearly every large enterprise IT shop has virtualized some portion of their infrastructure. By
producing nearly 30% hardware cost savings annually, their IT executives are applauding the
initial success. But, can they do better?
While new virtualization platforms proved to be stable, and consolidation of physical resources
was achieved during the initial deployments, the reality of the situation is that VMware
infrastructure remains bloated. Systar has spoken with hundreds of IT executives, virtualization
architects and capacity managers at leading enterprises who have all admitted to only achieving
post-virtualization capacity utilization rates of 10 – 20%, while their objectives were to safely
reach 50% - 60%. Reaching the >50% objective cannot be accomplished with default VMware
settings, and requires a new understanding of virtualized capacity.
In this paper, Systar explores five practices for optimizing the utilization of virtualized capacity
without increasing risk to service quality. The practices are:
As companies prepare to expand their virtualized environments, Systar sees an opportunity for
IT organizations to improve their understanding of virtualized capacity, reduce new hardware
spending significantly, and meet their utilization objectives safely. By applying the practices
discussed in this paper, Systar sees the initial round of applause transforming into a standing
ovation as virtualization expands across the enterprise while safely meeting the >50% objective.
The vast majority of today’s VMware infrastructure capacity is bloated. Where physical capacity
was over-provisioned on average by a factor of 10, consolidation to virtualized environments
has reduced the footprint of IT infrastructure, but not optimized computing capacity. On
average, virtualized capacity is over-provisioned by a factor of 4 including peak headroom –
representing millions of dollars in over-spending as corporate IT budgets continue to tighten.
Virtualization changes many IT functions but it changes capacity management more than most.
Where capacity in the physical world often focused on a single machine hosting a single
application, VMware clusters – made up of ESX Hosts and Virtual Machines – are the new
“computer” and capacity must be managed accordingly. VMware’s CTO, Steven Herrod,
recently pointed to this notion when he commented that “virtualization is the mainframe for the
21st century”.2
1
The Capacity Planning Software Market: Sustaining Application Performance by Evelyn Hubbert and Jean-Pierre
Garbani with Thomas Mendel, Ph.D.
2
VMware’s vSphere Introduction, Conference Call (2009)
Let’s explore some of the multidimensional aspects of cluster capacity while discussing practices
that can be applied to improve their management.
During the initial rollout of VMware more concern is typically placed on sizing VMs correctly and
placing workloads carefully, but the approach taken to achieve this is often rudimentary. We
have spoken with many organizations that ignore proper VM sizing and workload placement
methods in favor of placing common Operating Systems on the same machine and using a core-
to-VM ratio rule.
As deployments expand across the enterprise and DRS and HA enabled clusters enter into the
equation, managing capacity of the cluster becomes paramount. This sentiment is echoed in
Gartner, Inc.’s Data Center Conference Survey4, which states “In the long term, Gartner believes
that capacity-planning tools and processes will have to shift their orientation to focusing less on
a single VM or physical server to assist with the sizing of resource pools and clusters.”
3
Whitespace: capacity on a host that cannot be utilized due to alignment of resource requirements, or resources
that cannot be used since they are too small support a whole VM.
4
Data Center Conference Survey: Addressing the Operational Challenges of Virtual Server Management, by Cameron
Haight (February 2008)
Next, we will look at Entitlements5. The general measure of the capacity and health of a cluster
is the ability of the cluster to deliver the entitled resources to all VMs. If VM Entitlements total
more than the total capacity of the cluster, the cluster is undersized or improperly balanced.
Within the cluster itself, a good measure of a host’s ability to provide expected capacity is the
measure of its total Entitlements vs. total capacity. Unlike Reservations6, the cluster and its
hosts will not identify a violation when Entitlements exceed capacity.
By understanding the sum of all VM Entitlements on a host and within a cluster, VMware
architects and administrators will have a clear picture of the resources being made available to
meet demands on their capacity.
Another practice to consider is the calculation of target headroom7 within a cluster plus its high
availability (HA) failover capacity. The sum of these elements can be used to establish an
“effective capacity” of the cluster. The image below describes the effective capacity of the
cluster in terms of the percentage of CPU and memory available.
5
Entitlements are the computed result of configurations, reservations, limits, and shares used to establish the
resource allocation given to each VM for its operation. The Entitlement will always fall between the Reservation and
the Limit, based upon its Share.
6
A Reservation is the amount of vCPU and memory (in absolute units) that a VM is guaranteed should it need it.
7
Target headroom = demand spike headroom (which depends on workload profiles and risk tolerance + whitespace
(5% for large hosts, 10% for smaller ones); then, add in HA space of 15% per host in an 8-host cluster.
When calculating the number of VMs that can be added to a cluster, it is important to first
define differing VM template sizes (e.g., small, medium and large). The definition can start with
a simple average VM reservation for each resource per template size. For a more accurate
picture of where VMs should be placed within a cluster, you can expand the calculation to
consider maximum and minimum VM sizes and workload type such as sustained or peaky.
The diagram below provides a calculated assessment of how many medium-sized VMs can be
added to each cluster on the basis of CPU and memory requirements for that VM template.
The next practice we will explore is the importance of sizing objects correctly. The key to
successful VMware capacity optimization is to set the Reservation property for each VM. The
Reservation determines the amount of resources that a VM can receive before it begins
competing with other VMs for the remaining shared resources. The Reservation also
determines the size of the VMKernel swap file for the VM’s memory and impacts HA and DRS
calculations.
Reservations are used by VMware Admission Control to prevent placing too many VMs on a host
based on resources. A VM can only be powered on if there are adequate unreserved resources
available on the host to satisfy that VM’s Reservation requirement. If all VMs have the default
Reservation setting of zero, then there is no effective Admission Control and VMs can be loaded
When HA is enabled, Reservations are used to calculate the amount of space (slots) needed to
meet the Failover Level Policy in effect. HA will calculate the maximum of all Reservations and
then based on the Failover Level Policy in effect, it will set aside space to provide the designated
number of VMs with sufficient capacity to operate if trouble strikes. Without Reservations in
place, HA must use an input parameter that defaults to 256MHz/256MB slots sizes, which is not
optimal.
Reservations are also used as one of the triggers for DRS VM migrations at Periodic Invocation
time (along with load balancing and other mandatory moves). If the sum of the Reservations on
a host exceeds its capacity then a VM is selected for migration to correct the situation.
Now that we have established the importance of Reservations when managing capacity, we will
provide some guidance on selecting its correct size. From our experience in assisting large
organizations with their VMware environments, a common practice is to set the Reservation in
the range of the 45th to 60th percentile for the VM’s historical resource utilization. Beyond this
setting, allow the VM to compete with other VMs based on Shares and Entitlements during
periods of greater load.
VM Workload Profiles
Figure 3. Systar’s OmniVision Workload Profile Reports display minimum (light blue), average
(blue) and peak (orange) resource usage hour-by-hour for CPU, I/O and Memory. The measures
are calculated from data collected every 15 seconds. Profiles are available of daily, multi-day,
weekly and monthly views. Average and maximum usage profiles can be used to accurately
assign VMware Reservations.
Once the reservations have been set, there is an opportunity to set Shares that represent the
VM’s business priority. Shares will determine resource Entitlement which the VMKernel will try
to provide when needed (e.g., handling peak workload periods).
Placing Workloads
The third practice to optimize your VMware capacity is placing workloads effectively. Capacity is
determined to some extent by how well a workload behaves with other workloads in the shared
resource environment.
If a number of workloads peak at similar times within an ESX Host, the result may be degraded
service or restricted capacity for other workloads needing access to the remaining resources.
Multiple peaking workloads can be more troublesome if their behavior pattern is unpredictable,
making them more challenging to manage. Peaky workloads like user-generated transactions
must be studied more carefully to see which match up best for resource sharing. On the other
hand, sustained workloads such as batch can be safely stacked, resulting in high capacity
utilization, because their peak resource requirements are well known. It would make sense, for
example, to place transaction workloads that peak during working hours with batch processes
that run throughout the night to provide a balanced use of the resources available. To
accomplish this type of safe-stacking requires a keen understanding of workload behaviors,
including average and peak usage, over a period of time. Most experts would request
monitoring of the workload behaviors for a minimum of one month. Of course, placement
accuracy will increase by gathering additional workload data points over an extended period of
time.
A good rule is to produce a stacked chart of all workloads targeted for a resource (e.g., CPU,
memory) belonging to the ESX Host. Flatter curves or sets of bars over time indicate a better
workload fit. A highly variable curve or set of bars means that the peaks may be coinciding.
Coinciding workloads increase the risk of resource contention, or wasted resources, and the
need to separate those workloads on different hosts or clusters.
Another benefit of careful workload placement is to minimize DRS migrations. Although DRS
can automatically place a new VM and load balance existing ones, it only considers overall
resource utilization of the hosts and not workload profiles. Therefore, if a VM is migrated to a
host where workloads peak shortly after the move, DRS will trigger another migration. For
example, the figure below shows a stacked bar chart of VM Memory Resource use on an ESX
host. Each color represents memory use of a different VM. As you can see, memory use
declines on the Host from 2am to 1pm and then peaks. DRS could easily migrate a memory
intensive VM to this Host at 11am, but would then have to migrate it once again at 2pm if
memory contention reaches an unacceptable level. When workloads like this are not placed
effectively, it can result in continually wandering VMs.
DRS migrations via vMotion are very efficient but still require some overhead. DRS migrations
should not be approached haphazardly (e.g., stacking VMs at random and letting DRS determine
how to best balance the workloads). Peaky workloads that are not placed effectively may result
in excessive migrations, causing what has come to be known as “vMotion sickness”. In general,
setting anti-affinity rules within VMware is not recommended, but can be very helpful for
workloads that are variable and may peak simultaneously.
Matching workload patterns is one of many considerations when stacking VMs on a host.
Affinity rules, geographical constraints, organizational alignment, compliance issues and other
factors will play into best practice guidelines for where workloads are permitted to be placed.
Optimizing DRS
As a reminder from Part I of this white paper series, Distributed Resource Scheduling (DRS)
provides a watchful eye over VMs in clustered environments. With an intention to provide each
VM its required resources, DRS observes resource utilization on each host within a cluster.
When unsatisfactory conditions are observed within one host, DRS assesses other hosts within
the cluster where conditions may be more attractive. If DRS finds a suitable location, it then
facilitates a VM move known as a VM migration.
Our fourth practice points to the need to optimize DRS. Not all application environments need
or are suited to its workload balancing features. For some sets of applications (VMs) it makes
sense not to use DRS. For example, horizontally scaled applications like web servers are already
load balanced.
In other instances, the cluster may be hosting hundreds of tier 2, non-critical business
applications that roughly demonstrate the same resource consumption. This environment may
be best suited to setting DRS to “Auto” and the default level to “Aggressive” for all VMs. Auto
settings allow for the initial placement of a VM inside the cluster to be automated and the
automatic execution of migration recommendations. Aggressive migration thresholds will
trigger movements that promise even a slight improvement in the cluster’s load balance.
Where sets of workloads take on the opposite profile of the example above - becoming less
homogeneous and more critical in nature - you will want to consider less aggressive migration
thresholds and either partially-automated or manual placement and migration settings.
The vCenter screen shot below shows a DRS enabled cluster with 4 ESX Hosts, of which only 2
are active. These systems are hosting 36 VMs and show 190 migrations. In this instance,
workloads are clearly not balanced properly and excessive migrations are occurring.
Our final practice is centered on high availability (HA) within DRS-enabled clusters. HA is a very
cost-effective capability built into the DRS-enabled cluster. However, there is a tradeoff and its
cost is twofold:
• HA’s strict Admission Control is very conservative and wastes a great deal of capacity
• It is difficult to understand whether an application can be restarted in a given cluster
state
The current method of calculating VMware’s HA failover capacity is complicated (and too
lengthy to share here). However, in short, HA’s strict Admission Control uses the maximum
Reservation size in the cluster as a slot size for all calculations. In fact, many users report seeing
a message that there are “Insufficient resources to satisfy configured failover level for HA”,
when attempting to configure their HA environment.
Many sites we talk to limit resource utilization far below what they might need to restart the
critical VMs on a host. And many of these same sites do not set their HA restart priority. We
recommend setting and optimizing the restart priority around two points: minimizing capacity
loss, and ensuring critical VMs restart immediately while low priority VMs restart when possible.
• Turn off strict HA Admission Control; VMware admits this is a very conservative
approach (i.e., it wastes a lot of capacity).
• Set the restart priority of all VMs very carefully, usually according to the policies defined
in your DR plan.
• Take the max of the sum of the Reservations for your High Priority VMs on any host.
Subtract that sum plus an additional 5% for overhead from the aggregate capacity.
Subtract the remainder from your required headroom which is based on the peakiness
of the workloads and your risk tolerance. Manage the cluster to the remaining
“effective capacity”.
VMware is planning to change its HA approach in ESX 4, and once we have sufficient experience
with that release, this section of the paper will be revised.
Summary
As your VMware environment continues to expand and the pressure to reduce costs continues
to increase, applying the five practices recommended in this paper will provide greater control
over new virtualization spending and improve the quality of services delivered. Systar is
confident that by following these practices, your organization will be able to safely maximize the
utilization of your VMware capacity above the 50% mark.
8
Business Value of Virtualization: Realizing the Benefit of Integrated Solutions, July 2008
The concepts in this paper apply to most virtual environments – however, we use VMware VI 3.x to
illustrate our points.
• Admission Control – if on, will not allow the power on of a new VM on a host if there is not enough
unreserved resources available for the VM’s specified reservation
• Configured size – the number of vCPUs and the number of MB of memory that represent the size of
the physical machine the VM is presented with.
• Entitlement - the computed result of configurations, reservations, limits, and shares used to establish
the resource allocation given to each VM for its operation. The Entitlement will always fall between
the Reservation and the Limit, based upon its Share.
• Effective capacity – the amount of that capacity that can be used given workload mix, high availability
requirements and white space.
• Host – a server that supports multiple workloads and the ESX server.
• Limit – this property serves as a hard cap on resource allocation for a VM. If Limit is not specified,
then the configuration size is the Limit.
• Reservation – the amount of CPU MHz and MB of memory (in absolute units) that a VM is guaranteed
should it need it.
• Shared resources CPU and memory resources that are actively managed by the hypervisor
• Shares – relative units that determine a VM’s priority among sibling VMs, used to determine resource
allocation under contention.
• White space – capacity on a host that cannot be utilized due to alignment of resource requirements,
or resources that cannot be used since they are too small support a whole VM.
United Kingdom
Ground Floor Left
3 Dyer’s Buildings
London EC1N 2JT
Tel. +44 2072 692 799
Fax +44 2072 429 400
info-uk@systar.com
Systar, BusinessBridge, OmniVision, BusinessVision, ServiceVision, WideVision and Systar’s logo are registered
trademarks of Systar. All other brand names, product names and trademarks are the property of their respective
owners. Copyright 2009.