Beruflich Dokumente
Kultur Dokumente
3
resource allocation allotted to a specific resource pool; explains potential issues that arise when
assigning resource reservations and workarounds for over allocation
Architecture
VMware ESX Server provides a virtualized computing environment that, unlike VMware Server or VMware
Workstation, does not rely on an underlying operating system to communicate with the server hardware –
instead, ESX Server is installed directly on the server hardware. Virtual Machines (VMs) are then installed and
managed on top of the ESX Server software layer.
Since virtualization components are not hosted within the confines of a host operating system, the ESX
Server architecture has been described as “unhosted,” “native,” or “hostless.” Hosted and unhosted
architectures are compared in Figure 1.
Figure 1: Comparing the native ESX Server architecture with a typical hosted architecture
The ESX Server architecture provides shorter, more efficient computational and I/O paths for VMs and their
applications, reducing virtualization overhead and improving application performance. The unhosted
architecture also enables ESX Server to provide more granular and enforceable policies for hardware
allocation and VM prioritization – an important differentiator for ESX Server over a hosted architecture. In a
hosted virtualization environment, the host OS governs the execution of VM threads and typically limits the
granularity of prioritization to categories such as “high,” “low,” or “normal.”
Furthermore, in the unhosted architecture of ESX Server, VM processes do not contend with the many and
various processes that consume the resources allocated to a host OS.
In short, the unhosted architecture of ESX Server provides a lightweight, single-purpose virtualization
environment that allows enforceable hardware allocation and prioritization policies. The single-purpose
micro-kernel, called VMkernel, translates into higher-performance and flexibility. Also, because VMkernel
uses only drivers ported and rigorously validated by both HP and VMware, the micro-kernel provides
exceptional stability.
4
Hardware performance differences
Performance and resource utilization for a particular operating system instance and application differ when
running in virtualized and unvirtualized environments, as discussed below.
Driver translation
While the quantification of performance differentials is very complex, it can be stated that, in general, CPU
and memory performance overheads in a VM tend to be lower than overheads for network or disk traffic.
This is because neither CPU nor memory needs the same amount of translation as required when data flows
between virtual and physical device drivers. With the introduction of ESX Server 3.0, many of the physical
device drivers have been incorporated into the kernel to further improve VM performance. In general, the
performance of a primarily CPU-intensive application in a VM is likely to be closer to its performance on a
physical server than that of an application that is more network- or disk-intensive.
However, translation between virtual and physical devices does consume some additional CPU resources
on top of those required for application and guest OS processing. This translation results in a higher
percentage of CPU utilization being needed for each request processed in a VM when compared to a
physical server running the same application.
World switching
The world switching process helps the sharing of physical system resources by preempting a currently-
running VM, capturing and saving the instantaneous execution state of that VM, and initiating CPU
execution for a second VM.
Although world switching allows VMs to share physical system resources, the process introduces an
additional amount of overhead associated with running VMs. Though this process adds a small amount of
overhead, the benefits of virtualization strongly outweigh the additional cost.
Service console
ESX Server is often thought of as Linux or Linux-based – a misconception that might stem from the service
console. To rebut this misconception, consider the following:
• The VMkernel, responsible for the creation and execution of virtual machines, is a single-purpose, micro-
kernel.
• The VMkernel cannot boot itself and has no user interface. It relies on a privileged, modified Red Hat
Enterprise Linux installation to provide ancillary services like a boot loader and user interface.
• It is important to understand that the VMkernel – not the Linux kernel – is the governing authority in an ESX
Server deployment. It is the VMkernel that creates, monitors, and defines the virtualization components; it
makes the only and final decision for execution allocation – even the Linux service console is subject to
the scheduling decisions of the VMkernel.
Boot process
The boot process helps explain the relationship between Linux, the service console, and ESX Server.
5
During boot, the bootloader (GRUB) loads a Linux kernel. Since certain PCI devices are masked by the
GRUB configuration, the Linux kernel only loads drivers for visible devices.
After most Linux services have been loaded, the Linux kernel loads the vmnixmod module, which loads the
VMkernel logger, which, in turn, loads the VMkernel itself. During its loading process, the VMkernel assumes
nearly all hardware interrupts and, effectively, takes over server hardware that was not allocated to the
Linux service console kernel. At this point, with the VMkernel owning most of the server hardware, it is free to
schedule VM execution and distribute physical resources between VMs.
The final component loaded is the VMkernel core dump process, which is designed to capture the state of
the VMkernel in the event of a kernel crash.
6
Service console overview
With the advent of ESX Server 3.0, the service console is now executed as a VM. Drivers for service console
devices, such as the NIC and storage, are loaded in the VMkernel which allows service console access to
the configured hardware through the kernel itself. Although the service console accesses most devices
through kernel modules, access to USB devices and the floppy is direct.
The VMware host agent which runs within the service console provides access for the Virtual Infrastructure
client. Additionally, a web access client is available and is powered by a Tomcat web service which runs
within the confines of the service console. The web access client allows users to perform many
management tasks using a web based interface. Furthermore, the Secure SHell within the service console
provides secure access for the command-line management of the ESX Server. While these interfaces might
appear identical to any Linux installation, the service console includes packages that allow both the
command line and the web interfaces to pass commands and configuration data to the VMkernel.
Figure 2 shows how the service console integrates into the ESX Server architecture.
Figure 2: The relationship between ESX Server and the service console
The service console, which uses a uni-processor 2.4.21 Linux kernel, is scheduled only on physical CPU0. By
default, the VMkernel also reserves a minimum of 8% of CPU0 for the service console through the same
guarantee mechanism used for VMs, which are free to consume remaining CPU0 resources. This CPU
allocation, in most cases, ensures that the service console remains responsive, even if other, busy VMs are
consuming all other available physical resources.
Although the service console is not responsible for scheduling VMs, there is a correlation between the
responsiveness of the service console and the responsiveness of VMs. This is due, in part, because ESX Server
transmits the keyboard, video, and mouse access of a VM to a VMware Remote Console session through
the service console network connection.
7
Because of this relationship, if the service console should become unresponsive or unable to perform the
supporting processes (such as updating /proc nodes or maintaining the VMkernel logger), the virtualized
environment may exhibit symptoms of this contention, ranging from slow remote console access to
VMkernel crashing. To combat this, consider increasing the memory allocation and/or minimum CPU
guarantee for the service console. Note, however, that this discussion addresses an extreme case; in most
cases, the default allocations should provide stable and responsive operation.
The service console also provides an execution environment for management and backup agents; the
loads generated by these additional processes further justify an increase in memory and CPU allocations
over the default values.
Note:
HP Systems Insight Manager (SIM) and other hardware monitoring agents run
in the service console, not in VMs.
ESX Server does not support the running of unqualified packages within the
service console environment.
Access to floppy drives, serial port devices, parallel port devices, and CD-ROM drives access – even from
within a VM – are proxied through the service console. This delegation of slower-access devices allows the
VMkernel to focus on high-speed, low-latency devices like hard disks.
Note:
Virtual SMP is licensed separately and requires this license to support the
exposure of four processors within a single VM. This license can be purchased
separately or as part of the Virtual Infrastructure Node bundle.
Although it may be tempting to use Virtual SMP by default when creating new VMs, it should be used
carefully – especially when running on systems which offer few execution cores, for instance a dual
processor system populated with single core processors. A VM is never allocated a portion of a core; during
its allocated unit of CPU time, a VM’s access is exclusive. As a result, when a VM using Virtual SMP is
deployed on a physical server with only two cores (as in a dual-processor, single-core server without Hyper-
Threading Technology), both cores are allocated to this VM during the scheduled period; no CPU resources
are available for other VMs or the service console. The corollary also applies: when any other VM or service
console process is scheduled for execution on either one of the two execution cores, processes on the
Virtual SMP-enabled VM cannot execute. This phenomenon is known as processor fragmentation and is
shown in Figure 3.
8
Figure 3: Both physical processor cores have been allocated to a Virtual SMP-enabled VM, leaving no CPU resources
available for other VMs
Processor fragmentation is often the reason for poor performance on servers with only two execution cores.
With dual-processor VMs on servers with only two cores, there is nearly 100% contention for CPU resources
when the system is under load. Now, if the goal were to run only a single VM with Virtual SMP on a platform
with two execution cores, the performance impact of this contention would be less noticeable; however, it
is far more common to deploy multiple VMs on such a platform, making contention a significant issue.
When using Virtual SMP, it is recommended that more processor cores be deployed on the physical server
than any single VM.
New technologies, such as Hyper-Threading Technology and dual-core processors, change this behavior
slightly and are examined more closely in later sections of this white paper.
Resource virtualization
In discussing how ESX Server presents virtual abstractions of hardware and schedules the execution of VMs,
it is helpful to discuss the four primary resource groups (CPU, memory, network, and disk) independently.
CPU
The primary concepts of CPU virtualization are as follows:
• A physical processor package
• A virtual processor
• A logical processor capable of executing a thread
The physical processor is a familiar concept; it has a clock speed, a cache size, and a manufacturer; and
you can hold it in your hand.
Introduced with newer technologies such as Hyper-Threading Technology and dual-core processors, the
logical processor is slightly more abstract, and may best be explained by examples such as those shown in
Figure 4.
9
Figure 4: Representations of physical processors
In the context of ESX Server, not all logical processors are equal. For example, a processor with Hyper-
Threading Technology includes two instruction pipelines; however, only one of these can access the
execution pipeline at any given moment. Contrast this with a dual-core processor where both instruction
pipelines have access to their own execution pipelines. As such, when discussing virtual processors, it might
be helpful to refer specifically to an execution core to avoid confusion with the non-executing second
instruction pipeline in a processor with Hyper-Threading Technology.
By far the most abstract of the concepts of CPU virtualization is the virtual processor, which is best defined
as a period of time allocated for exclusive execution on a processor core. When a VM is powered on and
its virtual processor is scheduled to execute (or multiple virtual processors if using Virtual SMP), a slice of time
on one logical execution core (or multiple logical execution cores if using Virtual SMP) within the physical
processors is assigned to the virtual processor(s) within the VM. Since this explanation and concept is purely
abstract, perhaps an example will clarify. Consider the following simplified examples:
• A physical processor with a single logical core and a single VM with a single virtual processor
Ignoring all virtualization overheads and execution cycles for system services, the virtual processor in this
scenario receives 100% of the execution time. If the logical core of the physical processor is a 3.0 GHz
CPU, the virtual processor receives all three billion cycles of the CPU clock.
• A single processor with a single logical core and two VMs, each with one virtual processor
Ignoring all virtualization overheads and execution cycles for system services and assuming equal priority
for both VMs, each virtual processor receives, over some period of time, 50% of the execution time.
It is important to understand that when one VM is executing, that VM has exclusive access to the
allocated logical execution core. In other words, only one virtual processor can execute within a single
logical processor at any given moment. The x86 architecture does not allow two virtual processors to
have simultaneous access to a single execution pipeline. Thus, in this example, if one VM is executing,
the other is not. The non-executing VM is not idle, rather, it has been pre-empted.
10
Figure 5 shows a number of scenarios featuring a single virtual processor.
Figure 5: Showing scenarios with one virtual processor per physical processor core
As can be inferred from the above discussion, when the number of virtual processors increases, the period
of time for execution may become shorter. Similarly, as the ratio of virtual processors to logical execution
cores increases2, the contention for physical resources may increase. One possible result – depending on
the amount of idleness within the VMs – is that the number of computational cycles available to a VM may
be less.
The qualifications in the previous paragraph – “may become shorter,” “may increase,” and “may be less” –
are rooted in the manner in which ESX Server treats an “idle” VM. When an operating system is not
consuming resources for system-sustaining processes or in support of an application, it is idle; indeed, most
operating systems spend considerable amounts of time in this state. While idle, the operating system issues
instructions3 to the CPU indicating that no work is to be done.
ESX Server is capable of recognizing this idle loop – unique to each operating system – and automatically
gives priority to VMs using CPU cycles to perform non-idle operations, giving rise to the qualifications stated
above. For example, as the ratio of virtual processors to logical processors increases, the contention for
physical resources may increase unless there are idle VMs.
Reacting to an idle VM
Consider the following to illustrate how idle VMs may affect the scheduling of virtual processors. In this
scenario, there is a single processor with a single logical core supporting two VMs, each with one virtual
processor. If one VM is performing CPU-intensive operations that entirely consume the cycles allotted to it
11
and the other VM is completely idle, ESX Server recognizes this disparity and effectively increases the
percentage of cycles allocated to the busy VM. Note that the busy VM never receives 100% of the CPU
cycles; some cycles are allocated to – and consumed by – the more idle VM to advance its clock and
ensure that it has the opportunity to become busy.
Updating the VM clock
As mentioned earlier, when a VM is not scheduled for execution within a processor core, time does not pass
in that VM. As a result, the measurement of time within that VM is incorrect – unless the VMware Tools
package is installed.
This package includes a component that, when enabled, updates the clock within the VM to ensure more
accurate timekeeping. This updating does not provide for a real-time measurement; however, the
accuracy of the updated time should be sufficient for most application purposes.
IMPORTANT:
VMware strongly recommends that the measurement of time should not be
used for the purposes of benchmarking.
Any application running in a VM that measures performance with respect to
time – for example, requests per second, transactions per second, or response
time – has temporal components that should be considered unreliable.
Default settings
To this point, the discussion has assumed that none of the CPU resource management features of ESX Server
have been changed from their default values, which allow a virtual processor to consume up to 100% of an
execution core or as little as 0%. The default configuration allows ESX Server to dynamically change which
logical processor clock cycles are used to fulfill the time allotted to a virtual processor; in other words, ESX
Server can move a virtual processor between logical cores or, even, physical processors in response to
shifting loads within the physical host.
While unique resource management settings can be configured for each VM, VMware recommends
leaving these values set to their default values and allowing the VMkernel to make these decisions with
maximum flexibility.
12
To illustrate the performance impact of cache fragmentation, consider an ESX Server environment freshly
booted with no VMs powered on. When the first VM is powered on, the cache hit ratio for its processor is
initially zero but begins to increase; after some time, this VM, running alone on the processor, might achieve
a hit ratio that is high enough to improve performance. When a second VM is powered on and scheduled
to execute on the same processor, this VM begins to populate processor cache with its own data and
processes, replacing the cached data and processes from the first VM. When the first VM is next scheduled
for execution, the cache hit ratio will be lower than previously achieved.
While the hit ratio will improve over time, it is likely to be lower during initial execution cycles (as shown in
Figure 6), forcing VMs to execute from main memory. Because access to main memory is much slower than
filling the requests from processor cache, the VM will run slower until the hit ratio improves.
Figure 6: Simulated impact of cache fragmentation on the CPU cache hit ratio, showing the ratio dropping to zero
each time a world switch (indicated by a red line) occurs
Note:
Unlike processor registers, processor cache is not restored or saved when
switching between executions of virtual machines.
The impact of cache fragmentation is intensified in a higher-density deployment. With more VMs running
per processor, each VM runs for a shorter period of time, which may limit a VM’s ability to fully populate and
realize the benefits of processor cache. As density increases, the following conditions occur:
• There are more VMs to push data out of cache
• The length of time between executions for each VM increases
These conditions combine to reduce the amount of data that remains cached between executions.
In order to combat this performance degradation, a larger processor cache may provide sufficient storage
to maintain a significant amount of cached data between executions. This would improve the cache hit
ratios for all VMs running on the processor.
13
that performance can be maximized by continuing to run a virtual processor on the same logical core
(which is likely to contain cache pages for the particular VM4). As a result, the VMkernel is prepared to
accept a temporary increase in contention within one logical core before migrating a virtual processor to a
different core.
The decision to migrate or leave a virtual processor in place is governed by the potential penalty imposed
by the migration. If the contention within a logical core causes a virtual processor to delay an execution
request by a period that exceeds the migration penalty, the VMkernel recognizes that cache relevance
has been outweighed by the contention and will migrate the virtual processor to a logical core able to
serve the request more quickly.
To retain the ability to migrate virtual processors, VMware recommends that users allow ESX Server to
determine processor affinity to virtual machines. Many variables govern scheduling decisions made by the
VMkernel to guarantee the best possible performance from a particular physical configuration. Specifying
processor affinity reduces the flexibility available to the VMkernel to make optimizations. Thus, VMware
discourages forced association of a virtual machine to a specific processing core.
14
On the other hand, if only one virtual processor were scheduled to run within the hyperthreaded physical
processor, the VMkernel would account for this exclusive access through its internal accounting
capabilities. In this case, the virtual processor would be charged more for its exclusive consumption of the
physical processor. The rationale behind this extra charge is that the single virtual processor consumes the
full physical package, whereas two virtual processors within the two logical cores of a hyperthreaded
physical processor are each utilizing half of the physical processor. Other than this, the concepts of resource
management in ESX Server apply to servers with hyperthreaded processors in exactly the same manner as
servers with single-core processors.
ESX Server, however, can also use this secondary core to address processor fragmentation by scheduling
the two virtual processors of a Virtual SMP-enabled VM to use both cores of a hyperthreaded processor.
However, since hyperthreading may cause contention between these two cores, overall performance
depends on the nature of the particular application. If the application uses only a single virtual processor,
leaving the second processor largely unused, hyperthreading gives ESX Server the flexibility to avoid the
effects of processor fragmentation without significantly impacting application performance. If, however,
simultaneous, parallel execution of the two virtual processor threads is required, poor application
performance is likely.
Like most other parameters governing scheduling decisions, it is possible to update the policy for scheduling
virtual processors in a hyperthreaded environment. For more information on these policies and how to
change them, type man hyperthreading at the service console command prompt.
Note that the default setting for hyperthreading scheduling policy for a virtual processor is any, which
places no restrictions on the allocation of cores between virtual processors. This setting allows VMkernel to
use both cores within each hyperthreaded physical processor for the scheduling and execution of any
virtual processor within the system.
Setting hyperthreading sharing policy to none causes the particular VM to effectively ignore the fact that
the physical processor is hyperthreaded and continue to consume physical packages as though they
contained only a single logical core. Since this policy is set per VM, no virtual processor associated with this
VM will share a physical package with any other; the other logical core within the package will remain
unused.
VMs with more than one virtual processor can also use the internal setting for hyperthreading sharing policy.
This allows the virtual processors of a single VM to share cores within a physical package; however, these
virtual processors will not share cores with virtual processors associated with any other VM.
As with any parameters that can alter scheduling decisions made by the VMkernel, VMware strongly
recommends accepting the default values for the hyperthreading sharing policy.
5The process by which one VM is unscheduled and another scheduled to execute is known as a world switch. This process involves
capturing one VM’s processor registers and writing these registers to memory, and reading the registers for the other VM from main
memory and, finally, writing these registers to the processor.
15
Figure 9: Showing a NUMA implementation with two nodes
16
HP ProLiant servers with AMD Opteron processors are NUMA systems: that is, the system BIOS creates a
System Resource Allocation Table (SRAT) that presents the nodes and proximity domains to ESX Server. ESX
Server is NUMA-aware and uses the contents of the SRAT to make decisions on how to optimally schedule
VMs and allocate memory.
On NUMA systems, ESX Server attempts to schedule a VM thread to execute in core(s) that are in the same
proximity domain as the memory associated with that VM. ESX Server also attempts to maintain physical
memory for a particular VM within a single NUMA node. If a VM needs more memory than that available in
a NUMA node, ESX Server allocates additional memory from the nearest proximity domain.
The single-core Opteron processor creates a unique NUMA architecture where each NUMA node has only
one processor core. While this is perfectly valid within NUMA specifications, it is more typical to deploy
multiple cores in a single NUMA node. Within the ESX Server context, the only scenario that is affected by
the unique Opteron NUMA presentation involves Virtual SMP.
The Virtual SMP code of ESX Server has been coded to take particular advantage of NUMA architecture on
dual-core Opteron processors. When a VM is allocated two virtual processors, ESX Server schedules both
threads to execute within a single NUMA node; however, because the SRAT dictates that each node has
only one processor, it is impossible for ESX Server to execute both virtual processor threads within a single
proximity domain.
Dual-Core Opteron processors implement an architecture with two processors per NUMA node, allowing
ESX Server to schedule dual-processor VMs within the confines of a single NUMA node. As long as the
number of virtual processors remains the same or lower than the number of execution cores in a proximity
domain, ESX Server NUMA optimizations should be in effect.
Memory
Note:
An outstanding resource for detailed information on memory virtualization is
available at the ESX Server command line. Issuing the command man mem
displays a comprehensive guide on ESX Server memory virtualization.
17
burdened with memory requirements per VM running on a host. Individual VM process threads are handled
directly by the VMkernel, thereby eliminating the need for additional service console memory per VM.
However, if you intend to run additional agents – for hardware monitoring and/or backup – in the service
console, it may be prudent to allocate more memory than the defaults allow.
18
Memory management
ESX Server provides advanced memory management features that help ensure the flexible and efficient
use of system memory resources. For example, ESX Server systems support VM memory allocations that are
greater than the amount of physical memory available – overcommitting – as well as background memory
page sharing and ballooning.
When attempting to power on a VM, an ESX Server host first verifies that there is enough free physical
memory to meet the guaranteed minimum needed to support this VM. Once this admission control feature
has been passed, the VMkernel creates and presents the virtual memory space.
While virtual memory space is created and completely addressed as the VM is powered on, physical
memory is not allocated entirely at this time; instead, the VMkernel allocates physical memory to the VM as
needed. In every case, VMs are granted uncontested allocations of physical memory up to their
guaranteed minimums; because of admission control, these allocations are known to be present and
available in the server.
If the entire physical memory pool is already being actively used when a VM requests the memory due it
according to its guaranteed minimum, the VMkernel makes physical memory available by decreasing the
physical memory allocation to another VM deployed on the same host. The VMkernel relies on its own swap
file to accommodate the increased physical memory demand.
Consider the following example where two VMs, VMA and VMB, are each guaranteed a minimum of 256
MB and a maximum of 512 MB of RAM, and two additional VMs, VMC and VMD are each guaranteed a
minimum of 512 MB and a maximum of 1024 MB of RAM. Ignoring all service console allocations and
virtualization overheads for a moment, assume that the server has 1.5 GB of RAM and that VMA and VMB
are each actively using 512 MB of physical memory while VMC and VMD are each actively using only 256
MB. Admission control allows all of these VMs to run since the 1.5 GB of physical memory can
accommodate the guaranteed minimums. While all physical memory is consumed in this example (as
shown in Figure 7), some machines are not actively using their guaranteed minimum or maximum
allocations.
Figure 7: In this example, VMC and VMD are using less than their guaranteed minimum memory allocations; all physical
memory is consumed
Now, to continue with this example, VMC and VMD each request an additional 256MB of memory, which is
guaranteed and must be granted. To accommodate these additional allocations, ESX Server reclaims
19
physical memory from VMA and VMB (as shown in Figure 8), both of which are operating with above their
minimum guaranteed allocations. If VMA and VMB have equivalent memory shares, each should relinquish
roughly the same amount of memory.
Figure 8: In this continuing example, VMC and VMD are allocated memory that has been reclaimed from VMA and
VMB
However, the applications running within VMA and VMB are not aware of memory guarantees and will
continue to address what they perceive to be their full memory ranges; as a result, the VMkernel must
address the deficits.
Note:
In the case of a balloon driver-induced swap within a VM, memory is
swapped to the VM’s – rather than the VMkernel’s – swapfile.
Since the guest operating system is able to make intelligent decisions about which pages are appropriate
to swap and which are not, ESX Server uses the balloon driver to force the guest operating system to apply
20
this intelligence to reduce the physical memory used by its processes. At the same time, ESX Server is able to
identify the memory pages consumed by the balloon driver. These consumed pages are useless to the VM
but, to ESX Server, they represent physical memory that is essentially free to commit to other VMs. The
balloon driver only inflates by an amount that is enough to reduce the VM’s physical memory utilization to
the appropriate guaranteed minimum memory allocation.
21
3A*576MBB=1728MBC
Initially, this physical host would not have enough memory available to power-on a fourth VM. However,
through background memory page sharing, ESX Server may eventually find the necessary 256 MB of
redundant memory pages.
Note:
The amount of redundant memory reclaimed on a system is highly dependent
on the nature of the specific environment. The opportunity to share memory
dramatically increases in an environment where VMs are executing the same
OS. Metrics appearing in this example are only used for illustrative purposes
and do not represent actual physical memory that may be reclaimed.
With this newly reclaimed memory, ESX Server has enough free RAM to power on a fourth VM. Unless the first
three VMs attempt to update their shared memory pages, this system will continue to function as expected
with no discernable manifestation of either the memory page sharing or memory overcommitment.
However, if activity in the three original VMs should increase to the extent that each VM needs its own
distinct page, ESX Server can accommodate this shortage of physical memory through the use of a
swapfile.
Based on the share-based memory allocation policy, ESX Server reclaims physical memory by moving the
memory contents of a VM to disk. Ordinarily, this would be a very risky operation since there is no reliable,
programmatic method for the VMkernel to identify VM pages that are optimal for swapping to disk and
pages that should never be swapped to disk (for example, it would be inappropriate to swap VMkernel
pages). ESX Server solves this problem through the use of the balloon driver.
It is important to note, though, that this overcommit scenario with the use of the ESX Server swapfile can
have serious implications on the performance of the applications running in a VM. In environments where
performance is, in any way, a concern, avoid memory overcommitment. If possible, configure the physical
ESX Server platform with enough physical memory to accommodate all hosted VMs.
Network
Network virtualization in ESX Server is centered on the concept of a virtual switch, which is a software
representation of a 1016-port, full-duplex Ethernet switch. The virtual switch is the conduit between VM
network interfaces in two VMs or between a VM and the physical network. VM network interfaces connect
22
to virtual switches; virtual switches connect to both physical and virtual network interfaces, as shown in
Figure 10.
Figure 10: Showing a virtual switch providing connectivity between VMs and the network
When an application running in a VM attempts to send a packet, the request is handled by the operating
system and pushed to the network interface card device driver through the network stack. Inside the VM,
however, the network interface driver is actually the driver for the abstracted instance of the network
resource; the pathway through this virtual interface is not directly to the physical network interface but,
instead, passes the packet to the virtual switch components of the VMkernel. Once the packet has been
passed to the virtual switch, the VMkernel forwards the packet to the appropriate destination – either out of
the physical interface or to the virtual interface of another VM connected to the same virtual switch. Since
the virtual switch is implemented entirely in software, switch speed is a function of server processor power.
How to dedicate a physical NIC to a VM
A common question is, “How do I dedicate a physical NIC to a VM?”
By connecting a single virtual network adapter and a single physical network interface to a virtual switch, a
single VM obtains exclusive use of the physical interface. In this way, a physical NIC can be dedicated to a
VM. However, if a second VM is connected to the same virtual switch, both VMs will pass traffic through the
same physical interface.
A more complete response to the question of dedicating a physical NIC to a VM would be that a physical
network interface is not dedicated to a VM; instead, ESX Server is configured, through virtual switches, to
bridge only a single virtual network adapter through an individual physical interface.
23
Configuring virtual switches
Since it is possible to connect either more than one virtual adapter to a virtual switch or more than one
physical adapter to a single virtual switch (as shown in Figure 11), consider the multiplexing operation of a
virtual switch in each of these scenarios.
First, consider a virtual switch connected to two virtual network adapters deployed in two different VMs. Just
as if this were a physical switch with two servers connected, these VMs can communicate with one another
via the virtual switch. This scenario can be scaled up to the limits of the virtual switch, a 1016-port device (32
ports by default), allowing up to 1016 VMs attached to the same virtual switch to communicate within a
physical host. In this environment, with no physical adapter connected to the virtual switch, network traffic is
utterly isolated from the physical network segment.
If this example is modified by connecting 1016 VMs and one physical adapter to the virtual switch, all 1016
VMs can communicate with the physical network via the single physical interface. Interestingly, in ESX
Server, physical adapters connected to virtual switches do not deduct from the number of ports available
on a virtual switch.
Note that, in the example, each VM has only a single network interface connected to the virtual switch;
there is no reason for a VM to have more than one virtual network interface connected to a single virtual
switch. In fact, ESX Server does allow a VM to be configured with more than one virtual network adapter
connected to a single virtual switch.
Virtual network adapters, virtual switches, and the connections between these devices, are VMkernel
processes whose speeds are dictated by server CPU speed; as purely software processes within the
VMkernel, these devices are operational as long as the VMkernel is operational. Unlike physical network
components, virtual network devices cannot fail or reach the physical limitations of media throughput. As a
result, there is no need to use multiple virtual adapters to address fault tolerance or performance concerns
– not always the case for physical adapters.
24
When interfacing with the physical world, however, virtual switches can be connected to multiple physical
network adapters, as shown in Figure 12.
When multiple physical network interfaces are attached to a single virtual switch, ESX Server and the
VMkernel recognize this as an attempt to address the fault-tolerance and performance concerns of
physical networks and automatically create a bonded team of physical network interfaces. This bonded
team is able to send and receive higher rates of data and, should a link in the bonded team fail, the
remaining member(s) of the team continue to provide network access. In other words, the connection of
multiple physical adapters to the same virtual switch creates a fault-tolerant NIC team for all VMs
communicating through this virtual switch. There is no need for a driver or special configuration settings
within the VMs.
The fault-tolerance delivered by the NIC team is completely transparent to the guest operating system.
Indeed, even if the guest operating system does not support NIC teaming or fault-tolerant network
connections, the VMkernel and the virtual switch deliver this functionality through the abstracted network
service exposed to the VM.
Load distribution
ESX Server does not distribute the frames that make up a single TCP session across multiple links in a bond.
This means that a session with a single source and single destination never consumes more bandwidth than
is provided by a single network interface. However, IP-based load-balancing multiple sessions to multiple
destinations can consume more total bandwidth than any single physical link. The only scenario in which
the same frame is sent over more than one interface occurs when no network links within the bond can be
verified as functional.
The network load sharing capability of ESX Server can be configured to employ load sharing policies based
on either Layer 2 (based on the source MAC address) or Layer 3 (based on a combination of the source
and destination IP addresses). By default, ESX Server uses the Layer 2 policy, which does not require any
configuration on the external, physical switch.
25
With MAC-based teaming, because the only consideration when determining which link use is the MAC
address, the VM always transmits frames over the same physical NIC within a bond. However, the IP-based
load distribution algorithm typically results in a more evenly balanced utilization of all physical links in a
bond.
If ESX Server is configured to use the IP-based load-distribution algorithm, the external, physical switch must
be configured to communicate using the IEEE 802.3ad specification. Because the MAC address of the VM
will appear to be connected to each of the ports on virtual switch that it is transmitting, this configuration is
likely to confuse the switch unless 802.3ad is enabled. The load-distribution algorithm also handles inbound
and outbound traffic differently.
Figure 13 compares MAC-based load balancing with IP-based load balancing.
26
Distributing outbound traffic
With the IP-address-based load-distribution algorithm enabled, and outbound network packets being sent
from a virtual machine, non-IP frames are distributed among the network interfaces within a single bond in
a round-robin fashion. IP-based traffic is, by default, distributed among the member interfaces of a bond
based on the destination IP address within the packet. This algorithm has the following strengths and
weaknesses:
• It prevents out-of-order TCP segments and provides, in most cases, reasonable distribution of the load.
• 802.3ad-capable network switches may be required as there have been reports that this algorithm
confuses non-802.3ad switches.
• It is not well-suited for environments where a VM communicates with a single host – in this environment, all
traffic is destined for the same IP address and would not be distributed.
• With this algorithm, all traffic between each pair of hosts traverses only one link per pair of hosts until a
failover occurs.
• The transmit NIC to be used by a VM for a network transaction is chosen based on whether the
combination of destination and source IP addresses is even or odd. Specifically, the IP-based algorithm
uses an exclusive-or (XOR) of the last bytes in both the source and destination IP addresses to determine
which physical link should be used for a source-destination pair. By considering both the source and
destination IP addresses when selecting the bond member for a particular host, it is possible for two VMs
within the same ESX Server host, using the same set network bond to select different physical interfaces,
even when communicating with the same remote host.
It is important to note is that the load distribution algorithm is not bound on a per-VM basis. In other words,
the path selection for load distribution supports different physical paths to the same destination on a
multi-homed virtual adapter
27
Eliminating the switch as a single point of failure
ESX Server allows a single team of bonded NICs to be connected to multiple physical switches, eliminating
the switch as a single point of failure. This feature requires the beacon monitoring feature of both the
physical switch and ESX Server NIC team to be enabled.
Beacon monitoring allows ESX Server to test the links in a bond by sending a packet from one adapter to
the other adapters within a virtual switch across the physical links. For more information on the beacon
monitoring feature, see the ESX Server administration guide.
Another effective way to improve VM performance is by deploying the VMware vmxnet device driver. By
default, VMs are created with a highly-compatible virtual network adapter – the device reports itself as an
AMD PCNet PCI Ethernet adapter (vlance). This device is used as the default because of its near-universal
compatibility – there are DOS drivers for this adapter, as well as Linux, Netware, and all versions of Windows.
While this virtual adapter reports link speeds of 10Mbps with only a half-duplex interface, the actual
throughput can be much closer to the capabilities of the physical interface.
If the vlance adapter is not delivering acceptable throughput or if the physical host is suffering from
excessive CPU utilization, higher throughput may be possible by changing to the vmxnet adapter, which is a
highly-tuned virtual network adapter for VMs. The vmxnet driver is installed as a component of the VMware
Tools package, and must be supported by the operating system running in the virtual machine. For a list of
supported operating systems, please see the System Compatibility Guide for ESX Server 3.
Another key to maximizing the performance of physical network adapters is the manual configuration of
the speed and duplex settings of both the physical network adapters in an ESX Server and the physical
switches to which the ESX Server is connected. VMware Knowledge Base article #813 details the settings
and steps necessary to force the speed and duplex of most physical network adapters.
In most cases, ESX Server is configured to dedicate a physical network adapter to the service console for
management and administration. There are, however, scenarios where it may be necessary to have the
service console use a network adapter that is allocated to the ESX Server VMkernel. Such scenarios are
usually introduced by dense server blade configurations that have only two physical NICs and cannot
spare an entire physical interface for the service console. In this case, the service console can access the
same virtual networking resources (virtual switches and network adapters). This is achieved by correctly
configuring a single virtual switch to handle a combination of interfaces such as the service console,
VMotion and VM networks. Although consolidating the interfaces is not a recommended best practice, it is
possible.
28
To handle multiple source MAC addresses, the physical network interface of the server is put into
promiscuous mode. This causes its physical MAC address to be masked; all packets transmitted on the
network segment are presented to the VMkernel virtual switch interface. Any packets destined for a VM are
forwarded to the virtual network adapter through the virtual switch interface. Packets not destined for a VM
are immediately discarded. Similarly, network nodes perceive packets from a VM to have been transmitted
by the VM; the role of the physical interface is undetectable – the physical network interface has become
similar to a port on a switch, an identity-less conduit.
VLANs
ESX Server and virtual switches also support IEEE 802.1q VLAN tagging. To increase isolation or improve the
security of network traffic, ESX Server allows VMs to fully leverage existing VLAN capabilities and even
extends this functionality by implementing VLAN tagging within virtual switches.
VLAN tagging allows traffic to be isolated within the confines of a switched network. Traditionally, VLAN
tagging is performed by a physical switch, based on the physical port on which a packet arrives at the
switch. In an environment with no virtualized server instances, this approach provides complete isolation
within broadcast domains. However, when virtualization is introduced, port-based tagging at the physical
switch does not provide VLAN isolation between VMs that share the same physical network connection.
To address the scenario where broadcast-domain isolation is required between two VMs sharing the same
physical network, virtual switches support the creation of port groups that can provide VLAN tagging
isolation between VMs within the confines of a virtual switch. Port groups aggregate multiple ports under a
common configuration and provide a stable anchor point for virtual machines connecting to labeled
networks. Each port group is identified by a network label, which is unique to the current host, and can
optionally have a VLAN tagging ID.
Considerations when configuring virtual switches
• When initially configuring your virtual switches on ESX Server, invest in creating a naming convention that
provides meaningful names for these switches beyond the context of a single server. For example,
VMotion requires that both the source and destination ESX Server have the same network names; for this
reason, virtual switch names like “Second Network” may not translate from server to server as easily as
more definitive designations like, “Production Network” or “Management Network.”
• ESX Server supports a maximum of 20 physical NICs, whether 100 Mbps or 1 Gbps
• A virtual switch provides up to 1016 ports for virtual network adapter connections, the default is 32.
However, physical connections do not consume ports on virtual switches. For example, if four physical
network cards are connected to a single virtual switch, that switch still has all 1016 ports available for VMs.
• When using VLAN tagging within a virtual switch, you should configure the VM’s network adapter to
connect to the name of the port group, rather than the name of the physical switch. Note that external,
physical switch port to which ESX Server connects should be set to VLAN trunking mode to allow the port
to receive packets bound for multiple broadcast domains.
• A virtual switch may connect to multiple virtual network adapters (multiple VMs), but a VM can have no
more than one connection to any virtual switch
• A physical adapter may not connect to more than one virtual switch, but a virtual switch may connect to
multiple physical network adapters. When multiple physical adapters are connected to the same virtual
switch, they are automatically teamed and bonded.
• If you are implementing VMotion within your ESX Server environment, reserve or assign a Gigabit NIC for
VMotion to ensure the quickest possible migration.
Note:
VMware only supports VMotion over Gigabit Ethernet; VMotion over a 10/100
Mbps network is not supported.
29
Storage
Storage virtualization is probably the most complex component within an ESX Server environment. Some of
this complexity can be attributed to the robust, feature-rich Storage Array Network (SAN) devices deployed
to provide storage, but much is due to the fact that SANs and servers are often managed independently,
sometimes by entirely different organizations. As a result, this white paper discusses storage virtualization
from the following two perspectives:
• How the SAN (iSCSI, fibre channel) sees ESX Server
• How ESX Server sees the SAN (iSCSI and fibre channel)
Presenting both perspectives should help both SAN and server administrators better communicate their
unique requirements in an ESX Server deployment.
30
Architecture
Figure 14 presents an overview of virtual storage.
Figure 14: A virtual storage solution with three VMs accessing a single VMFS volume
ESX Server storage virtualization allows VMs to access underlying physical storage as though it were JBOD
SCSI within the VM – regardless of the physical storage topology or protocol. In other words, a VM accesses
physical storage by issuing read and write commands to what appears to be a local SCSI controller with a
locally-attached SCSI drive. Either an LSILogic or BusLogic SCSI controller driver is loaded in the VM so that
the guest operating system can access storage exactly as if this were a physical environment.
When an application within the VM issues a file read or write request to the operating system, the operating
system performs a file-to-block conversion and passes request to the driver. However, the driver in an ESX
Server environment does not talk directly to the hardware; instead, the driver passes the block read/write
request to the VMkernel where the physical device driver resides and then the read/write request is
forwarded to the actual physical hardware device and forwarded to the storage controller. In previous
versions of ESX, the physical device drivers were not loaded in the kernel which created an extra leg in the
journey from the VM to the physical storage. The integration of the drivers into the kernel in ESX Server 3
thereby removes an extra translation layer and improves I/O performance.
The storage controller may be a locally-attached RAID controller or a remote, multi-pathed SAN device –
the physical storage infrastructure is completely hidden from the virtual machine. To the SAN, however, the
converse is true: VMs are completely hidden from the physical storage infrastructure. The storage controller
sees I/O requests that appear to originate from an ESX Server; all storage bus traffic from VMs on a
particular physical host appears to originate from a single source.
31
There are two ways to make blocks of storage accessible to a VM:
• Using an encapsulated, VMware File System (VMFS)-hosted VM disk file
• Using a raw LUN formatted with the operating system’s native file system
VMFS
The vast majority of (unclustered) VMs use encapsulated disk files stored on a VMFS volume.
Note:
VMFS is a high-performance file system that stores large, monolithic virtual disk
files and is tuned for this task alone.
To understand why VMFS is used requires an understanding of VM disk files. Perhaps the closest analogy to a
VM disk file is an .ISO image of a CD-ROM disk, which is a single, large file containing a file system with many
individual files. Through the virtualization layer, the storage blocks within this single, large file are presented
to the VM as a SCSI disk drive, made possible by the file and block translations described above. To the VM,
this file is a hard disk, with physical geometry, files, and a file system; to the storage controller, this is a range
of blocks.
A VM disk file is, for all intents and purposes, the hard drive of a VM. This file contains the operating system,
applications, data, and all the settings associated with a typical/conventional hard drive. If an
administrator were to delete a VM disk file, it would be analogous to throwing a physical hard drive in the
trash – the data, the operating system, the applications, the settings, and even blocks of storage would be
lost. By the same token, if an administrator were to copy a VM disk file, an exact duplicate of the VM’s hard
drive would be created for use as a backup or for cloning the particular configuration.
Unlike Windows and Linux operating systems, ESX Server does not lock a LUN when it is mounted – a simple
fact that is the source of both power and potential confusion in an ESX Server environment. When
configuring a switched SAN topology, it is critical to use zoning, selective storage presentation, or LUN
masking to limit the number of physical servers (non-ESX Server) that can see a particular LUN. Without
limiting which physical – Windows or Linux – servers can see a LUN, locking and LUN contention will quickly
cause data to become inaccessible or inconsistent between nodes.
VMFS is inherently a distributed file system, allowing more than one ESX Server to view the same LUN. Unlike
Windows/NTFS or Linux/ext3, ESX Server/VMFS supports simultaneous access by multiple hosts. This means
that while numerous ESX Server instances may view the contents of a VMFS LUN, only one ESX Server may
open a file at any given moment. To an ESX Server and VMFS, when a VM is powered on, the VM disk file is
locked.
While VMotion is described in detail later in the document, it might be helpful to explain now that, in a
VMotion operation, the VM disk file remains in place on the SAN, in the same LUN; file ownership is simply
transferred between ESX Server hosts that have access to the same LUN.
The distributed nature of VMFS means that, when configuring the SAN to which ESX Server is attached,
zoning should be configured to allow multiple ESX Servers to access the same LUN where the VMFS partition
resides. This may be out of the ordinary for the SAN administrator.
32
Figure 15 shows a typical SAN solution.
Figure 15: A virtual storage solution with six VMs accessing LUNs on a SAN array
33
Remember that the VMFS volume will host multiple VMs, which has two effects on LUN performance:
• Since a single VMFS volume may have multiple ESX Servers and each ESX Server may have multiple VMs
within the same partition, the I/O loads on a VMFS-formatted LUN can be significantly higher than the
loads on a single-host, single-operating system LUN.
• Since many VM disk files are likely to be stored within a single VMFS volume, the importance for fault
tolerance on this LUN is amplified. Always employ at least the level of fault tolerance used for physical
machines.
Fault tolerance becomes even more of a concern if a larger VMFS volume is created from multiple,
smaller VMFS extents within ESX Server. Should any one extent fail, all data within that extent would be
lost, whereas information on the remaining extents would remain available. Therefore, measures like RAID
technology and stand-by drives should be considered standard as part of any VMFS LUN.
From a pure performance perspective, tuning an array for a particular application may not be as effective
with VMs as with physical machines. Since VM storage is abstracted from the VM and, typically,
encapsulated in a virtual machine disk file within a VMFS volume, it is probable that the same parameters
that enhanced database performance in an NTFS partition will not deliver the same gains in a virtualized
environment. As a result, at this time there are no recommended application-specific tuning parameters for
a VMFS formatted LUN.
Tuning VM storage
While it may be possible to perform some storage performance tuning in an ESX Server environment, you
should consider some potential trade-offs.
Storage performance tuning generally involves a low-level understanding of how an application accesses
disks and how to configure placement, allocation units, and caches within an array to optimize the
performance of this application. What is not always considered is that enhancing the performance for
application may, in practice, degrade the performance of many other applications.
Understanding this tuning trade-off is especially important in an ESX Server environment where dissimilar
applications may access the same groups of spindles. If an array hosting the virtual disks for several file and
print server VMs were tuned to optimize Microsoft SQL Server performance, for example, the performance
of the file and print servers would probably be degraded. It is also possible that, since the array is tuned for
SQL Server traffic – and is therefore less efficient when handling file and print traffic – SQL Server
performance could be degraded while the array struggles with the suboptimal file and print workload.
What may ultimately determine the degree to which storage is tuned for VM application performance is
the trade-off in flexibility. For the majority of deployments, the flexible, on-demand capability to create and
move VMs is one of the most powerful features of an ESX Server environment. To some extent, creating LUNs
that are tuned for specific applications restricts this flexibility.
34
Other design considerations
• When designing LUN schemes and storage layouts for a virtualized environment, you should consider the
requirements of VMotion, which needs all VM disk files (or the raw device mapping file) to be visible on
the SAN to both the source and destination servers.
• According to the “VirtualCenter Technical Best Practices” white paper, available at
http://www.vmware.com/pdf/vc_technical_best.pdf, there should be no more than 16 ESX Server hosts
connected to a single VMFS volume.
• In a larger deployment, it may not be practical to expose all VMs to all hosts; as a result, care should be
taken to ensure that VM disk files or disk mappings are accessible to the appropriate ESX Server hosts.
35
Figure 16: A VM and physical machine clustered with a raw LUN
Using its capability to attach a raw device as a local storage device, a VM can hold or host data within
native operating system file systems, such as NTFS or ext3.
Raw device mapping
Prior to the release of ESX Server 2.5, the use of raw devices meant that many of the flexible aspects of
VMFS and VM disk files were not available. However, a feature called Raw Device Mapping (RDM)
addresses this shortcoming by allowing a VM to attach to a raw device as though it were a VMFS-hosted
file. With RDM, raw devices can deliver many of the same features previously reserved for VM disk files –
particularly VMotion and .redo logs.
Note:
.redo logs for VM disk files used in undoable disks and VM snapshots are
available only when raw device mapping is in virtual compatibility mode.
Raw device mapping relies on a VMFS-hosted pointer – or proxy – file to redirect requests from the VMFS file
system to the raw LUN.
For example, consider the following VMFS directory:
[root@System1 root]# ls –la /vmfs/demo/
total 25975808
drwxrwxrwt 1 root root 512 Aug 19 20:12
drwxrwxrwt 1 root root 512 Aug 22 19:10
-rw------- 1 root root 4194304512 Aug 25 15:13 W2K-SQL.vmdk
-rw------- 1 root root 18210038272 Aug 19 20:12 W2K-SQLDATA.vmdk
-rw------- 1 root root 4194304512 Aug 25 15:13 WNT-BDC.vmdk
36
In the above example, the VMFS volume demo contains both VM disk files and VM raw device mappings.
W2K-SQLDATA.vmdk is the raw device mapping that points the physical host to the appropriate LUN.
Note that raw device mapping appears to be exactly like a VM disk file, even appearing to have a file size
that is equivalent to the LUN to which the mapping refers. Since the map file is accessible through VMFS, it
appears to all physical hosts that can see the VMFS volume. When a VM attempts to access its raw-device-
mapped storage, the VMkernel resolves the SAN target through the data stored in the mapping file, which
is able to do per-host resolution for the raw device proxied by the raw device mapping file.
Consider a second example with two physical hosts; on each host is one node of a two-node cluster. Each
node – NodeA and NodeB – references a shared data disk that is a raw device. The VM configuration file
for NodeA references the shared disk as /vmfs/demo/data_disk.vmdk; NodeB shares this apparently
identical reference to /vmfs/demo/data_disk.vmdk for the shared data drive. However, because of
physical and configuration differences between the two systems, the physical SAN paths to the VMFS
volume demo and the physical SAN paths to the LUN referenced by the mapping file data_disk.vmdk are
different.
For the server hosting NodeA, the physical SAN address for the demo LUN is vmhba1:0:1:2; for the server
hosting NodeB, the physical SAN address for the demo LUN vmhba2:0:1:2. Similarly, the paths to the LUN
referred to by raw device mapping might be different. Without raw device mapping, only the physical,
static SAN path is used to access the raw LUN. Since the two physical hosts access the LUN over different
physical SAN paths, the VM configuration files would have to be updated to resolve the change in SAN
without a raw device map.
By removing the limitations of static SAN path definitions, raw device maps enable VMotion operations with
VMs that use raw devices. Additional functionality is enabled by raw device mappings; now, all raw device
access for a mapped LUN is proxied through a VMFS volume. As a result, the raw device may have access
to many of the features of the VMFS file system, depending on the mapping mode used.
37
There are two raw device mapping modes – virtual compatibility mode and physical compatibility mode.
• Virtual compatibility mode allows a mapped raw device to inherit nearly all of the features of a VM disk
file – such as file locking, file permissions, and .redo logs.
• Physical compatibility mode allows nearly every SCSI command to be passed directly to the storage
controller. This means that SAN-based replication tools, such as HP StorageWorks Business Copy or
Continuous Access, should work within a VM that is presented storage through a raw device map in
physical compatibility mode. This mode should allow SAN management applications to communicate
directly with storage controllers for monitoring and configuration.
Check with the storage vendor to determine if the appropriate storage management software has been
tested and is supported for running in a VM.
Testing has shown no performance difference between VMs accessing storage as encapsulated disk files
and those accessing storage as raw volumes; however, from an administrative perspective, the use of raw
volumes requires more coordination between SAN and server administrators. VMFS does not require the
strict SAN zoning needed to support raw devices with non-distributed file systems.
From a functional perspective, with the introduction of RDM, many of the differences between VMFS and
raw devices have been resolved. As such, unless there is an application requirement or architectural
justification for using raw devices, the use of VM disk files in a VMFS volume is preferable due to their
flexibility and ease of management. For example, with raw storage devices an administrator must create a
LUN on the SAN whenever a new VM is to be created; on the other hand, when creating a VM using a VM
disk file within a VMFS volume, no SAN administration is required since the LUN already exists.
38
Planning partitions
Before installing ESX Server, VMware strongly recommends that you consider your partitioning needs.
Repartitioning an ESX Server requires some Linux expertise; it is easier to plan an appropriate installation
rather than having to repartition later. Table 2 shows the recommended partitioning for a typical scenario. Comment [A1]: This seems out of
See the Installation and Upgrade Guide for more detailed information. place. The rest of the paragraph and the
following table talk about partitioning,
and no more mention is made of array
Table 2: Default storage configuration and partitioning for a VMFS volume on internal drives controllers.
Note:
If your ESX Server host has no network storage, and one local disk, you must
create two more required partitions on the local disk (for a total of five
required partitions): vmkcore, a vmkcore partition is required to store core
dumps for troubleshooting. VMware does not support ESX Server host
configurations without a vmkcore partition.
vmfs3, a vmfs3 partition is required to store your virtual machines. These vmfs
and vmkcore partitions are required on a local disk only if the ESX Server host
has no network storage.
The /var partition can be particularly important as a log file repository. By having the /var mount point
reference a partition that is separate from the root partition, the root partition is less likely to become full. If
the root partition on the service console becomes completely full, the system may become unstable.
Implementing boot-from-SAN
The distributed nature of the VMFS files system can only be leveraged in a shared storage environment;
currently, a SAN (iSCSI or fibre channel) or NAS is the only form of shared storage currently certified for use
with ESX Server. As a result, most ESX Server deployments are attached to a SAN.
ESX Server supports boot-from-SAN, wherein the boot partitions for the Linux-based service console are
placed on the SAN (iSCSI or fibre channel); NAS does not support boot-from-SAN. In this boot-from-SAN
environment, there is no need for local drives within the physical host.
Unlike the VMFS volumes used for storing VM disk files, the partitions for booting ESX Server, which use the
standard Linux ext3 files system, should not be zoned for access by more than one system. In other words, in
the zoning configuration within your SAN, VMFS volumes may be exposed to many hosts; however, boot
partitions – /boot, / (root), swap and any other service console partitions you may have created – should
only be exposed to a single host.
39
The configuration of ESX Server to boot from SAN should be performed at installation time. If you are
installing from the product CD, you must select either the bootfromsan or bootfromsan-text option
Note:
ESX Server multipathing even allows fault-tolerant SAN access to non-VMFS
volumes, if there are raw devices within the VMs.
ESX Server identifies storage entities through a hierarchical naming convention that references the following
elements: controller, target, LUN and partition. This convention provides unique references to VMFS volumes,
such as vmhba2:0:1:2, for example.
This example corresponds to the partition accessed through HBA vmhba2, target 0, LUN 1, and partition 2.
When ESX Server scans the SAN, each HBA reports all LUNs visible on the storage network; each LUN reports
an ID that uniquely identifies it to all nodes on the storage network. After detecting the same unique LUN ID
reported by the storage network, the VMkernel automatically enables multiple, redundant paths to this LUN,
known as multipathing.
ESX Server uses a single storage path for a particular LUN until the LUN becomes unavailable over this path.
After noting the path failure, ESX Server switches to an operational path.
40
Fail-back
For fail-back after all paths are restored, two policies are available to govern the ESX Server response: fixed
and Most-Recently Used (MRU). These policies can be configured through the Storage Management
Options in the web interface or from the command line.
• The fixed policy dictates that access to a particular LUN should always use the specified path, if available.
Should the specified, preferred path become unavailable, the VMkernel uses an alternate path to access
data and partitions on the LUN. ESX Server periodically attempts to initialize the failed SAN path; when the
preferred path is restored, the VMkernel reverts to this path for access to the LUN.
• The MRU policy does not place a preference on SAN paths; instead, the VMkernel accesses LUNs over
any available path. In the event of a failure, the VMkernel maintains LUN connectivity by switching to a
healthy SAN path. The LUN will continue to be accessed over this path, regardless of the state of the
previously-failed path; ESX Server does not attempt to initialize and restore any particular path.
Note:
The concept of a preferred path applies only when the failover policy is fixed;
with the MRU policy, the preferred path specification is ignored.
Application of the path policy is dictated, to a large extent, by the particular storage array deployed.
• For the active-passive SAN controllers found in HP StorageWorks EVA3000, EVA5000 and MSA-series arrays,
avoid the fixed policy; only use MRU.
• For the newer HP StorageWorks EVA4000, EVA6000, and EVA8000 arrays and all members of the HP
StorageWorks XP disk array family, which are all true active-active storage controller platforms, either
policy – fixed or MRU – can be used.
Since the physical storage mechanism is masked by the VMkernel, VMs are unaware of the underlying
infrastructure hosting their data. As a result, multipathing, multipathing policy, and path failover are all
irrelevant within a VM.
Resource Management
ESX Server 3.0 provides the ability for organizations to pool computing resources and then logical and
dynamically allocate guaranteed resources as appropriate, whether that be to organizations, individuals or
job functions. For the following sections, it is helpful to consider resource providers and resource consumers.
Clusters
VirtualCenter allows users to create clusters which can be viewed as logical containers within which
computing resources will be grouped. Each cluster can be configured to support VMware DRS and
VMware HA which will be discussed later in this section. Clusters are consumers of host resources and are
providers to resource pools and VMs.
41
Figure 17: Host systems aggregated into a single resource pool
Resource Pools
Resource Pools are used to hierarchically divide CPU and memory resources within a designated cluster.
Each individual host and each DRS cluster has a root resource pool which aggregates the resources of that
individual host or cluster. Children resource pools can be created from the root resource pool. Each child
owns a portion of the parent resources and can, in turn, provide a hierarchy of child pools. Resource pools
can be made up of both child resource pools and virtual machines. Within each pool users can specify
reservations, limits, shares which are then available to the child resource pools or VMs. For a detailed
discussion on the benefits, usage and resource pool best practices, please refer to the VMware “Resource
Management Guide.”
Resource Allocation
ESX Server provides powerful, flexible hardware allocation policies to enforce Quality of Service (QoS) or
performance requirements, allowing users to define limits and reservations for CPU and memory allocations
within each VM. These dynamic resource management policies make it possible to reserve CPU resources
for a particular VM – even while the VM is operational. For example, administrators could improve the
potential performance of one VM by specifying a reservation of 100% of CPU resources; at the same time,
other VMs in the same physical host could be constrained to a limit of 25%.
42
Allocations can be absolute or share-based. In addition, the allocations can be made to a resource pool
or to an individual VM.
Absolute allocation
It is possible to set a limit and reservation for each VM on a physical host. If, for example, a VM has been
allocated 25% of a CPU, VMkernel gives the VM at least 25% of CPU regardless of the demands of other
VMs, unless the VM with the reservation is idle6. Likewise, a resource pool can be guaranteed a reservation
of 25% of the CPU resources of a cluster. Therefore, each VM within that resource pool will further divide the
compute resources available to that pool. Regardless of reservations, idle VMs are preempted in favor of
VMs requesting resources.
Share-based allocation
In addition to the absolute allocation of resources for an individual, busy VM (with limit and reservation
guarantees), share-based allocation provides a mechanism for the relative distribution of server resources
between VMs. This concept applies to resource pools as well.
Each VM is assigned a certain number of per-resource, per-VM shares. For example, if two VMs have an
equal number of CPU shares, VMkernel ensures that they receive an equal number of CPU cycles (assuming
that neither reservations nor limits are violated for either VM and that neither VM is idle). If one VM has twice
as many shares as another, the VM with the larger share receives twice as many CPU cycles (again,
assuming that minimum or maximum guarantees are not violated for either VM and that neither VM is idle).
Consider a cluster that contains two resource pools, each with an equivalent number of CPU shares. The
VMkernel will guarantee that each pool within that cluster is provided an equal number of CPU cycles. If
one resource pool has twice as many shares as the other, the resource pool with the superior share count
will receive twice the number of CPU cycles allotted to that particular cluster.
6 An idle VM is only attempting to execute instructions that constitute the idle loop process.
43
Allocating shares for other resources
Shares can be defined and allocated per-resource – for CPU, memory, or disk – for each VM. Note that the
application of the relative share allocation policy for other resources differs slightly from CPU:
• Memory
The share allocation policy for memory defines the relative extent to which memory is reclaimed from a
VM if memory overcommitment should occur. A VM with a larger allocation of shares retains a
proportionally larger allocation of physical memory when VMkernel needs to usurp memory from VMs.
• Disk
For disk accesses, the proportional share algorithm allows proportional prioritization for each VM’s disk
access.
There are no shares associated with network traffic; instead of using shares, network resources are
constrained either by traffic shaping or limiting outbound bandwidth.
Best practices
VMware publishes best practices guides for many components of the virtualized environment. These
include the following:
VirtualCenter & Templates http://www.vmware.com/pdf/vc_2_templates_usage_best_practices_wp.pdf
VMware VirtualCenter
VirtualCenter is a centralized management application that supports the hierarchical and logical
organization and viewing of physical ESX Server resources and associated VMs.
VirtualCenter 2.0 allows users to view the following key items:
• All running VMs
• The current state and utilization of each VM
• All ESX Server physical hosts
• The current state and utilization of each ESX Server physical host
• Historical performance and utilization data for each VM
• Historical performance and utilization data for each ESX Server physical host
• VM configuration
• Cluster (DRS and HA) and resource pool configuration
44
Architecture
VirtualCenter is based on a client – server – agent architecture, with each managed host requiring a
management agent license7.
When an ESX Server physical host connects to VirtualCenter, VirtualCenter automatically installs an agent,
which communicates status as well as command and control functions between the VirtualCenter server
and ESX Server.
VirtualCenter server8 is a Windows service that may run either on a physical server or inside a VM. Each
VirtualCenter server should be able to manage between 50 and 100 physical hosts and between 1000 and
2000 VMs, depending on the configuration of the server running the VirtualCenter server service.
The Virtual Infrastructure Client application, acts as the user interface for VirtualCenter. It does not require a
license, and many clients may access the same VirtualCenter server simultaneously. The Virtual Center
Client also acts as the main interface to ESX Server 3; this is convenient for environments that have a small
number of ESX Server hosts, in which it is feasible to manage these hosts directly.
Note:
VirtualCenter requires an ODBC-compliant database for its datastore. This
database holds historical performance data and VM configuration
information.
Template
A template is analogous to a “golden master” server image and represents a ready-to-provision server
installation that helps eliminate the redundant tasks associated with provisioning a new server. For instance,
a template can be built by creating a VM and installing an operating system, all of the required patches
and service packs, and standard security and management applications as well as any common
configuration parameters. Then, the VM’s network identity is reset using a tool such as SysPrep, and is
powered off. Then, VirtualCenter can be used to create a template from this VM. New VMs can be
deployed and customized using a wizard-driven interface or an .XML formatted file containing the desired
customizations.
Templates are not required to be stored within the VirtualCenter server filesystem; they can also be stored
on a NAS shared storage or on a VMFS3 datastore.
7 The management agent is licensed based on the number of physical processors present in the platform to be managed.
8 This server is also a separately licensed component of the Virtual Infrastructure, though, unlike other separately licensed products, the
license for VirtualCenter Management Server is not included within the Virtual Infrastructure Node bundle.
45
Cloning
VirtualCenter can also clone a VM to achieve the rapid deployment and replication of a server
configuration.
Both of these deployment options support the thorough customization of a new VM before it is powered on.
Consider your particular environment before selecting the approach that best meets your needs.
46
When configuring the database connection for VirtualCenter, configure the ODBC client to use a
System DSN with SQL Authentication.
– VirtualCenter 2.0 does not support Windows Authentication to the database servers.
Compatibility
Refer to Table 3 for compatibility between current and previous version of VirtualCenter and ESX Server.
VMotion
With the release of VMotion, VMware introduced a unique, new technology that allows a VM to move
between physical platforms while the VM is running. VMotion can address a wide range of IT challenges –
from accommodating scheduled downtime to building an Adaptive Enterprise.
Architecture
VMotion relies on several of the underlying components of ESX Server virtualization, most notably the VMFS
file system.
As described earlier, VMFS is a distributed file system that locks VM disk files at the file level, a unique locking
mechanism that allows multiple ESX Server instances to utilize a VM disk file within a particular VMFS volume.
This mechanism ensures that only one physical host at a time can access a disk file and power-on the
associated VM.
To support the rapid movement of VMs between physical machines, it is imperative that the large amount
of data associated with each VM does not move – moving the many gigabytes of disk storage associated
with a typical VM would take a significant length of time. As a result, instead of moving the disk storage,
VMotion and VirtualCenter simply change the owner of the VM disk file, allowing the VM to migrate to a
different physical host (as shown in Figure 17).
47
Figure 17: Migrating a VM from one physical host to another without moving the VM disk file
The new physical host also requires access to the memory contents and CPU state information of the VM to
be migrated. However, unlike the disk-bound data, there is no shared medium for memory and CPU
resources; the CPU state must be migrated by copying the data over a network connection.
When initiating a migration, VMotion takes a snapshot of source server memory, then sends a copy of these
memory pages – unencrypted – to the destination server. During this copying process, execution continues
on the source server; as a result, memory contents on this server are likely to change. ESX Server tracks these
changes and, when copying is complete, sends the destination server a map indicating which memory
pages have changed.
At this point, the CPU state is sent to the destination server, the file lock is changed, and the destination
server opens the VM file and assumes execution of the VM. Accesses to one of the changed memory
pages are served from the source server until all memory changes have been communicated to the
destination server via background processes.
These network-intensive operations justify the deployment of a Gigabit network interface to minimize
latency between source and destination servers and maximize the rate at which memory pages can be
moved between these servers. Moreover, since these memory pages are not encrypted, security needs
may justify the deployment of a dedicated network interface.
48
• Both must have access to the VMFS SAN-based partition that holds either the VM disk file or the VM disk
raw device map file.
Since the VM disk file or raw device map file is not moved during the VMotion operation, both servers
must be able to access this partition. A VMotion operation cannot involve moving the disk file from one
LUN (either local or SAN-based) to another9.
Ensure that the SAN is configured to expose the LUN to the HBAs of both the source and destination
servers.
• Both must have identical virtual switches defined and available for all virtual network adapters within the
VM.
Assuming that the VM to be migrated is using a network interface to perform meaningful activity,
VirtualCenter must ensure that this connection is still available after the migration. Before initiating a
VMotion operation, VirtualCenter examines and compares virtual switch definitions and configurations on
both source and destination servers to ensure that they are identical.
Note that VirtualCenter does not attempt to validate the defined connectivity, VC assumes that the IT
staff followed good practice during configuration.
For example, if the VM connects to a virtual switch named devnet on the source server, the destination
server must also have a virtual switch named devnet. If the appropriate virtual switch exists on the
destination server, VirtualCenter assumes that the networks are identical and provides the same
functional connectivity. As a result, if the virtual switch on the source server connects to a development
network and the virtual switch on the destination server connects to a production network, the VMotion
operation still continues; however, it is likely that the application within the migrated VM will not be able to
access the appropriate network resources.
To facilitate this process, you should take care to be consistent and use meaningful names when
configuring and creating virtual switches.
• Both must have compatible processors.
Since both the CPU and the execution states are moved from one server to the other, it is critical that
both processors implement the same instruction sets in exactly the same manner. If not, unsupported or
altered instruction execution will have unknown and potentially catastrophic affect on the migrated VM.
While this safeguard seems straightforward, the enhancements regularly implemented by Intel and AMD
mean that compatibility is not always clear – particularly when these enhancements occur within a
particular processor family, compatibility is not always clear.
VMware Knowledge Base article #1377 provides an overview of the challenges to be faced when
migrating a VM from a physical host that supports the SSE3 instruction set and one that does not, and vice
versa. For example, VMotion reports an incompatibility between HP ProLiant BL20p G2 and G3 server
blades.
If an incompatibility were to be reported, the above Knowledge Base describes an unsupported method
for overcoming this safeguard.
• The destination server must have enough free memory to support the minimum guarantee for the VM to
be moved.
In practice, this statement could apply to all server resources: if a physical resource allocation
guaranteed to the VM on the source machine cannot be met on the destination machine, the VMotion
operation fails. In this case, the VM continues to run on the source server.
Clustered VMs unsupported
Currently, clustered VMs are not supported for VMotion operations.
VMotion requires that VMs access VMFS volumes using the public bus access mode; however, because of
their shared storage requirements, ESX Server requires clustered VMs to use the shared mode. These two
access modes are incompatible.
In order to migrate a clustered VM node from one physical host to another, you must take down one node
and perform a “cold migration.” After the migration is complete, bring the cluster node back up and rejoin
the cluster. With the cluster complete, repeat the process with the other node, if desired.
9 To perform a migration that requires the disk file to be moved, the VM must be powered off (or suspended) and “cold migrated.”
49
50
For more information
For access to VMware product guides see, http://www.vmware.com/support/pubs
For detailed information on Planning, Deploying, or Managing a virtual infrastructure on
ProLiant see, http://h71019.www7.hp.com/ActiveAnswers/cache/71086-0-0-0-121.html
Copyright © 2006 VMware, Inc. All rights reserved. Protected by one or more of U.S. Patent
Nos. 6,397,242, 6,496,847, 6,704,925, 6,711,672, 6,725,289, 6,735,601, 6,785,886, 6,789,156,
6,795,966, 6,880,022 6,961,941, 6,961,806 and 6,944,699; patents pending. VMware, the
VMware “boxes” logo and design, Virtual SMP and VMotion are registered trademarks or
trademarks of VMware, Inc. in the United States and/or other jurisdictions. Microsoft,
Windows and Windows NT are registered trademarks of Microsoft Corporation. Linux is a
registered trademark of Linus Torvalds. All other marks and names mentioned herein may
be trademarks of their respective companies.