Anda di halaman 1dari 145

VMware

vSphere 5 Design Best Practice Guide


www.viKernel.com, Gareth Hogarth


Advisory Information:
This document has been put together using VMware vSphere 5 best practice white papers, in most cases the information is simply
copied and pasted. These are my notes used in preparation for the VCAP5-DCD exam. Please note that I may have intentionally not
included all information from the whitepapers, just pertinent information relating to areas that I feel should provide the appropriate
knowledge of vSphere components, use at own risk.

This material is VMware Copyrighted. The information is publicly available from pubs.vmware.com.

Purpose:
The purpose of this document is to provide the necessary supporting knowledge for the VCAP5-DCD exam. The information
provided is not intended as a comprehensive study guide, but can be used to assist you with some of the topics highlighted in the
exam blueprint.

Version
Date
Author
Description
1.0
14/05/2014
Gareth Hogarth
vSphere 5.0, 5.1 specific content.



Table of Contents
1.
2.

VMware vSphere VMFS Technical Overview and Best Practises


VMware Fault Tolerance Recommendations and Considerations on VMware vSphere

3.

Networking Best Practices

16

4.

VMware vSphere High Availability 5.0 Deployment best Practices

16

5.

vSphere ESXi vCenter Server 5.0 Availability Guide (High Level)

27

6.

Best Practices for Running VMware vSphere on iSCSI

41

7.

Best Practices for running VMware vSphere on Network Attached Storage

54

8.

VMware vSphere 5.0 Upgrade Best Practices

59

9.

Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs

72

10.

Performance Best Practices for VMware vSphere 5.0

77

11.

VMware vSphere Distributed Switch Best Practices

100

12.

VMware Network I/O Control: Architecture, Performance and Best Practices

122

13.
















Storage I/O Control Technical Overview and Considerations for Deployment

137


1. VMware vSphere VMFS Technical Overview and Best Practises

VMFS5
Provides Distributed Infrastructure Services for Multiple vSphere Hosts
VMFS enables virtual disk files to be shared by as many as 32 vSphere hosts. Furthermore, it manages storage access for
multiple vSphere hosts and enables them to read and write to the same storage pool at the same time
Facilitates Dynamic Growth
Provides Intelligent Cluster Volume Management
Optimizes Storage Utilization
Enables High Availability with Lower Management Overhead
Simplifies Disaster Recovery

Best Practices for Deployment and Use of VMFS

Topics Addressed:
How Large a LUN?

The best way to configure a LUN for a given VMFS volume is to size for throughput first and capacity second. That is, you
should aggregate the total I/O throughput for all applications or virtual machines that might run on a given shared pool of
storage; then make sure you have provisioned enough back-end disk spindles (disk array cache) and appropriate storage
service to meet the requirements.
Because there is no single correct answer to the question of how large your LUNs should be for a VMFS volume, the more
important question to ask is, How long would it take one to restore the virtual machines on this datastore if it were to
fail? The recovery time objective (RTO) is now the major consideration when deciding how large to make a VMFS
datastore. This equates to how long it would take an administrator to restore all of the virtual machines residing on a single
VMFS volume if there were a failure that caused data loss.
The main concern now is how long it would take to recover from a catastrophic storage failure. Another important question
to ask is, How does one determine whether a certain datastore is overprovisioned or under provisioned?
vSphere Storage DRS, introduced in vSphere 5.0, can also be a useful feature to leverage for load balancing virtual machines
across multiple datastores, from both a capacity and a performance perspective.

Isolation or Consolidation

The basic answer depends on the nature of the I/O access patterns of that virtual machine. If you have a very heavy I/O-
generating application, in many cases VMware vSphere Storage I/O Control can assist in managing fairness of I/O resources
among virtual machines. Another consideration in addressing the noisy neighbor problem is that it might be worth the
potentially inefficient use of resources to allocate a single LUN to a single virtual machine. This can be accomplished using
either an RDM or a VMFS volume that is dedicated to a single virtual machine. These two types of volumes perform
similarly (within 5 percent of each other), with varying read and write sizes and I/O access patterns.

Isolated Storage Resources

One school of thought suggests limiting the access of a single LUN to a single virtual machine. In the physical world, this is
quite common. When using RDMs, such isolation is implicit, because each RDM volume is mapped to a single virtual
machine.
The downside to this approach is that as you scale the virtual environment, you soon reach the upper limit of 256 LUNs per
host.

Consolidated Pools of Storage

The consolidation school wants to gain additional management productivity and resource utilization by pooling the storage
resource and sharing it, with many virtual machines running on several vSphere hosts. Dividing this shared resource among
many virtual machines enables better flexibility as well as easier provisioning and ongoing management of the storage
resources for the virtual environment.
Compared to strict isolation, consolidation normally offers better utilization of storage resources. The cost is additional
resource contention, which under some circumstances can lead to reduction in virtual machine I/O performance. However,
vSphere offers Storage I/O Control and vSphere Storage DRS to mitigate these risks.

Best Practice: Mix Consolidation with Some Isolation

In general, use vSphere Storage DRS to detect and mitigate storage latency and capacity bottlenecks by load balancing
virtual machines across multiple VMFS volumes. Additionally, vSphere Storage I/O Control can be leveraged to ensure
fairness of I/O resource distribution among many virtual machines sharing the same VMFS datastore.
Because workloads can vary significantly, there is no exact formula that determines the limits of performance and
scalability regarding the number of virtual machines per LUN. These limits also depend on the number of vSphere hosts
sharing concurrent access to a given VMFS volume. The key is to remember the upper limit of 256 LUNs per vSphere host
and consider that this number can diminish the consolidation ratio if you take the concept of one LUN per virtual machine
too far.

Use of RDMs or VMFS

An RDM file is a special file in a VMFS volume that manages metadata for its mapped device.
Employing RDMs provides the advantages of direct access to a physical device while keeping some advantages of a virtual
disk in the VMFS file system. In effect, the RDM merges VMFS manageability with raw device access.

An RDM is a symbolic link from a VMFS volume to a raw volume.

Using RDMs, you can do the following:

Use vMotion to migrate virtual machines using raw volumes. Add raw volumes to virtual machines using the VI client.
Use file system features such as distributed file locking, permissions and naming.

RDMs have the following two compatibility modes:


Virtual compatibility mode enables a mapping to act exactly like a virtual disk file, including the use of storage array
snapshots.
Physical compatibility mode enables direct access of the SCSI device, for those applications needing lower level control.
vMotion, vSphere DRS and vSphere HA are all supported for RDMs that are in both physical and virtual compatibility modes.

Why Use VMFS?

For most applications, VMFS is the clear choice. It provides the automated file system capabilities that make it easy to
provision and manage storage for virtual machines running on a cluster of vSphere hosts. VMFS has an automated
hierarchical file system structure with user-friendly file-naming access.
It enables a higher disk utilization rate by facilitating the process of provisioning the virtual disks from a shared pool of
clustered storage.
As you scale the number of vSphere hosts and the total capacity of shared storage, VMFS greatly simplifies the process. It
also enables a larger pool of storage than might be addressed via RDMs. Because the number of LUNs that a given cluster of
vSphere hosts can discover is currently capped at 256, you can reach this number rather quickly if mapping a set of LUNs to
every virtual machine running on the vSphere host cluster.
Using RDMs usually requires more frequent and varied dependence on the storage administration team, because each LUN
must be sized for the needs of each specific virtual machine to which it is mapped.
With VMFS, however, you can carve out many smaller VMDKs for virtual machines from a single VMFS volume. This enables
the partitioning of a larger VMFS volumeor a single LUNinto several smaller virtual disks, which facilitates a centralized
management utility (vCenter) to be used as a control point
With RDMs, there is no way to break up the LUN and address it as anything more than a single disk for a given virtual
machine

Why Use RDMs?


Even with all the advantages of VMFS, there still are some cases where it makes more sense to use RDM storage access. The
following scenarios call for raw disk mapping:

Migrating an existing application from a physical environment to virtualization


Using Microsoft Cluster Service (MSCS) for clustering in a virtual environment
Implementing N-Port ID Virtualization (NPIV)
Separating heavy I/O workloads from the shared pool of storage

RDM Scenario 1: Migrating an Existing Application to a Virtual Server


Figure 4 shows a typical migration from a physical server to a virtual one. Before migration, the application running on the physical
server has two disks (LUNs) associated with it. One disk is for the OS and application files; a second disk is for the application data.
To begin, use the VMware vCenter Converter TM to build the virtual machine and to load the OS and application data into the new
virtual machine.

Next, remove access to the data disk from the physical machine and make sure the disk is properly zoned and accessible from the
vSphere host. Then create an RDM for the new virtual machine pointing to the data disk. This enables the contents of the existing
data disk to be accessed just as they are, without the need to copy them to a new location.


RDM Scenario 2: Using Microsoft Cluster Service in a Virtual Environment, another common use of RDMs is for MSCS configurations.


When and How to Use Disk Spanning
It is generally best to begin with a single LUN in a VMFS volume. To increase the size of that resource pool, you can provide
additional capacity by either 1) adding a new VMFS extent to the VMFS volume or 2) increasing the size of the VMFS volume on an
underlying LUN that has been expanded in the array (via a dynamic expansion within the storage array). Adding a new extent to the
existing VMFS volume will result in the existing VMFS volumes spanning across more than one LUN. However, until the initial
capacity is filled, that additional allocation of capacity is not yet put to use.
Expanding the VMFS volume on an existing, larger LUN will also increase the size of the VMFS volume, but it should not be confused
with spanning.
From a management perspective, it is preferable that a single large LUN with a single extent host your VMFS. Using multiple LUNs to
back multiple extents of a VMFS volume entails presenting every LUN to each of the vSphere hosts sharing the datastore. Although
multiple extents might have been required prior to the release of vSphere 5 and VMFS5 to produce VMFS volumes larger than 2TB,
VMFS5 now supports single-extent volumes up to 64TB.
Gaining Additional Throughput and Storage Capacity
Additional capacity with disk spanning does not necessarily increase I/O throughput capacity for that VMFS volume. It does,
however, result in increased storage capacity.
Suggestions for Rescanning
In prior versions of vSphere, it was recommended that before adding a new VMFS extent to a VMFS volume, you make sure a rescan
of the SAN is executed for all nodes in the cluster that share the common pool of storage. However, in more recent versions of
vSphere, there is an automatic rescan that is triggered when the target detects a new LUN, so that each vSphere host updates its
shared storage information when a change is made on that shared storage resource. This auto rescan is the default setting in
vSphere and is configured to occur every 300 seconds.


2. VMware Fault Tolerance Recommendations and Considerations on VMware vSphere

VMware High Availability Features Timeline


VMware Fault Tolerance (FT)

VMware FT is a feature available with VMware vSphere TM 4 (i.e., ESX 4 and vCenterTM Server 4) that allows a virtual machine to
continue running even when the underlying physical server fails.
It is a software solution that runs on commodity hardware and does not require any modifications to the guest operating system or
applications running inside the virtual machine
Overview
When VMware FT is enabled on a virtual machine (called the Primary VM), a copy of the Primary VM (called the Secondary VM) is
automatically created on another host, chosen by VMware Distributed Resource Scheduler (DRS)
If VMware DRS is not enabled, the target host is chosen from the list of available hosts. VMware FT then runs the Primary and
Secondary VMs in lockstep with each other essentially mirroring the execution state of the Primary VM to the Secondary VM. In
the event of a hardware failure that causes the Primary VM to fail, the Secondary VM immediately picks up where the Primary VM
left off, and continues to run without any loss of network connections, transactions, or data.
VMware FT keeps the Primary and Secondary VMs in lockstep using VMware vLockstep technology. vLockstep technology ensures
that the Primary and Secondary VMs execute the same x86 instructions in an identical sequence. Here, the Primary VM captures all
nondeterministic events and sends them across a VMware FT logging network to the Secondary VM
As both the Primary and Secondary VMs execute the same instruction sequence, both initiate I/O operations. However, the outputs
of the Primary VM are the only ones that take effect: disk writes are committed, network packets are transmitted, and so on. All
outputs of the Secondary VM are suppressed by ESX. Thus, only a single virtual machine instance appears to the outside world.

Transparent Failover
Along with keeping the Primary and Secondary VMs in sync, VMware Fault Tolerance must rapidly detect and respond to hardware
failures of the physical machines running the Primary or the Secondary VM. When vLockstep technology is initiated, the ESX
hypervisor starts sending heartbeats over the FT logging network between the ESX hosts where the Primary and Secondary VMs
reside. This allows VMware FT to detect immediately if a host fails and execute a transparent failover where the remaining VMware
FT virtual machine continues running the protected workload without interruption.
Consider a VMware HA cluster of three ESX hosts, two of which are running a Primary and Secondary VM.
If the host running the Primary VM fails the Secondary VM is immediately activated to replace the Primary VM. A new Secondary VM
is created and fault tolerance is re-established in a short period of time. Unlike the initial creation of the Secondary VM where DRS
chooses the target ESX host, for failovers VMware HA chooses the target ESX host for the new Secondary VM. Users experience no
interruption in service and no loss of data during the transparent failover.
Lifecycle of a fault-tolerant virtual machine
Turning on and enabling VMware FT for a virtual machine affects the virtual machines lifecycle, but it is entirely transparent to the
end-user client and does not disrupt client connections or the clients workload. The following steps outline the lifecycle of a
VMware FT virtual machine:
1. Administrator selects a virtual machine in either the powered-on or off state and turns on VMware FT.
2. The virtual machine becomes the Primary VM and a Secondary VM is automatically created and assigned to an ESX host,
sharing the same disk as the ESX host running the Primary VM.
3. If the Primary VM is already powered-on when VMware FT is turned on, its active state is immediately migrated using a
special form of VMotion to the Secondary VM on an automatically chosen ESX host. If the Primary VM is powered-off then the
migration of its active state to the Secondary VM occurs right after the Primary VM is powered on.
4. The Secondary VM stays synchronized with the Primary VM through VMware vLockstep technology.
5. If the ESX host running the Primary VM goes down, the Secondary VM will immediately go live and become the Primary VM.
6. VMware HA automatically starts a new Secondary VM on another available host to restore protection.
7. The Secondary VM is powered off when the Primary VM powers off or when VMware FT is disabled. The Secondary VM is
removed altogether when VMware FT is turned off.


Requirements:
Cluster and Host Requirements

VMware FT can only be used in a VMware HA cluster.


Ensure that all ESX hosts in the VMware HA cluster have identical ESX versions and patch levels. vLockstep technology only
works between Primary and Secondary VMs on hosts running identical versions of ESX. Please see the section on Patching
hosts running VMware FT virtual machines for recommendations on how to upgrade hosts that are running FT virtual
machines.
ESX host processors must be VMware FT capable and belong to the same processor model family. VMware FT capable
processors required changes in both the performance counter architecture and virtualization hardware assists of both AMD
and Intel (AMD OpteronTM based on the AMD Barcelona, Budapest and Shanghai processor families; and Intel Xeon
processors based on the Penryn and Nehalem micro-architectures and their successors.
VMware FT does not disable AMDs Rapid Virtualization Indexing (i.e., nested page tables) or Intels Extended Page Tables
for the ESX host, but it is automatically disabled for the virtual machine when turning on VMware FT. However, virtual
machines without FT enabled can still take advantage of these hardware-assisted virtualization features.
VMware FT is supported on ESX hosts which have hyper-threading enabled or disabled. Hyper-threading does not have to
be disabled on these systems for VMware FT to work.


Storage Requirements

Shared storage required Fibre channel, iSCSI, or NAS.


Turning on VMware FT for a virtual machine first requires the virtual machines virtual disk (VMDK) files to be eager zeroed
and thick-provisioned . So, thin-provisioned or lazy- zeroed disks could be converted during off-peak times through two
methods: Use the vmkfstools --disk format eagerzeroedthick option in the vSphere CLI when the virtual machine is powered
off. Please see the vSphere Command-Line Interface Installation and Reference Guide for details:
http://www.vmware.com/pdf/vsphere4/r40/vsp_40_vcli.pdf
Set cbtmotion.forceEagerZeroedThick = true flag in the .vmx file before powering on the virtual machine. Then use
VMware Storage VMotion to do the conversion /.
Backup solutions within the guest operating system for file or disk-level backups are supported. However, these
applications may lead to the saturation of the VMware FT logging network if heavy read access is performed.
Saturation of the FT logging network could occur for any disk-intensive workload
Do not run a lot of VMware FT virtual machines with high disk reads and high network inputs on the same ESX host

Networking Recommendations
At a minimum, use 1 GbE NICs for VMware FT logging network. Use 10 GbE NICs for increased bandwidth of FT logging
traffic.
Ensure that the networking latency between ESX hosts is low Sub-millisecond latency is recommended for the FT logging
network. Use vmkping to measure the latency.
VMware vSwitch settings on the hosts should also be uniform, such as using the same VLAN for VMware FT logging, to
make these hosts available for placement of Secondary VMs Consider using a VMware vNetwork Distributed Switch to
avoid inconsistencies in the vSwitch settings
Baseline Recommendation:

Preferably, each host has separate 1 GbE NICs for FT logging traffic and VMotion. The reason for recommending separate NICs is that
the creation of the Secondary VM is done by migrating the Primary VM with VMotion. This can produce significant traffic on the
VMotion NIC and could affect VMware FT logging traffic if the NICs are shared.

In addition, it is preferable that the VMware FT logging NIC has redundancy, so that no unnecessary failovers occur if a
single NIC is lost.
As described in the steps below, the VMware FT logging NIC and VMotion NIC can be configured so that they will
automatically share the remaining NIC if one or the other NIC fails.


1. Create a vSwitch that is connected to at least two physical NICs.
2. Create a VMware VMkernel connection (displayed as VM kernel Port in vSphere Client) for VMotion and another one for FT traffic.
3. Make sure that different IP addresses are set for the two VMkernel connections.
4. Assign the NIC teaming properties to ensure that vMotion and FT use different NICs as the active NIC:
a. For VMotion: Set NIC A as active and NIC B as passive.
b. For FT: Set NIC B as active and NIC A as passive.

Not supported:
Source port ID or source MAC address based load balancing policies do not distribute FT logging traffic. However, if there are
multiple VMware FT host pairs, some load balancing is possible with an IP-hash load balancing scheme, though IP-hash may require
physical switch changes such as ether-channel setup. VMware FT will not automatically change any vSwitch settings.
VMware FT Usage Scenarios
VMware FT can be used to protect mission-critical workloads, while VMware HA protects the other workloads by restarting the
virtual machine in the event of a virtual machine or ESX host failure.
Running VMware FT and VMware HA virtual machines on the same ESX host is fully supported. VMware HA also helps protect

VMware FT virtual machines in the unlikely case where the ESX hosts running the Primary and Secondary VMs both fail. In that case,
VMware HA will trigger the restart of the Primary VM as well as re-spawn a new Secondary VM onto another host. Note that if the
guest operating system in the Primary VM fails, such as resulting from a blue screen in Windows, the Secondary VM will experience
the same failure. The VMware HA feature called VM Monitoring will detect this Primary VM failure through VMware Tools
heartbeats and VMware HA will automatically restart the failed Primary VM and re-spawn a new Secondary VM.
VMware FT on-demand
The process of turning on VMware FT for a virtual machine takes on the order of minutes. Turning off VMware FT occurs in seconds.
This allows virtual machines to be turned on and off on-demand when needed. Turning on and off VMware FT can also be
automated by scheduling the task for certain times using the vSphere CLI.
During critical times in your datacenter, such as the last three days of the quarter when any outage can be disastrous, VMware FT
on-demand can be scheduled to protect virtual machines for the critical 72 or 96 hours when protection is vital.
When the critical period ends VMware FT is turned off again, and the resources used for the Secondary VM are no longer allocated.
Patching hosts running VMware FT virtual machines
When ESX hosts are running VMware FT virtual machines, the ESX hosts running the Primary and Secondary VMs must be running
the same ESX version and patch level. This requirement must be carefully considered when updating the ESX hosts. The following
two approaches are recommended for patching ESX hosts with FT virtual machines.
The first approach is suggested for environments where disabling VMware FT for virtual machines can be tolerated for the amount
of time required to update all ESX hosts in the cluster
For each virtual machine protected by VMware FT in the cluster, right-click the virtual machine, highlight Fault Tolerance and select
Disable Fault Tolerance (note: turning off VMware FT would work but turning it back on later would take longer).
After updating all hosts in the cluster to the same version and patch level right-click each virtual machine you wish to protect with
VMware FT, highlight Fault Tolerance, and select Enable Fault Tolerance.
Please note that the performance data of the Secondary VM will be lost when you turn off VMware FT for the virtual machine. This
data is not lost when you disable VMware FT.
Recommendations for Reliability
Removing single points of failure from your environment is the most important practice in increasing reliability. Reduce single points
of failure by implementing multiple NICs, multiple HBAs, multiple power supplies, storage RAID, etc. Fully-redundant NIC teaming
and storage multi-pathing are recommended to improve reliability.
VMware FT does attempt a failover if the Primary VM loses all paths to fibre channel storage and the Secondary VM still has
connection to fibre channel storage, but customers should not rely on this. Instead they should implement fully-redundant NIC
teaming and storage multi-pathing


Other recommendations to improve reliability include:

Ensuring VMotion and VMware FT logging NICs use a private network.


Using vNetwork Distributed Switches for all networks and hosts.
Minimizing VMotion migrations of the Primary or Secondary VMs to reduce network and compute resources required by
VMware FT. The administrator may also prefer to keep the Primary and Secondary VMs on specific hosts.
Ensuring that ESX hosts deliver consistent CPU cycles by making the power management usage consistent among hosts.
When using network-attached storage (NAS), ensure that the NAS device itself has sufficient resources

Uniformity of Hosts:
The ESX hosts in your cluster should be as uniform to each other as possible as described in the Cluster and host requirements
section. For better performance, the hosts running the Primary and Secondary VMs should operate at roughly the same processor
frequencies in order to ensure the highest level of fault tolerance. Processor speed differences greater than 400 MHz in frequency
may become problematic for CPU-bound workloads.
CPU frequency scaling may cause the Secondary VM to run slower than the Primary VM and will cause the Primary VM to slow
down.
It is therefore recommended that BIOS-based power management features be used consistently across hosts and that certain
settings should be avoided on hosts with VMware FT virtual machines.
VMware Distributed Power Management (DPM) will not recommend a host for power off unless it can successfully recommend
VMotion migrations of all virtual machines off that host.
Since VMware FT virtual machines are VMware DRS disabled and cannot be migrated by VMotion recommendations, VMware DPM
will not recommend powering off any host with running VMware FT virtual machines
However, VMware DPM can still be enabled on a VMware HA cluster running VMware FT virtual machines and will simply provide
power on or off recommendations for hosts not running VMware FT virtual machines.
Placement of Fault Tolerant Virtual Machines
VMware FT creates Secondary VMs and places them onto another ESX host. If VMware DRS is enabled, DRS decides the target host
for the Secondary VM when VMware FT is turned on. If DRS is not enabled, the target host is chosen from the list of available hosts.

After a failover, VMware HA decides the target host for the new Secondary VM. When enabling VMware FT for many virtual
machines, you may want to avoid the situation where many Primary and Secondary VMs are placed on the same host. The number
of fault tolerant virtual machines that you can safely run on each host cannot be stated precisely because the number is based on
the ESX host size, the virtual machine size, and workload factors, all of which can vary widely. VMware does expect the number of
supportable VMware FT VMs running on a host to be bound by the saturation of the VMware FT logging network
Given this, it is recommended that no more than four Primary and Secondary VMs be placed onto the same ESX host. For running
more than four VMware FT virtual machines on a host, refer to the following:
As described in the section on VMware vLockstep technology, the VMware FT logging network traffic depends on the amount of
nondeterministic events and external inputs that are recorded at the Primary VM. Since the bulk of this traffic usually consists of
incoming network packets and disk reads one could calculate the amount of networking bandwidth required for VMware FT logging
using the following:
VMware FT logging bandwidth ~= (Avg disk reads (MB/s) x 8 + Avg network input (Mbps)) x 1.2 [20% headroom]
The above calculation reserves an additional 20 percent of networking bandwidth on top of the disk and network inputs to the
virtual machine. This 20 percent headroom is recommended for transmitting nondeterministic CPU events and for TCP/IP overhead.
You can measure the characteristics of your workload through the vSphere Client. Click the Performance tab of the virtual machine
to see disk and network I/O.


When running multiple VMware FT virtual machines on the same ESX host, mix Primary and Secondary VMs together. The bulk of
the VMware FT logging traffic flows from the Primary VM to the Secondary VM. Much less traffic flows from the Secondary VM to
the Primary VM. Therefore, the bandwidth of the VMware FT logging NICs will be better utilized if each host has a mix of Primary
and Secondary VMs, rather than all Primary VMs or all Secondary VMs. Also, the Secondary VM does not perform any I/O to the
virtual machine network and disk. So, the utilization of the virtual machine network and disk will also be more balanced if a host has
a mix of Primary and Secondary VMs.

Timekeeping Recommendations
In order to avoid time mis-match issues of a virtual machine after an VMware FT failover, perform the following steps:
1. Synchronize the guest operating system time with a time source, which will depend whether the guest is Windows or
Linux.
2. Synchronize the time of each ESX server host with a network time protocol (NTP) server.


Windows guest operating system time synch
For Windows Server 2003 guest operating systems, synchronize time with the appropriate domain controllers within their Microsoft
Active Directory (AD) domain. In turn, each domain controller should sync their clock with the primary domain controller emulator
(PDC Emulator) of the domain. All PDC Emulators should be time synchronized with the PDC Emulator of the root forest domain.
Finally the PDC Emulator of the root forest domain should be time synchronized with a stratum 1 time source such as an NTP time
server or a hardware atomic clock. If AD is not being used in your environment, synchronize time directly with the NTP time server
or another reliable external time source. Please refer to your Windows documentation for details.
Linux guest operating system time synchronization

For Linux guest operating systems, synchronize time with an NTP server by performing the following steps: 1. Open the
VMware Tools Properties dialog box from within the guest. Under Miscellaneous Options, make sure Time synchronization
between the virtual machine and the ESX Server option is not checked.
Synchronize time with an NTP time server. Please refer to Installing and Configuring Linux Guest Operating Systems for
configuration details. http://www.vmware.com/resources/techresources/1076

If your guest operating system is very time-sensitive, then synchronize the guest operating system directly with the NTP server. The
method to do this varies depending on the guest operating system. Please consult your guest operating system documentation for
details.

VMware FT Application Recommendations
Here are a few example recommendations for protecting applications with FT.Example 1: High availability for a multi-tiered SAP
application
SAP NetWeaver 7.0 is a service-oriented application and integration platform that serves as the foundation for all other SAP
applications. Within this multi-tiered SAP NetWeaver 7.0 application, the ABAP SAP Central Services (ASCS) instance is a single point
of failure. (ABAP stands for Advanced Business Application Programming.) ASCS is a group of two servers: the Message Server and
the Enqueue Server.
The Message Server handles all communications in the SAP system. Messaging Server failures cause internal communications
between SAP dispatchers to fail. Other problems include failures in user logon and in batch job scheduling.
The Enqueue Server manages the logical locks for SAP documents and objects during transactions. Enqueue Server failures result in
automatic roll backs of all transactions holding locks and SAP updates that are requesting locks will be aborted.
Since the ASCS is a single point of failure, it requires a high availability solution. For moderate use cases of client connections, a
single vCPU virtual machine running ASCS will suffice. Running these services on a single vCPU virtual machine on another host will
allow it to be protected with VMware FT



ESX #1: Virtual machine with two vCPUs running the database and SAP Central Instance (minus the Message and Enqueue Servers).
Note: This host is also running an SAP-specific load driver benchmark called the Sales and Distribution (SD) Benchmark. This
benchmark was used to validate continuous transaction execution with VMware FT during host failover.
ESX #2: Virtual machine with one vCPU running ASCS (i.e., the Message and Enqueue Servers). This virtual machine has VMware FT
turned on and acts as the Primary VM.
ESX #3: Virtual machine with one vCPU acting as the Secondary VM for the ASCS.
Upon failure of either ESX #2 or #3, VMware FT allows the virtual machine on the other host to immediately takeover execution.

Thus, the ASCS services will not lose any data and will not experience any interruption in service. This can be tested by manually
checking lock integrity via SAP transaction SM12, the SAP lock management transaction. If ESX #1 fails the database (protected via
VMware HA) will temporarily go down but will not force a client disconnection for users logged onto separate dialog instance virtual
machines (not shown above). The client will only experience a pause until the database comes back online either when the host is
rebooted or when the database virtual machine is rebooted on another host through VMware HA.
Example 2: High availability for the Blackberry Enterprise Server
The Blackberry Enterprise Server (BES) 4.1.6 for Microsoft Exchange enables push-based access in delivering Exchange email,
calendar, contacts, scheduling, instant messaging, and other Web services to Blackberry devices. Running BES in a single vCPU virtual
machine can support up to 200 users that receive an average of 100-200 email messages per day. Unless there is a failover
mechanism in place, the loss of BES due to hardware failure will result in the disruption of Blackberry users ability to synch with
Exchange. VMware FT can be turned on for the BES virtual machine as shown in Figure 7 to provide continuous availability that can
survive ESX host failures.



ESX #1: Virtual machine with two vCPUs running the database and Microsoft Exchange server.
ESX #2: Virtual machine with one vCPU running BES 4.1.6. This virtual machine has VMware FT turned on and acts as the Primary
VM.
ESX #3: Virtual machine with one vCPU acting as the Secondary VM for BES 4.1.6.
A failure of either ESX #2 or #3 results in no loss of email delivery to the Blackberry device. VMware FT ensures that the BES
workload is uninterrupted. Currently there are a number of different methods to protect BES from failure, ranging from simple
backup plans to having offline stand-by servers prepared. However, VMware FT is the only software solution to offer uninterrupted
protection for BES service while remaining cost-effective and user-friendly.
Summary of Performance Recommendations

For each virtual machine there are two VMware FT-related actions that can be taken: turning FT on/off and
enabling/disabling FT.
Turning on FT prepares the virtual machine for VMware FT by prompting for the removal of unsupported devices, disabling
unsupported features, and setting the virtual machines memory reservation to be equal to its memory size (thus avoiding
ballooning or swapping).
Enabling FT performs the actual creation of the Secondary VM by live-migrating the Primary VM.
Note: Turning on VMware FT for a powered-on virtual machine will also automatically Enable FT for that virtual machine.
Each of these operations has performance implications.

Do not turn on VMware FT for a virtual machine unless you will be using (i.e., Enabling) VMware FT for that machine.
Turning on VMware FT automatically disables some features for the specific virtual machine that can help performance,
such as hardware virtual MMU (if the processor supports it).
Enabling VMware FT for a virtual machine uses additional resources (for example, the Secondary VM uses as much CPU and
memory as the Primary VM). Therefore make sure you are prepared to devote the resources required before enabling
VMware FT.
The live migration that takes place when VMware FT is enabled can briefly saturate the VMotion network link and can also
cause spikes in CPU utilization.
If the VMotion network link is also being used for other operations, such as VMware FT logging, the performance of those
other operations can be impacted. For this reason, it is best to have separate and dedicated NICs for FT logging traffic and
also for VMotion, especially when multiple VMware FT virtual machines reside on the same host.
Because this potentially resource-intensive live migration takes place each time FT is enabled, it is recommended that
VMware FT not be frequently enabled and disabled.
Because VMware FT logging traffic is asymmetric (the majority of the traffic flows from Primary to Secondary VM),
congestion on the logging NIC can be avoided by distributing primaries onto multiple hosts. For example, on a cluster with
two ESX hosts and two virtual machines with VMware FT enabled, placing one of the Primary VMs on each of the hosts
allows the network bandwidth to be utilized bi-directionally.
VMware FT virtual machines that receive large amounts of network traffic or perform lots of disk reads can create
significant bandwidth on the VMware FT logging NIC. This is true of machines that routinely do these things as well as
machines doing them only intermittently, such as during a backup operation. To avoid saturating the network link used for
logging traffic, limit the number of VMware FT virtual machines on each host or limit disk read bandwidth and network
receive bandwidth of those virtual machines.
Make sure the VMware FT logging traffic is carried by at least a 1 GbE-rated NIC (which should in turn be connected to at
least 1 GbE-rated infrastructure).
Avoid placing more than four VMware FT-enabled virtual machines on a single host. In addition to reducing the possibility
of saturating the network link used for logging traffic, this also limits the number of live- migrations needed to create new
Secondary VMs in the event of a host failure.
If the Secondary VM lags too far behind the Primary VM (which can happen when the Primary VM is CPU bound and the
Secondary VM is not getting enough CPU cycles), the hypervisor may slow down execution on the Primary VM to allow the
Secondary VM to catch up. This can be avoided by making sure the hosts on which the Primary and Secondary VMs run are
relatively closely matched with similar CPU make, model, and frequency. It is recommended to disable certain power
management settings that do not allow for adjustments based on workload. As another alternative, enabling CPU
reservations for the Primary VM (which will be duplicated for the Secondary VM) will help ensure that the Secondary VM
gets CPU cycles when it requires them.
Though timer interrupt rates do not significantly affect VMware FT performance, high timer interrupt rates create
additional network traffic on the FT logging NIC. Therefore, if possible, reduce timer interrupt rates as described in the
Guest Operating System CPU Considerations section of Performance Best Practices for VMware vSphereTM 4.


Fault Tolerance Host Networking Configuration Example
This example describes the host network configuration for Fault Tolerance in a typical deployment with four 1GB NICs. This is one
possible deployment that ensures adequate service to each of the traffic types identified in the example and could be considered a
best practice configuration.
Fault Tolerance provides full uptime during the course of a physical host failure due to power outage, system panic, or similar
reasons. Network or storage path failures or any other physical server components that do not impact the host running state may
not initiate a Fault Tolerance failover to the Secondary VM. Therefore, customers are strongly encouraged to use appropriate
redundancy (for example, NIC teaming) to reduce that chance of losing virtual machine connectivity to infrastructure components
like networks or storage arrays.
NIC Teaming policies are configured on the vSwitch (vSS) Port Groups (or Distributed Virtual Port Groups for vDS) and govern how
the vSwitch will handle and distribute traffic over the physical NICs (vmnics) from virtual machines and vmkernel ports. A unique
Port Group is typically used for each traffic type with each traffic type typically assigned to a different VLAN.

Host Networking Configuration Guidelines


The following guidelines allow you to configure your host's networking to support Fault Tolerance with different combinations of
traffic types (for example, NFS) and numbers of physical NICs.

Distribute each NIC team over two physical switches ensuring L2 domain continuity for each VLAN between the two
physical switches.
Use deterministic teaming policies to ensure particular traffic types have an affinity to a particular NIC (active/standby) or
set of NICs (for example, originating virtual port-id).
Where active/standby policies are used, pair traffic types to minimize impact in a failover situation where both traffic types
will share a vmnic.
Where active/standby policies are used, configure all the active adapters for a particular traffic type (for example, FT
Logging) to the same physical switch. This minimizes the number of network hops and lessens the possibility of
oversubscribing the switch to switch links.

Configuration Example with Four 1Gb NICs


Figure 3-2 depicts the network configuration for a single ESXi host with four 1GB NICs supporting Fault Tolerance. Other hosts in the
FT cluster would be configured similarly.
This example uses four port groups configured as follows:

VLAN A: Virtual Machine Network Port Group-active on vmnic2 (to physical switch #1); standby on vmnic0 (to physical
switch #2.)
VLANB: Management Network PortGroup-active on vmnic0 (to physical switch#2);stand by on vmnic2 (to physical switch
#1.)
VLAN C: vMotion Port Group-active on vmnic1 (to physical switch #2); standby on vmnic3 (to physical switch #1.)
VLAND:FT Logging PortGroup-active on vmnic3(to physical switch #1);standby on vmnic1(to physical switch #2.)

vMotion and FT Logging can share the same VLAN (configure the same VLAN number in both port groups), but require their own
unique IP addresses residing in different IP subnets. However, separate VLANs might be preferred if Quality of Service (QoS)
restrictions are in effect on the physical network with VLAN based QoS. QoS is of particular use where competing traffic comes into
play, for example, where multiple physical switch hops are used or when a failover occurs and multiple traffic types compete for
network resources.

3. Networking Best Practices



Extract from Page 88 - http://pubs.vmware.com/vsphere-50/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-50-
networking-guide.pdf

Separate network services from one another to achieve greater security and better performance.
Put a set of virtual machines on a separate physical NIC. This separation allows for a portion of the total networking
workload to be shared evenly across multiple CPUs. The isolated virtual machines can then better serve traffic from a Web
client, for example:
Keep the vMotion connection on a separate network devoted to vMotion. When migration with vMotion occurs, the
contents of the guest operating systems memory is transmitted over the network. You can do this either by using VLANs to
segment a single physical network or by using separate physical networks(the latter is preferable).
When using pass-through devices with a Linux kernel version 2.6.20 or earlier, avoid MSI and MSI-X modes because these
modes have significant performance impact.
To physically separate network services and to dedicate a particular set of NICs to a specific network service, create a
vSphere standard switch or vSphere distributed switch for each service. If this is not possible, separate network services on
a single switch by attaching them to port groups with different VLAN IDs. In either case, confirm with your network
administrator that the networks or VLANs you choose are isolated in the rest of your environment and that no routers
connect them.
You can add and remove network adapters from a standard or distributed switch without affecting the virtual machines or
the network service that is running behind that switch. If you remove all the running hardware, the virtual machines can
still communicate among themselves. If you leave one network adapter intact, all the virtual machines can still connect with
the physical network.
To protect your most sensitive virtual machines, deploy firewalls in virtual machines that route between virtual networks
with uplinks to physical networks and pure virtual networks with no uplinks.
For best performance, use vmxnet3 virtual NICs.
Every physical network adapter connected to the same vSphere standard switch or vSphere distributed switch should also
be connected to the same physical network.
Configure all VMkernel network adapters to the same MTU. When several VMkernel network adapters are connected to
vSphere distributed switches but have different MTUs configured, you might experience network connectivity problems


4. VMware vSphere High Availability 5.0 Deployment best Practices

vSphere makes it possible to reduce both planned and unplanned downtime. With the revolutionary VMware vSphere vMotion
capabilities in vSphere, it is possible to perform planned maintenance with zero application downtime.
VMware vSphere High Availability (HA) specifically reduces unplanned downtime by leveraging multiple VMware vSphere ESXi hosts
configured as a cluster, to provide rapid recovery from outages and cost-effective high availability for applications running in virtual
machines.
vSphere HA provides for application availability in the following ways:

It reacts to hardware failure and network disruptions by restarting virtual machines on active hosts within the cluster.
It detects operating system (OS) failures by continuously monitoring a virtual machine and restarting it as required.
It provides a mechanism to react to application failures.
It provides the infrastructure to protect all workloads within the cluster, in contrast to other clustering solutions.

Users can combine HA with VMware vSphere Distributed Resource SchedulerTM (DRS) to protect against failures and to provide
load balancing across the hosts within a cluster.

Design Principles for High Availability


Host Selection
Overall vSphere availability starts with proper host selection. This includes items such as redundant power supplies, error-correcting
memory, remote monitoring and notification and so on. Consideration should also be given to removing single points of failure in
host location. This includes distributing hosts across multiple racks or blade chassis to ensure that rack or chassis failure cannot
impact an entire cluster.
When deploying a vSphere HA cluster, it is a best practice to build the cluster out of identical server hardware. The use of identical
hardware provides a number of key advantages, such as the following ones:

Simplifies configuration and management of the servers using Host Profiles


Increases ability to handle server failures and reduces resource fragmentation. The use of drastically different hardware
leads to an unbalanced cluster, as described in the Admission Control section. By default, vSphere HA prepares for the
worst-case scenario in which the largest host in the cluster fails. To handle the worst case, more resources across all hosts
must be reserved, making them essentially unusable.


Additionally, care should be taken to remove any inconsistencies that would prevent a virtual machine from being started on any
cluster host. Inconsistencies such as the mounting of datastores to a subset of the cluster hosts or the implementation of VSphere
DRSrequired virtual machineto-host affinity rules are scenarios to consider carefully. The avoidance of these conditions will
increase the portability of the virtual machine and provide a higher level of availability.

The overall size of a cluster is another important factor to consider. Smaller-sized clusters require a larger relative percentage of the
available cluster resources to be set aside as reserve capacity to handle failures adequately.
For example, to ensure that a cluster of three nodes can tolerate a single host failure, about 33 percent of the cluster resources are
reserved for failover. A 10-node cluster requires that only 10 percent be reserved.
In contrast, as cluster size increases so does the management complexity of the cluster, However, this increase in management
complexity is overshadowed by the benefits a large cluster can provide.

Host Versioning
An ideal configuration is one in which all the hosts contained within the cluster use the latest version of ESXi. When adding a host to
vSphere 5.0 clusters, it is always a best practice to upgrade the host to ESXi 5.0 and to avoid using clusters with mixed-host versions.
Mixed clusters are supported but not recommended because there is some differences in vSphere HA performance between host
versions and these differences can introduce operational variances in a cluster.
These differences arise from the fact that earlier host versions do not offer the same capabilities as later versions. For example,
VMware ESX 3.5 hosts do not support certain properties present within ESX 4.0 and greater. These properties were added to ESX
4.0 to inform vSphere HA of conditions warranting a restart of a virtual machine. As a result, HA will not restart virtual machines that
crash while running on ESX 3.5 hosts but will restart such a virtual machine if it was running on an ESX 4.0 or later host.
The following apply if using a vSphere HAenabled cluster that includes hosts with differing versions:

Users should be aware of the general limitations of using a mixed cluster, as previously mentioned.
Users should also know that ESXi 3.5 hosts within a 5.0 cluster must include a patch to address an issue involving file locks.
For ESX 3.5 hosts, users must apply the ESX350-201012401-SG patch. For ESXi 3.5, they must apply the ESXe350-
201012401-I-BG patch. Prerequisite patches must be applied before applying these patches. HA will not enable an ESX/ESXi
3.5 host to be added to the cluster if it does not meet the patch requirements.
Users should avoid deploying mixed clusters if VMware vSphere Storage vMotion or VMware vSphere Storage DRS is
required. The vSphere 5.0 Availability Guide has more information on this topic

VMware vCenter Server Availability Considerations


VMware vCenter Server is the management focal point for any vSphere environment. Although vSphere HA will continue to protect
any environment without vCenter Server, the ability to manage the environment is severely impacted without it.
It is highly recommended that users protect their vCenter Server instance as well as possible. The following methods can help to
accomplish this:

Use of VMware vCenter Server Heartbeata specially designed high availability solution for vCenter Server
Use of vSphere HAuseful in environments in which the vCenter Server instance is virtualized, such as when using the
VMware vCenter Server Appliance


It is extremely critical when using ESXi Auto Deploy that both the Auto Deploy service and the vCenter Server instance used are
highly available. In the event of a loss of the vCenter Server instance, Auto Deploy hosts might not be able to reboot successfully in
certain situations. However, it bears repeating here that if vSphere HA is used to make vCenter Server highly available, the vCenter
Server virtual machine must be configured with a restart priority of high.
Additionally, this virtual machine should be configured to run on two or more hosts that are not managed by Auto Deploy. This can
be done by using a DRS virtual machineto-host must run on rule or by deploying the virtual machine on a datastore accessible to
only these hosts. Because Auto Deploy depends upon the availability of vCenter Server in certain circumstances, this ensures that
the vCenter Server virtual machine is able to come online. This does not require that vSphere DRS be enabled if users employ DRS
rules, because these rules will remain in effect after DRS has been disabled.
Networking Design Considerations
General Networking Guidelines

If the physical network switches that connect the servers support the PortFast (or an equivalent setting, this should be
enabled. If this feature is not enabled, it can take a while for a host to regain network connectivity after booting due to the
execution of lengthy spanning tree algorithms. While this execution is occurring, virtual machines cannot run on the host
and HA will report the host as isolated or dead. Isolation will be reported if the host and an FDM master can access the
hosts heartbeat datastores.

Host monitoring should be disabled when performing any network maintenance that might disable all heartbeat paths
(including storage heartbeats) between the hosts within the cluster, because this might trigger an isolation response.

With vSphere HA 5.0, all dependencies on DNS have been removed.


Users should employ consistent port group names and network labels on VLANs for public networks
If users employ inconsistent names for the original server and the failover server, virtual machines are disconnected from
their networks after failover. Network labels are used by virtual machines to reestablish network connectivity upon restart.
Use of a documented naming scheme is highly recommended. Issues with port naming can be completely mitigated by use
of a VMware vSphere Distributed Switch.
Configure the management networks so that the vSphere HA agent on a host in the cluster can reach the agents on any of
the other hosts using one of the management networks. Without such a configuration, a network partition condition can
occur after a master host is elected.
Configure the fewest possible number of hardware segments between the servers in a cluster. This limits single points of
failure. Additionally, routes with too many hops can cause networking packet delays for heartbeats and increase the
possible points of failure.
In environments where both IPv4 and IPv6 protocols are used, the user should configure the distributed switches on all
hosts to enable access to both networks. This prevents network partition issues due to the loss of a single IP networking
stack or host failure.
Ensure that TCP/UDP port 8182 is open on all network switches and firewalls that are used by the hosts for interhost
communication. vSphere HA will open these ports automatically when enabled and close them when disabled. User action
is required only if there are firewalls in place between hosts within the cluster, as in a stretched cluster configuration.
Configure redundant management networking from ESXi hosts to network switching hardware if possible along with
heartbeat datastores. Using network adaptor teaming will enhance overall network availability.

Configuration of hosts with management networks on different subnets as part of the same cluster is supported. One or
more isolation addresses for each subnet should be configured accordingly. Refer to the Host Isolation section for more
details.
The management network supports the use of jumbo frames as long as the MTU values and physical network switch
configurations are set correctly. Ensure that the network supports jumbo frames end to end.


Setting Up Redundancy for vSphere HA Networking
Networking redundancy between cluster hosts is absolutely critical for vSphere HA reliability. Redundant management networking
enables the reliable detection of failures.
NOTE: Because this document is primarily focused on vSphere 5.0, its use of the term management network refers to the
VMkernel network selected for use as a management network. Refer to the vSphere Availability Guide for information regarding the
service console network when using VMware ESX 4.1, ESX 4.0, or ESX 3.5x.
Network Adaptor Teaming and Management Networks
Using a team of two network adaptors connected to separate physical switches can improve the reliability of the management
network. The cluster is more resilient to failures because the hosts are connected to each other through two network adaptors and
through two separate switches and thus they have two independent paths for cluster communication.
To configure a network adaptor team for the management network, it is recommended to configure the vNICs in the distributed
switch configuration for the ESXi host in an active/standby configuration. This is illustrated in the following example:
Requirements:
Two physical network adaptors
VLAN trunking
Two physical switches

The distributed switch should be configured as follows:
Load balancing set to route based on the originating virtual port ID (default)
Failback set to No
vSwitch0: Two physical network adaptors (for example, vmnic0 and vmnic2)
Two port groups (for example, vMotion and management)

In this example, the management network runs on vSwitch0 as active on vmnic0 and as standby on vmnic2. The vMotion network
runs on vSwitch0 as active on vmnic2 and as standby on vmnic0.
It is recommended to use NIC ports from different physical NICs and it is preferable that the NICs are different makes and models.
Failback is set to no because in the case of physical switch failure and restart, ESXi might falsely determine that the switch is back
online when its ports first come online. However, the switch itself might not be forwarding any packets until it is fully online.
Therefore, when failback is set to no and an issue arises, both the management network and vMotion network will be running on
the same network adaptor and will continue running until the user manually intervenes.
Management Network Changes in a vSphere HA Cluster
vSphere HA uses the management network as its primary communication path. As a result, it is critical that proper precautions are
taken whenever a maintenance action will affect the management network.
As a general rule, whenever maintenance is to be performed on the management network, the host-monitoring functionality of
vSphere HA should be disabled. This will prevent HA from determining that the maintenance action is a failure and from
consequently triggering the isolation responses.
If there are changes involving the management network, it is advisable to reconfigure HA on all hosts in the cluster after the
maintenance action is completed. This ensures that any pertinent changes are recognized by HA. Changes that cause a loss of
management network connectivity are grounds for performing a reconfiguration of HA. An example of this is the addition or deletion
of networks used for management network traffic when the host is not in maintenance mode.
Storage Design Considerations
Best practices for storage design reduces the likelihood of hosts losing connectivity to the storage used by the virtual machines, and
that used by vSphere HA for Heartbeating. To maintain a constant connection between an ESXi host and its storage, ESXi supports
multipathing, a technique that enables users to employ more than one physical path to transfer data between the host and an
external storage device.
In case of a failure of any element in the SAN, such as an adapter, switch or cable, ESXi can move to another physical path that does
not use the failed component.
In addition to path failover, multipathing provides load balancing, which is the process of distributing I/O loads across multiple
physical paths. Load balancing reduces or removes potential bottlenecks.
Storage Heartbeats
A new feature of vSphere HA in vSphere 5.0 makes it possible to use storage subsystems as a means of communication between the
hosts of a cluster. Storage heartbeats are used when the management network is unavailable to enable a slave HA agent to
communicate with a master HA agent.
The feature also makes it possible to distinguish accurately between the different failure scenarios of dead, isolated or partitioned
hosts.

Storage heartbeats enable detection of cluster partition scenarios that are not supported with previous versions of vSphere.
This results in a more coordinated failover when host isolation occurs.
By default, vCenter Server will select automatically two datastores to use for storage heartbeats, It is intended to select
datastores that are connected to the highest number of hosts. The algorithm is designed to select datastores that are
backed by different LUNs or NFS servers. A preference is given to VMware vSphere VMFSformatted datastores over NFS-
hosted datastores.
vCenter Server selects the heartbeat datastores when HA is enabled, when a datastore is added or removed from a host
and when the accessibility to a datastore changes. Users can, however, configure vSphere HA to give preference to a subset
of the datastores mounted by the hosts in the cluster. Alternately, they can require that HA choose only from a subset of
these.
VMware recommends the users employ the default setting unless there are datastores in the cluster that are more highly
available than others. If there are some more highly available datastores, VMware recommends that users configure
vSphere HA to give preference to these.

VMware does not recommend restricting vSphere HA to using only a subset of the datastores because this setting restricts
the systems ability to respond when a host loses connectivity to one of its configured heartbeat datastores.
NOTE: vSphere HA datastore heartbeating is very lightweight and will not impact in any way the use of the datastores by
virtual machines.
Although users can increase to four the number of heartbeat datastores chosen for each host, increasing the number does
not make the cluster significantly more tolerant of failures. (See the vSphere Metro Storage Cluster white paper for details
about heartbeat datastore recommendations specific to stretched clusters.)
Environments that provide only network-based storage must work optimally with the network architecture to realize fully
the potential of the storage heartbeat feature. If the storage network traffic and the management network traffic flow
through the same network components, disruptions in network service might disrupt both. It is recommended that these
networks be separated as much as possible or that datastores with a different failure domain be used for heartbeating.
In cases where converged networking is used, VMware recommends that users leave heartbeating enabled. This is because
even with converged networking failures can occur that disrupt only the management network traffic. For example, the
VLAN tags for the management network might be incorrectly changed without impacting those used for storage traffic.
It is also recommended that all hosts within a cluster have access to the same datastores. This promotes virtual machine
portability because the virtual machines can then run on any of the hosts within the cluster. Such a configuration is also
beneficial because it maximizes the chance that an isolated or partitioned host can communicate with a master during a
network partition or isolation event.
If network partitions or isolations are anticipated within the environment, users should ensure that a minimum of two
shared datastores is provisioned to all hosts in the cluster


Cluster Configuration Considerations
Host Isolation
One key mechanism within vSphere HA is the ability for a host to detect when it has become network-isolated from the rest of the
cluster. With this information, vSphere is able to take administrator-specified action with respect to running virtual machines on the
host that has been isolated.
Depending on network layout and specific business needs, the administrator might wish to tune the vSphere HA response to an
isolated host to favor rapid failover or to leave the virtual machine running so clients can continue to access it. The following section
explains how a vSphere HA node detects when it has been isolated from the rest of the cluster, and the response options available
to that node after that determination has been made.
Host Isolation Detection
Host isolation detection happens at the individual host level. Isolation fundamentally means a host is no longer able to communicate
over the management network. To determine if it is network-isolated, the host attempts to ping its configured isolation addresses.
The isolation address used should always be reachable by the host under normal situations, because after five seconds have elapsed
with no response from the isolation addresses, the host then declares itself isolated.
The default isolation address is the gateway specified for the management network. Advanced settings can be used to modify the
isolation addresses used for your particular environment. The option das.isolationaddress[X] (where X is 09) is used to configure
multiple isolation addresses. Additionally, das.usedefaultisolationaddress is used to indicate whether the default isolation address
(the default gateway) should be used to determine if the host is network-isolated. If the default gateway is not able to receive ICMP
ping packets, you must set this option to false.
Host Isolation Response
Tuning the host isolation response is typically based on whether loss of connectivity to a host via the management network would
typically also indicate that clients accessing the virtual machine would also be affected. In this case it is likely that administrators
would want the virtual machines shut down so other hosts with operational networks can start them up. If failures of the
management network are not likely correlated with failures of the virtual machine network, where the loss of the management
network simply results in the inability to manage the virtual machines on the isolated host, it is often preferable to leave the virtual
machines running while the management network connectivity is restored.

The Host Isolation Response setting provides a means to set the action preferred for the powered-on virtual machines maintained by
a host when that host has declared it is isolated. There are three possible isolation response values that can be configured and
applied to a cluster or individually to a specific virtual machine.
Leave Powered On
Power Off
Shut Down

Leave Powered On
With this option, virtual machines hosted on an isolated host are left powered on. In situations where a host loses all management
network access, a virtual machine might still have the ability to access the storage subsystem and the virtual machine network. By
selecting this option, the user enables the virtual machine to continue to function if this were to occur. This is the default isolation
response setting in vSphere HA 5.0.
Power Off
When this isolation response option is used, the virtual machines on the isolated host are immediately stopped. This is similar to
removing the power from a physical host. This can induce inconsistency with the file system of the OS used in the virtual machine.
The advantage of this action is that vSphere HA will attempt to restart the virtual machine more quickly than when using the Shut
Down option.
Shut Down
Through the use of the VMware Tools package installed within the guest OS of a virtual machine, this option attempts to shut down
the OS gracefully with the virtual machine before powering off the virtual machine. This is more desirable than using the Power Off
option because it provides the OS with time to commit any outstanding I/O activity to disk.
HA will wait for a default time period of 300 seconds (five minutes) for this graceful shutdown to occur. If the OS is not gracefully
shut down by this time, it will initiate a power off of the virtual machine.
Changing the das.isolationshutdowntimeout attribute will modify this timeout if it is determined that more time is required to shut
down an OS gracefully. The Shut Down option requires that the VMware Tools package be installed in the guest OS. Otherwise, it is
equivalent to the Power Off setting.
In environments that use only network-based storage protocols, such as iSCSI and NFS, and those that share physical network
components between the management and storage traffic, the recommended isolation response is Power Off. With these
environments, it is likely that a network outage causing a host to become isolated will also affect the hosts ability to communicate
to the datastores. This situation might be problematic if both instances of the virtual machine retain access to the virtual machine
network. The Power Off isolation response recommendation reduces the impact of this issue by having the isolated HA agent power
off the virtual machines on the isolated host.
The following table lists the recommended isolation policy for converged network configurations:

Host Monitoring
The host monitoring setting determines whether vSphere HA restarts virtual machines on other hosts in the cluster after a host
isolation, a host failure or after they should crash for some other reason. This setting does not impact the VM/application
monitoring feature. If host monitoring is disabled, isolated hosts wont apply the configured isolation response, and vSphere HA
wont restart virtual machines that fail for any reason. Disabling host monitoring also impacts VMware vSphere Fault Tolerance (FT)
because it controls whether HA will restart an FT secondary virtual machine after a failure event.
Cluster Partitions
A cluster partition is a situation where a subset of hosts within the cluster loses the ability to communicate with the rest of the hosts
in the cluster but can still communicate with each other.
This can occur for various reasons, but the most common cause is the use of a stretched cluster configuration. A stretched cluster is
defined as a cluster that spans multiple sites within a metropolitan area.
When a cluster partition occurs, one subset of hosts is still able to communicate to a master node. The other subset of hosts cannot.
For this reason, the second subset will go through an election process and elect a new master node. Therefore, it is possible to have
multiple master nodes in a cluster partition scenario, with one per partition. This situation will last only as long as the partition
exists. After the network issue causing the partition is resolved, the master nodes will be able to communicate and discover multiple
master roles. Anytime multiple master nodes exist and can communicate with each other over the management network, all but one
will abdicate. Robust management network architecture helps to avoid cluster partition situations.
Additionally, if a network partition occurs, users should ensure that each host retains access to its heartbeat datastores, and that the
masters are able to access the heartbeat datastores used by the slave hosts
vSphere Metro Storage Cluster Considerations
VMware vSphere Metro Storage Clusters (vMSC), or stretched clusters as they are often called, are environments that span multiple
sites within a metropolitan area (typically up to 100km). Storage systems in these environments typically enable a seamless failover
between sites. Because this a complex environment, a paper specific to the vMSC has been produced. Download it here:
http://www.vmware.com/resources/ techresources/10299

Auto Deploy Considerations
Auto Deploy utilizes a PXE boot infrastructure to provision a host automatically. No host-state information is stored on the host
itself.
The best practices recommendation from VMware staff for environments using Auto Deploy is as follows:

Deploy vCenter Server Heartbeat. vCenter Server Heartbeat delivers high availability for vCenter Server, protecting the
virtual and cloud infrastructure from application-, configuration-, OS- or hardware-related outages. (EOA)
Avoid using Auto Deploy in stretched cluster environments, because this complicates the environment
Deploy vCenter Server in a virtual machine. Run the vCenter Server virtual machine in a vSphere HAenabled cluster and
configure the virtual machine with a vSphere HA restart priority of high. Perform one of the following actions:
o Include two or more hosts in the cluster that are not managed by Auto Deploy and pin the vCenter Server virtual
machine to these hosts by using a rule (vSphere DRSrequired virtual machineto-host rule). Users can set up the
rule and then disable DRS if they do not wish to use DRS in the cluster.
o Deploy vCenter Server and Auto Deploy in a separate management environment, that is, by hosts managed by a
different vCenter server.


Virtual Machine and Application Health Monitoring
These features enable the vSphere HA agent on a host to detect heartbeat information on a virtual machine through VMware Tools
or an agent running within the virtual machine that is monitoring the application health.
After the loss of a defined number of VMware Tools heartbeats on the virtual machine, vSphere HA will reset the virtual machine.

Virtual machine and application monitoring are not dependent on the virtual machine protection state attribute as reported by the
vSphere Client.
This attribute signifies that vSphere HA detects that the preferred state of the virtual machine is to be powered on. For this reason,
HA will attempt to restart the virtual machine assuming that there is nothing restricting the restart. Conditions that might restrict
this action include insufficient resources available and a disabled virtual machine restart priority. This functionality is not available
when the vSphere HA agent on a host is in the uninitialized state, as would occur immediately after the vSphere HA agent has been
installed on the host or when the host is not available. Additionally, the number of missed heartbeats is reset after the vSphere HA
agent on the host reboots. This should occur rarely if at all, or after vSphere HA is reconfigured on the host.
Because virtual machines exist only for the purposes of hosting an application, it is highly recommended that virtual machine health
monitoring be enabled. All virtual machines must have the VMware Tools package installed within the guest OS. NOTE: Guest OS
sleep states are not currently supported by virtual machine monitoring and can trigger an unnecessary restart of the virtual
machine.
vSphere HA and vSphere FT
Often vSphere HA is used in conjunction with vSphere FT and provides protection for extremely critical virtual machines where any
loss of service is intolerable
vSphere HA detects the use of FT to ensure proper operation. This section describes some of the unique behavior specific to vSphere
FT with vSphere HA. Additional vSphere FT best practices can be found in the vSphere 5.0 Availability Guide.
Host Partitions

vSphere HA will restart a secondary virtual machine of a vSphere FT virtual machine pair when the primary virtual machine is running
in the same partition as the master HA agent that is responsible for the virtual machine. If this condition is not met, the secondary
virtual machine in 5.0 cannot be restarted until the partition ends.

Host Isolation
Host isolation responses are not performed on virtual machines enabled with vSphere FT. The rationale is that the primary and
secondary FT virtual machine pairs are already communicating via the FT logging network. So they either continue to function and
have network connectivity or they have lost network and they are not heartbeating over the FT logging network, in which case one
of them will then take over as a primary FT virtual machine. Because vSphere HA does not offer better protection than that, it
bypasses FT virtual machines when initiating host isolation response.
Ensure that the FT logging network that is used is implemented with redundancy to provide greater resiliency to failures for FT.
Admission Control
vCenter Server uses HA admission control to ensure that sufficient resources in the cluster are reserved for virtual machine recovery
in the event of host failure.
Admission control will prevent the following if there is encroachment on resources reserved for virtual machines restarted due to
failure:

The power-on of new virtual machines


Changes of virtual machine memory or CPU reservations
A vMotion instance of a virtual machine introduced into the cluster from another cluster


This mechanism is highly recommended to guarantee the availability of virtual machines. With vSphere 5.0, HA offers the following
configuration options for choosing users admission control strategy:
Host Failures Cluster Tolerates (default):
HA ensures that a specified number of hosts can fail and that sufficient resources remain in the cluster to fail over all the virtual
machines from those hosts. HA uses a concept called slots to calculate available resources and required resources for a failing over

of virtual machines from a failed host. Under some configurations, this policy might be too conservative in its reservations. The slot
size can be controlled using several advanced configuration options. In addition, an advanced option can be used to specify the
default slot size value for CPU. This value is used when no CPU reservation has been specified for a virtual machine. The value was
changed in vSphere 5.0 from 256MHz to 32MHz. When no memory reservation is specified for a virtual machine, the largest memory
overhead for any virtual machine in the cluster will be used as the default slot size value for memory. See the vSphere Availability
Guide for more information on slot-size calculation and tuning

Percentage of Cluster Resources Reserved as failover spare capacity:
vSphere HA ensures that a specified percentage of memory and CPU resources are reserved for failover. This policy is recommended
for situations where the user must have host virtual machines with significantly different CPU and memory reservations in the same
cluster or have different-sized hosts in terms of CPU and memory capacity (vSphere 5.0 adds the ability to specify different
percentages for memory and CPU through the vSphere Client). A key difference between this policy and the Host Failures Cluster
Tolerates policy is that with this option the capacity set aside for failures can be fragmented across hosts.
Specify a Failover Host:
vSphere HA designates a specific host or hosts as a failover host(s). When a host fails, HA attempts to restart its virtual machines on
the specified failover host(s). The ability to specify more than one failover host is a new feature in vSphere HA 5.0. When a host is
designated as a failover host, HA admission control does not enable the powering on of virtual machines on that host, and DRS will
not migrate virtual machines to the failover host. It effectively becomes a hot standby.
With each of the three admission control policies there is a chance in specific scenarios that, at the time of failing over a virtual
machine, there might be insufficient contiguous capacity available on a single host to power on a given virtual machine . Although
these are corner case scenarios this has been taken into account and HA will request vSphere DRS, if it is enabled, to attempt to
defragment the capacity in such situations.
Further, if a host had been put into standby and vSphere DPM is enabled, it will attempt to power up a host if defragmentation is not
sufficient.
The best practices recommendation from VMware staff for admission control is as follows:
Select the Percentage of Cluster Resources Reserved policy for admission control. This policy offers the most flexibility in terms of
host and virtual machine sizing and is sufficient for most situations. When configuring this policy, the user should choose a
percentage for CPU and memory that reflects the number of host failures they wish to support.
For example, if the user wants vSphere HA to set aside capacity for two host failures and there are 10 hosts of equal capacity in the
cluster, then they should specify 20 percent (2/10). If there are not equal capacity hosts, then the user should specify a percentage
that equals the capacity of the two largest hosts as a percentage of the cluster capacity.

If the Host Failures Cluster Tolerates policy is used, attempt to keep virtual machine resource reservations similar across all
configured virtual machines. Host Failures Cluster Tolerates uses a notion of slot sizes to calculate the amount of capacity
needed as a reserve for each virtual machine. The slot size is based on the largest reserved memory and CPU needed for
any virtual machine. Mixing virtual machines of greatly different CPU and memory requirements will cause the slot size
calculation to default to the largest possible virtual machine, limiting consolidation. See the vSphere 5.0 Availability Guide
for more information on slot-size calculation and overriding slot-size calculation in cases where it is necessary to configure
different- sized virtual machines in the same cluster.

If the Failover Host policy is used, decide how many host failures to support, and then specify this number of hosts as
failover hosts. Ensure that all cluster hosts are sized equally. If unequally sized hosts are used with the Host Failures Cluster
Tolerates policy, vSphere HA will reserve excess capacity to handle failures of the largest N hosts, where N is the number of
host failures specified. With Percentage of Cluster Resources Reserved policy, unequally sized hosts will require that the
user increase the percentages to reserve enough capacity for the planned number of host failures. Finally, with the Specify
a Failover Host policy, users must specify failover hosts that are as large as the largest nonfailover hosts in the cluster. This
ensures that there is adequate capacity in case of failures.


HA added a capability in vSphere 4.1 to balance virtual machine loading on failover, thereby reducing the issue of resource
imbalance in a cluster after a failover. With this capability, there is less likelihood for vMotion instances after a failover. Also in

vSphere 4.1, HA invokes vSphere DRS to create more contiguous capacity on hosts. This increases the chance for larger virtual
machines to be restarted if some virtual machines cannot be restarted because of resource fragmentation. This does not guarantee
enough contiguous resources to restart all the failed virtual machines. It simply means that vSphere will make the best effort to
restart all virtual machines with the host resources remaining after a failure.
The admission control policy is evaluated against the current state of the cluster, not the normal state of the cluster. The normal
state means that all hosts are connected and healthy. Admission control does not take into account resources of hosts that are
disconnected or in maintenance mode. Only healthy and connected hosts including standby hosts, if vSphere DPM is enabled
can provide resources that are reserved for tolerating host failures.
Affinity Rules
A virtual machinehost affinity rule specifies that the members of a selected virtual machine DRS group should or must run on the
members of a specific host DRS group. Unlike a virtual machinevirtual machine affinity rule, which specifies affinity (or anti-affinity)
between individual virtual machines, a virtual machinehost affinity rule specifies an affinity relationship between a group of virtual
machines and a group of hosts. There are required rules (designated by the term must) and preferred rules (designated by the
term should). See the vSphere Resource Management Guide for more details on setting up virtual machinehost affinity rules.
When restarting virtual machines after a failure, HA ignores the preferential virtual machinehost rules but follows the required
rules. If HA violates any preferential rule, DRS will attempt to correct it after the failover is complete by migrating virtual machines.
Additionally, vSphere DRS might be required to migrate other virtual machines to make space on the preferred hosts.
If required rules are specified, vSphere HA will restart virtual machines on an ESXi host in the same host DRS group only. If no
available hosts are in the host DRS group or the hosts are resource constrained, the restart will fail.
Any required rules defined when DRS is enabled are enforced even if DRS is subsequently disabled. So to remove the effect of such a
rule, it must be explicitly disabled.
Limit the use of required virtual machinehost affinity rules to situations where they are necessary, because such rules can restrict
HA target host selection when restarting a virtual machine after a failure.
Log Files
In the latest version of HA, the changes in the architecture enabled changes in how logging is performed. Previous versions of HA
stored the operational logging information across several distinct log files. In vSphere HA 5.0, this information is consolidated into a
single operational log file. This log file utilizes a circular log rotation mechanism, resulting in multiple files, with each file containing a
part of the overall retained log history.

To improve the ability of the VMware support staff to diagnose problems, VMware recommends configuring logging to retain
approximately one week of history. The following table provides recommended log capacities for several sample cluster
configurations.


The preceding recommendations are sufficient for most environments. If the user notices that the HA log history does not span one
week after implementing the recommended settings in the preceding table, they should consider increasing the capacity beyond
what is noted.
Increasing the log capacity for HA involves specifying the number of log rotations that are preserved and the size of each log file in

the rotation. For log capacities up to 30MB, use a 1MB file size; for log capacities greater than 30MB, use a 5MB file size.
1. The default log settings are sufficient for ESXi hosts that are logging to persistent storage.
2. The default log setting is sufficient for ESXi 5.0 hosts if the following conditions are met: (i) they are not managed by Auto Deploy
and (ii) they are configured with the default log location in a scratch directory on a vSphere VMFS partition.
NOTE: The name of the vSphere HA logger is Fault Domain Manager (FDM).

General Logging Recommendations for All ESX Versions

Ensure that the location where the log files will be stored has sufficient space available.
For ESXi hosts, ensure that logging is being done to a persistent location.
When changing the directory path, ensure that it is present on all hosts in the cluster and is mapped to a different directory
for each host.
Configure each HA cluster separately.
In vSphere 5.0, if a cluster contains 5.0 and earlier host versions, setting the das.config.log.maxFileNum advanced option
will cause the 5.0 hosts to maintain two copies of the log files, one maintained by the 5.0 logging mechanism discussed in
the ESXi 5.0 documentation (see the following) and one maintained by the pre-5.0 logging mechanism, which is configured
using the advanced options previously discussed. In vSphere 5.0U1, this issue has been resolved. In this version, to maintain
two sets of log files, the new HA advanced configuration option das.config.log.outputToFiles must be set to true, and
das.config.log.maxFileNum must be set to a value greater than two.
After changing the advanced options, reconfigure HA on each host in the cluster. The log values users configure in this
manner will be preserved across vCenter Server updates. However, applying an update that includes a new version of the
HA agent will require HA to be reconfigured on each host for the configured values to be reapplied.




5.

vSphere ESXi vCenter Server 5.0 Availability Guide (High Level)


Business Continuity and Minimizing Downtime

vSphere makes it possible for organizations to dramatically reduce planned downtime. Because workloads in a vSphere environment
can be dynamically moved to different physical servers without downtime or service interruption, server maintenance can be
performed without requiring application and service downtime. With vSphere, organizations can:
Eliminate downtime for common maintenance operations.
Eliminate planned maintenance windows.
Perform maintenance at any time without disrupting users and services.

The vSphere vMotion and Storage vMotion functionality in vSphere makes it possible for organizations to reduce planned downtime
because workloads in a VMware environment can be dynamically moved to different physical servers or to different underlying
storage without service interruption
Preventing Unplanned Downtime
Key availability capabilities are built into vSphere:

Sharedstorage.Eliminatesinglepointsoffailurebystoringvirtualmachinefilesonsharedstorage,such as Fibre Channel or iSCSI


SAN, or NAS. The use of SAN mirroring and replication features can be used to keep updated copies of virtual disk at
disaster recovery sites.
Network interface teaming. Provide tolerance of individual network card failures.
Storage multipathing. Tolerate storage path failures.


vSphere HA Provides Rapid Recovery from Outages
Unlike other clustering solutions, vSphere HA provides the infrastructure to protect all workloads with the infrastructure:

You do not need to install special software within the application or virtual machine. All workloads are protected by
vSphere HA. After vSphere HA is configured, no actions are required to protect new virtual machines. They are
automatically protected.
You can combine vSphere HA with vSphere Distributed Resource Scheduler (DRS) to protect against failures and to provide
load balancing across the hosts within a cluster.


Minimal setup
After a vSphere HA cluster is set up, all virtual machines in the cluster get failover support without additional configuration.
Reduced hardware cost and setup
The virtual machine acts as a portable container for the applications and it can be moved among hosts. Administrators avoid
duplicate configurations on multiple machines. When you use vSphere HA, you must have sufficient resources to fail over the
number of hosts you want to protect with vSphere HA. However, the vCenter Server system automatically manages resources and
configures clusters.
Increased application availability
Any application running inside a virtual machine has access to increased availability. Because the virtual machine can recover from
hardware failure, all applications that start at boot have increased availability without increased computing needs, even if the
application is not itself a clustered application. By monitoring and responding to VMware Tools heartbeats and restarting
nonresponsive virtual machines, it protects against guest operating system crashes.
DRS and vMotion integration
If a host fails and virtual machines are restarted on other hosts, DRS can provide migration recommendations or migrate virtual
machines for balanced resource allocation. If one or both of the source and destination hosts of a migration fail, vSphere HA can
help recover from that failure.
vSphere Fault Tolerance Provides Continuous Availability
vSphere HA provides a base level of protection for your virtual machines by restarting virtual machines in the event of a host failure.
vSphere Fault Tolerance provides a higher level of availability, allowing users to protect any virtual machine from a host failure with
no loss of data, transactions, or connections.
How vSphere HA Works
When you create a vSphere HA cluster, a single host is automatically elected as the master host. The master host communicates
with vCenter Server and monitors the state of all protected virtual machines and of the slave hosts.
The master host must distinguish between a failed host and one that is in a network partition or that has become network isolated.
The master host uses datastore heartbeating to determine the type of failure.
Master and Slave Hosts
When you add a host to a vSphere HA cluster, an agent is uploaded to the host and configured to communicate with other agents in
the cluster. Each host in the cluster functions as a master host or a slave host.
When vSphere HA is enabled for a cluster, all active hosts (those not in standby or maintenance mode, or not disconnected)
participate in an election to choose the cluster's master host

The host that mounts the greatest number of datastores has an advantage in the election.
Only one master host exists per cluster and all other hosts are slave hosts. If the master host fails, is shut down, or is removed from
the cluster a new election is held.
The master host in a cluster has a number of responsibilities:
Monitoring the state of slave hosts. If a slave host fails or becomes unreachable, the master host identifies which virtual
machines need to be restarted.
Monitoring the power state of all protected virtual machines. If one virtual machine fails, the master host ensures that it is
restarted. Using a local placement engine, the master host also determines where the restart should be done.
Managing the lists of cluster hosts and protected virtual machines.
Acting as vCenter Server management interface to the cluster and reporting the cluster health state.

The slave hosts primarily contribute to the cluster by running virtual machines locally, monitoring their runtime states, and reporting
state updates to the master host. A master host can also run and monitor virtual machines. Both slave hosts and master hosts
implement the VM and Application Monitoring features.
One of the functions performed by the master host is virtual machine protection. When a virtual machine is protected, vSphere HA
guarantees that it attempts to power it back on after a failure.
A master host commits to protecting a virtual machine when it observes that the power state of the virtual machine changes from
powered off to powered on in response to a user action. If a failover occurs, the master host must restart the virtual machines that
are protected and for which it is responsible. This responsibility is assigned to the master host that has exclusively locked a system-
defined file on the datastore that contains a virtual machine's configuration file.
NOTE If you disconnect a host from a cluster, all of the virtual machines registered to that host are unprotected by vSphere HA.
Host Failure Types and Detection
In a vSphere HA cluster, three types of host failure are detected:

A host stops functioning (that is, fails)


A host becomes network isolated
A host loses network connectivity with the master host.

The master host monitors the liveness of the slave hosts in the cluster. This communication is done through the exchange of
network heartbeats every second.
When the master host stops receiving these heartbeats from a slave host, it checks for host liveness before declaring the host to
have failed. The liveness check that the master host performs is to determine whether the slave host is exchanging heartbeats with
one of the datastores.
Also, the master host checks whether the host responds to ICMP pings sent to its management IP addresses.
If a master host is unable to communicate directly with the agent on a slave host, the slave host does not respond to ICMP pings,
and the agent is not issuing heartbeats it is considered to have failed.
The host's virtual machines are restarted on alternate hosts.
If such a slave host is exchanging heartbeats with a datastore, the master host assumes that it is in a network partition or network
isolated and so continues to monitor the host and its virtual machines
Host network isolation occurs when a host is still running, but it can no longer observe traffic from vSphere HA agents on the
management network. If a host stops observing this traffic, it attempts to ping the cluster isolation addresses. If this also fails, the
host declares itself as isolated from the network.

The master host monitors the virtual machines that are running on an isolated host and if it observes that they power off, and the
master host is responsible for the virtual machines, it restarts them.
NOTE If you ensure that the network infrastructure is sufficiently redundant and that at least one network path is available at all
times, host network isolation should be a rare occurrence.

Network Partitions
Datastore Heartbeating
When the master host in a vSphere HA cluster can not communicate with a slave host over the management network, the master
host uses datastore heartbeating to determine whether the slave host has failed, is in a network partition, or is network isolated. If
the slave host has stopped datastore heartbeating, it is considered to have failed and its virtual machines are restarted elsewhere.
You can use the advanced attribute das.heartbeatdsperhost to change the number of heartbeat datastores selected by vCenter
Server for each host. The default is two and the maximum valid value is five.
vSphere HA creates a directory at the root of each datastore that is used for both datastore heartbeating and for persisting the set of
protected virtual machines. The name of the directory is .vSphere-HA. Do not delete or modify the files stored in this directory,
because this can have an impact on operations.
vSphere HA Security
vSphere HA uses TCP and UDP port 8182 for agent-to-agent communication. The firewall ports open and close automatically to
ensure they are open only when needed.
vSphere HA stores configuration information on the local storage or on ramdisk if there is no local datastore. These files are
protected using file system permissions and they are accessible only to the root user.
For ESXi 5.x hosts, vSphere HA writes to syslog only by default, so logs are placed where syslog is configured to put them. The log file
names for vSphere HA are prepended with fdm, fault domain manager, which is a service of vSphere HA
All communication between vCenter Server and the vSphere HA agent is done over SSL.
vSphere HA requires that each host have a verified SSL certificate. Each host generates a self-signed certificate when it is booted for
the first time. This certificate can then be regenerated or replaced with one issued by an authority. If the certificate is replaced,
vSphere HA needs to be reconfigured on the host. If a host becomes disconnected from vCenter Server after its certificate is updated
and the ESXi or ESX Host agent is restarted, then vSphere HA is automatically reconfigured when the host is reconnected to vCenter
Server. If the disconnection does not occur because vCenter Server host SSL certificate verification is disabled at the time, verify the
new certificate and reconfigure vSphere HA on the host.
Using vSphere HA and DRS Together
Using vSphere HA with Distributed Resource Scheduler (DRS) combines automatic failover with load balancing.
When vSphere HA performs failover and restarts virtual machines on different hosts, its first priority is the immediate availability of
all virtual machines. After the virtual machines have been restarted, those hosts on which they were powered on might be heavily
loaded, while other hosts are comparatively lightly loaded.
In a cluster using DRS and vSphere HA with admission control turned on, virtual machines might not be evacuated from hosts
entering maintenance mode. This behavior occurs because of the resources reserved for restarting virtual machines in the event of a
failure. You must manually migrate the virtual machines off of the hosts using vMotion.
In some scenarios, vSphere HA might not be able to fail over virtual machines because of resource constraints. This can occur for
several reasons.

HA admission control is disabled and Distributed Power Management(DPM)is enabled. This can result in DPM consolidating

virtual machines onto fewer hosts and placing the empty hosts in standby mode leaving insufficient powered-on capacity to
perform a failover.
VM-Host affinity (required) rules might limit the hosts on which certain virtual machines can be placed.
There might be sufficient aggregate resources but these can be fragmented across multiple hosts so that they can not be
used by virtual machines for failover.

In such cases, vSphere HA can use DRS to try to adjust the cluster (for example, by bringing hosts out of standby mode or migrating
virtual machines to defragment the cluster resources) so that HA can perform the failovers.
If DPM is in manual mode, you might need to confirm host power-on recommendations. Similarly, if DRS is in manual mode, you
might need to confirm migration recommendations.
If you are using VM-Host affinity rules that are required, be aware that these rules cannot be violated. vSphere HA does not perform
a failover if doing so would violate such a rule.
vSphere HA Admission Control
vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and
to ensure that virtual machine resource reservations are respected.

Three types of admission control are available.

Host Ensures that a host has sufficient resources to satisfy the reservations of all virtual machines running on it.
Resource Pool Ensures that a resource pool has sufficient resources to satisfy the reservations, shares, and limits of all
virtual machines associated with it.
vSphere HA Ensures that sufficient resources in the cluster are reserved for virtual machine recovery in the event of host
failure.

Admission control imposes constraints on resource usage and any action that would violate these constraints is not permitted.
Examples of actions that could be disallowed include the following:

Powering on a virtual machine.
Migrating a virtual machine onto a host or into a cluster or resource pool.
Increasing the CPU or memory reservation of a virtual machine.

Of the three types of admission control, only vSphere HA admission control can be disabled. However, without it there is no
assurance that the expected number of virtual machines can be restarted after a failure. VMware recommends that you do not
disable admission control, but you might need to do so temporarily, for the following reasons:

If you need to violate the failover constraints when there are not enough resources to support them--for example, if you
are placing hosts in standby mode to test them for use with Distributed Power Management (DPM).
If an automated process needs to take actions that might temporarily violate the failover constraints (for example, as part
of an upgrade directed by vSphere Update Manager).
If you need to perform testing or maintenance operations.

NOTE When vSphere HA admission control is disabled, vSphere HA ensures that there are at least two powered-on hosts in the
cluster even if DPM is enabled and can consolidate all virtual machines onto a single host. This is to ensure that failover is possible.
Host Failures Cluster Tolerates Admission Control Policy
You can configure vSphere HA to tolerate a specified number of host failures. With the Host Failures Cluster Tolerates admission
control policy, vSphere HA ensures that a specified number of hosts can fail and sufficient resources remain in the cluster to fail over
all the virtual machines from those hosts.
With the Host Failures Cluster Tolerates policy, vSphere HA performs admission control in the following way:

Calculates the slot size. A slot is a logical representation of memory and CPU resources. By default, it is sized to satisfy the
requirements for any powered-on virtual machine in the cluster.

Determines how many slots each host in the cluster can hold.
Determines the Current Failover Capacity of the cluster. This is the number of hosts that can fail and still leave enough slots
to satisfy all of the powered-on virtual machines.
Determines whether the Current Failover Capacity is less than the Configured Failover Capacity (provided by the user).
If it is, admission control disallows the operation.

Slot Size Calculation


Slot size is comprised of two components, CPU and memory.
vSphere HA calculates the CPU component by obtaining the CPU reservation of each powered-on virtual machine and selecting the
largest value. If you have not specified a CPU reservation for a virtual machine, it is assigned a default value of 32MHz. You can
change this value by using the das.vmcpuminmhz advanced attribute.)
vSphere HA calculates the memory component by obtaining the memory reservation, plus memory overhead, of each powered-on
virtual machine and selecting the largest value. There is no default value for the memory reservation.
If your cluster contains any virtual machines that have much larger reservations than the others, they will distort slot size calculation.
To avoid this, you can specify an upper bound for the CPU or memory component of the slot size by using the das.slotcpuinmhz or
das.slotmeminmb advanced attributes, respectively.
Using Slots to Compute the Current Failover Capacity
After the slot size is calculated, vSphere HA determines each host's CPU and memory resources that are available for virtual
machines. These amounts are those contained in the host's root resource pool, not the total physical resources of the host. The
resource data for a host that is used by vSphere HA can be found by using the vSphere Client to connect to the host directly, and
then navigating to the Resource tab for the host. If all hosts in your cluster are the same, this data can be obtained by dividing the
cluster-level figures by the number of hosts. Resources being used for virtualization purposes are not included. Only hosts that are
connected, not in maintenance mode, and that have no vSphere HA errors are considered.
The maximum number of slots that each host can support is then determined. To do this, the hosts CPU resource amount is divided
by the CPU component of the slot size and the result is rounded down. The same calculation is made for the host's memory resource
amount. These two numbers are compared and the smaller number is the number of slots that the host can support.
The Current Failover Capacity is computed by determining how many hosts (starting from the largest) can fail and still leave enough
slots to satisfy the requirements of all powered-on virtual machines.
Admission Control Using Host Failures Cluster Tolerates Policy
The way that slot size is calculated and used with this admission control policy is shown in an example. Make the following
assumptions about a cluster:

The cluster is comprised of three hosts, each with a different amount of available CPU and memory resources. The first host
(H1) has 9GHz of available CPU resources and 9GB of available memory, while Host 2 (H2) has 9GHz and 6GB and Host 3
(H3) has 6GHz and 6GB.
There are five powered-on virtual machines in the cluster with differing CPU and memory requirements. VM1 needs 2GHz
of CPU resources and 1GB of memory, while VM2 needs 2GHz and 1GB, VM3 needs 1GHz and 2GB, VM4 needs 1GHz and
1GB, and VM5 needs 1GHz and 1GB.

The Host Failures Cluster Tolerates is set to one.




1. Slot size is calculated by comparing both the CPU and memory requirements of the virtual machines and selecting the largest.
The largest CPU requirement (shared by VM1 and VM2) is 2GHz, while the largest memory requirement (for VM3) is 2GB. Based on
this, the slot size is 2GHz CPU and 2GB memory.
2. Maximum number of slots that each host can support is determined. H1 can support four slots. H2 can support three slots (which
is the smaller of 9GHz/2GHz and 6GB/2GB) and H3 can also support three slots.
3. Current Failover Capacity is computed. The largest host is H1 and if it fails, six slots remain in the cluster, which is sufficient for all
five of the powered-on virtual machines. If both H1 and H2 fail, only three slots remain, which is insufficient. Therefore, the Current
Failover Capacity is one.
The cluster has one available slot (the six slots on H2 and H3 minus the five used slots).
Percentage of Cluster Resources Reserved Admission Control Policy
You can configure vSphere HA to perform admission control by reserving a specific percentage of cluster CPU and memory resources
for recovery from host failures.
With the Percentage of Cluster Resources Reserved admission control policy, vSphere HA ensures that a specified percentage of
aggregate CPU and memory resources are reserved for failover.
With the Cluster Resources Reserved policy, vSphere HA enforces admission control as follows:

Calculates the total resource requirements for all powered-on virtual machines in the cluster.
Calculates the total host resources available for virtual machines.
Calculates the Current CPU Failover Capacity and Current Memory Failover Capacity for the cluster.
Determines if either the Current CPU Failover Capacity or Current Memory Failover Capacity is less than the corresponding
Configured Failover Capacity (provided by the user). If so, admission control disallows the operation.


vSphere HA uses the actual reservations of the virtual machines. If a virtual machine does not have reservations, meaning that the
reservation is 0, a default of 0MB memory and 32MHz CPU is applied.
NOTE The Percentage of Cluster Resources Reserved admission control policy also checks that there are at least two vSphere HA-
enabled hosts in the cluster (excluding hosts that are entering maintenance mode). If there is only one vSphere HA-enabled host, an
operation is not allowed, even if there is a sufficient percentage of resources available. The reason for this extra check is that
vSphere HA cannot perform failover if there is only a single host in the cluster.
Computing the Current Failover Capacity
The total resource requirements for the powered-on virtual machines is comprised of two components, CPU and memory. vSphere
HA calculates these values.

The CPU component by summing the CPU reservations of the powered-on virtual machines. If you have not specified a CPU
reservation for a virtual machine, it is assigned a default value of 32MHz (this value can be changed using the
das.vmcpuminmhz advanced attribute.)
The memory component by summing the memory reservation (plus memory overhead) of each powered- on virtual
machine.
The total host resources available for virtual machines is calculated by adding the hosts' CPU and memory resources. These amounts
are those contained in the host's root resource pool, not the total physical resources of the host. Resources being used for
virtualization purposes are not included. Only hosts that are connected, not in maintenance mode, and have no vSphere HA errors
are considered.

The Current CPU Failover Capacity is computed by subtracting the total CPU resource requirements from the total host CPU
resources and dividing the result by the total host CPU resources. The Current Memory Failover Capacity is calculated similarly.
Admission Control Using Percentage of Cluster Resources Reserved Policy
The way that Current Failover Capacity is calculated and used with this admission control policy is shown with an example. Make the
following assumptions about a cluster:

The cluster is comprised of three hosts, each with a different amount of available CPU and memory resources. The first host
(H1) has 9GHz of available CPU resources and 9GB of available memory, while Host 2 (H2) has 9GHz and 6GB and Host 3
(H3) has 6GHz and 6GB.
There are five powered-on virtual machines in the cluster with differing CPU and memory requirements. VM1 needs 2GHz
of CPU resources and 1GB of memory, while VM2 needs 2GHz and 1GB, VM3 needs 1GHz and 2GB, VM4 needs 1GHz and
1GB, and VM5 needs 1GHz and 1GB
The Configured Failover Capacity is set to 25%.



The total resource requirements for the powered-on virtual machines is 7GHz and 6GB. The total host resources available for virtual
machines is 24GHz and 21GB. Based on this, the Current CPU Failover Capacity is 70% ((24GHz - 7GHz)/24GHz). Similarly, the Current
Memory Failover Capacity is 71% ((21GB-6GB)/21GB).
Because the cluster's Configured Failover Capacity is set to 25%, 45% (70-25) of the cluster's total CPU resources and 46% (71-25) of
the cluster's memory resources are still available to power on additional virtual machines.

Specify Failover Hosts Admission Control Policy
You can configure vSphere HA to designate specific hosts as the failover hosts.
With the Specify Failover Hosts admission control policy, when a host fails, vSphere HA attempts to restart its virtual machines on
one of the specified failover hosts. If this is not possible, for example the failover hosts have failed or have insufficient resources,
then vSphere HA attempts to restart those virtual machines on other hosts in the cluster.
To ensure that spare capacity is available on a failover host, you are prevented from powering on virtual machines or using vMotion
to migrate virtual machines to a failover host. Also, DRS does not use a failover host for load balancing.

NOTE If you use the Specify Failover Hosts admission control policy and designate multiple failover hosts, DRS does not load balance
failover hosts and VM-VM affinity rules are not supported.
The Current Failover Hosts appear in the vSphere HA section of the cluster's Summary tab in the vSphere Client. The status icon next
to each host can be green, yellow, or red.
Green. The host is connected, not in maintenance mode, and has no vSphere HA errors. No powered-on virtual machines
reside on the host.
Yellow. The host is connected, not in maintenance mode, and has no vSphere HA errors. However, powered-on virtual
machines reside on the host.
Red. The host is disconnected, in maintenance mode, or has vSphere HA errors.
Choosing an Admission Control Policy

You should choose a vSphere HA admission control policy based on your availability needs and the characteristics of your cluster.
When choosing an admission control policy, you should consider a number of factors.

Avoiding Resource Fragmentation
Resource fragmentation occurs when there are enough resources in aggregate for a virtual machine to be failed over. However,
those resources are located on multiple hosts and are unusable because a virtual machine can run on one ESXi host at a time

The Host Failures Cluster Tolerates policy avoids resource fragmentation by defining a slot as the maximum virtual machine
reservation.
The Percentage of Cluster Resources policy does not address the problem of resource fragmentation.
With the Specify Failover Hosts policy, resources are not fragmented because hosts are reserved for failover.

Flexibility of Failover Resource Reservation


Admission control policies differ in the granularity of control they give you when reserving cluster resources for failover protection.
The Host Failures Cluster Tolerates policy allows you to set the failover level as a number of hosts. The Percentage of Cluster
Resources policy allows you to designate up to 100% of cluster CPU or memory resources for failover. The Specify Failover Hosts
policy allows you to specify a set of failover hosts.
Heterogeneity of Cluster
Clusters can be heterogeneous in terms of virtual machine resource reservations and host total resource capacities. In a
heterogeneous cluster, the Host Failures Cluster Tolerates policy can be too conservative because it only considers the largest virtual
machine reservations when defining slot size and assumes the largest hosts fail when computing the Current Failover Capacity. The
other two admission control policies are not affected by cluster heterogeneity.
NOTE vSphere HA includes the resource usage of Fault Tolerance Secondary VMs when it performs admission control calculations.
For the Host Failures Cluster Tolerates policy, a Secondary VM is assigned a slot, and for the Percentage of Cluster Resources policy,
the Secondary VM's resource usage is accounted for when computing the usable capacity of the cluster.
vSphere HA Checklist

All hosts must be licensed for vSphere HA.

NOTE ESX/ESXi 3.5 hosts are supported by vSphere HA but must include a patch to address an issue involving file locks. For ESX 3.5
hosts, you must apply the patch ESX350-201012401-SG, while for ESXi 3.5 you must apply the patch ESXe350-201012401-I-BG.
Prerequisite patches need to be applied before applying these patches.

You need at least two hosts in the cluster.


All hosts need to be configured with static IP addresses. If you are using DHCP, you must ensure that the address for each
host persists across reboots.
To ensure that any virtual machine can run on any host in the cluster, all hosts should have access to the same virtual
machine networks and datastores. Similarly, virtual machines must be located on shared, not local, storage otherwise they

cannot be failed over in the case of a host failure.


NOTE vSphere HA uses datastore heartbeating to distinguish between partitioned, isolated, and failed hosts. Accordingly, you
must ensure that datastores reserved for vSphere HA are readily available at all times.

For VM Monitoring to work, VMware tools must be installed


Host certificate checking should be enabled
vSphere HA supports both IPv4 and IPv6.A cluster that mixes the use of both of these protocol versions, however, is more
likely to result in a network partition.


Enabling or Disabling Admission Control
You can enable or disable admission control for the vSphere HA cluster.
Enable: Disallow VM power on operations that violate availability constraints
Enables admission control and enforces availability constraints and preserves failover capacity. Any operation on a virtual machine
that decreases the unreserved resources in the cluster and violates availability constraints is not permitted.
Disable: Allow VM power on operations that violate availability constraints
Disables admission control. Virtual machines can, for example, be powered on even if that causes insufficient failover capacity.
When you do this, no warnings are presented, and the cluster does not turn red. If a cluster has insufficient failover capacity,
vSphere HA can still perform failovers and it uses the VM Restart Priority setting to determine which virtual machines to power on
first.
vSphere HA provides three policies for enforcing admission control, if it is enabled.

Host failures the cluster tolerates


Percentage of cluster resources reserved as failover spare capacity
Specify failover hosts

Virtual Machine Options


Default virtual machine settings control the order in which virtual machines are restarted (VM restart priority) and how vSphere HA
responds if hosts lose network connectivity with other hosts (host isolation response.)
VM Restart Priority Setting
VM restart priority determines the relative order in which virtual machines are restarted after a host failure. Such virtual machines
are restarted sequentially on new hosts, with the highest priority virtual machines first and continuing to those with lower priority
until all virtual machines are restarted or no more cluster resources are available
The values for this setting are: Disabled, Low, Medium (the default), and High. If you select Disabled, vSphere HA is disabled for the
virtual machine, which means that it is not restarted on other ESXi hosts if its host fails.
The Disabled setting does not affect virtual machine monitoring, which means that if a virtual machine fails on a host that is
functioning properly, that virtual machine is reset on that same host.
The restart priority settings for virtual machines vary depending on user needs. VMware recommends that you assign higher restart
priority to the virtual machines that provide the most important services.
For example, in the case of a multitier application you might rank assignments according to functions hosted on the virtual
machines.

High. Database servers that will provide data for applications.

Medium. Application servers that consume data in the database and provide results on web pages.
Low. Web servers that receive user requests, pass queries to application servers, and return results to
users.

Host Isolation Response Setting


Host isolation response determines what happens when a host in a vSphere HA cluster loses its management network connections
but continues to run. Host isolation responses require that Host Monitoring Status is enabled. If Host Monitoring Status is disabled,
host isolation responses are also suspended.
A host determines that it is isolated when it is unable to communicate with the agents running on the other hosts and it is unable to
ping its isolation addresses.
When this occurs, the host executes its isolation response. The responses are:

Leave powered on (the default)


Power off
Shut down


You can customize this property for individual virtual machines.
To use the Shut down VM setting, you must install VMware Tools in the guest operating system of the virtual machine.

Virtual machines that are in the process of shutting down will take longer to fail over while the shutdown completes. Virtual
Machines that have not shut down in 300 seconds, or the time specified in the advanced attribute das.isolationshutdowntimeout
seconds, are powered off.
NOTE After you create a vSphere HA cluster, you can override the default cluster settings for Restart Priority and Isolation Response
for specific virtual machines. Such overrides are useful for virtual machines that are used for special tasks. For example, virtual
machines that provide infrastructure services like DNS or DHCP might need to be powered on before other virtual machines in the
cluster.
If a host has its isolation response disabled (that is, it leaves virtual machines powered on when isolated) and the host loses access
to both the management and storage networks, a "split brain" situation can arise. In this case, the isolated host loses the disk locks
and the virtual machines are failed over to another host even though the original instances of the virtual machines remain running
on the isolated host.
When the host comes out of isolation, there will be two copies of the virtual machines, although the copy on the originally isolated
host does not have access to the vmdk files and data corruption is prevented. In the vSphere Client, the virtual machines appear to
be flipping back and forth between the two hosts.
To recover from this situation, ESXi generates a question on the virtual machine that has lost the disk locks for when the host comes
out of isolation and realizes that it cannot reacquire the disk locks. vSphere HA automatically answers this question and this allows
the virtual machine instance that has lost the disk locks to power off, leaving just the instance that has the disk locks.
VM and Application Monitoring
VM Monitoring restarts individual virtual machines if their VMware Tools heartbeats are not received within a set time. Similarly,
Application Monitoring can restart a virtual machine if the heartbeats for an application it is running are not received. You can
enable these features and configure the sensitivity with which vSphere HA monitors non-responsiveness.
When you enable VM Monitoring, the VM Monitoring service (using VMware Tools) evaluates whether each virtual machine in the
cluster is running by checking for regular heartbeats and I/O activity from the VMware Tools process running inside the guest. If no
heartbeats or I/O activity are received, this is most likely because the guest operating system has failed or VMware Tools is not being
allocated any time to complete tasks.
In such a case, the VM Monitoring service determines that the virtual machine has failed and the virtual machine is rebooted to
restore service.
Occasionally, virtual machines or applications that are still functioning properly stop sending heartbeats. To avoid unnecessary

resets, the VM Monitoring service also monitors a virtual machine's I/O activity.
If no heartbeats are received within the failure interval, the I/O stats interval (a cluster-level attribute) is checked.
The I/O stats interval determines if any disk or network activity has occurred for the virtual machine during the previous two
minutes (120 seconds). If not, the virtual machine is reset. This default value (120 seconds) can be changed using the advanced
attribute das.iostatsinterval.
To enable Application Monitoring, you must first obtain the appropriate SDK (or be using an application that supports VMware
Application Monitoring) and use it to set up customized heartbeats for the applications you want to monitor. After you have done
this, Application Monitoring works much the same way that VM Monitoring does.
vSphere HA Advanced Attributes
Attribute

Description

das.isolationaddress[...]

Sets the address to ping to determine if a host is isolated from


the network. This address is pinged only when heartbeats are
not received from any other host in the cluster. If not specified,
the default gateway of the management network is used. This
default gateway has to be a reliable address that is available, so
that the host can determine if it is isolated from the network.
You can specify multiple isolation addresses (up to 10) for the
cluster: das.isolationaddressX, where X = 1-10. Typically you
should specify one per management network. Specifying too
many addresses makes isolation detection take too long.

das.usedefaultisolationaddress

das.isolationshutdowntimeout

das.slotmeminmb

das.slotcpuinmhz

das.vmmemoryminmb

das.vmcpuminmhz

das.iostatsinterval

By default, vSphere HA uses the default gateway of the console


network as an isolation address. This attribute specifies whether
or not this default is used (true|false).
The period of time the system waits for a virtual machine to shut
down before powering it off. This only applies if the host's
isolation response is Shut down VM. Default value is 300
seconds.
Defines the maximum bound on the memory slot size. If this
option is used, the slot size is the smaller of this value or the
maximum memory reservation plus memory overhead of any
powered-on virtual machine in the cluster.
Defines the maximum bound on the CPU slot size. If this option
is used, the slot size is the smaller of this value or the maximum
CPU reservation of any powered-on virtual machine in the
cluster.
Defines the default memory resource value assigned to a virtual
machine if its memory reservation is not specified or zero. This is
used for the Host Failures Cluster Tolerates admission control
policy. If no value is specified, the default is 0 MB.
Defines the default CPU resource value assigned to a virtual
machine if its CPU reservation is not specified or zero. This is
used for the Host Failures Cluster Tolerates admission control
policy. If no value is specified, the default is 32MHz.
Changes the default I/O stats interval for VM Monitoring
sensitivity. The default is 120 (seconds). Can be set to any value

greater than, or equal to 0. Setting to 0 disables the check.

das.ignoreinsufficienthbdatastore

Disables configuration issues created if the host does not have


sufficient heartbeat datastores for vSphere HA. Default value is
false.


das.heartbeatdsperhost

Changes the number of heartbeat datastores required. Valid


values can range from 2-5 and the default is 2

NOTE : If you change the value of any of the following advanced attributes, you must disable and then re-enable vSphere HA before
your changes take effect.

das.isolationaddress[...]
das.usedefaultisolationaddress
das.isolationshutdowntimeout

Options No Longer Supported

das.defaultfailoverhost
das.failureDetectionTime
das.failureDetectionInterval

Best Practices for vSphere HA Clusters


Setting Alarms to Monitor Cluster Changes
When vSphere HA or Fault Tolerance take action to maintain availability, for example, a virtual machine failover, you can be notified
about such changes. Configure alarms in vCenter Server to be triggered when these actions occur, and have alerts, such as emails,
sent to a specified set of administrators.
Several default vSphere HA alarms are available.

Insufficient failover resources (a cluster alarm)


Cannot find master (a cluster alarm)
Failover in progress (a cluster alarm)
Host HA status (a host alarm)
VM monitoring error (a virtual machine alarm)
VM monitoring action (a virtual machine alarm)
Failover failed (a virtual machine alarm)


Monitoring Cluster Validity
A valid cluster is one in which the admission control policy has not been violated.
A cluster enabled for vSphere HA becomes invalid (red) when the number of virtual machines powered on exceeds the failover
requirements, that is, the current failover capacity is smaller than configured failover capacity. If admission control is disabled,
clusters do not become invalid.
Admission Control Best Practices
The following recommendations are best practices for vSphere HA admission control

Select the Percentage of Cluster Resources Reserved admission control policy. This policy offers the most flexibility in terms
of host and virtual machine sizing. In most cases, a calculation of 1/N, where N is the number of total nodes in the cluster,
yields adequate sparing.

Ensure that you size all cluster hosts equally. An unbalanced cluster results in excess capacity being reserved to handle
failure of the largest possible node.
Try to keep virtual machine sizing requirements similar across all configured virtual machines. The Host Failures Cluster
Tolerates admission control policy uses slot sizes to calculate the amount of capacity needed to reserve for each virtual
machine. The slot size is based on the largest reserved memory and CPU needed for any virtual machine. When you mix
virtual machines of different CPU and memory requirements, the slot size calculation defaults to the largest possible, which
limits consolidation.

Using Auto Deploy with vSphere HA


You can use vSphere HA and Auto Deploy together to improve the availability of your virtual machines. Auto Deploy provisions hosts
when they power up and you can also configure it to install the vSphere HA agent on such hosts during the boot process. To have
Auto Deploy install the vSphere HA agent, the image profile you assign to the host must include the vmware-fdm VIB.
Best Practices for Networking
Network Configuration and Maintenance

When making changes to the networks that your clustered ESXi hosts are on, VMware recommends that you suspend the
Host Monitoring feature. Changing your network hardware or networking settings can interrupt the heartbeats that
vSphere HA uses to detect host failures, and this might result in unwanted attempts to fail over virtual machines.
When you change the networking configuration on the ESXi hosts themselves, for example, adding port groups, or
removing vSwitches, VMware recommends that in addition to suspending Host Monitoring, you place the hosts on which
the changes are being made into maintenance mode. When the host comes out of maintenance mode, it is reconfigured,
which causes the network information to be reinspected for the running host. If not put into maintenance mode, the
vSphere HA agent runs using the old network configuration information.


Networks Used for vSphere HA Communications
To identify which network operations might disrupt the functioning of vSphere HA, you should know which management networks
are being used for heart beating and other vSphere HA communications.

On legacy ESX hosts in the cluster, vSphere HA communications travel over all networks that are designated as service
console networks. VMkernel networks are not used by these hosts for vSphere HA communications.
On ESXi hosts in the cluster, vSphere HA communications, by default, travel over VMkernel networks, except those marked
for use with vMotion. If there is only one VMkernel network, vSphere HA shares it with vMotion, if necessary. With ESXi 4.x
and ESXi, you must also explicitly enable the Management Network checkbox for vSphere HA to use this network.

NOTE VMware recommends that you do not configure hosts with multiple vmkNICs on the same subnet. If this is done, be aware
that vSphere HA sends packets using any pNIC that is associated with a given subnet if at least one vNIC for that subnet has been
configured for management traffic.
Network Isolation Addresses
A network isolation address is an IP address that is pinged to determine whether a host is isolated from the network. This address is
pinged only when a host has stopped receiving heartbeats from all other hosts in the cluster. If a host can ping its network isolation
address, the host is not network isolated, and the other hosts in the cluster have failed. However, if the host cannot ping its isolation
address, it is likely that the host has become isolated from the network and no failover action is taken.
By default, the network isolation address is the default gateway for the host. Only one default gateway is specified, regardless of
how many management networks have been defined. You should use the das.isolationaddress[...] advanced attribute to add
isolation addresses for additional networks.
Other Networking Considerations

Configuring Switches. If the physical network switches that connect your servers support the Port Fast (or an equivalent)

setting, enable it. This setting prevents a host from incorrectly determining that a network is isolated during the execution
of lengthy spanning tree algorithms.
Port Group Names and Network Labels. Use consistent port group names and network labels on VLANs for public networks.
Port group names are used to reconfigure access to the network by virtual machines. If you use inconsistent names
between the original server and the failover server, virtual machines are disconnected from their networks after failover.
Network labels are used by virtual machines to reestablish network connectivity upon restart.
Configure the management networks so that the vSphere HA agent on a host in the cluster can reach the agents on any of
the other hosts using one of the management networks. If you do not set up such a configuration, a network partition
condition can occur after a master host is elected.

Network Path Redundancy


Network path redundancy between cluster nodes is important for vSphere HA reliability. A single management network ends up
being a single point of failure and can result in failovers although only the network has failed.
You can implement network redundancy at the NIC level with NIC teaming, or at the management network level. In most
implementations, NIC teaming provides sufficient redundancy, but you can use or add management network redundancy if required.
Redundant management networking allows the reliable detection of failures and prevents isolation conditions from occurring,
because heartbeats can be sent over multiple networks.
Configure the fewest possible number of hardware segments between the servers in a cluster. The goal being to limit single points of
failure. Additionally, routes with too many hops can cause networking packet delays for heartbeats, and increase the possible points
of failure.
Network Redundancy Using NIC Teaming
Using a team of two NICs connected to separate physical switches improves the reliability of a management network. Because
servers connected through two NICs (and through separate switches) have two independent paths for sending and receiving
heartbeats, the cluster is more resilient. To configure a NIC team for the management network, configure the vNICs in vSwitch
configuration for Active or Standby configuration. The recommended parameter settings for the vNICs are:
Default load balancing = route based on originating port ID
Failback = No
After you have added a NIC to a host in your vSphere HA cluster, you must reconfigure vSphere HA on that host.
Network Redundancy Using a Secondary Network
As an alternative to NIC teaming for providing redundancy for heartbeats, you can create a secondary management network
connection, which is attached to a separate virtual switch. The primary management network connection is used for network and
management purposes. When the secondary management network connection is created, vSphere HA sends heartbeats over both
the primary and secondary management network connections. If one path fails, vSphere HA can still send and receive heartbeats
over the other path.
6. Best Practices for Running VMware vSphere on iSCSI

iSCSI considerations

For datacenters with centralized storage, iSCSI offers customers many benefits. It is comparatively inexpensive and it is based on
familiar SCSI and TCP/IP standards. In comparison to FC and Fibre Channel over Ethernet (FCoE) SAN deployments, iSCSI requires less
hardware, it uses lower-cost hardware, and more IT staff members might be familiar with the technology. These factors contribute
to lower-cost implementations.
One major difference between iSCSI and FC relates to I/O congestion. When an iSCSI path is overloaded, the TCP/IP protocol drops
packets and requires them to be resent. FC communication over a dedicated path has a built-in pause mechanism when congestion
occurs

When a network path carrying iSCSI storage traffic is oversubscribed, a bad situation quickly grows worse and performance further
degrades as dropped packets must be resent. There can be multiple reasons for an iSCSI path being overloaded, ranging from
oversubscription (too much traffic), to network switches that have a low port buffer.
Another consideration is the network bandwidth. Network bandwidth is dependent on the Ethernet standards used (1Gb or 10Gb).
There are other mechanisms such as port aggregation and bonding links that deliver greater network bandwidth.
When implementing software iSCSI that uses network interface cards rather than dedicated iSCSI adapters, gigabit Ethernet
interfaces are required. These interfaces tend to consume a significant amount of CPU Resource.
One way of overcoming this demand for CPU resources is to use a feature called a TOE (TCP/IP offload engine). TOEs shift TCP packet
processing tasks from the server CPU to specialized TCP processors on the network adaptor or storage device
iSCSI was considered a technology that did not work well over most shared wide-area networks. It has prevalently been approached
as a local area network technology. However, this is changing. For synchronous replication writes (in the case of high availability) or
remote data writes, iSCSI might not be a good fit. Latency introductions bring greater delays to data transfers and might impact
application performance. Asynchronous replication, which is not dependent upon latency sensitivity, makes iSCSI an ideal solution.
VMware vCenterTM Site Recovery ManagerTM may build upon iSCSI asynchronous storage replication for simple, reliable site
disaster protection.
iSCSI Architecture
iSCSI initiators must manage multiple, parallel communication links to multiple targets. Similarly, iSCSI targets must manage
multiple, parallel communications links to multiple initiators. Several identifiers exist in iSCSI to make this happen, including iSCSI
Name, ISID (iSCSI session identifiers), TSID (target session identifier), CID (iSCSI connection identifier) and iSCSI portals.
iSCSI Names
iSCSI nodes have globally unique names that do not change when Ethernet adapters or IP addresses change. iSCSI supports two
name formats as well as aliases. The first name format is the Extended Unique Identifier (EUI). An example of an EUI name might be
eui.02004567A425678D.
The second name format is the iSCSI Qualified Name (IQN). An example of an IQN name might be iqn.1998-01. com.vmware:tm-
pod04-esx01-6129571c.
iSCSI Initiators and Targets
A storage network consists of two types of equipment: initiators and targets. Initiators, such as hosts, are data consumers. Targets,
such as disk arrays or tape libraries, are data providers. In the context of vSphere, iSCSI initiators fall into three distinct categories.
They can be software, hardware dependent or hardware independent.
Software iSCSI Adapter
A software iSCSI adapter is VMware code built into the VMkernel. It enables your host to connect to the iSCSI storage device through
standard network adaptors. The software iSCSI adapter handles iSCSI processing while communicating with the network adaptor.
With the software iSCSI adapter, you can use iSCSI technology without purchasing specialized hardware.
Dependent Hardware iSCSI Adapter
This hardware iSCSI adapter depends on VMware networking and iSCSI configuration and management interfaces provided by
VMware. This type of adapter can be a card that presents a standard network adaptor and iSCSI offload functionality for the same
port. The iSCSI offload functionality depends on the hosts network configuration to obtain the IP and MAC addresses, as well as
other parameters used for iSCSI sessions. An example of a dependent adapter is the iSCSI licensed Broadcom 5709 NIC.
Independent Hardware iSCSI Adapter
This type of adapter implements its own networking and iSCSI configuration and management interfaces. An example of an
independent hardware iSCSI adapter is a card that presents either iSCSI offload functionality only or iSCSI offload functionality and

standard NIC functionality. The iSCSI offload functionality has independent configuration management that assigns the IP address,
MAC address, and other parameters used for the iSCSI sessions. An example of an independent hardware adapter is the QLogic
QLA4052 adapter.
SCSI Portals
iSCSI nodes keep track of connections via portals, enabling separation between names and IP addresses. A portal manages an IP
address and a TCP port number. Therefore, from an architectural perspective, sessions can be made up of multiple logical
connections, and portals track connections via TCP/IP port/address


iSCSI Implementation Options
With the hardware-initiator iSCSI implementation, the iSCSI HBA provides the translation from SCSI commands to an encapsulated
format that can be sent over the network. A TCP offload engine (TOE) does this translation on the adapter.
The software-initiator iSCSI implementation leverages the VMkernel to perform the SCSI to IP translation and requires extra CPU
cycles to perform this work. As mentioned previously, most enterprise-level networking chip sets offer TCP offload or checksum
offloads, which vastly improve CPU overhead.
With the hardware-initiator iSCSI implementation, the iSCSI HBA provides the translation from SCSI commands to an encapsulated
format that can be sent over the network. A TCP offload engine (TOE) does this translation on the adapter.
Mixing iSCSI Options
Having both software iSCSI and hardware iSCSI enabled on the same host is supported. However, use of both software and hardware
adapters on the same vSphere host to access the same target is not supported. One cannot have the host access the same target via
hardware-dependent/hardware-independent/software iSCSI adapters for multipathing purposes

Networking Settings
Network design is key to making sure iSCSI works. In a production environment, gigabit Ethernet is essential for software iSCSI.
Hardware iSCSI, in a VMware Infrastructure environment, is implemented with dedicated HBAs.
iSCSI should be considered a local-area technology, not a wide-area technology, because of latency issues and security concerns. You
should also segregate iSCSI traffic from general traffic. Layer-2 VLANs are a particularly good way to implement this segregation.
Beware of oversubscription. Oversubscription occurs when more users are connected to a system than can be fully supported at the
same time. Networks and servers are almost always designed with some amount of oversubscription, assuming that users do not all
need the service simultaneously. If they do, delays are certain and outages are possible. Oversubscription is permissible on general-
purpose LANs, but you should not use an oversubscribed configuration for iSCSI.
Best practice is to have a dedicated LAN for iSCSI traffic and not share the network with other network traffic. It is also best practice
not to oversubscribe the dedicated LAN.
Finally, because iSCSI leverages the IP network, VMkernel NICs can be placed into teaming configurations. Alternatively, a VMware

recommendation is to use port binding rather than NIC teaming. Port binding will be explained in detail later in this paper but suffice
to say that with port binding, iSCSI can leverage VMkernel multipath capabilities such as failover on SCSI errors and Round Robin
path policy for performance.
In the interest of completeness, both methods will be discussed. However, port binding is the recommended best practice.
VMkernel Network Configuration
A VMkernel network is required for IP storage and thus is required for iSCSI. A best practice would be to keep the iSCSI traffic
separate from other networks, including the management and virtual machine networks.
IPv6 Supportability Statements
At the time of this writing, there is no IPv6 support for either hardware iSCSI or software iSCSI adapters in vSphere 5.1.
Throughput Options
There are a number of options available to improve iSCSI performance.
1. 10GbE This is an obvious option to begin with. If you can provide a larger pipe, the likelihood is that you will achieve greater
throughput. Of course, if there is not enough I/O to fill a 1GbE connection, then a larger connection isnt going to help you. But lets
assume that there are enough virtual machines and enough datastores for 10GbE to be beneficial.
2. Jumbo frames This feature can deliver additional throughput by increasing the size of the payload in each frame from a default
MTU of 1,500 to an MTU of 9,000. However, great care and consideration must be used if you decide to implement it. All devices
sitting in the I/O path (iSCSI target, physical switches, network interface cards and VMkernel ports) must be able to implement
jumbo frames for this option to provide the full benefits. For example, if the MTU is not correctly set on the switches, the datastores
might mount but I/O will fail. A common issue with jumbo-frame configurations is that the MTU value on the switch isnt set
correctly. In most cases, this must be higher than that of the hosts and storage, which are typically set to 9,000. Switches must be set
higher, to 9,198 or 9,216 for example, to account for IP overhead. Refer to switch-vendor documentation as well as storage-vendor
documentation before attempting to configure jumbo frames.
3. Round Robin path policy Round Robin uses an automatic path selection rotating through all available paths, enabling the
distribution of load across the configured paths. This path policy can help improve I/O throughput. For active/passive storage arrays,
only the paths to the active controller will be used in the Round Robin policy. For active/active storage arrays, all paths will be used
in the Round Robin policy. For ALUA arrays (Asymmetric Logical Unit Assignment), Round Robin uses only the active/optimized (AO)
paths. These are the paths to the disk through the managing controller. Active/nonoptimized (ANO) paths to the disk through the
nonmanaging controller are not used.
Not all arrays support the Round Robin path policy. Refer to your storage-array vendors documentation for recommendations on
using this Path Selection Policy (PSP).
Minimizing Latency
Because iSCSI on VMware uses TCP/IP to transfer I/O, latency can be a concern. To decrease latency, one should always try to
minimize the number of hops between the storage and the vSphere host. Ideally, one would not route traffic between the vSphere
host and the storage array, and both would coexist on the same subnet.
NOTE: If iSCSI port bindings are implemented for the purposes of multipathing, you cannot route your iSCSI traffic.

Routing
A vSphere host has a single routing table for all of its VMkernel Ethernet interfaces. This imposes some limits on network
communication. Consider a configuration that uses two Ethernet adapters with one VMkernel TCP/IP stack. One adapter is on the
10.17.1.1/24 IP network and the other on the 192.168.1.1/24 network. Assume that 10.17.1.253 is the address of the default
gateway. The VMkernel can communicate with any servers reachable by routers that use the 10.17.1.253 gateway. It might not be
able to talk to all servers on the 192.168 network unless both networks are on the same broadcast domain.

The VMkernel TCP/IP Routing Table


Another consequence of the single routing table affects one approach you might otherwise consider for balancing I/O. Consider a
configuration in which you want to connect to iSCSI storage and also want to enable NFS mounts. It might seem that you can use
one Ethernet adapter for iSCSI and a separate Ethernet adapter for NFS traffic to spread the I/O load. This approach does not work
because of the way the VMkernel TCP/IP stack handles entries in the routing table.
For example, you might assign an IP address of 10.16.156.66 to the VMkernel adapter you want to use for NFS. The routing table
then contains an entry for the 10.16.156.x network for this adapter. If you then set up a second adapter for iSCSI and assign it an IP
address of 10.16.156.25, the routing table contains a new entry for the 10.16.156.x network for the second adapter. However, when
the TCP/IP stack reads the routing table, it never reaches the second entry, because the first entry satisfies all routes to both
adapters. Therefore, no traffic ever goes out on the iSCSI network, and all IP storage traffic goes out on the NFS network.
The fact that all 10.16.156.x traffic is routed on the NFS network causes two types of problems. First, you do not see any traffic on
the second Ethernet adapter. Second, if you try to add trusted IP addresses both to iSCSI arrays and NFS servers, traffic to one or the
other comes from the wrong IP address.
Using Static Routes
As mentioned before, for vSphere hosts, the management network is on a VMkernel port and therefore uses the default VMkernel
gateway. Only one VMkernel default gateway can be configured on a vSphere host. You can, however, add static routes to additional
gateways/routers from the command line
Availability Options Multipathing or NIC Teaming
NIC Teaming for Availability
A best practice for iSCSI is to avoid the vSphere feature called teaming (on the network interface cards) and instead use port binding.
Port binding introduces multipathing for availability of access to the iSCSI targets and LUNs. If for some reason this is not suitable
(for instance, you wish to route traffic between the iSCSI initiator and target), then teaming might be an alternative.
If you plan to use teaming to increase the availability of your network access to the iSCSI storage array, you must turn off port
security on the switch for the two ports on which the virtual IP address is shared
The purpose of this port security setting is to prevent spoofing of IP addresses.
Thus many network administrators enable this setting. However, if you do not change it, the port security setting prevents failover
of the virtual IP from one switch port to another and teaming cannot fail over from one path to another. For most LAN switches, the
port security is enabled on a port level and thus can be set on or off for each port.
iSCSI Multipathing via Port Binding for Availability
Another way to achieve availability is to create a multipath configuration. This is a more preferred method over NIC teaming,
because this method will fail over I/O to alternate paths based on SCSI sense codes and not just network failures. Also, port bindings
give administrators the opportunity to load-balance I/O over multiple paths to the storage device


Error Correction Digests
iSCSI header and data digests check the end-to-end, noncryptographic data integrity beyond the integrity checks that other
networking layers provide, such as TCP and Ethernet. They check the entire communication path, including all elements that can
change the network-level traffic, such as routers, switches and proxies.
Enabling header and data digests does require additional processing for both the initiator and the target and can affect throughput
and CPU use performance.

Some systems can offload the iSCSI digest calculations to the network processor, thus reducing the impact on performance.
Flow Control
The general consensus from our storage partners is that hardware-based flow control is recommended for all network interfaces
and switches.
Security Considerations
Private Network
iSCSI storage traffic is transmitted in an unencrypted format across the LAN. Therefore, it is considered best practice to use iSCSI on
trusted networks only and to isolate the traffic on separate physical switches or to leverage a private VLAN. All iSCSI-array vendors
agree that it is good practice to isolate iSCSI traffic for security reasons. This would mean isolating the iSCSI traffic on its own
separate physical switches or leveraging a dedicated VLAN (IEEE 802.1Q).
Encryption
ISCSI supports several types of security. IPSec (Internet Protocol Security) is a developing standard for security at the network or
packet-processing layer of network communication. IKE (Internet Key Exchange) is an IPSec standard protocol used to ensure
security for VPNs. However, at the time of this writing IPSec was not supported on vSphere hosts.
Authentication
There are also a number of authentication methods supported with iSCSI.

Kerberos (not supported vSphere 5.1)


SRP (Secure Remote Password) (not supported vSphere 5.1)
SPKM1/2 (Simple Public-Key Mechanism) (not supported vSphere 5.1)
CHAP (Challenge Handshake Authentication Protocol) (Supported)


At the time of this writing (vSphere 5.1), a vSphere host does not support Kerberos, SRP or public-key authentication methods for
iSCSI
The only authentication protocol supported is CHAP. CHAP verifies identity using a hashed transmission.
The target initiates the challenge. Both parties know the secret key. It periodically repeats the challenge to guard against replay
attacks. CHAP is a one-way protocol, but it might be implemented in two directions to provide security for both ends. The iSCSI
specification defines the CHAP security method as the only must-support protocol. The VMware implementation uses this security
option. Initially, VMware supported only unidirectional CHAP, but bidirectional CHAP is now supported.
iSCSI Datastore Provisioning Steps
1.

Create a new VMkernel port group for IP storage on an already existing virtual switch (vSwitch) or on a new vSwitch when it
is configured. The vSwitch can be a vSphere Standard Switch (VSS) or a VMware vSphere Distributed Switch.

2.

Ensure that the iSCSI initiator on the vSphere host(s) is enabled.

3.

Ensure that the iSCSI storage is configured to export a LUN accessible to the vSphere host iSCSI initiators on a trusted
network.

Why Use iSCSI Multipathing?


The primary use case of this feature is to create a multipath configuration with storage that presents only a single storage portal,
such as the DELL EqualLogic and the HP LeftHand.
Without iSCSI multipathing, this type of storage would have one path only between the VMware ESX host and each volume. iSCSI
multipathing enables us to multipath to this type of clustered storage.
Another benefit is the ability to use alternate VMkernel networks outside of the vSphere host management network. This means

that if the management network suffers an outage, you continue to have iSCSI connectivity via the VMkernel ports participating in
the iSCSI bindings.
NOTE: VMware considers the implementation of iSCSI multipathing versus NIC teaming a best practice.

Software iSCSI Multipathing Configuration Steps
For port binding to work correctly, the initiator must be able to reach the target directly on the same subnet iSCSI port binding in
vSphere 5.0 does not support routing.
In this configuration, if I place my VMkernel ports on VLAN 74, they can reach the iSCSI target without the need of a router. This is an
important point and requires further elaboration because it causes some confusion. If I do not implement port binding and use a
standard VMkernel port, then my initiator can reach the targets through a routed network.
This is supported and works well. It is only when iSCSI binding is implemented that a direct, non-routed network between the
initiators and targets is required. In other words, initiators and targets must be on the same subnet.


There is another important point to note when it comes to the configuration of iSCSI port bindings. On VMware standard switches
that contain multiple vmnic uplinks, each VMkernel (vmk) port used for iSCSI bindings must be associated with a single vmnic uplink.
The other uplink(s) on the vSwitch must be placed into an unused state. This is only a requirement when there are multiple vmnic
uplinks on the same vSwitch. If you are using multiple VSSs with their own vmnic uplinks, then this is not an issue.
Continuing with the network configuration, a second VMkernel (vmk) port is created. Now there are two vmk ports, labeled iSCSI1
and iSCSI2. These will be used for the iSCSI port binding/multipathing configuration. The next step is to configure the bindings and
iSCSI targets. This is done in the properties of the software iSCSI adapter. Since vSphere 5.0, there is a new Network Configuration
tab in the Software iSCSI Initiator Properties window. This is where the VMkernel ports used for binding to the iSCSI adapter are
added.



After selecting the VMkernel adapters for use with the software iSCSI adapter, the Port Group Policy tab will tell you whether or not
these adapters are compliant for binding. If you have more than one active uplink on a vSwitch that has multiple vmnic uplinks, the
vmk interfaces will not show up as compliant. Only one uplink should be active. All other uplinks should be placed into an unused
state.



Interoperability Considerations

Storage I/O Control


Storage I/O Control (SIOC) prevents a single virtual machine residing on one vSphere host from consuming more than its share of
bandwidth on a datastore that it shares with other virtual machines residing on other vSphere hosts.
Historically, the disk shares feature can be set up on a pervSphere host basis. This works well for all virtual machines residing on the
same vSphere host sharing the same datastore built on a local disk. However, this cannot be used as a fairness mechanism for virtual
machines from different vSphere hosts sharing the same datastore.
This is what SIOC does. SIOC modifies the I/O queues on various vSphere hosts to ensure that virtual machines with a higher priority
get more queue entries than those virtual machines with a lower priority, enabling these higher-priority virtual machines to send
more I/O than their lower-priority counterparts.
SIOC is a congestion-driven feature. When latency remains below a specific latency value, SIOC is dormant. It is triggered only when
the latency value on the datastore rises above a predefined threshold.
SIOC is recommended if you have a group of virtual machines sharing the same datastore spread across multiple vSphere hosts and
you want to prevent the impact of a single virtual machines I/O on the I/O (and thus performance) of other virtual machines. With
SIOC you can set shares to reflect the priority of virtual machines, but you can also implement an IOPS limit per virtual machine. This
means that you can limit the impact, in number of IOPS, which a single virtual machine can have on a shared datastore.
SIOC is available in the VMware vSphere Enterprise Plus Edition.
Network I/O Control
The Network I/O Control (NIOC) feature ensures that when the same network interface cards are used for multiple traffic types,
other traffic types on the same network interface cards do not impact iSCSI traffic. It works by setting priority and bandwidth using
priority tags in TCP/IP packets. With 10GbE networks, this feature can be very useful, because there is one pipe that is shared with
multiple other traffic types. With 1GbE networks, you have probably dedicated the pipe solely to iSCSI traffic. This means that NIOC
is congestion driven. NIOC takes effect only when there are different traffic types competing for bandwidth and the performance of
one traffic type is likely to be impacted.
Whereas SIOC assists in dealing with the noisy-neighbor problem from a datastore-sharing perspective, NIOC assists in dealing with

the noisy-neighbor problem from a network perspective.


Using NIOC, one can also set the priority levels of different virtual machine traffic. If certain virtual machine traffic is important to
you, these virtual machines can be grouped into one virtual machine port group and lower-priority virtual machines can be placed
into another virtual machine port group. NIOC can now be used to prioritize virtual machine traffic and ensure that the high-priority
virtual machines get more bandwidth when there is competition for bandwidth on the pipe.
SIOC and NIOC can coexist and in fact complement each other.
NIOC is available in the vSphere Enterprise Plus Edition.

vSphere Storage DRS
VMware vSphere Storage DRS, introduced with vSphere 5.0, fully supports VMFS datastores on iSCSI. When you enable vSphere
Storage DRS on a datastore cluster (group of datastores), it automatically configures balancing based on space usage.
The threshold is set to 80 percent but can be modified. This means that if 80 percent or more of the space on a particular datastore
is utilized, vSphere Storage DRS will try to move virtual machines to other datastores in the datastore cluster using VMware vSphere
Storage vMotion to bring this usage value back down below 80 percent.
If the cluster is set to the automatic mode of operation, vSphere Storage DRS uses vSphere Storage vMotion to automatically
migrate virtual machines to other datastores in the datastore cluster if the threshold is exceeded.
If the cluster is set to manual, the administrator is given a set of recommendations to apply. vSphere Storage DRS will provide the
best recommendations to balance the space usage of the datastores. After you apply the recommendations, vSphere Storage
vMotion, as seen before, moves one or more virtual machines between datastores in the same datastore cluster.
Another feature of vSphere Storage DRS is that it can balance virtual machines across datastores in the datastore cluster based on
I/O metrics, and specifically latency.
vSphere Storage DRS uses SIOC to evaluate datastore capabilities and capture latency information regarding all the datastores in the
datastore cluster. As mentioned earlier, the purpose of SIOC is to ensure that no single virtual machine uses all the bandwidth of a
particular datastore. It achieves this by modifying the queue depth for the datastores on each vSphere host.
In vSphere Storage DRS, its implementation is different. SIOC, on behalf of vSphere Storage DRS, checks the capabilities of the
datastores in a datastore cluster by injecting various I/O loads. After this information is normalized, vSphere Storage DRS can
determine the types of workloads that a datastore can handle. This information is used in initial placement and load-balancing
decisions.
vSphere Storage DRS continuously uses SIOC to monitor how long it takes an I/O to do a round trip. This is the latency. This
information about the datastore is passed back to vSphere Storage DRS. If the latency value for a particular datastore is above the
threshold value (the default is 15 milliseconds) for a significant percentage of time over an observation period (the default is 16
hours), vSphere Storage DRS tries to rebalance the virtual machines across the datastores in the datastore cluster so that the latency
value returns below the threshold. This might involve one or more vSphere Storage vMotion operations. In fact, even if vSphere
Storage DRS is unable to bring the latency below the defined threshold value, it might still move virtual machines between
datastores to balance the latency.
When evaluating vSphere Storage DRS, VMware makes the same best practice recommendation made for vSphere Storage DRS
initially. The recommendation is to run vSphere Storage DRS in manual mode first and then monitor the recommendations that
vSphere Storage DRS is surfacing, ensuring that they make sense. After a period of time, if the recommendations make sense and
you build a comfort level using vSphere Storage DRS, consider switching it to automated mode.
There are a number of considerations when using vSphere Storage DRS with certain array features. Check your storage vendors
recommendation for using vSphere Storage DRS. There might be specific interaction with some advanced features on the array that
you want to be aware of. VMware has already produced a very detailed white paper regarding the use of vSphere Storage DRS with
array features such as tiered storage, thin provisioning and deduplication. More details regarding vSphere Storage DRS
interoperability with storage-array features can be found in the VMware vSphere Storage DRS Interoperability white paper.

vSphere Storage APIs Array Integration


This API enables the vSphere host to offload certain storage operations to the storage array rather than consuming resources on the
vSphere host by doing the same operations.
For block storage arrays, no additional VIBs need to be installed on the vSphere host. All software necessary to use vSphere Storage
APIs Array Integration is preinstalled on the hosts.
The first primitive to discuss is Extended Copy (XCOPY), which enables the vSphere host to offload a clone operation or template
deployments to the storage array.
NOTE: This primitive also supports vSphere Storage vMotion.
The next primitive is called Write Same. When creating VMDKs on block datastores, one of the options is to create an Eager Zeroed
Thick (EZT) VMDK, which means zeroes get written to all blocks that make up that VMDK. With the Write Same primitive, the act of
writing zeroes is offloaded to the array. This means that we dont have to send lots of zeroes across the wire, which speeds up the
process. In fact, for some arrays this is simply a metadata update, which means a very fast zeroing operation.
Our final primitive is Atomic Test & Set (ATS). ATS is a block primitive that replaces SCSI reservations when metadata updates are
done on VMFS volumes.
Thin provisioning (TP) primitives were introduced with such vSphere 5.0. features as the raising of an alarm when a TP volume
reached 75 percent of capacity at the back end, TP-Stun and, of course, the UNMAP primitive.
vSphere Storage DRS leverages 75 percent of capacity event. After the alarm is triggered, vSphere Storage DRS no longer considers
those datastore as destinations for initial placement or ongoing load balancing of virtual machines.
The vSphere Storage APIs Array Integration primitive TP-Stun was introduced to detect out-of-space conditions on SCSI LUNs. If a
datastore reaches full capacity and has no additional free space, any virtual machines that require additional space will be stunned.
Virtual machines that do not require additional space continue to work normally. After the additional space has been added to the
datastore, the suspended virtual machines can be resumed.
Finally, the UNMAP primitive is used as a way to reclaim dead space on a VMFS datastore built on thin-provisioned LUNs. A detailed
explanation of vSphere Storage APIs Array Integration can be found in the white paper, VMware vSphere Storage APIs Array
Integration (VAAI).
NOTE: At the time of this writing, there was no support from vSphere Storage APIs Array Integration for storage appliances.
Support from vSphere Storage APIs Array Integration is available only on physical storage arrays.


vSphere Storage vMotion
The only other considerations with vSphere Storage vMotion are relevant to both block operations and NAS. This is the configuration
maximum. At the time of this writing, the maximum number of concurrent vSphere Storage vMotion operations per vSphere host is
two and the maximum number of vSphere Storage vMotion operations per datastore is eight. This is to prevent any single datastore
from being unnecessarily impacted.
vSphere Storage vMotion operations
vSphere Storage vMotion has gone through quite a few architectural changes over the years. The latest version in vSphere 5.x uses a
mirror driver to split writes to the source and destination datastores after a migration is initiated. This means speedier migrations
because there is only a single copy operation now required, unlike the recursive copy process used in previous versions that
leveraged Change Block Tracking (CBT).
One consideration that has been called out already is that vSphere Storage vMotion operations of virtual machines between
datastores cannot be offloaded to the array without support from vSphere Storage APIs Array Integration. In those cases, the

software data mover does all vSphere Storage vMotion operations.


NOTE: A new enhancement in vSphere 5.1 enables up to four VMDKs belonging to the same virtual machine to be migrated in
parallel, as long as the VMDKs reside on unique datastores.
Sizing Considerations
Recommended Volume Size
When creating this paper, we asked a number of our storage partners if there was a volume size that worked well for iSCSI. All
partners said that there was no performance gain or degradation depending on the volume size and those customers might build
iSCSI volumes of any size, so long as it was below the array vendors supported maximum. The datastore sizes vary greatly from
customer to customer.
DELL recommends starting with a datastore that is between 500GB and 750GB for their Compellent range of arrays. However,
because VMFS datastore can be easily extended on the fly, their general recommendation is to start with smaller and more
manageable datastore sizes initially and expand them as needed. This seems like good advice.
Sizing of volumes is typically proportional to the number of virtual machines you attempt to deploy, in addition to
snapshots/changed blocks created for backup purposes. Another consideration is that many arrays now have deduplication and
compression features, which will also reduce capacity requirements. A final consideration is Recovery Point Objective (RPO) and
Recovery Time Objective (RTO). These determine how fast you can restore your datastore with your current backup platform.
Recommended Block Size
This parameter is not tunable, for the most part. Some vendors have it hard set to 4KB and others have it hard set to 8KB. Block sizes
are typically a multiple of 4KB. These align nicely with the 4KB grain size used in the VMDK format of VMware. For those vendors
who have it set to 8KB, the recommendation is to format the volumes in the guest operating system (OS) to a m atching 8KB block
size for optimal performance. In this area, it is best to speak to your storage-array vendor to get vendor-specific advice.
Maximum Number of Virtual Machines per Datastore
The number of virtual machines that can run on a single datastore is directly proportional to the infrastructure and the workloads
running in the virtual machines. For example, one might be able to run many hundreds of low-I/O virtual machines but only a few
very intensive I/O virtual machines on the same datastore. Network congestion is an important factor. Users might consider using
the Round Robin path policy on all storage devices to achieve optimal performance and load balancing. In fact, since vSphere 5.1
EMC now has the Round Robin path policy associated with its SATP (Storage Array Type Plug-in) in the VMkernel, so that when an
EMC storage device is discovered, it will automatically use Round Robin.
The other major factor is related to the backup and recovery Service-Level Agreement (SLA). If you have one datastore with many
virtual machines, there is a question of how long you are willing to wait while service is restored in the event of a failure. This is
becoming the major topic in the debate over how many virtual machines per datastore is optimal.
The snapshot technology used by the backup product is an important questionspecifically, whether it uses array-based snapshots
or virtual machine snapshots. Performance is an important consideration if virtual machine snapshots are used to concurrently
capture point-in-time copies of virtual machines. In many cases, array-based snapshots have less impact on the datastores and are
more scalable when it comes to backups. There might be some array-based limitations to look at also. For instance, the number of
snapshot copies of a virtual machine that a customer wants to maintain might exceed the number of snapshot copies an array can
support. This varies from vendor to vendor. Check this configuration maximum with your storage-array vendor.
KB article 1015180 includes further details regarding snapshots and their usage. As shown in KB article 1025279, virtual machines
can support up to 32 snapshots in a chain, but VMware recommends that you use only two or three snapshots in a chain and also
that you use no single snapshot for more than 2472 hours.
Booting a vSphere Host from Software iSCSI
VMware introduced support for iSCSI with ESX 3.x. However, ESX could boot only from an iSCSI LUN if a hardware iSCSI adapter was
used. Hosts could not boot via the software iSCSI initiator of VMware. In vSphere 4.1, VMware introduced support making it possible
to boot the host from an iSCSI LUN via the software iSCSI adapter.

NOTE: Support was introduced for VMware ESXi only, and not classic ESX.
Not all of our storage partners support iSCSI Boot Firmware Table (iBFT) boot from SAN. Refer to the partners own documentation
for clarification.
Why Boot from SAN?
It quickly became clear that there was a need to boot via software iSCSI. Partners of VMware were developing blade chassis
containing blade servers, storage and network interconnects in a single rack. The blades were typically diskless, with no local
storage. The requirement was to have the blade servers boot off of an iSCSI LUN using network interface cards with iSCSI
capabilities, rather than using dedicated hardware iSCSI initiators.
Compatible Network Interface Card
Much of the configuration for booting via software iSCSI is done via the BIOS settings of the network interface cards and the host.
Check the VMware Hardware Compatibility List (HCL) to ensure that the network interface card is compatible. This is important, but
a word of caution is necessary. If you select a particular network interface card and you see iSCSI as a feature, you might assume that
you can use it to boot a vSphere host from an iSCSI LUN. This is not the case.
To see if a particular network interface card is supported for iSCSI boot, set the I/O device type to Network (not iSCSI) in the HCL and
then check the footnotes. If the footnotes state that iBFT is supported, then this card can be used for boot from iSCSI.
Advanced Settings
There are a number of tunable parameters available when using iSCSI datastores. Before drilling into these advanced settings in
more detail, you should understand that the recommended values for some of these settings might (and probably will) vary from
storage-array vendor to storage-array vendor.
LoginTimeout
When iSCSI establishes a session between initiator and target, it must log in to the target. It will try to log in for a period of
LoginTimeout seconds. If that is exceeded, the login fails.
LogoutTimeout
When iSCSI finishes a session between initiator and target, it must log out of the target. It will try to log out for a period of
LogoutTimeout seconds. If that is exceeded, the logout fails.
RecoveryTimeout
The other options relate to how a dead path is determined. RecoveryTimeout is used to determine how long we should wait, in
seconds, after PDUs are no longer being sent or received before placing a once-active path into a dead state. Realistically its a bit
longer than that, because other considerations are taken into account as well.
NoopInterval and NoopTimeout
The noop settings are used to determine if a path is dead when it is not the active path. iSCSI will passively discover if this path is
dead by using the noop timeout. This test is carried out on nonactive paths every NoopInterval seconds. If a response isnt received
by NoopTimeout, measured in seconds, the path is marked as dead.
Unless faster failover times are desirable, it is not required to change these parameters from their default settings. Use caution
when modifying these parameters, because if paths fail too quickly and then recover, you might have LUNs/devices moving
ownership unnecessarily between targets, and that can lead to path thrashing.
QFullSampleSize and QFullThreshold
Some of our storage partners require the use of the parameters QFullSampleSize and QFullThreshold to enable the adaptive queue-
depth algorithm of VMware. With the algorithm enabled, no additional I/O throttling is required on the vSphere hosts. Refer to your
storage-array vendors documentation to see if this is applicable to your storage.

Disk.DiskMaxIOSize
To improve the performance of virtual machines that generate large I/O sizes, administrators can consider setting the advanced
parameter Disk.DiskMaxIOSize. Some of our partners suggest setting this to 128KB to enhance storage performance. However, it
would be best to understand the I/O size that the virtual machine is generating before setting this parameter. A different size might
be more suitable to your application.
DelayedAck
A host receiving a stream of TCP data segments, as in the case of iSCSI, can increase efficiency in both the network and the hosts by
sending less than one ack acknowledgment segment per data segment received. This is known as a delayed ack. The common
practice is to send an ack for every other full-sized data segment and not to delay the ack for a segment by more than a specified
threshold. This threshold varies between 100 and 500 milliseconds. vSphere hosts, as do most other servers, use a delayed ack
because of its benefits.
Some arrays, however, take the very conservative approach of retransmitting only one lost data segment at a time and waiting for
the hosts ack before retransmitting the next one. This approach slows read performance to a halt in a congested network and might
require the delayed ack feature to be disabled on the vSphere host. More details can be found in KB article 1002598.
Additional Considerations
Disk Alignment
This is not a recommendation specific to iSCSI, because it also can have an adverse effect on the performance of all block storage.
Nevertheless, to account for every contingency, it should be considered a best practice to have the partitions of the guest OS
running with the virtual machine aligned to the storage.
Microsoft Clustering Support
With the release of vSphere 5.1, VMware supports as many as five nodes in a Microsoft Cluster. However, at the time of this writing,
VMware does not support the cluster quorum disk over the iSCSI protocol.
In-Guest iSCSI Support
A number of in-guest iSCSI software solutions are available. The iSCSI driver of Microsoft is one commonly seen running in a virtual
machine when the guest OS is a version of Microsoft Windows. The support statement for this driver can be found in KB article
1010547, which states that if you encounter connectivity issues using a third-party software iSCSI initiator to the third-party storage
device, engage the third-party vendors for assistance. If the third-party vendors determine that the issue is due to a lack of network
connectivity to the virtual machine, contact VMware for troubleshooting assistance.

All Paths Down and Permanent Device Loss
All Paths Down (APD) can occur on a vSphere host when a storage device is removed in an uncontrolled manner or if the device fails
and the VMkernel core storage stack cannot detect how long the loss of device access will last. One possible scenario for an APD
condition is an FC switch failure that brings down all the storage paths, or, in the case of an iSCSI array, a network connectivity issue
that similarly brings down all the storage paths.
A new condition known as Permanent Device Loss (PDL) was introduced in vSphere 5.0. The PDL condition enabled the vSphere host
to take specific actions when it detected that the device loss was permanent. The vSphere host can be informed of a PDL situation
by specific SCSI sense codes sent by the target array.
In vSphere 5.1, VMware introduced a PDL detection method for those iSCSI arrays that present only one LUN for each target. These
arrays were problematic, because after LUN access was lost, the target also was lost. Therefore, the vSphere host had no way of
reclaiming any SCSI sense codes.
vSphere 5.1 extends PDL detection to those arrays that have only a single LUN per target. With vSphere 5.1, for those iSCSI arrays
that have a single LUN per target, an attempt is made to log in again to the target after a dropped session. If there is a PDL condition,

the storage system rejects the effort to access the device. Depending on how the array rejects the efforts to access the LUN, the
vSphere host can determine whether the device has been lost permanently (PDL) or is temporarily unreachable.
Round Robin Path Policy Setting IOPS=1
A number of our partners have documented that if using the Round Robin path policy, best results can be achieved with an IOPS=1
setting. This might well be true in very small environments where there are a small number of virtual machines and a small number
of datastores. However, because the environment scales with a greater number of virtual machines and a greater number of
datastores, VMware considers that the default settings associated with the Round Robin path policy to be sufficient. Consult your
storage array vendor for advice on this setting.
Data Center Bridging (DCB) Support
Our storage partner Dell now supports iSCSI over DCB under the PVSP (Partner Verified and Supported Products) program of
VMware. This is for the Dell EqualLogic (EQL) array only with certain Converged Network Adapters (CNAs) and only on vSphere
version 5.1. See KB article 2044431 for further details.


7. Best Practices for running VMware vSphere on Network Attached Storage

Background
VMware introduced the support of IP based storage in release 3 of the ESX server. Prior to that release, the only option for shared
storage pools was Fibre Channel (FC). With VI3, both iSCSI and NFS storage were introduced as storage resources that could be
shared across a cluster of ESX servers.
The addition of new choices has led to a number of people asking What is the best storage protocol choice for one to deploy a
virtualization project on? The answer to that question has been the subject of much debate, and there seems to be no single
correct answer.
The considerations for this choice tend to hinge on the issue of cost, performance, availability, and ease of manageability. However,
an additional factor should also be the legacy environment and the storage administrator familiarity with one protocol vs. the other
based on what is already installed.
The bottom line is, rather than ask which storage protocol to deploy virtualization on, the question should be, Which
virtualization solution enables one to leverage multiple storage protocols for their virtualization environment? And, Which will
give them the best ability to move virtual machines from one storage pool to another, regardless of what storage protocol it uses,
without downtime, or application disruption? Once those questions are considered, the clear answer is VMware vSphere.
However, to investigate the options a bit further, performance of FC is perceived as being a bit more industrial strength than IP
based storage. However, for most virtualization environments, NFS and iSCSI provide suitable I/O performance. The comparison has
been the subject of many papers and projects. One posted on VMTN is located at:
http://www.vmware.com/files/pdf/storage_protocol_perf.pdf.

The general conclusion reached by the above paper is that for most workloads, the performance is similar with a slight increase in
ESX Server CPU overhead per transaction for NFS and a bit more for software iSCSI. For most virtualization environments, the end
user might not even be able to detect the performance delta from one virtual machine running on IP based storage vs. another on
FC storage.
The more important consideration that often leads people to choose NFS storage for their virtualization environment is the ease of
provisioning and maintaining NFS shared storage pools. NFS storage is often less costly than FC storage to set up and maintain. For
this reason, NFS tends to be the choice taken by small to medium businesses that are deploying virtualizationas well as the choice
for deployment of virtual desktop infrastructures. This paper will investigate the trade offs and considerations in more detail.
Overview of the Steps to Provision NFS Datastores

Before NFS storage can be addressed by an ESX server, the following issues need to be addressed:

Have a virtual switch configured for IP based storage.


The ESX hosts needs to be configured to enable its NFS client.
The NFS storage server needs to have been configured to export a mount point that is accessible to the ESX server on a
trusted network.


For more details on NFS storage options and setup, consult the best practices for VMware provided by the storage vendor.
EMC with VMware vSphere 4 Applied Best Practices
NetApp and VMware vSphere Storage Best Practices
Regarding item one above, to configure the vSwitch for IP storage access you will need to create a new vSwitch under ESX server
configuration, networking tab in vCenter. Indicating it is a vmkernel type connection will automatically add to the vSwitch. You will
need to populate the network access information.
Regarding item two above, to configure the ESX host for running its NFS client, youll need to open a firewall port for the NFS client.
To do this, select the configuration tab for the ESX Server in Virtual Center and click on Security Profile (listed under software
options) and then check the box for NFS Client listed under the remote access choices in the Firewall Properties screen.
With these items addressed, an NFS datastore can now be added to the ESX server following the same process used to configure a
datastore for block based (FC or iSCSI) datastores.

On the ESX Server configuration tab in VMware VirtualCenter, select storage (listed under hardware options) and then click
the add button.
On the screen for select storage type, select Network File System and in the next screen enter the IP address of the NFS
server, mount point for the specific destination on that server and the desired name for that new datastore.
If everything is completed correctly, the new NFS datastore will show up in the refreshed list of datastores available for that
ESX server.

The main differences in provisioning an NFS datastores compared to block based storage datastores are:

For NFS there are fewer screens to navigate through but more data entry required than block based storage.
The NFS device needs to be specified via an IP address and folder (mount point) on that filer, rather than a pick list of
options to choose from.

Issues to Consider for High Availability


To achieve high availability, the LAN on which the NFS traffic will run needs to be designed with availability, downtime-avoidance,
isolation, and no single-fail-point of failure in mind.
Multiple administrators need to be involved in designing for high-availability: Both the virtual administrator and the network
administrator. If done correctly the failover capabilities of an IP based storage network can be as robust as that of a FC storage
network.
Terminology
First, it is important to define a few terms that often cause confusion in the discussion of IP based storage networking. Some
common terms and there definitions are as follows:
NIC/adapter/port/link End points of a network connection.
Teamed/trunked/bonded/bundled ports Pairing of two connections that are treated as one connection by a network switch or
server. The result of this pairing are also referred to as an ether-channel. This is pairing of connections is defined as Link Aggregation
in the 802.3 networking specification.
Cross stack ether channel A pairing of ports that can span across two physical LAN switches managed as one logical switch. This is

only an option with a limited number of switches that are available today.
IP hash Method of switching to an alternate path based on a hash of the IP address of both end points for multiple connections.
Virtual IP (VIF) An interface used by the NAS device to present the same IP address out of two ports from that single array.

Avoiding single points of failure a the NIC, switch, filer levels
The first level of High Availability (HA) is to avoid a single point of failure being a NIC card in an ESX server, or the cable between the
NIC card and the switch. With the solution having two NICs connected to the same LAN switch and configured as teamed at the
switch and having IP hash failover enabled at the ESX server.



The second level of HA is to avoid a single point of failure being a loss of the switch to which the ESX connects. With this solution,
one has four potential NIC cards in the ESX server configured with IP hash failover and two pairs going to separate LAN switches
with each pair configured as teamed at the respective LAN switches.


The third level of HA protects against loss of a filer (or NAS head) becoming unavailable. With storage vendors that provide clustered
NAS heads that can take over for another in the event of a failure, one can configure the LAN such that downtime can be avoided in
the event of losing a single filer, or NAS head.
An even higher level of performance and HA can build on the previous HA level with the addition of Cross Stack Ether-channel
capable switches. With certain network switches, it is possible to team ports across two separate physical switches that are
managed as one logical switch. This provides additional resilience as well as some performance optimization that one can get HA
with fewer NICs, or have more paths available across which one can distribute load sharing.



Caveat: NIC teaming provides failover but not load-balanced performance (in the common case of a single NAS datastore)
It is also important to understand that there is only one active pipe for the connection between the ESX server and a single storage
target (LUN or mountpoint). This means that although there may be alternate connections available for failover, the bandwidth for a
single datastore and the underlying storage is limited to what a single connection can provide. To leverage more available
bandwidth, an ESX server has multiple connections from server to storage targets. One would need to configure multiple datastores
with each datastore using separate connections between the server and the storage. This is where one often runs into the

distinction between load balancing and load sharing. The configuration of traffic spread across two or more datastores configured
on separate connections between the ESX server and the storage array is load sharing.
Security Considerations
VMware vSphere implementation of NFS supports NFS version 3 in TCP. There is currently no support for NFS version 2, UDP, or
CIFS/SMB. Kerberos is also not supported in the ESX Server 4, and as such traffic is not encrypted. Storage traffic is transmitted as
clear text across the LAN. Therefore, it is considered best practice to use NFS storage on trusted networks only. And to isolate the
traffic on separate physical switches or leverage a private VLAN.
Another security concern is that the ESX Server must mount the NFS server with root access. This raises some concerns about
hackers getting access to the NFS server. To address the concern, it is best practice to use of either dedicated LAN or VLAN to
provide protection and isolation.
Additional Attributes of NFS Storage
There are several additional options to consider when using NFS as a shared storage pool for virtualization. Some additional
considerations are thin provisioning, de-duplication, and the ease-of-backup-and-restore of virtual machines, virtual disks, and even
files on a virtual disk via array based snapshots.
Thin Provisioning
Virtual disks (VMDKs) created on NFS datastores are in thin provisioned format by default. This capability offers better disk
utilization of the underlying storage capacity in that it removes what is often considered wasted disk space. For the purpose of this
paper, VMware will define wasted disk space as allocated but not used. The thin-provisioning technology removes a significant
amount of wasted disk space.
On NFS datastores, the default virtual disk format is thin. As such, less allocation of VMFS volume storage space than is needed for
the same set of virtual disks provisioned as thick format
De-duplication
Some NAS storage vendors offer data de-duplication features that can greatly reduce the amount of storage space required. It is
important to distinguish between in-place de-duplication and de-duplication for backup streams. Both offer significant savings in
space requirements, but in-place de-duplication seems to be far more significant for virtualization environments. Some customers
have been able to reduce their storage needs by up to 75 percent of their previous storage footprint with the use of in place de-
duplication technology.

Summary of Best Practices
Networking Settings
To isolate storage traffic from other networking traffic, it is considered best practice to use either dedicated switches or VLANs for
your NFS and iSCSI ESX server traffic. The minimum NIC speed should be 1 gig E. In VMware vSphere, use of 10gig E is supported.
Best to look at the VMware HCL to confirm which models are supported.
It is important to not over-subscribe the network connection between the LAN switch and the storage array. The retransmitting of
dropped packets can further degrade the performance of an already heavily congested network fabric.
Datastore Settings
The default setting for the maximum number of mount points/datastore an ESX server can concurrently mount is eight. Although
the limit can be increased to 64 in the existing release. If you increase max NFS mounts above the default setting of eight, make sure
to also increase Net.TcpipHeapSize as well. If 32 mount points are used, increase tcpip.Heapsize to 30MB.

TCP/IP Heap Size


The safest way to calculate the tcpip heap size given the number of NFS volumes configured is to linearly scale the default values.
For 8 NFS volumes the default min/max sizes of the tcpip heap are respectively 6MB/30MB. This means the tcpip heap size for a host
configured with 64 NFS volumes should have the min/max tcpip heap sizes set to 48MB/240MB.
Filer Settings
In a VMware cluster, it is important to make sure to mount datastores the same way on all ESX servers. Same host
(hostname/FQDN/IP), export and datastore name. Also make sure NFS server settings are persistent on the NFS filer (use FilerView,
exportfs p, or edit /etc/exports).
ESX Server Advanced Settings and Timeout settings
When setting the NIC teaming settings, it is considered best practice to select no for NIC teaming failback option. If there is some
intermittent behavior in the network, this will prevent the flip-flopping of NIC cards being used.
When setting up VMware HA, it is a good starting point to also set the following ESX server timeouts and settings under the ESX
server advanced setting tab.
NFS.HeartbeatFrequency = 12
NFS.HeartbeatTimeout = 5
NFS.HeartbeatMaxFailures = 10


NFS Heartbeats
NFS heartbeats are used to determine if an NFS volume is still available. NFS heartbeats are actually GETATTR requests on the root
file handle of the NFS Volume. There is a system world that runs every NFS. HeartbeatFrequency seconds to check if it needs to issue
heartbeat requests for any of the NFS volumes. If a volume is marked available, a heartbeat will only be issues if it has been >
NFS.HeartBeatDelta seconds since a successful GETATTR (not necessarily a heartbeat GETATTR) for that volume was issued The NFS
heartbeat world will always issue heartbeats for NFS volumes that are marked unavailable. Here is the formula to calculate how long
it can take ESX to mark an NFS volume as unavailable:
RoundUp(NFS.HeartbeatDelta, NFS.HeartbeatFrequency) + (NFS.HeartbeatFrequency * (NFS. HeartbeatMaxFailures - 1)) +
NFS.HeartbeatTimeout
Once a volume is back up it can take NFS.HeartbeatFrequency seconds before ESX marks the volume as available. See Appendix 2 for
more details on these settings.
Previously thought to be Best Practices
Some early adopters of VI3 on NFS created some best practices that are no longer viewed favorably. They are:
Turn off the NFS locking within the ESX server
Not placing virtual machine swap space on NFS storage.

Both these have been debunked and the following section provides more details as to why these are not no longer considered best
practices.


NFS Locking
NFS locking on ESX does not use the NLM protocol. VMware has established its own locking protocol. These NFS locks are
implemented by creating lock files on the NFS server. Lock files are named .lck-<fileid>, where <fileid> is the value of the fileid
field returned from a GETATTR request for the file being locking. Once a lock file is created, VMware periodically (every
NFS.DiskFileLockUpdateFreq seconds) send updates to the lock file
to let other ESX hosts know that the lock is still active. The lock file updates generate small (84 byte) WRITE requests to the NFS

server. Changing any of the NFS locking parameters will change how long it takes to recover stale locks. The following formula can be
used to calculate how long it takes to recover a stale NFS lock:
(NFS.DiskFileLockUpdateFreq * NFS.LockRenewMaxFailureNumber) + NFS.LockUpdateTimeout
If any of these parameters are modified, its very important that all ESX hosts in the cluster use identical settings. Having inconsistent
NFS lock settings across ESX hosts can result in data corruption!
In vSphere the option to change the NFS.Lockdisable setting has been removed. This was done to remove the temptation to disable
the VMware locking mechanism for NFS. So it is no longer an option to turn it off in vSphere.
Virtual Machine Swap Space Location
Keeping the virtual machine swap space on the NFS datastore is now considered to be the best practice.
NFS Advanced Options

NFS.DiskFileLockUpdateFreq
Time between updates to the NFS lock file on the NFS server. Increasing this value will increase the time it takes to recover
stale NFS locks. (See NFS Locking)
NFS.LockUpdateTimeout
Amount of time VMWare waits before we abort a lock update request. (See NFS Locking)
NFS.LockRenewMaxFailureNumber
Number of lock update failures that must occur before VMare marks the lock as stale. (See NFS Locking)
NFS.HeartbeatFrequency
How often the NFS heartbeat world runs to see if any NFS volumes need a heartbeat request. (See NFS Heartbeats)
NFS.HeartbeatTimeout
Amount of time VMware waits before aborting a heartbeat request. (See NFS Heartbeats)
NFS.HeartbeatDelta
Amount of time after a successful GETATTR request before the heartbeat world will issue a heartbeat request for a volume.
If an NFS volume is in an unavailable state, an update will be sent every time the heartbeat world runs
(NFS.HeartbeatFrequency seconds). (See NFS Heartbeats)
NFS.HeartbeatMaxFailures Number of consecutive heartbeat requests that must fail before VMwares mark a server as
unavailable. (See NFS Heartbeats)
NFS.MaxVolumes Maximum number of NFS volume that can be mounted. The TCP/IP heap must be increased to
accommo- date the number of NFS volumes configured (See TCP/IP Heap Size)
NFS.SendBufferSize This is the size of the send buffer for NFS sockets. This value was chosen based on internal
performance testing. Customers should not need to adjust this value.
NFS.ReceiveBufferSize This is the size of the receive buffer for NFS sockets. This value was chosen based on internal
performance testing. Customers should not need to adjust this value.
NFS.VolumeRemountFrequency This determines how often VMWare would try to mount an NFS volume that was initially
unmountable. Once a volume is mounted, it never needs to be remounted. The volume may be marked unavailable if
VMWare loses connectivity to the NFS serverbut it will still remain mounted.

8.


VMware vSphere 5.0 Upgrade Best Practices


VMware vSphere 5.0 Whats New

Industrys largest virtual machines VMware can support even the largest applications with the introduction of virtual
machines that can grow to as many as 32 vCPUs and can use up to 1TB of memory. This enhancement is 4x bigger than the
previous release. vSphere can now support business-critical applications of any size and dimension.
vSphere High Availability (VMware HA) New architecture ensures the most simplified setup and the best guarantees for
the availability of business-critical applications. Setup of the most widely used VMware HA technology in the industry has
never been easier. VMware HA can now be set up in just minutes.
VMware vSphere Auto Deploy In minutes, you can deploy more vSphere hosts running the ESXi hypervisor architecture
on the fly. After it is running, Auto Deploy simplifies patching by enabling you to do a one-time patch of the source ESXi

image and then push the updated image out to your ESXi hosts, as opposed to the traditional method of having to apply the
same patch to each host individually.
Profile-Driven Storage You can reduce the steps in the selection of storage resources by grouping storage according to a
user-defined policy.
vSphere Storage DRS Automated load balancing now analyzes storage characteristics to determine the best place for a
given virtual machines data to live when it is created and then used over time.
vSphere Web Client This rich browser-based client provides full virtual machine administration, and now has
multiplatform support and optimized client/server communication, which delivers faster response and a more efficient user
experience that helps take care of business needs faster.
VMware vCenter Appliance (VCSA) This VMware vCenter ServerTM preinstalled virtual appliance simplifies the
deployment and configuration of vCenter Server, slipstreams future upgrades and patching, and reduces the time and cost
associated with managing vCenter Server. (Upgrading to the VMware vCenter Appliance from the installable vCenter Server
is not supported.)
Licensing Reporting Manager With the new vSphere vRAM licensing introduced with vSphere 5.0, vCenter Server is
enabled to show not only installed licenses but the vRAM license memory pooling and its real-time utilization. This allows
administrators to see the benefits of vRAM pooling and how to size as the business grows.



Upgrading to VMware vCenter Server 5.0
The first step in any vSphere migration project should always be the upgrade of vCenter Server. Your vCenter Server must be running
at version 5.0 in order to manage an ESXi 5.0 host.
Upgrading vCenter Server 5.0 involves upgrading the vCenter Server machine, its accompanying database, and any configured plug-
ins, including VMware vSphere Update Manager and VMware vCenter Orchestrator.
As of vSphere 4.1, vCenter Server requires a 64-bit server running a 64-bit operating system (OS). If you are currently running
vCenter Server on a 32-bit OS, you must migrate to the 64-bit architecture first. With the 64-bit vCenter Server, you also must use a
64-bit database source name (DSN) for the vCenter database.
Planning the Upgrade
It is recommended that you create an inventory of the current components and that you validate compatibility with the
requirements of vCenter 5.0
Requirements
These are supported minimums. Scaling and sizing of vCenter Server and components should be based on the size of the current
virtual environment and anticipated growth.

Processor: Two CPUs 2.0GHz or higher Intel or AMD x86 processors, with processor requirements higher if the database
runs on the same machine
Memory: 4GB RAM, with RAM requirements higher if your database runs on the same machine
Disk storage: 4GB, with disk requirements higher if your database runs on the same machine
Networking: 1Gb recommended
OS: 64-bit
Supported database platform

Upgrade Process
The following diagram depicts possible upgrade scenarios



NOTE: With the release of vSphere 5.0, vCenter Server is also offered as a Linux-based appliance, referred to as the vCenter Server
Appliance (VCSA), which can be deployed in minutes. Due to the architectural differences between the installable vCenter and the
new VCSA, there is no migration path or database conversion tool to migrate to the VCSA. You must deploy a new VCSA and attach
all the infrastructure components before recreating and attaching inventory objects.
We will explore the three most common scenarios:

vCenter 4.0 and Upgrade Manager 4.0, and a 32-bit OS with a local database
vCenter 4.1 and Upgrade Manager 4.1, a 64-bit OS with a local database, and the requirement to migrate to a remote
database
vCenter 4.1, a 64-bit OS with a remote database, and a separate Upgrade Manager server


Backing Up Your vCenter Configuration

Before starting the upgrade procedure, it is recommended to back up your current vCenter Server to ensure that you can restore to
the previous configuration in the case of an unsuccessful upgrade. It is important to realize that there are multiple objects that must
be backed up to provide the ability to roll back:

SSL certificates
vpxd.cfg
Database


Depending on the type of platform used to host your vCenter Server, it might be possible to simply create a clone or snapshot of
your vCenter Server and database to allow for a simple and effective rollback scenario.
In most cases, however, it is recommended that you back up each of the aforementioned items separately to allow for a more
granular recovery when required, following the database software vendors best practices and documentation.
The vCenter configuration file vpxd.cfg and the SSL certificates can be simply backed up by copying them to a different location. It is
recommended that you copy them to a location external to the vCenter Server. The SSL certificates are located in a folder named
SSL under the following foldersvpxd.cfg can be in the root of these folders:
Windows 2003: %ALLUSERSPROFILE%\Application Data\VMware\VMware VirtualCenter\ Windows 2008:
%systemdrive%\ProgramData\VMware\VMware VirtualCenter\
It is important to also document any changes made to the vCenter configuration and to your database configuration settings, such as
the database DSN, user name and password. Before any upgrade is undertaken, it is recommended that you back up your database
and vCenter Server.

Host Agents
It is recommended that you validate that the current configuration meets the vCenter Server requirements. This can be done
manually or by using the Agent Pre-Upgrade Checker, which is provided with the vCenter Server installation media.
The Agent Pre-Upgrade Checker will investigate each of ESX/ESXi hosts in the environment, and will report whether or not the agent
on the host can be updated
Upgrading a 32-Bit vCenter 4.0 OS with a Local Database
This scenario will describe an upgrade of vCenter Server 4.0 with a local database running on a 32-bit version of a Microsoft
Windows 2003 OS. As vCenter 5.0 is a 64-bit platform, an in-place upgrade is not impossible. A VMware Data Migration Tool
included with the vCenter Server media can be utilized to migrate data and settings from the old 32-bit OS to the new 64-bit OS.
The Data Migration Tool should be unzipped in both the source and destination vCenter Server.
Backup Configuration Using the Data Migration Tool
Stop the following services on the source vCenter Server:

VMware vSphere Update Manager service


vCenter Management Web Services
vCenter Server service


Open a Command Prompt and go to the location from which datamigration.zip was extracted.
Type backup.bat.
Decide whether the host patches should be backed up or not. We recommend not backing them up and downloading new patches
and excluding ESX patches to minimize stored data.
Installing vCenter Using Data Provided by the Data Migration Tool

Copy the contents of the source vCenter Servers datamigration folder to the new vCenter Server.
Open up a Command Prompt and go to the folder containing the datamigration tools that you just copied.
Run install.bat.


Using the Data Migration Tool, you can easily migrate the vCenter Server 4.0 32-bit OS using Microsoft SQL Server 2005 Express to a
64-bit OS. As with any tool, there are some caveats. We have listed the most accessed VMware knowledge base articles regarding
the Data Migration Tool for your convenience as follows:
Backing up the vCenter Server 4.x bundle using the Data Migration tool fails with the error: Object reference not set to an
instance of an object (http://kb.vmware.com/kb/1036228)
Data migration tool fails with the error: RESTORE cannot process database VIM_VCDB because it is in use by this session
(http://kb.vmware.com/kb/2001184)
vCenter Server 4.1 Data Migration Tool fails with the error: HResult 0x2, Level 16, State 1
(http://kb.vmware.com/kb/1024490) Using the Data Migration Tool to upgrade from vCenter Server 4.0 to vCenter Server
4.1 fails (http://kb.vmware.com/kb/1024380) When upgrading to vCenter Server 4.1, running install.bat of the Data
Migration Tool fails (http://kb.vmware.com/kb/1029663)



Upgrading a 64-Bit vCenter 4.1 Server with a Remote Database
Of the three scenarios this is the most straightforward, but we still suggest that you back up your current vCenter configuration and
database to provide a rollback scenario

Insert the VMware vCenter Server 5.0 CD. Select vCenter Server and click Install.
Select the appropriate language and click OK.
Install .NET Framework 3.5 SP1 by clicking Install.
The ESXi Installer should now detect that vCenter is already installed. Upgrade the current installation by clicking Next.




Upgrading a 64-Bit vCenter 4.1 Server with a Local Database to a Remote Database
When upgrading your environment from vCenter Server 4.1 to vCenter Server 5.0, it might also be the right time to make
adjustments to your design decisions. One of those changes might be the location of the vCenter Server database, where instead of
using a local Microsoft SQL Server Express 2005 database, a remote SQL server is used. In this scenario, we will primarily focus on
how to migrate the database. The upgrade of vCenter Server 4.1 can be done in two different ways, which we will briefly explain at
the end of the migration workflow section.
If vCenter Server is currently installed as a virtual machine, we recommended that you create a new virtual machine for vCenter
Server 5.0. That way, in case a rollback is required, the vCenter Server 4.1 virtual machine can be powered on with a minimal impact
on your management environment.

Download the Microsoft SQL Server Management Studio Express and install it on your vCenter Server (Guide assumes you
are using SQL Express).
Stop the service named VMware VirtualCenter Server.
Start the Microsoft SQL Server Management Studio Express application and log in to the local SQL instance.
Right-click your vCenter Server Database VIM_VCDB and click Back Up under Tasks.


Copy this database from the selected location to your new Microsoft SQL Database Server.
Create a new database on your destination Microsoft SQL Server 2008.

Open Microsoft SQL Server Management Studio Express.


Log in to the local Microsoft SQL Server instance.
Right-click Databases and select New Database.
Give the new database a name and select an appropriate owner.

Use the database calculator to identify the initial size of the database. Leave this set to the default and click OK.
Now that the database has been created, the old database must be restored to this newly created database.
Open Microsoft SQL Server Management Studio Express.
Log in to the local Microsoft
Right-click the newly created database and select Restore Database.
Select From device. Select the correct database.
SQL Server instance.
Unfold Databases.
Ensure that the correct database is selected to restore, as depicted in the following.
Select Overwrite the existing database (WITH REPLACE).

If you want to reuse your current environment, go to the vCenter Server and recreate the system DSN. If you prefer to keep this, go
to the new vCenter Server and create a new system DSN.

Open the ODBC Data Source Administrator.


Click the System DSN tab.
Remove the listed VMware VirtualCenter system DSN entry.
Add a new system DSN using the Microsoft SQL Server Native Client. If this option is not available, download it here:
http://www.microsoft.com/downloads/en/details.aspx?FamilyId=C6C3E9EF-BA29- 4A43-8D69-
A2BED18FE73C&displaylang=en.

If the current vCenter Server environment is reused, take the following steps. If a new vCenter Server is used, skip this step. We have
tested the upgrade without uninstalling vCenter Server. Although it was successful, we recommend removing it every time to
prevent any unexpected performance or results.
Uninstall vCenter Server.
Reboot the vCenter Server host.
In both cases, vCenter Server must be reinstalled.
Install vCenter Server.

In the installation wizard, select the newly created DSN that connects to your SQL2008 database. Select the Do not
overwrite, leave my existing database in place option.


Ensure that the authentication type used in SQL2008 is the same as that used on SQLExpress2005.


Reset the permissions of the vCenter account that connects to the database as the database owner
(dbo) user of the MSDB system database. Details regarding this migration procedure can also be found in VMware
knowledge base article 1028601 (http://kb.vmware.com/kb/1028601), Migrating the vCenter Server 4.x database from SQL
Express 2005 to SQL Server 2008.


Upgrading to VMware ESXi 5.0
Following the vCenter Server upgrade, you are ready to begin upgrading your ESXi hosts. You can upgrade your ESX/ESXi 4.x hosts to
ESXi 5.0 using either the ESXi Installer or vSphere Update Manager. Each method has a unique set of advantages and disadvantages.


Choosing an Upgrade Path

The two upgrade methods work equally well, but there are specific requirements that must be met before a host can be upgraded to
ESXi 5.0. The following chart takes into account the various upgrade requirements and can be used as a guide to help determine
both your upgrade eligibility and your upgrade path.


Verifying Hardware Compatibility
ESXi 5.0 supports only 64-bit servers. Supported servers are listed on the vSphere Hardware Compatibility List (HCL). When verifying
hardware compatibility, its also important to consider firmware versions. VMware will often annotate firmware requirements in the
footnotes of the HCL.

Verifying ESX/ESXi Host Version


Only hosts running ESX/ESXi 4.x can be directly upgraded to ESXi 5.0. Hosts running older releases must first be upgraded to
ESX/ESXi 4.x. While planning your ESXi 5.0 upgrade, evaluate the benefit of upgrading older servers against the benefit of replacing
them with new hardware.
Boot-Disk Free-Space Requirements
Upgrading from ESXi 4.x
When upgrading from ESXi 4.x, using either the ESXi Installer or Update Manager, a minimum of 50MB of free space is required in
the hosts local VMware vSphere VMFS (VMFS) datastore. This space is used to temporarily store the host configuration.
Upgrading from ESX 4.x
When upgrading from ESX 4.x, the free-space requirements vary depending on whether you are using the ESXi Installer or Update
Manager.
ESXi Installer
When using the ESXi Installer, a minimum of 50MB of free space is required in the hosts local VMFS datastore. This space is used to
temporarily store the host configuration.
VMware vSphere Update Manager
When using Update Manager, in addition to having 50MB of free space on the local VMFS datastore, there is an additional
requirement of 350MB free space in the /boot partition. This space is used as a temporary staging area where Update Manager
will copy the ESXi 5.0 image and required upgrade scripts.
NOTE: Due to differences in the boot disk partition layout between ESX 3.5 and ESX 4.x, ESX 4.x hosts upgraded from ESX 3.x might
not have the required 350MB of free space and therefore cannot be upgraded to ESXi 5.0 using Update Manager. In this case, use the
ESXi Installer to perform the upgrade.



Disk Partitioning Requirements
Upgrading an existing ESX/ESXi 4.x host to ESXi 5.0 modifies the hosts boot disk. As such, a successful upgrade is highly dependent
on having a supported boot disk partition layout.

Disk Partitioning Requirements for ESXi
ESXi 5.0 uses the same boot disk layout as ESXi 4.x. Therefore, in most cases the boot disk partition table does not require
modification as part of the 5.0 upgrade. One notable exception is with an ESXi 3.5 host that is upgraded to ESXi 4.x and then
immediately upgraded to ESXi 5.0. In ESXi 3.5, the boot banks are 48MB. In ESXi 4.x, the size of the boot banks changed to 250MB.
When a host is upgraded from ESXi 3.5 to ESX 4.x, only one of the two boot banks is resized. This results in a situation where a host

will have one boot bank at 250MB and the other at 48MB, a condition referred to as having lopsided boot banks. An ESXi host with
lopsided boot banks must have a new partition table written to the disk during the upgrade. Update Manager cannot be used to
upgrade a host with lopsided boot banks. The ESXi Installer must be used instead.
Disk Partitioning Requirements for ESX
When upgrading an ESX 4.x host to ESXi 5.0, the ESX boot disk partition table is modified to support the dual- image bank
architecture used by ESXi. The VMFS-3 partition is the only partition that is retained. All other partitions on the disk are destroyed.
Limitations of an Upgraded ESXi 5.0 Host
There are some side effects associated with upgrading an ESX host to ESXi 5.0 as compared to performing a fresh installation. These
include the following:

Upgraded hosts retain the legacy MSDOS-based partition label and are still limited to a physical disk that is less than 2TB in
size. Installing ESXi on a disk larger than 2TB requires a fresh install.
Upgraded hosts do not have a dedicated scratch partition. Instead, as scratch directory is created and mounted off a VMFS
volume. Aside from the scratch partition, all other disk partitions, such as the boot banks, locker and vmkcore, are identical
to that of a freshly installed ESXi 5.0 host.
The existing VMFS partition is not upgraded from VMFS-3 to VMFS-5.You can manually upgrade the VMFS partition after
the upgrade. ESXi 5.0 is compatible with VMFS-3 partitions, so upgrading to VMFS-5 is required only to enable new vSphere
5.0 features.
For hosts in which the VMFS partition is on a separate disk from the boot drive, the VMFS partition is left intact and the
entire boot disk is overwritten. Any extra data on the disk is erased.


Preserving the ESX/ESXi Host Configuration
During the upgrade, most of the ESX/ESXi host configuration is retained. However, not all of the host settings are preserved. The
following list highlights key configuration settings that are not carried forward during an upgrade:

The service console port group


Local users and groups on the ESX/ESXi host
NIS settings
Rulesets and custom firewall rules
Any data in custom disk partitions
Any custom or third-party scripts/agents running in the ESX service console
SSH configurations for ESX hosts (SSH settings are kept for ESXi hosts)


Third-Party Software Packages

Some customers run optional third-party software components on their ESX/ESXi 4.x hosts. When upgrading, if third-party
components are detected, you are warned that they will be lost during the upgrade.
If a host being upgraded contains third-party software components, such as CIM providers or nonstandard device drivers, either
these components can be reinstalled after the upgrade or you can use vSphere 5.0 Image Builder CLI to create a customized ESXi
installation image with these packages bundled.

VMware ESXi Upgrade Best Practices


Using vMotion/Storage vMotion
Virtual machines cannot be running on the ESX/ESXi host while it is upgraded. To avoid virtual machine downtime, use vMotion and
Storage vMotion to migrate virtual machines and their related data files off the host prior to upgrading. If virtual machines are not
migrated off the hosts, they must be shut down for the duration of the upgrade. If you dont have a license for vMotion or Storage
vMotion, leverage the vCenter 60-day trial period to access these features for the duration of the upgrade.
Placing ESX Hosts into Clusters and Enabling HA/DRS
Placing ESX hosts into a DRS-enabled HA cluster will facilitate migrating virtual machines off the host and ensure continued
availability of your virtual machines. When running virtual machines on a DRS-enabled HA cluster, virtual machines on shared
storage will automatically be migrated off the host when it is placed into maintenance mode. In addition, DRS will ensure that the
cluster workload remains balanced as you roll the upgrade through the host in the vSphere cluster. Again, if you dont have a license
for HA/DRS, leverage the vCenter 60-day trial period to access these features for the duration of the upgrade.
Watching Out for Local Storage
Virtual machines running on local storage cannot be accessed by other ESXi hosts in your datacenter. They therefore cannot be
resumed or taken over by another host in the rare event that you encounter a problem during the upgrade. If a problem develops,
all virtual machines on local datastores will be down until the problem is resolved and the host is restored. If the problem is severe
and you must resort to reinstalling ESX/ ESXi, you are at risk of losing all your local virtual machines. To avoid unnecessary virtual
machine downtime and eliminate the risk of unwanted virtual machine deletion, migrate local virtual machines and their data files
off the host and onto shared storage using vMotion/Storage vMotion. Again, leverage the vCenter 60-day trial period to enable
vMotion and Storage vMotion if not already available.
Backing Up Your Host Configuration Upgrading
Prior to beginning a host migration, its always a good idea to back up the host configuration. The steps to backing up the host
configuration differ depending on whether the host is running ESX or ESXi.
Backing Up Your ESX Host Configuration:
Before you upgrade an ESX host, back up the hosts configuration and local VMFS volumes. This backup ensures that you will not
lose data during the upgrade.
Procedure

Backup the files in the /etc/passwd, /etc/groups, /etc/shadow and /etc/gshadow directories.
The /etc/shadow and /etc/gshadow files might not be present on all installations.
Backup any custom scripts.
Backup your .vmxfiles.
Backup local images, such as templates, exported virtual machines and .isofiles.


Backing Up Your ESXi Host Configuration:
Procedure
Install the vSphereCLI.

In the vSphere CLI, run the vicfg-cfg backup command with the-s flag to save the host configuration to a specified backup filename.
~# vicfg-cfgbackup --server <ESXi-host-ip> --portnumber <port_number> --protocol <protocol_type> --username username --
password <password> -s <backup-filename>
In addition, its a good idea to document the host configuration and to have this information available in the event that problems
arise during the host upgrade.

Summary of Upgrade Requirements and Recommendations


The following list provides a summary of the upgrade requirements and recommendations:

Verify that your hardware is supported with ESXi5.0 by using the vSphere5.0 Hardware Compatibility List (HCL) at
http://www.vmware.com/resources/compatibility/search.php.
Consider phasing out the older servers and refreshing your hardware in conjunctionwithanESXi5.0upgrade.
Backup your host before attempting an upgrade. The upgrade process modifies the ESX/ESXi hosts boot disk partition
table, preventing automated rollback.
Verify that the boot disk partition table meets the upgrade requirementsparticularly regarding the size of the /boot
partition and the location of the VMFS partition (the VMFS partition can be preserved only when it is physically located
beyond the 1GB markthat is, after the ESX boot partition, which is partition 4, and after the extended disk partition on
the disk (8192 + 1835008 sectors).
Use Image Builder CLI to add optional third-party software components, such as CIM providers and device drivers, to your
ESXi 5.0 installation image.
Move virtual machines on local storage over to shared storage, where they can be kept highly available using vMotion and
Storage vMotion together with VMware HA and DRS.
If the host was upgraded from ESXi3.5,watch out for lopsided bootbanks. Upgrade hosts with lopsided boot banks using
the ESXi Installer.
If the ESXi Installer does not provide an option to upgrade, verify that the required disk space is available (350MB in /boot,
50MB in VMFS).

Upgrading to ESXi 5.0 Using Update Manager


Requirements As a reminder, the following requirements must be met to perform an upgrade using Update Manager:

Perform a full backup of the ESX/ESXi host.


Ensure that you have 50MBof free space on the bootdisk VMFS datastore.
Ensurethatyouhave350MBfreeontheESXhosts/bootpartition (ESXonly).
Ensure that the VMFS partition begins beyond the 1GB mark(starts after sector 1843200).
EnsurethatthehostwasnotrecentlyupgradedfromESXi3.5(ESXionly).
Use vMotion/Storage vMotion to migrate all virtual machines off the host (alternatively, power the virtual machines down).

Uploading the ESXi Installation ISO


Start the upgrade by uploading the ESXi 5.0 installation image into Update Manager. From the Update Manager screen, choose the
ESXi Images tab and click the link to Import ESXi Image... . Follow the wizard to import the ESXi 5.0 Image.
Creating an Upgrade Baseline
Create an upgrade baseline using the uploaded ESXi 5.0 image. From the Update Manager screen, choose the Baselines and Groups
tab. From the Baselines section on the left, choose Create... to create a new baseline. Follow the wizard to create a new baseline.
Attaching the Baseline to Your Cluster/Host
Attach the upgrade baseline to your host or cluster. From the vCenter Hosts and Clusters view, select the Update Manager tab and
choose Attach... . Select the upgrade baseline created previously. If you have any other upgrade baselines attached, remove them.
Scanning the Cluster/Host
Scan your hosts to ensure that the host requirements are met and you are ready to upgrade. From the vCenter Hosts and Clusters
view, select the host/cluster, select the Update Manager tab and select Scan... . Wait for the scan to complete.
If the hosts return a status of Non-Compliant, you are ready to proceed with upgrading the host.
If a host returns a status of Incompatible with the reason being an invalid boot disk, you cannot use Update Manager to upgrade. Try
using the ESXi Installer.

If a host returns a status of Incompatible with the reason being that optional third-party software was detected, you can proceed
with the upgrade and reinstall the optional software packages afterward or you can proactively add the optional packages to the
ESXi installation image using Image Builder CLI.
Remediating Your Host
After the scan completes and your host is flagged as Non-Compliant, you are ready to perform the upgrade. From the Hosts and
Clusters view, select the host/cluster, select the Update Manager tab and select Remediate. You will get a pop-up asking if you want
to install patches, upgrade, or do both. Choose the upgrade option and follow the wizard to complete the remediation.
Assuming that DRS is enabled and running in fully automated mode, Update Manager will proceed to place the host into
maintenance mode (if not already in maintenance mode) and perform the upgrade. If DRS is not enabled, you must evacuate the
virtual machines off the host and put it into maintenance mode before remediating.
After the upgrade, the host will reboot and Update Manager will take it out of maintenance mode and return the host into
operation.
Using Update Manager to Upgrade an Entire Cluster
You can use Update Manager to remediate an individual host or an entire cluster. If you choose to remediate an entire cluster,
Update Manager will roll the upgrade through the cluster, upgrading each host in turn. You have flexibility in determining how
Update Manager will treat the virtual machines during the upgrade. You can choose to either power them off or use vMotion to
migrate them to another host. If you chose to power off the virtual machines, Update Manager will first power off all the virtual
machines in the cluster and then proceed to upgrade the entire cluster in parallel. If you choose to migrate the virtual machines,
Update Manager will evacuate as many hosts as it can (keeping within the HA admission control constraints) and upgrade the
evacuated hosts in parallel. Then, after they are upgraded, it will move on to the next set of hosts.
Rolling Back from a Failed Update Manager Upgrade
During the upgrade, the files on the boot disk are overwritten. This prevents any kind of automated rollback if problems arise. To
restore a host to its pre-upgrade state, reinstall the ESX/ESXi 4.x software and restore the host configuration from the backup.
Upgrading Using the ESXi Installer
Requirements
As a reminder, the following requirements must be met to perform an upgrade using the ESXi Installer:

Perform a full backup of the ESX/ESXi host.


Ensure that you have 50M of free space on the boot disk VMFS datastore.
Ensure that the VMFS partition begins beyond the 1GB mark (starts after sector 1843200).
Use vMotion/Storage vMotion to migrate all virtual machines off the host (alternatively, power the virtual machines down).


Placing the Host into Maintenance Mode

Use vMotion/Storage vMotion to evacuate all virtual machines off the host and put the host into maintenance mode. If DRS is
enabled in fully automated mode, the virtual machines on shared storage will be automatically migrated when the host is put into
maintenance mode. Alternatively, you can power off any virtual machines running on the host.
Booting Off the ESXi 5.0 Installation Media
Connect to the host console and boot the host off the ESXi 5.0 installation media. From the boot menu, select the option to boot
from the ESXi Installer.
Selecting Option to Migrate and Preserving the VMFS Datastore
When an existing ESX/ESXi 4.x installation is detected, the ESXi Installer will prompt to both migrate (upgrade) the host and preserve
the existing VMFS datastore, or to do a fresh install (with options to preserve or overwrite the VMFS datastore). Select the Migrate
ESX, preserve VMFS datastore option.

Third-Party-Software Warning
If third-party software components are detected, a warning is displayed indicating that these components will be lost.
If the identified software components are required, ensure either that they are included with the ESXi installation media (use Image
Builder CLI to added third-party software packages to the install media) or that you reinstall them after the upgrade. Press Enter to
continue the install or Escape to cancel.
Confirming the Upgrade
The system is then scanned in preparation for the upgrade. When the scan completes, the user is asked to confirm the upgrade by
pressing the F11 key.
The ESXi Installer will then proceed to upgrade the host to ESXi 5.0. After the installation, the user will be asked to reboot the host.
Then reconnect the host and exit maintenance mode.

Post-Upgrade Considerations
Configuring the VMware ESXi 5.0 Dump Collector
A core dump is the state of working memory in the event of host failure. By default, an ESXi core dump is saved to the local boot
disk. Use the VMware ESXi Dump Collector to consolidate core dumps onto a network server to ensure that they are available for
use if debugging is required. You can install the ESXi Dump Collector on the vCenter Server or on a separate Windows server that has
a network connection to the vCenter Server. Refer to the vSphere Installation and Setup Guide for more information on setting up
the ESXi Dump Collector.
Configuring the ESXi 5.0 Syslog Collector
Install the vSphere Syslog Collector to enable ESXi system logs to be directed to a network server rather than to the local disk. You
can install the Syslog Collector on the vCenter Server or on a separate Windows server that has a network connection to the vCenter
Server. Refer to the vSphere Installation and Setup Guide for more information on setting up the ESXi Syslog Collector.
Configuring a Remote Management Host
Most ESXi host administration will be done through the vCenter Server, using the vSphere Client. There also will be occasions when
remote command-line access is beneficial, such as for scripting, troubleshooting and some advanced configuration tuning. ESXi
provides a rich set of APIs that are accessible using VMware vSphere Command Line Interface (vCLI) and Windows based VMware
vSphere PowerCLI.

Upgrading Virtual Machines
After you perform an upgrade, you must determine if you will also upgrade the virtual machines that reside on the upgraded hosts.
Upgrading virtual machines ensures that they remain compatible with the upgraded host software and can take advantage of new
features. Upgrading your virtual machines entails upgrading the version of VMware Tools as well as the virtual machines virtual
hardware version.
VMware Tools
The first step in upgrading virtual machines is to upgrade VMware Tools.
vSphere 5.0 supports virtual machines running both VMware Tools version 4.x and 5.0. Running virtual machines with VMware Tools
version 5.0 on older ESX/ESXi 4.x hosts is also supported
Therefore, virtual machines running VMware Tools 4.x or higher do not require upgrading following the ESXi host upgrade. However,
only the upgraded virtual machines will benefit from the new features and latest performance benefits associated with the most
recent version of VMware Tools.


Virtual Hardware
The second step in upgrading virtual machines is to upgrade the virtual hardware version. Before upgrading the virtual hardware,
you must first upgrade the VMware Tools. The hardware version of a virtual machine reflects the virtual machines supported virtual
hardware features. These features correspond to the physical hardware available on the ESXi host on which you create the virtual
machine. Virtual hardware features include BIOS and EFI, available virtual PCI slots, maximum number of CPUs, maximum memory
configuration, and other characteristics typical to hardware. One important consideration when upgrading the virtual hardware is
that virtual machines running the latest virtual hardware version (version 8) can run only on ESXi 5.0 hosts. Do not upgrade the
virtual hardware for virtual machines running in a mixed cluster made up of ESX/ESXi 4.x hosts and ESXi 5.0 hosts. Only upgrade a
virtual machines virtual hardware version after all the hosts in the cluster have been upgraded to ESXi 5.0. Upgrading the virtual
machines virtual hardware version is a one-way operation. There is no option to reverse the upgrade after it is done.


Orchestrated Upgrade of VMware Tools and Virtual Hardware

An orchestrated upgrade enables you to upgrade both the VMware Tools and the virtual hardware of the virtual machines in your
vSphere inventory at the same time. Use Update Manager to perform an orchestrated upgrade.
You can perform an orchestrated upgrade of virtual machines at the folder or datacenter level. Update Manager makes the process
of upgrading the virtual machines convenient by providing baseline groups. When you remediate a virtual machine against a
baseline group containing the VMware Tools Upgrade to Match Host baseline and the VM Hardware Upgrade to Match Host
baseline, Update Manager sequences the upgrade operations in the correct order. As a result, the guest operating system is in a
consistent state at the end of the upgrade.
Upgrading VMware vSphere VMFS
After you perform an ESX/ESXi upgrade, you might need to upgrade your VMFS to take advantage of the new features. vSphere 5.0
supports both VMFS version 3 and version 5, so it is not necessary to upgrade your VMFS volumes unless one needs to leverage new
5.0 features. However, VMFS-5 offers a variety of new features such as larger single-extent volume (approximately 60TB), larger
VMDKs with unified 1MB block size (2TB), smaller subblock (8KB) to reduce the amount of stranded/unused space, and an
improvement in performance and scalability via the implementation of the vSphere Storage API for Array Integration (VAAI)
primitive Atomic Test & Set (ATS) across all datastore operations. VMware recommends that customers move to VMFS-5 to benefit
from these features. A complete set of VMFS-5 enhancements can be found in the Whats New in vSphere 5.0 Storage white paper.
Considerations Upgrade to VMFS-5 or Create New VMFS-5
Although a VMFS-3 that is upgraded to VMFS-5 provides you with most of the same capabilities as a newly created VMFS-5, there
are some differences. Both upgraded and newly created VMFS-5 support single-extent volumes up to approximately 60TB and both

support VMDK sizes of 2TB, no matter what the VMFS file block size is. However, the additional differences, although minor, should
be considered when making a decision on upgrading to VMFS-5 or creating new VMFS-5 volumes.

VMFS-5 upgraded from VMFS-3 continues to use the previous file block size, which might be larger than the unified 1MB
file block size. This can lead to stranded/unused disk space when there are many small files on the datastore.
VMFS-5 upgraded from VMFS-3 continues to use 64KB subblocks, not new 8K subblocks. This can also lead to
stranded/unused disk space.
VMFS-5 upgraded from VMFS-3 continues to have a file limit of 30720 rather than the new file limit of >100000 for a newly
created VMFS-5. This has an impact on the scalability of the file system.

For these reasons, VMware recommends using newly created VMFS-5 volumes if you have the luxury of doing so. You can then
migrate the virtual machines from the original VMFS-3 to VMFS-5. If you do not have the available space to create new VMFS-5
volumes, upgrading VMFS-3 to VMFS-5 will still provide you with most of the benefits that come with a newly created VMFS-5.

Online Upgrade
If you do decide to upgrade VMFS-3 to VMFS-5, it is a simple, single-click operation. After you have upgraded the host to ESXi 5.0, go
to the Configuration tab > Storage view. Select the VMFS-3 datastore. Above the Datastore Details window, an option to Upgrade to
VMFS-5... will be displayed:
The upgrade process is online and non-disruptive. Virtual machines can continue to run on the datastore while it is being upgraded.
Upgrading VMFS is a one-way operation. There is no option to reverse the upgrade after it is done. Also, after a file system has been
upgraded, it will no longer be accessible by older ESX/ESXi 4.x hosts, so you must ensure that all hosts accessing the datastore are
running ESXi 5.0. In fact, there are checks built in to vSphere that will prevent you from upgrading to VMFS-5 if any of the hosts
accessing the datastore are running a version of ESX/ESXi that is older than 5.0.
As with any upgrade, VMware recommends that a backup of your virtual machines be made prior to upgrading your VMFS-3 to
VMFS-5.
After the VMFS-5 volume is in place, the size can be extended to approximately 60TB, even if it is a single extent, and 2TB virtual
machine disks (VMDKs) can be created, no matter what the underlying file block size. These features are available out of the box,
without any additional configuration steps.
Refer to the vSphere Upgrade Guide for more information on features that require VMFS version 5, the differences between VMFS
versions 3 and 5, and how to upgrade.
The following table provides a matrix showing the supported VMware Tools, virtual hardware and VMFS versions in ESXi 5.0.


9. Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs

This summarizes our findings and recommends best practices to tune the different layers of an applications environment for similar
latency-sensitive workloads. By latency-sensitive, we mean workloads that require optimizing for a few microseconds to a few tens
of microseconds end-to-end latencies; we dont mean workloads in the hundreds of microseconds to tens of milliseconds end-to-

end-latencies. In fact, many of the recommendations in this paper that can help with the microsecond level latency can actually end
up hurting the performance of applications that are tolerant of higher latency.
Please note that the exact benefits and effects of each of these configuration choices will be highly dependent upon the specific
applications and workloads, so we strongly recommend experimenting with the different configuration options with your workload
before deploying them in a production environment.
BIOS Settings
Most servers with new Intel and AMD processors provide power savings features that use several techniques to dynamically detect
the load on a system and put various components of the server, including the CPU, chipsets, and peripheral devices into low power
states when the system is mostly idle.
There are two parts to power management on ESXi platforms:
1.

The BIOS settings for power management, which influence what the BIOS advertises to the OS/hypervisor about whether it
should be managing power states of the host or not.

2.

The OS/hypervisor settings for power management, which influence the policies of what to do when it detects that the
system is idle.

For latency-sensitive applications, any form of power management adds latency to the path where an idle system (in one of several
power savings modes) responds to an external event. So our recommendation is to set the BIOS setting for power management to
static high, that is, no OS-controlled power management, effectively disabling any form of active power management. Note that
achieving the lowest possible latency and saving power on the hosts and running the hosts cooler are fundamentally at odds with
each other, so we recommend carefully evaluating the trade-offs of disabling any form of power management in order to achieve
the lowest possible latencies for your applications needs.
Servers with Intel Nehalem class and newer (Intel Xeon 55xx and newer) CPUs also offer two other power management options: C-
states and Intel Turbo Boost. Leaving C-states enabled can increase memory latency and is therefore not recommended for low-
latency workloads. Even the enhanced C-state known as C1E introduces longer latencies to wake up the CPUs from halt (idle) states
to full-power, so disabling C1E in the BIOS can further lower latencies. Intel Turbo Boost, on the other hand, will step up the internal
frequency of the processor should the workload demand more power, and should be left enabled for low-latency, high-performance
workloads. However, since Turbo Boost can over-clock portions of the CPU, it should be left disabled if the applications require
stable, predictable performance and low latency with minimal jitter.
How power managementrelated settings are changed depends on the OEM make and model of the server.
For example, for HP ProLiant servers:
Set the Power Regulator Mode to Static High Mode.
Disable Processor C-State Support.
Disable Processor C1E Support.
Disable QPI Power Management.
Enable Intel Turbo Boost.

For Dell PowerEdge servers:
Set the Power Management Mode to Maximum Performance.
Set the CPU Power and Performance Management Mode to Maximum Performance.
Processor Settings: set Turbo Mode to enabled.
Processor Settings: set C States to disabled.


NUMA
The high latency of accessing remote memory in NUMA (Non-Uniform Memory Access) architecture servers can add a non-trivial
amount of latency to application performance. ESXi uses a sophisticated, NUMA-aware scheduler to dynamically balance processor
load and memory locality.

For best performance of latency-sensitive applications in guest OSes, all vCPUs should be scheduled on the same NUMA node and all
VM memory should fit and be allocated out of the local physical memory attached to that NUMA node.
Processor affinity for vCPUs to be scheduled on specific NUMA nodes, as well as memory affinity for all VM memory to be allocated
from those NUMA nodes, can be set using the vSphere Client under VM Settings Options tab Advanced General Configuration
Parameters and adding entries for numa.nodeAffinity=0, 1, ..., where 0, 1, etc. are the processor socket numbers.
Note that when you constrain NUMA node affinities, you might interfere with the ability of the NUMA scheduler to rebalance virtual
machines across NUMA nodes for fairness. Specify NUMA node affinity only after you consider the rebalancing issues. Note also that
when a VM is migrated (for example, using vMotion) to another host with a different NUMA topology, these advanced settings may
not be optimal on the new host and could lead to sub-optimal performance of your application on the new host. You will need to re-
tune these advanced settings for the NUMA topology for the new host.
ESXi 5.0 and newer also support vNUMA where the underlying physical hosts NUMA architecture can be exposed to the guest OS by
providing certain ACPI BIOS tables for the guest OS to consume. Exposing the physical hosts NUMA topology to the VM helps the
guest OS kernel make better scheduling and placement decisions for applications to minimize memory access latencies.
vNUMA is automatically enabled for VMs configured with more than 8 vCPUs that are wider than the number of cores per physical
NUMA node. For certain latency-sensitive workloads running on physical hosts with fewer than 8 cores per physical NUMA node,
enabling vNUMA may be beneficial. This is achieved by adding an entry for "numa.vcpu.min = N", where N is less than the number of
vCPUs in the VM, in the vSphere Client under VM Settings Options tab Advanced General Configuration Parameters.
To learn more about this topic, please refer to the NUMA sections in the "vSphere Resource Management Guide" and the white
paper explaining the vSphere CPU Scheduler: http://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf
Choice of Guest OS
Certain older guest OSes like RHEL5 incur higher virtualization overhead for various reasons, such as frequent accesses to virtual PCI
devices for interrupt handling, frequent accesses to the virtual APIC (Advanced Programmable Interrupt Controller) for interrupt
handling, high virtualization overhead when reading the current time, inefficient mechanisms to idle, and so on.
Moving to a more modern guest OS (like SLES11 SP1 or RHEL6 based on 2.6.32 Linux kernels, or Windows Server 2008 or newer)
minimizes these virtualization overheads significantly. For example, RHEL6 is based on a tickless kernel, which means that it
doesnt rely on high-frequency timer interrupts at all. For a mostly idle VM, this saves the power consumed when the guest wakes
up for periodic timer interrupts, finds out there is no real work to do, and goes back to an idle state.
Note however, that tickless kernels like RHEL6 can incur higher overheads in certain latency-sensitive workloads because the kernel
programs one-shot timers every time it wakes up from idle to handle an interrupt, while the legacy periodic timers are pre-
programmed and dont have to be programmed every time the guest OS wakes up from idle. To override tickless mode and fall back
to the legacy periodic timer mode for such modern versions of Linux, pass the nohz=off kernel boot-time parameter to the guest OS.
These newer guest OSes also have better support for MSI-X (Message Signaled Interrupts) which are more efficient than legacy INT-x
style APIC -based interrupts for interrupt delivery and acknowledgement from the guest OSes.
Since there is a certain overhead when reading the current time, due to overhead in virtualizing various timer mechanisms, we
recommend minimizing the frequency of reading the current time (using gettimeofday() or currentTimeMillis() calls) in your guest
OS, either via the latency-sensitive application doing so directly, or via some other software component in the guest OS doing this.
The overhead in reading the current time was especially worse in Linux versions older than RHEL 5.4, due to the underlying timer
device they relied on as their time source and the overhead in virtualizing them. Versions of Linux after RHEL5.4 incur significantly
lower overhead when reading the current time.
To learn more about best practices for time keeping in Linux guests, please see the VMware KB 1006427:
http://kb.vmware.com/kb/1006427. To learn more about how timekeeping works in VMware VMs, please read
http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf.
Physical NIC Settings
Most 1GbE or 10GbE NICs (Network Interface Cards) support a feature called interrupt moderation or interrupt throttling, which
coalesces interrupts from the NIC to the host so that the host doesnt get overwhelmed and spend all its CPU cycles processing

interrupts.
However, for latency-sensitive workloads, the time the NIC is delaying the delivery of an interrupt for a received packet or a packet
that has successfully been sent on the wire is the time that increases the latency of the workload.
Most NICs also provide a mechanism, usually via the ethtool command and/or module parameters, to disable interrupt moderation.
Our recommendation is to disable physical NIC interrupt moderation on the ESXi host as follows:
# esxcli system module parameters set -m ixgbe -p "InterruptThrottleRate=0"
This example applies to the Intel 10GbE driver called ixgbe. You can find the appropriate module parameter for your NIC by first
finding the driver using the ESXi command:
# esxcli network nic list
Then find the list of module parameters for the driver used:
# esxcli system module parameters list -m <driver>
Note that while disabling interrupt moderation on physical NICs is extremely helpful in reducing latency for latency-sensitive VMs, it
can lead to some performance penalties for other VMs on the ESXi host, as well as higher CPU utilization to handle the higher rate of
interrupts from the physical NIC.
Disabling physical NIC interrupt moderation can also defeat the benefits of Large Receive Offloads (LRO), since some physical NICs
(like Intel 10GbE NICs) that support LRO in hardware automatically disable it when interrupt moderation is disabled, and ESXis
implementation of software LRO has fewer packets to coalesce into larger packets on every interrupt. LRO is an important offload
for driving high throughput for large-message transfers at reduced CPU cost, so this trade-off should be considered carefully.

Virtual NIC Settings
ESXi VMs can be configured to have one of the following types of virtual NICs (http://kb.vmware.com/kb/1001805): Vlance,
VMXNET, Flexible, E1000, VMXNET 2 (Enhanced), or VMXNET 3.
We recommend you choose VMXNET 3 virtual NICs for your latency-sensitive or otherwise performance-critical VMs. VMXNET 3 is
the latest generation of our paravirtualized NICs designed from the ground up for performance, and is not related to VMXNET o r
VMXNET 2 in any way. It offers several advanced features including multi-queue support: Receive Side Scaling, IPv4/IPv6 offloads,
and MSI/MSI-X interrupt delivery. Modern enterprise Linux distributions based on 2.6.32 or newer kernels, like RHEL6 and SLES11
SP1, ship with out- of-the-box support for VMXNET 3 NICs.
VMXNET 3 by default also supports an adaptive interrupt coalescing algorithm, for the same reasons that physical NICs implement
interrupt moderation. This virtual interrupt coalescing helps drive high throughputs to VMs with multiple vCPUs with parallelized
workloads (for example, multiple threads), while at the same time striving to minimize the latency of virtual interrupt delivery.
However, if your workload is extremely sensitive to latency, then we recommend you disable virtual interrupt coalescing for
VMXNET 3 virtual NICs as follows.
To do so through the vSphere Client, go to VM Settings -> Options tab -> Advanced General -> Configuration Parameters and add an
entry for ethernetX.coalescingScheme with the value of disabled.
Please note that this new configuration option is only available in ESXi 5.0 and later. An alternative way to disable virtual interrupt
coalescing for all virtual NICs on the host which affects all VMs, not just the latency-sensitive ones, is by setting the advanced
networking performance option (Configuration Advanced Settings Net) Coalesce DefaultOn to 0 (disabled). See
http://communities.vmware.com/docs/DOC-10892 for details.
Another feature of VMXNET 3 that helps deliver high throughput with lower CPU utilization is Large Receive Offload (LRO), which
aggregates multiple received TCP segments into a larger TCP segment before delivering it up to the guest TCP stack. However, for
latency-sensitive applications that rely on TCP, the time spent aggregating smaller TCP segments into a larger one adds latency. It

can also affect TCP algorithms like delayed ACK, which now cause the TCP stack to delay an ACK until the two larger TCP segments
are received, also adding to end-to-end latency of the application.
Therefore, you should also consider disabling LRO if your latency-sensitive application relies on TCP. To do so for Linux guests, you
need to reload the vmxnet3 driver in the guest:
# modprobe -r vmxnet3 Add the following line in /etc/modprobe.conf (Linux version dependent): options vmxnet3
disable_lro=1 Then reload the driver using: # modprobe vmxnet3
VM Settings
If your application is multi-threaded or consists of multiple processes that could benefit from using multiple CPUs, you can add more
virtual CPUs (vCPUs) to your VM. However, for latency-sensitive applications, you should not overcommit vCPUs as compared to the
number of pCPUs (processors) on your ESXi host. For example, if your host has 8 CPU cores, limit your number of vCPUs for your VM
to 7. This will ensure that the ESXi vmkernel scheduler has a better chance of placing your vCPUs on pCPUs which wont be
contended by other scheduling contexts, like vCPUs from other VMs or ESXi helper worlds.
If your application needs a large amount of physical memory when running unvirtualized, consider configuring your VM with a lot of
memory as well, but again, try to refrain from overcommitting the amount of physical memory in the system. You can look at the
memory statistics in the vSphere Client under the hosts Resource Allocation tab under Memory -> Available Capacity to see how
much memory you can configure for the VM after all the virtualization overheads are accounted for.
If you want to ensure that the VMkernel does not deschedule your VM when the vCPU is idle (most systems generally have brief
periods of idle time, unless youre running an application which has a tight loop executing CPU instructions without taking a break or
yielding the CPU), you can add the following configuration option. Go to VM Settings -> Options tab -> Advanced General ->
Configuration Parameters and add monitor_control.halt_desched with the value of false.
Note that this option should be considered carefully, because this option will effectively force the vCPU to consume all of its
allocated pCPU time, such that when that vCPU in the VM idles, the VM Monitor will spin on the CPU without yielding the CPU to the
VMkernel scheduler, until the vCPU needs to run in the VM again. However, for extremely latency-sensitive VMs which cannot
tolerate the latency of being descheduled and scheduled, this option has been seen to help.
A slightly more power conserving approach which still results in lower latencies when the guest needs to be woken up soon after it
idles, are the following advanced configuration parameters (see also http://kb.vmware.com/kb/1018276):

For > 1 vCPU VMs, set monitor.idleLoopSpinBeforeHalt to true


For 1 vCPU VMs, set monitor.idleLoopSpinBeforeHaltUP to true This option will cause the VM Monitor to spin for a small
period of time (by default 100 us, configurable through monitor.idleLoopMinSpinUS)before yielding the CPU to the
VMkernel scheduler, which may then idle the CPU if there is no other work to do.

New in vSphere 5.5 is a VM option called Latency Sensitivity, which defaults to Normal. Setting this to High can yield significantly
lower latencies and jitter, as a result of the following mechanisms that take effect in ESXi:

Exclusive access to physical resources, including pCPUs dedicated to vCPUs with no contending threads for executing on
these pCPUs.
Full memory reservation eliminates ballooning or hypervisor swapping leading to more predictable performance with no
latency overheads due to such mechanisms.
Halting in the VM Monitor when the vCPU is idle, leading to faster vCPU wake-up from halt, and bypassing the VMkernel
scheduler for yielding the pCPU. This also conserves power as halting makes the pCPU enter a low power mode, compared
to spinning in the VM Monitor with the monitor_control.halt_desched=FALSE option.
Disabling interrupt coalescing and LRO automatically for VMXNET 3 virtual NICs.
Optimized interrupt delivery path for VM DirectPath I/O and SR-IOV passthrough devices, using heuristics to derive hints
from the guest OS about optimal placement of physical interrupt vectors on physical CPUs. To learn more about this topic,
please refer to the technical whitepaper: http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-
vsphere55.pdf

Polling Versus Interrupts


For applications or workloads that are allowed to use more CPU resources in order to achieve the lowest possible latency, polling in
the guest for I/O to be complete instead of relying on the device delivering an interrupt to the guest OS could help. Traditional
interrupt-based I/O processing incurs additional overheads at various levels, including interrupt handlers in the guest OS, accesses to
the interrupt subsystem (APIC, devices) that incurs emulation overhead, and deferred interrupt processing in guest OSes (Linux
bottom halves/NAPI poll, Windows DPC), which hurts latency to the applications.
With polling, the driver and/or the application in the guest OS will spin waiting for I/O to be available and can immediately indicate
the completed I/O up to the application waiting for it, thereby delivering lower latencies. However, this approach consumes m ore
CPU resources, and therefore more power, and hence should be considered carefully.
Note that this approach is different from what the idle=poll kernel parameter for Linux guests achieves. This approach requires
writing a poll-mode device driver for the I/O device involved in your low latency application, which constantly polls the device (for
example, looking at the receive ring for data to have been posted by the device) and indicates the data up the protocol stack
immediately to the latency-sensitive application waiting for the data.
Guest OS Tips and Tricks
If your application uses Java, then one of the most important optimizations we recommend is to configure both the guest OS and
Java to use large pages. Add the following command-line option when launching Java:
-XX:+UseLargePages
For other important guidelines when tuning Java applications running in VMware VMs, please refer to
http://www.vmware.com/resources/techresources/1087.
Another source of latency for networking I/O can be guest firewall rules like Linux iptables. If your security policy for your VM can
allow for it, consider stopping the guest firewall.
Similarly, security infrastructure like SELinux can also add to application latency, since it intercepts every system call to do additional
security checks. Consider disabling SELinux if your security policy can allow for that.

10. Performance Best Practices for VMware vSphere 5.0


Validate Your Hardware
Before deploying a system we recommend the following:
Before deploying a system we recommend the following:
Verify that all hardware in the system is on the hardware compatibility list for the specific version of VMware software you will
be running.
Make sure that your hardware meets the minimum configuration supported by the VMware software you will be running.
Test system memory for 72 hours, checking for hardware errors.

Hardware CPU Considerations
This section provides guidance regarding CPUs for use with vSphere 5.0.

General CPU Considerations
When selecting hardware, it is a good idea to consider CPU compatibility for VMware vMotion (which in turn affects DRS)
and VMware Fault Tolerance. See VMware vMotion and Storage vMotion on page 51, VMware Distributed Resource
Scheduler (DRS) on page 52, and VMware Fault Tolerance on page 59.

Hardware-Assisted Virtualization

Most recent processors from both Intel and AMD include hardware features to assist virtualization.
These features were released in two generations:
The first generation introduced CPU virtualization
The second generation added memory management unit (MMU) virtualization

For the best performance, make sure your system uses processors with second-generation hardware-assist features.

Hardware-Assisted CPU Virtualization (VT-x and AMD-V) The first generation of hardware virtualization assistance, VT-x
from Intel and AMD-V from AMD, became available in 2006. These technologies automatically trap sensitive events and
instructions, eliminating the overhead required to do so in software. This allows the use of a hardware virtualization (HV)
virtual machine monitor (VMM) as opposed to a binary translation (BT) VMM.

Hardware-Assisted MMU Virtualization (Intel EPT and AMD RVI) More recent processors also include second generation
hardware virtualization assistance that addresses the overheads due to memory management unit (MMU) virtualization by
providing hardware support to virtualize the MMU. ESXi supports this feature both in AMD processors, where it is called
rapid virtualization indexing (RVI) or nested page tables (NPT), and in Intel processors, where it is called extended page
tables (EPT).

Hardware-assisted MMU virtualization allows an additional level of page tables that map guest physical memory to host
physical memory addresses, eliminating the need for ESXi to maintain shadow page tables. This reduces memory
consumption and speeds up workloads that cause guest operating systems to frequently modify page tables. While
hardware-assisted MMU virtualization improves the performance of the vast majority of workloads, it does increase the
time required to service a TLB miss, thus potentially reducing the performance of workloads that stress the TLB.

Hardware-Assisted I/O MMU Virtualization (VT-d and AMD-Vi) An even newer processor feature is an I/O memory
management unit that remaps I/O DMA transfers and device interrupts. This can allow virtual machines to have direct
access to hardware I/O devices, such as network cards, storage controllers (HBAs) and GPUs. In AMD processors this feature
is called AMD I/O

Virtualization (AMD-Vi or IOMMU) and in Intel processors the feature is called Intel Virtualization Technology for Directed
I/O (VT-d).

Hardware Storage Considerations

Storage performance is a vast topic that depends on workload, hardware, vendor, RAID level, cache size,
stripe size, and so on. Consult the appropriate documentation from VMware as well as the storage vendor.

Many workloads are very sensitive to the latency of I/O operations. It is therefore important to have storage
devices configured correctly

VMware Storage vMotion performance is heavily dependent on the available storage infrastructure
Bandwidth

Consider choosing storage hardware that supports VMware vStorage APIs for Array Integration (VAAI).
VAAI can improve storage scalability by offloading some operations to the storage hardware instead of
performing them in ESXi.

On SANs, VAAI offers the following features:

Hardware-accelerated cloning (sometimes called full copy or copy offload) frees resources on the host and can speed u
workloads that rely on cloning, such as Storage vMotion.
Block zeroing speeds up creation of eager-zeroed thick disks and can improve first-time write performance on lazy-zeroed
thick disks and on thin disks.
Scalable lock management (sometimes called atomic test and set, or ATS) can reduce locking-related overheads, speeding
up thin-disk expansion as well as many other administrative and file system-intensive tasks. This helps improve the
scalability of very large deployments by speeding up provisioning operations like boot storms, expansion of thin disks,

snapshots, and other tasks.


Thin provision UNMAP allows ESXi to return no-longer-needed thin-provisioned disk space to the storage hardware for
reuse.



On NAS devices, VAAI offers the following features:

Hardware-accelerated cloning (sometimes called full copy or copy offload) frees resources on the host and can speed
up workloads that rely on cloning. (Note that Storage vMotion does not make use of this feature on NAS devices.)

Space reservation allows ESXi to fully pre-allocate space for a virtual disk at the time the virtual disk is created. Thus, in
addition to the thin provisioning and eager-zeroed thick provisioning options that non-VAAI NAS devices support, VAAI NAS
devices also support lazy-zeroed thick provisioning.

Though the degree of improvement is dependent on the storage hardware, VAAI can reduce storage latency for several types of
storage operations, can reduce the ESXi host CPU utilization for storage operations, and can reduce storage network traffic.

Performance design for a storage network must take into account the physical constraints of the network, not logical
allocations. Using VLANs or VPNs does not provide a suitable solution to the problem of link oversubscription in shared
configurations. VLANs and other virtual partitioning of a network provide a way of logically configuring a network, but don't
change the physical capabilities of links and trunks between switches.

If you have heavy disk I/O loads, you might need to assign separate storage processors (SPs) to separate systems to handle
the amount of traffic bound for storage.

To optimize storage array performance, spread I/O loads over the available paths to the storage (that is, across multiple
host bus adapters (HBAs) and storage processors).

Make sure that end-to-end Fibre Channel speeds are consistent to help avoid performance problems. For more
information, see KB article 1006602.

Configure maximum queue depth for Fibre Channel HBA cards. For additional information see VMware KB article 1267.

Applications or systems that write large amounts of data to storage, such as data acquisition or transaction logging systems,
should not share Ethernet links to a storage device with other applications or systems. These types of applications perform
best with dedicated connections to storage devices.

For iSCSI and NFS, make sure that your network topology does not contain Ethernet bottlenecks, where multiple links are
routed through fewer links, potentially resulting in oversubscription and dropped network packets. Any time a number of
links transmitting near capacity are switched to a smaller number of links, such oversubscription is a possibility.

Recovering from these dropped network packets results in large performance degradation. In addition to time spent
determining that data was dropped, the retransmission uses network bandwidth that could otherwise be used for new
transactions. For iSCSI and NFS, if the network switch deployed for the data path supports VLAN, it might be beneficial to
create a VLAN just for the ESXi host's vmknic and the iSCSI/NFS server. This minimizes network interference from other
packet sources.

Be aware that with software-initiated iSCSI and NFS the network protocol processing takes place on the host system, and
thus these might require more CPU resources than other storage options.

Local storage performance might be improved with write-back cache. If your local storage has write-back cache installed,
make sure its enabled and contains a functional battery module. For more information, see KB article 1006602.

Hardware Networking Considerations

Before undertaking any network optimization effort, you should understand the physical aspects of the network. The following are
just a few aspects of the physical layout that merit close consideration:
Consider using server-class network interface cards (NICs) for the best performance.

Make sure the network infrastructure between the source and destination NICs doesnt introduce bottlenecks. For
example, if both NICs are 10Gigabit, make sure all cables and switches are capable of the same speed and that the switches
are not configured to a lower speed.


For the best networking performance, we recommend the use of network adapters that support the following hardware features:
Checksum offload
TCP segmentation offload (TSO)
Ability to handle high-memory DMA (that is, 64-bit DMA addresses)
Ability to handle multiple Scatter Gather elements per Tx frame
Jumbo frames (JF)
Large receive offload (LRO)

On some 10 Gigabit Ethernet hardware network adapters, ESXi supports NetQueue, a technology that significantly improves
performance of 10Gigabit Ethernet network adapters in virtualized environments.

In addition to the PCI and PCI-X bus architectures, we now have the PCI Express (PCIe) architecture. Ideally single-port 10 Gigabit
Ethernet network adapters should use PCIe x8 (or higher) or PCI-X 266 and dual-port 10 Gigabit Ethernet network adapters should
use PCIe x16 (or higher). There should preferably be no bridge chip (e.g., PCI-X to PCIe or PCIe to PCI-X) in the path to the actual
Ethernet device (including any embedded bridge chip on the device itself), as these chips can reduce performance.

Multiple physical network adapters between a single virtual switch (vSwitch) and the physical network constitute a NIC team. NIC
teams can provide passive failover in the event of hardware failure or network outage and, in some configurations, can increase
performance by distributing the traffic across those physical network adapters.

Hardware BIOS Settings, General BIOS Settings

Make sure you are running the latest version of the BIOS available for your system.
Make sure the BIOS is set to enable all populated processor sockets and to enable all cores in each socket.
Enable Turbo Boost in the BIOS if your processors support it.
Make sure hyper-threading is enabled in the BIOS for processors that support it.
Some NUMA-capable systems provide an option in the BIOS to disable NUMA by enabling node interleaving. In most cases
you will get the best performance by disabling node interleaving (in other words, leaving NUMA enabled).
Make sure any hardware-assisted virtualization features (VT-x, AMD-V, EPT, RVI, and so on) are enabled in the BIOS.
Disable from within the BIOS any devices you wont be using. This might include, for example, unneeded serial, USB, or
network ports

Cache prefetching mechanisms (sometimes called DPL Prefetch, Hardware Prefetcher, L2 Streaming Prefetch, or Adjacent
Cache Line Prefetch) usually help performance, especially when memory access patterns are regular. When running
applications that access memory randomly, however, disabling these mechanisms might result in improved performance.

If the BIOS allows the memory scrubbing rate to be configured, we recommend leaving it at the manufacturers default
setting.


Power Management BIOS Settings

VMware ESXi includes a full range of host power management capabilities in the software that can save power when a host is not
fully utilized.
We recommend that you configure your BIOS settings to allow ESXi the most flexibility in using (or not using) the power
management features offered by your hardware, then make your power-management choices within ESXi.

In order to allow ESXi to control CPU power-saving features, set power management in the BIOS to OS Controlled Mode or
equivalent. Even if you dont intend to use these power-saving features, ESXi provides a convenient way to manage them.

Availability of the C1E halt state typically provides a reduction in power consumption with little or no impact on performance. When
Turbo Boost is enabled, the availability of C1E can sometimes even increase the performance of certain single-threaded
workloads. We therefore recommend that you enable C1E in BIOS.

However, for a very few workloads that are highly sensitive to I/O latency, especially those with low CPU utilization, C1E can reduce
performance. In these cases, you might obtain better performance by disabling C1E in BIOS, if that option is available.

C-states deeper than C1/C1E (i.e., C3, C6) allow further power savings, though with an increased chance of performance impacts. We
recommend, however, that you enable all C-states in BIOS, then use ESXi host power management to control their use.

ESXi and Virtual Machines

ESXi General Considerations

This subsection provides guidance regarding a number of general performance considerations in ESXi.

Plan your deployment by allocating enough resources for all the virtual machines you will run, as well as those needed by
ESXi itself.

Allocate to each virtual machine only as much virtual hardware as that virtual machine requires. Provisioning a virtual
machine with more resources than it requires can, in some cases, reduce the performance of that virtual machine as well as
other virtual machines sharing the same host.

Disconnect or disable any physical hardware devices that you will not be using. These might include devices such as:
o COM ports
o LPT ports
o USB controllers
o Floppy drives
o Optical drives (that is, CD or DVD drives)
o Network interfaces
o Storage controllers

Disabling hardware devices (typically done in BIOS) can free interrupt resources. Additionally, some devices, such as USB
controllers, operate on a polling scheme that consumes extra CPU resources. Lastly, some PCI devices reserve blocks of
memory, making that memory unavailable to ESXi.

Unused or unnecessary virtual hardware devices can impact performance and should be disabled. For example, Windows
guest operating systems poll optical drives (that is, CD or DVD drives) quite frequently. When virtual machines are
configured to use a physical drive, and multiple guest operating systems simultaneously try to access that drive,
performance could suffer. This can be reduced by configuring the virtual machines to use ISO images instead of physical
drives, and can be avoided entirely by disabling optical drives in virtual machines when the devices are not needed.

ESXi 5.0 introduces virtual hardware version 8. By creating virtual machines using this hardware version, or upgrading
existing virtual machines to this version, a number of additional capabilities become available. Some of these, such as
support for virtual machines with up to 1TB of RAM and up to 32 vCPUs, support for virtual NUMA, and support for 3D
graphics, can improve performance for some workloads. This hardware version is not compatible with versions of ESXi prior
to 5.0, however, and thus if a cluster of ESXi hosts will contain some hosts running pre-5.0 versions of ESXi, the virtual
machines running on hardware version 8 will be constrained to run only on the ESXi 5.0 hosts. This could limit vMotion
choices for Distributed Resource Scheduling (DRS) or Distributed Power Management (DPM).

ESXi CPU Considerations

CPU virtualization adds varying amounts of overhead depending on the percentage of the virtual machines workload that can be
executed on the physical processor as is and the cost of virtualizing the remainder of the workload:

For many workloads, CPU virtualization adds only a very small amount of overhead, resulting in performance essentially
comparable to native.

Many workloads to which CPU virtualization does add overhead are not CPU-boundthat is, most of their time is spent
waiting for external events such as user interaction, device input, or data retrieval, rather than executing instructions.
Because otherwise-unused CPU cycles are available to absorb the virtualization overhead, these workloads will typically
have throughput similar to native, but potentially with a slight increase in latency.

For a small percentage of workloads, for which CPU virtualization adds overhead and which are CPU-bound, there might be
a noticeable degradation in both throughput and latency.

If an ESXi host becomes CPU saturated (that is, the virtual machines and other loads on the host demand all the CPU
resources the host has), latency sensitive workloads might not perform well. In this case you might want to reduce the CPU
load, for example by powering off some virtual machines or migrating them to a different host (or allowing DRS to migrate
them automatically).

It is a good idea to periodically monitor the CPU usage of the host. This can be done through the vSphere Client or by using
esxtop or resxtop. Below we describe how to interpret esxtop data:
o If the load average on the first line of the esxtop CPU panel is equal to or greater than 1, this indicates that the
system is overloaded.
o The usage percentage for the physical CPUs on the PCPU line can be another indication of a possibly overloaded
condition. In general, 80% usage is a reasonable ceiling and 90% should be a warning that the CPUs are
approaching an overloaded condition. However organizations will have varying standards regarding the desired
load percentage.

Configuring a virtual machine with more virtual CPUs (vCPUs) than its workload can use might cause slightly increased
resource usage, potentially impacting performance on very heavily loaded systems. Common examples of this include a
single-threaded workload running in a multiple-vCPU virtual machine or a multi-threaded workload in a virtual machine
with more vCPUs than the workload can effectively use.

Most guest operating systems execute an idle loop during periods of inactivity. Within this loop, most of these guest
operating systems halt by executing the HLT or MWAIT instructions. Some older guest operating systems (including
Windows 2000 (with certain HALs), Solaris 8 and 9, and MS-DOS), however, use busy-waiting within their idle loops. This
results in the consumption of resources that might otherwise be available for other uses (other virtual machines, the
VMkernel, and so on). ESXi automatically detects these loops and de-schedules the idle vCPU. Though this reduces the CPU
overhead, it can also reduce the performance of some I/O-heavy workloads. For additional information see VMware KB
articles 1077 and 2231.

The guest operating systems scheduler might migrate a single-threaded workload amongst multiple vCPUs, thereby losing
cache locality.



UP/ vs SMP HALs /Kernels

NOTE When changing an existing virtual machine running Windows from multi-core to single-core the HAL usually remains SMP. For
best performance, the HAL should be manually changed back to UP.

Hyper-Threading

Hyper-threading technology (sometimes also called simultaneous multithreading, or SMT) allows a single physical processor core to
behave like two logical processors, essentially allowing two independent threads to run simultaneously.

Unlike having twice as many processor coresthat can roughly double performancehyper-threading can provide anywhere from a
slight to a significant increase in system performance by keeping the processor pipeline busier.

If the hardware and BIOS support hyper-threading, ESXi automatically makes use of it. For the best performance we recommend
that you enable hyper-threading.

Be careful when using CPU affinity on systems with hyper-threading. Because the two logical processors share most of the processor
resources, pinning vCPUs, whether from different virtual machines or from a single SMP virtual machine, to both logical processors
on one core (CPUs 0 and 1, for example) could cause poor performance.

ESXi provides configuration parameters for controlling the scheduling of virtual machines on hyper-threaded systems (Edit virtual
machine settings > Resources tab > Advanced CPU). When choosing hyper-threaded core sharing choices, the Any option (which is
the default) is almost always preferred over None.


The None option indicates that when a vCPU from this virtual machine is assigned to a logical processor, no other vCPU, whether
from the same virtual machine or from a different virtual machine, should be assigned to the other logical processor that resides on
the same core. That is, each vCPU from this virtual machine should always get a whole core to itself and the other logical CPU on
that core should be placed in the halted state.

This option is like disabling hyper-threading for that one virtual machine. For nearly all workloads, custom hyper-threading settings
are not necessary. In cases of unusual workloads that interact badly with hyper-threading, however, choosing the None hyper-
threading option might help performance. For example, even though the ESXi scheduler tries to dynamically run higher-priority
virtual machines on a whole core for longer durations, you can further isolate a high-priority virtual machine from interference by
other virtual machines by setting its hyper-threading sharing property to None.

Non-Uniform Memory Access (NUMA)

By default, ESXi NUMA scheduling and related optimizations are enabled only on systems with a total of at least four CPU cores and
with at least two CPU cores per NUMA node.
On such systems, virtual machines can be separated into the following two categories:

Virtual machines with a number of vCPUs equal to or less than the number of cores in each physical NUMA node. These
virtual machines will be assigned to cores all within a single NUMA node and will be preferentially allocated memory local
to that NUMA node. This means that, subject to memory availability, all their memory accesses will be local to that NUMA
node, resulting in the lowest memory access latencies.
Virtual machines with more vCPUs than the number of cores in each physical NUMA node (called wide virtual machines).
These virtual machines will be assigned to two (or more) NUMA nodes and will be preferentially allocated memory local to
those NUMA nodes. Because vCPUs in these wide virtual machines might sometimes need to access memory outside their
own NUMA node, they might experience higher average memory access latencies than virtual machines that fit entirely
within a NUMA node.

Host Power Management in ESXi

Power Policy Options in ESXi
ESXi 5.0 offers the following power policy options:
High performance - This power policy maximizes performance, using no power management features.
Balanced - This power policy (the default in ESXi 5.0) is designed to reduce host power consumption while having little or no
impact on performance.
Low power - This power policy is designed to more aggressively reduce host power consumption at the risk of reduced
performance.
Custom - This power policy starts out the same as Balanced, but allows for the modification of individual parameters.

While the default power policy in ESX/ESXi 4.1 was High performance, in ESXi 5.0 the default is now Balanced. This power policy will
typically not impact the performance of CPU-intensive workloads. Rarely, however, the Balanced policy might slightly reduce the
performance of latency sensitive workloads. In these cases, selecting the High performance power policy will provide the full
hardware performance.


ESXi Memory Considerations

Memory Overhead
Virtualization causes an increase in the amount of physical memory required due to the extra memory needed by ESXi for its own
code and for data structures. This additional memory requirement can be separated into two components:

1. A system-wide memory space overhead for the VMkernel and various host agents (hostd, vpxa, etc.).
2. An additional memory space overhead for each virtual machine.

The per-virtual-machine memory space overhead can be further divided into the following categories:

Memory reserved for the virtual machine executable (VMX) process. This is used for data structures needed to bootstrap
and support the guest (i.e., thread stacks, text, and heap).

Memory reserved for the virtual machine monitor (VMM). This is used for data structures required by the virtual hardware
(i.e., TLB, memory mappings, and CPU state).
Memory reserved for various virtual devices (i.e., mouse, keyboard, SVGA, USB, etc.)
Memory reserved for other subsystems, such as the kernel, management agents, etc.


The amounts of memory reserved for these purposes depend on a variety of factors, including the number of vCPUs, the configured
memory for the guest operating system, whether the guest operating system is 32-bit or 64-bit, and which features are enabled for
the virtual machine. For more information about these overheads, see vSphere Resource Management.

Memory Sizing

You should allocate enough memory to hold the working set of applications you will run in the virtual machine, thus
minimizing thrashing.
You should also avoid over-allocating memory. Allocating more memory than needed unnecessarily increases the virtual
machine memory overhead, thus consuming memory that could be used to support more virtual machines.


Memory Overcommit Techniques

ESXi uses five memory management mechanismspage sharing, ballooning, memory compression, swap to host cache, and regular
swappingto dynamically reduce the amount of machine physical memory required for each virtual machine.

Page Sharing: ESXi uses a proprietary technique to transparently and securely share memory pages between virtual
machines, thus eliminating redundant copies of memory pages. In most cases, page sharing is used by default regardless of
the memory demands on the host system. (The exception is when using large pages, as discussed in Large Memory Pages
for Hypervisor and Guest Operating System on page 28.)

Ballooning: If the virtual machines memory usage approaches its memory target, ESXi will use ballooning to reduce that
virtual machines memory demands. Using a VMware-supplied vmmemctl module installed in the guest operating system as
part of VMware Tools suite, ESXi can cause the guest operating system to relinquish the memory pages it considers least
valuable. Ballooning provides performance closely matching that of a native system under similar memory constraints. To
use ballooning, the guest operating system must be configured with sufficient swap space.

Memory Compression: If the virtual machines memory usage approaches the level at which host-level swapping will be
required, ESXi will use memory compression to reduce the number of memory pages it will need to swap out. Because the
decompression latency is much smaller than the swap-in latency, compressing memory pages has significantly less impact
on performance than swapping out those pages.

Swap to Host Cache: If memory compression doesnt keep the virtual machines memory usage low enough, ESXi will next
forcibly reclaim memory using host-level swapping to a host cache (if one has been configured). Swap to host cache is a
new feature in ESXi 5.0 that allows users to configure a special swap cache on SSD storage. In most cases this host cache
(being on SSD) will be much faster than the regular swap files (typically on hard disk storage), significantly reducing access
latency. Thus, although some of the pages ESXi swaps out might be active, swap to host cache has a far lower performance
impact than regular host-level swapping.

Regular Swapping: If the host cache becomes full, or if a host cache has not been configured, ESXi will next reclaim memory
from the virtual machine by swapping out pages to a regular swap file. Like swap to host cache, some of the pages ESXi
swaps out might be active. Unlike swap to host cache, however, this mechanism can cause virtual machine performance to
degrade significantly due to its high access latency.

While ESXi uses page sharing, ballooning, memory compression, and swap to host cache to allow significant memory over-
commitment, usually with little or no impact on performance, you should avoid overcommitting memory to the point that active
memory pages are swapped out with regular host-level swapping.

In the vSphere Client, select the virtual machine in question, select the Performance tab, then look at the value of Memory Balloon
(Average). An absence of ballooning suggests that ESXi is not under heavy memory pressure and thus memory over commitment is
not affecting the performance of that virtual machine.

In the vSphere Client, select the virtual machine in question, select the Performance tab, then compare the values of Consumed
Memory and Active Memory. If consumed is higher than active, this suggests that the guest is currently getting all the memory it
requires for best performance.

In the vSphere Client, select the virtual machine in question, select the Performance tab, then look at the values of Swap-In and
Decompress. Swapping in and decompressing at the host level indicate more significant memory pressure.

Check for guest operating system swap activity within that virtual machine. This can indicate that ballooning might be starting to
impact performance, though swap activity can also be related to other issues entirely within the guest (or can be an indication that
the guest memory size is simply too small).

Memory Swapping Optimizations

Because ESXi uses page sharing, ballooning, and memory compression to reduce the need for host-level memory swapping, dont
disable these techniques.

If you choose to overcommit memory with ESXi, be sure you have sufficient swap space on your ESXi system. At the time a virtual
machine is first powered on, ESXi creates a swap file for that virtual machine equal in size to the difference between the virtual
machine's configured memory size and its memory reservation. The available disk space must therefore be at least this large (plus
the space required for VMX swap, as described in Memory Overhead on page 25).

You can optionally configure a special host cache on an SSD (if one is installed) to be used for the new swap to host cache feature.
NOTE Placing the regular swap file in SSD and using swap to host cache in SSD (as described above) are two different approaches to
improving host swapping performance. Because it is unusual to have enough SSD space for a hosts entire swap file needs, we
recommend using local SSD for swap to host cache.

If you cant use SSD storage, place the regular swap file on the fastest available storage. This might be a Fibre Channel SAN array or a
fast local disk.
Placing swap files on local storage (whether SSD or hard drive) could potentially reduce vMotion performance. This is because if a
virtual machine has memory pages in a local swap file, they must be swapped in to memory before a vMotion operation on that
virtual machine can proceed.

Regardless of the storage type or location used for the regular swap file, for the best performance, and to avoid the possibility of
running out of space, swap files should not be placed on thin-provisioned storage.

Large Memory Pages for Hypervisor and Guest Operating System

In addition to the usual 4KB memory pages, ESXi also provides 2MB memory pages (commonly referred to as large pages). By
default ESXi assigns these 2MB machine memory pages to guest operating systems that request them, giving the guest operating
system the full advantage of using large pages. The use of large pages results in reduced memory management overhead and can
therefore increase hypervisor performance.

If an operating system or application can benefit from large pages on a native system, that operating system or application can
potentially achieve a similar performance improvement on a virtual machine backed with 2MB machine memory pages.

Use of large pages can also change page sharing behavior. While ESXi ordinarily uses page sharing regardless of memory demands, it
does not share large pages. Therefore with large pages, page sharing might not occur until memory over-commitment is high
enough to require the large pages to be broken into small pages.

ESXi Storage Considerations

VMware vStorage APIs for Array Integration (VAAI)

For the best storage performance, consider using VAAI-capable storage hardware. The performance gains from VAAI (described in
Hardware Storage Considerations on page 11) can be especially noticeable in VDI environments (where VAAI can improve boot-
storm and desktop workload performance), large data centers (where VAAI can improve the performance of mass virtual machine
provisioning and of thin-provisioned virtual disks), and in other large-scale deployments.

LUN Access Methods, Virtual Disk Modes, and Virtual Disk Types

You can use RDMs in virtual compatibility mode or physical compatibility mode:
Virtual mode specifies full virtualization of the mapped device, allowing the guest operating system to treat the RDM like
any other virtual disk file in a VMFS volume.

Physical mode specifies minimal SCSI virtualization of the mapped device, allowing the greatest flexibility for SAN
management software or other SCSI target-based software running in the virtual machine.

ESXi supports multiple virtual disk types:
Thick Thick virtual disks, which have all their space allocated at creation time, are further divided into two types: eager zeroed and
lazy zeroed.

Eager-zeroed An eager-zeroed thick disk has all space allocated and zeroed out at the time of creation. This increases the
time it takes to create the disk, but results in the best performance, even on the first write to each block.

Lazy-zeroed A lazy-zeroed thick disk has all space allocated at the time of creation, but each block is zeroed only on first
write. This results in a shorter creation time, but reduced performance the first time a block is written to. Subsequent
writes, however, have the same performance as on eager-zeroed thick disks.

Thin Space required for a thin-provisioned virtual disk is allocated and zeroed upon first write, as opposed to upon
creation. There is a higher I/O cost (similar to that of lazy-zeroed thick disks) during the first write to an unwritten file block,
but on subsequent writes thin-provisioned disks have the same performance as eager-zeroed thick disks.

Partition Alignment

The alignment of file system partitions can impact performance. VMware makes the following recommendations for VMFS
partitions:
Like other disk-based filesystems, VMFS filesystems suffer a performance penalty when the partition is unaligned. Using the vSphere
Client to create VMFS partitions avoids this problem since, beginning with ESXi 5.0, it automatically aligns VMFS3 or VMFS5
partitions along the 1MB boundary.

SAN Multipathing

By default, ESXi uses the Most Recently Used (MRU) path policy for devices on Active/Passive storage arrays. Do not use Fixed path
policy for Active/Passive storage arrays to avoid LUN path thrashing.

NOTE With some Active/Passive storage arrays that support ALUA (described below) ESXi can use Fixed path policy without risk of
LUN path thrashing.

By default, ESXi uses the Fixed path policy for devices on Active/Active storage arrays. When using this policy you can maximize the
utilization of your bandwidth to the storage array by designating preferred paths to each LUN through different storage controllers.
For more information, see the VMware SAN Configuration Guide.

In addition to the Fixed and MRU path policies, ESXi can also use the Round Robin path policy, which can improve storage
performance in some environments. Round Robin policy provides load balancing by cycling I/O requests through all Active paths,
sending a fixed (but configurable) number of I/O requests through each one in turn.

If your storage array supports ALUA (Asymmetric Logical Unit Access), enabling this feature on the array can improve storage
performance in some environments. ALUA, which is automatically detected by ESXi, allows the array itself to designate paths as
Active Optimized. When ALUA is combined with the Round Robin path policy, ESXi cycles I/O requests through these Active
Optimized paths.

Storage I/O Resource Allocation

VMware vSphere provides mechanisms to dynamically allocate storage I/O resources, allowing critical workloads to maintain their
performance even during peak load periods when there is contention for I/O resources. This allocation can be performed at the level
of the individual host or for an entire datastore.

The storage I/O resources available to an ESXi host can be proportionally allocated to the virtual machines running on that
host by using the vSphere Client to set disk shares for the virtual machines (select Edit virtual machine settings, choose the
Resources tab, select Disk, then change the Shares field).

The maximum storage I/O resources available to each virtual machine can be set using limits. These limits, set in I/O
operations per second (IOPS), can be used to provide strict isolation and control on certain workloads. By default, these are
set to unlimited. When set to any other value, ESXi enforces the limits even if the underlying datastores are not fully
utilized.

An entire datastores I/O resources can be proportionally allocated to the virtual machines accessing that datastore using
Storage I/O Control (SIOC). When enabled, SIOC evaluates the disk share values set for all virtual machines accessing a
datastore and allocates that datastores resources accordingly. SIOC can be enabled using the vSphere Client (select a
datastore, choose the Configuration tab, click Properties... (at the far right), then under Storage I/O Control add a
checkmark to the Enabled box).


With SIOC disabled (the default), all hosts accessing a datastore get an equal portion of that datastores resources. Any shares values
determine only how each hosts portion is divided amongst its virtual machines.

With SIOC enabled, the disk shares are evaluated globally and the portion of the datastores resources each host receives depends
on the sum of the shares of the virtual machines running on that host relative to the sum of the shares of all the virtual machines
accessing that datastore.

General ESXi Storage Recommendations

I/O latency statistics can be monitored using esxtop (or resxtop), which reports device latency, time spent in the kernel, and latency
seen by the guest operating system.

Make sure that the average latency for storage devices is not too high. This latency can be seen in esxtop (or resxtop) by looking at
the GAVG/cmd metric. A reasonable upper value for this metric depends on your storage subsystem. If you use SIOC, you can use
your SIOC setting as a guide your GAVG/cmd value should be well below your SIOC setting. The default SIOC setting is 30 ms, but
if you have very fast storage (SSDs, for example) you might have reduced that value. For further information on average latency see
VMware KB article 1008205.

You can adjust the maximum number of outstanding disk requests per VMFS volume, which can help equalize the bandwidth across
virtual machines using that volume. For further information see VMware KB article 1268.

If you will not be using Storage I/O Control and often observe QFULL/BUSY errors, enabling and configuring queue depth throttling
might improve storage performance. This feature can significantly reduce the number of commands returned from the array with a
QFULL/BUSY error. If any system accessing a particular LUN or storage array port has queue depth throttling enabled, all systems
(both ESX hosts and other systems) accessing that LUN or storage array port should use an adaptive queue depth algorithm. Queue
depth throttling is not compatible with Storage DRS. For more information about both QFULL/BUSY errors and this feature see KB
article 1008113.

Running Storage Latency Sensitive Applications

By default the ESXi storage stack is configured to drive high storage throughout at low CPU cost. While this default configuration
provides better scalability and higher consolidation ratios, it comes at the cost of potentially higher storage latency. Applications
that are highly sensitive to storage latency might therefore benefit from the following:

Adjust the host power management settings:

Some of the power management features in newer server hardware can increase storage latency. Disable them as follows:

Set the ESXi host power policy to Maximum performance (as described in Host Power Management in ESXi on page 23;
this is the preferred method) or disable power management in the BIOS (as described in Power Management BIOS
Settings on page 14).
Disable C1E and other C-states in BIOS (as described in Power Management BIOS Settings on page 14).

Enable Turbo Boost in BIOS (as described in General BIOS Settings on page 14).


ESXi Networking Considerations

In a native environment, CPU utilization plays a significant role in network throughput. To process higher levels of throughput, more
CPU resources are needed. The effect of CPU resource availability on the network throughput of virtualized applications is even
more significant. Because insufficient CPU resources will limit maximum throughput, it is important to monitor the CPU utilization of
high-throughput workloads.

Use separate virtual switches, each connected to its own physical network adapter, to avoid contention between the VMkernel and
virtual machines, especially virtual machines running heavy networking workloads.

To establish a network connection between two virtual machines that reside on the same ESXi system, connect both virtual
machines to the same virtual switch. If the virtual machines are connected to different virtual switches, traffic will go through wire
and incur unnecessary CPU and network overhead.


Network I/O Control (NetIOC)

Network I/O Control (NetIOC) allows the allocation of network bandwidth to network resource pools. You can either select from
among seven predefined resource pools (Fault Tolerance traffic, iSCSI traffic, vMotion traffic, management traffic, vSphere
Replication (VR) traffic, NFS traffic, and virtual machine traffic) or you can create user-defined resource pools. Each resource pool is
associated with a portgroup and, optionally, assigned a specific 802.1p priority level.

Network bandwidth can be allocated to resource pools using either shares or limits:
Shares can be used to allocate to a resource pool a proportion of a network links bandwidth equivalent to the ratio of its shares to
the total shares. If a resource pool doesnt use its full allocation, the unused bandwidth is available for use by other resource pools.

Limits can be used to set a resource pools maximum bandwidth utilization (in Mbps) from a host through a specific virtual
distributed switch (vDS). These limits are enforced even if a vDS is not saturated, potentially limiting a resource pools bandwidth
while simultaneously leaving some bandwidth unused. On the other hand, if a resource pools bandwidth utilization is less than its
limit, the unused bandwidth is available to other resource pools.

NetIOC can guarantee bandwidth for specific needs and can prevent any one resource pool from impacting the others.

DirectPath I/O

vSphere DirectPath I/O leverages Intel VT-d and AMD-Vi hardware support (described in Hardware-Assisted I/O MMU Virtualization
(VT-d and AMD-Vi) on page 10) to allow guest operating systems to directly access hardware devices. In the case of networking,
DirectPath I/O allows the virtual machine to access a physical NIC directly rather than using an emulated device (E1000) or a para-
virtualized device (VMXNET, VMXNET3). While DirectPath I/O provides limited increases in throughput, it reduces CPU cost for
networking-intensive workloads.

DirectPath I/O is not compatible with certain core virtualization features, however. This list varies with the hardware on which ESXi is
running:

New for vSphere 5.0, when ESXi is running on certain configurations of the Cisco Unified Computing System (UCS) platform,
DirectPath I/O for networking is compatible with vMotion, physical NIC sharing, snapshots, and suspend/resume. It is not compatible
with Fault Tolerance, NetIOC, memory overcommit, VMCI, or VMSafe.

For server hardware other than the Cisco UCS platform, DirectPath I/O is not compatible with vMotion, physical NIC sharing,
snapshots, suspend/resume, Fault Tolerance, NetIOC, memory overcommit, or VMSafe.

Typical virtual machines and their workloads don't require the use of DirectPath I/O. For workloads that are very networking
intensive and don't need the core virtualization features mentioned above, however, DirectPath I/O might be useful to reduce CPU
usage.


SplitRx Mode

SplitRx mode, a new feature in ESXi 5.0, uses multiple physical CPUs to process network packets received in a single network queue.
This feature can significantly improve network performance for certain workloads.
These workloads include:

Multiple virtual machines on one ESXi host all receiving multicast traffic from the same source. (SplitRx mode will typically
improve throughput and CPU efficiency for these workloads.)

Traffic via the vNetwork Appliance (DVFilter) API between two virtual machines on the same ESXi host. (SplitRx mode will
typically improve throughput and maximum packet rates for these workloads.)

This feature, which is supported only for VMXNET3 virtual network adapters, is individually configured for each virtual NIC using the
ethernetX.emuRxMode variable in each virtual machines .vmx file (where X is replaced with the network adapters ID). The possible
values for this variable are:

ethernetX.emuRxMode = "0" This value disables splitRx mode for ethernetX.
ethernetX.emuRxMode = "1" This value enables splitRx mode for ethernetX.

To change this variable through the vSphere Client:
1. Select the virtual machine you wish to change, then click Edit virtual machine settings.
2. Under the Options tab, select General, then click Configuration Parameters.
3. Look for ethernetX.emuRxMode (where X is the number of the desired NIC). If the variable isnt present, click Add Row and enter
it as a new variable.
4. Click on the value to be changed and configure it as you wish. The change will not take effect until the virtual machine has been
restarted.

Running Network Latency Sensitive Applications

By default the ESXi network stack is configured to drive high network throughout at low CPU cost. While this default configuration
provides better scalability and higher consolidation ratios, it comes at the cost of potentially higher network latency. Applications
that are highly sensitive to network latency might therefore benefit from the following:

Use VMXNET3 virtual network adapters
Adjust the host power management settings (Maximum Performance, disable C1E and other C-States, Enable Turbo Boost
in BIOS)
Disable VMXNET3 virtual interrupt coalescing for the desired NIC. In some cases this can improve performance for latency-
sensitive applications. In other casesmost notably applications with high numbers of outstanding network requestsit
can reduce performance.

Guest Operating Systems

Install the latest version of VMware Tools in the guest operating system.
Disable screen savers and Window animations in virtual machines.
Schedule backups and virus scanning programs in virtual machines to run at off-peak hours
For the most accurate timekeeping, consider configuring your guest operating system to use NTP, Windows Time Service,
the VMware Tools time-synchronization option, or another timekeeping utility suitable for your operating system.

We recommend, however, that within any particular virtual machine you use either the VMware Tools time-
synchronization option or another timekeeping utility, but not both.


Measuring Performance in Virtual Machines

Be careful when measuring performance from within virtual machines.
Timing numbers measured from within virtual machines can be inaccurate, especially when the processor is overcommitted.

NOTE One possible approach to this issue is to use a guest operating system that has good timekeeping behavior when run in a
virtual machine, such as a guest that uses the NO_HZ kernel configuration option (sometimes called tickless timer). More
information about this topic can be found in Timekeeping in VMware Virtual Machines
(http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf).

Measuring performance from within virtual machines can fail to take into account resources used by ESXi for tasks it offloads from
the guest operating system, as well as resources consumed by virtualization overhead.

Guest Operating System CPU Considerations

Many operating systems keep time by counting timer interrupts. The timer interrupt rates vary between different operating systems
and versions. For example:

Unpatched 2.4 and earlier Linux kernels typically request timer interrupts at 100 Hz (that is, 100 interrupts per second),
though this can vary with version and distribution.
Linux kernels have used a variety of timer interrupt rates, including 100 Hz, 250 Hz, and 1000 Hz, again varying with version
and distribution.

The most recent 2.6 Linux kernels introduce the NO_HZ kernel configuration option (sometimes called tickless timer) that
uses a variable timer interrupt rate.

Microsoft Windows operating system timer interrupt rates are specific to the version of Microsoft Windows and the
Windows HAL that is installed. Windows systems typically use a base timer interrupt rate of 64 Hz or 100 Hz.

Running applications that make use of the Microsoft Windows multimedia timer functionality can increase the timer
interrupt rate. For example, some multimedia applications or Java applications increase the timer interrupt rate to
approximately 1000 Hz.
In addition to the timer interrupt rate, the total number of timer interrupts delivered to a virtual machine also depends on a
number of other factors:

Virtual machines running SMP HALs/kernels (even if they are running on a UP virtual machine) require more timer
interrupts than those running UP HALs/kernels.

The more vCPUs a virtual machine has, the more interrupts it requires. Delivering many virtual timer interrupts negatively
impacts virtual machine performance and increases host CPU consumption. If you have a choice, use guest operating
systems that require fewer timer interrupts. For example:

If you have a UP virtual machine use a UP HAL/kernel.

In some Linux versions, such as RHEL 5.1 and later, the divider=10 kernel boot parameter reduces the timer interrupt
rate to one tenth its default rate. See VMware KB article 1006427 for further information

Kernels with tickless-timer support (NO_HZ kernels) do not schedule periodic timers to maintain system time. As a result,
these kernels reduce the overall average rate of virtual timer interrupts, thus improving system performance and scalability
on hosts running large numbers of virtual machines


Virtual NUMA (vNUMA)

Virtual NUMA (vNUMA), a new feature in ESXi 5.0, exposes NUMA topology to the guest operating system, allowing NUMA-
aware guest operating systems and applications to make the most efficient use of the underlying hardwares NUMA
architecture.

Virtual NUMA, which requires virtual hardware version 8, can provide significant performance benefits, though the benefits
depend heavily on the level of NUMA optimization in the guest operating system and applications.

You can obtain the maximum performance benefits from vNUMA if your clusters are composed entirely of hosts with
matching NUMA architecture.

This is because the very first time a vNUMA-enabled virtual machine is powered on, its vNUMA topology is set based in part
on the NUMA topology of the underlying physical host on which it is running. Once a virtual machines vNUMA topology is
initialized it doesnt change unless the number of vCPUs in that virtual machine is changed. This means that if a vNUMA
virtual machine is moved to a host with a different NUMA topology, the virtual machines vNUMA topology might no longer
be optimal for the underlying physical NUMA topology, potentially resulting in reduced performance.

Size your virtual machines so they align with physical NUMA boundaries. For example, if you have a host system with six
cores per NUMA node, size your virtual machines with a multiple of six vCPUs (i.e., 6 vCPUs, 12 vCPUs, 18 vCPUs, 24 vCPUs,
and so on).
NOTE Some multi-core processors have NUMA node sizes that are different than the number of cores per socket. For
example, some 12-core processors have two six-core NUMA nodes per processor.


Guest Operating System Storage Considerations

The default virtual storage adapter in ESXi 5.0 is either BusLogic Parallel, LSI Logic Parallel, or LSI Logic SAS, depending on
the guest operating system and the virtual hardware version. However, ESXi also includes a paravirtualized SCSI storage
adapter, PVSCSI (also called VMware Paravirtual). The PVSCSI adapter offers a significant reduction in CPU utilization as well
as potentially increased throughput compared to the default virtual storage adapters, and is thus the best choice for
environments with very I/O-intensive guest applications.

The depth of the queue of outstanding commands in the guest operating system SCSI driver can significantly impact disk
performance. A queue depth that is too small, for example, limits the disk bandwidth that can be pushed through the
virtual machine.

In some cases large I/O requests issued by applications in a virtual machine can be split by the guest storage driver.
Changing the guest operating systems registry settings to issue larger block sizes can eliminate this splitting, thus
enhancing performance. For additional information see VMware KB article 9645697.

Make sure the disk partitions within the guest are aligned.


Guest Operating System Networking Considerations

The default virtual network adapter emulated in a virtual machine is either an AMD PCnet32 device (vlance) or an Intel E1000 device
(E1000). VMware also offers the VMXNET family of paravirtualized network adapters, however, that provide better performance
than these default adapters and should be used for optimal performance within any guest operating s ystem for which they are
available.

For the best performance, use the VMXNET3 paravirtualized network adapter for operating systems in which it is
supported. This requires that the virtual machine use virtual hardware version 7 or later, and that VMware Tools be
installed in the guest operating system.

The VMXNET3, Enhanced VMXNET, and E1000 devices support jumbo frames for better performance. (Note that the vlance
device does not support jumbo frames.) To enable jumbo frames, set the MTU size to 9000 in both the guest network driver
and the virtual switch configuration. The physical NICs at both ends and all the intermediate hops/routers/switches must
also support jumbo frames.

In ESXi, TCP Segmentation Offload (TSO) is enabled by default in the VMkernel, but is supported in virtual machines only
when they are using the VMXNET3 device, the Enhanced VMXNET device, or the E1000 device. TSO can improve
performance even if the underlying hardware does not support TSO.

In some cases, low receive throughput in a virtual machine can be caused by insufficient receive buffers in the receiver
network device. If the receive ring in the guest operating systems network driver overflows, packets will be dropped in the
VMkernel, degrading network throughput. A possible workaround is to increase the number of receive buffers, though this
might increase the host physical CPU workload.

For VMXNET, the default number of receive and transmit buffers is 100 each, with the maximum possible being 128. For

Enhanced VMXNET, the default number of receive and transmit buffers are 150 and 256, respectively, with the maximum
possible receive buffers being 512. You can alter these settings by changing the buffer size defaults in the .vmx
(configuration) files for the affected virtual machines. For additional information see VMware KB article 1010071

Receive-side scaling (RSS) allows network packet receive processing to be scheduled in parallel on multiple CPUs. Without
RSS, receive interrupts can be handled on only one CPU at a time. With RSS, received packets from a single NIC can be
processed on multiple CPUs concurrently. This helps receive throughput in cases where a single CPU would otherwise be
saturated with receive processing and become a bottleneck. To prevent out-of-order packet delivery, RSS schedules all of a
flows packets to the same CPU.


Virtual Infrastructure Management

Use resource settings (that is, Reservation, Shares, and Limits) only if needed in your environment.

If you expect frequent changes to the total available resources, use Shares, not Reservation, to allocate resources fairly across virtual
machines. If you use Shares and you subsequently upgrade the hardware, each virtual machine stays at the same relative priority
(keeps the same number of shares) even though each share represents a larger amount of memory or CPU.

Use Reservation to specify the minimum acceptable amount of CPU or memory, not the amount you would like to have available.
After all resource reservations have been met, ESXi allocates the remaining resources based on the number of shares and the
resource limits configured for your virtual machine.

When specifying the reservations for virtual machines, always leave some headroom for memory virtualization overhead and
migration overhead. In a DRS-enabled cluster, reservations that fully commit the capacity of the cluster or of individual hosts in the
cluster can prevent DRS from migrating virtual machines between hosts. As you approach fully reserving all capacity in the system, it
also becomes increasingly difficult to make changes to reservations and to the resource pool hierarchy without violating admission
control.

VMware vCenter

This section lists VMware vCenter practices and configurations recommended for optimal performance. It also includes a few
features that are controlled or accessed through vCenter.

The performance of vCenter Server is dependent in large part on the number of managed entities (hosts and virtual machines) and
the number of connected VMware vSphere Clients. Exceeding the maximums specified in Configuration Maximums for VMware
vSphere 5.0, in addition to being unsupported, is thus likely to impact vCenter Server performance.

Whether run on virtual machines or physical systems, make sure you provide vCenter Server and the vCenter Server database with
sufficient CPU, memory, and storage resources for your deployment size.

To minimize the latency of vCenter operations, keep to a minimum the number of network hops between the vCenter Server system
and the vCenter Server database.

Although VMware vCenter Update Manager can be run on the same system and use the same database as vCenter Server, for
maximum performance, especially on heavily-loaded vCenter systems, consider running Update Manager on its own system and
providing it with a dedicated database.

Similarly, VMware vCenter Converter can be run on the same system as vCenter Server, but doing so might impact performance,
especially on heavily-loaded vCenter systems.


VMware vCenter Database Considerations

VMware vCenter Database Network and Storage Considerations

To minimize the latency of operations between vCenter Server and the database, keep to a minimum the number of network hops
between the vCenter Server system and the database system.

The hardware on which the vCenter database is stored, and the arrangement of the files on that hardware, can have a significant
effect on vCenter performance:

The vCenter database performs best when its files are placed on high-performance storage.
The database data files generate mostly random read I/O traffic, while the database transaction logs generate mostly
sequential write I/O traffic. For this reason, and because their traffic is often significant and simultaneous, vCenter performs
best when these two file types are placed on separate storage resources that share neither disks nor I/O bandwidth.

VMware vCenter Database Configuration and Maintenance

Configure the vCenter statistics level to a setting appropriate for your uses. This setting can range from 1 to 4, but a setting of 1 is
recommended for most situations. Higher settings can slow the vCenter Server system. You can also selectively disable statistics
rollups for particular collection levels.

To avoid frequent log file switches, ensure that your vCenter database logs are sized appropriately for your vCenter inventory. For
example, with a large vCenter inventory running with an Oracle database, the size of each redo log should be at least 512MB.

vCenter Server starts up with a database connection pool of 50 threads. This pool is then dynamically sized, growing adaptively as
needed based on the vCenter Server workload, and does not require modification. However, if a heavy workload is expected on the
vCenter Server, the size of this pool at startup can be increased, with the maximum being 128 threads. Note that this might result in
increased memory consumption by vCenter Server and slower vCenter Server startup.

Update statistics of the tables and indexes on a regular basis for better overall performance of the database.

As part of the regular database maintenance activity, check the fragmentation of the index objects and recreate indexes if needed
(i.e., if fragmentation is more than about 30%).

Microsoft SQL Server Database Recommendations

If you are using a Microsoft SQL Server database, the following points can improve vCenter Server performance:
Setting the transaction logs to Simple recovery mode significantly reduces the database logs disk space usage as well as their
storage I/O load. If it isnt possible to set this to Simple, make sure to have a high-performance storage subsystem.

To further improve database performance for large inventories, place tempDB on a different disk than either the database data files
or the database transaction logs.

We recommend a fill factor of about 70% for the four VPX_HIST_STAT tables (vpx_hist_stat1, vpx_hist_stat2, vpx_hist_stat3, and
vpx_hist_stat4). If the fill factor is set too high, the server must take time splitting pages when they fill up. If the fill factor is set too
low, the database will be larger than necessary due to the unused space on each page, thus increasing the number of pages that
need to be read during normal operations.

Oracle Database Recommendations

If you are using an Oracle database, the following points can improve vCenter Server performance:
When using Automatic Memory Management (AMM) in Oracle 11g, or Automatic Shared memory Management (ASMM) in Oracle
10g, allocate sufficient memory for the Oracle database.

Set appropriate PROCESSES or SESSIONS initialization parameters. Oracle creates a new server process for every new connection
that is made to it. The number of connections an application can make to the Oracle instance thus depends on how many processes
Oracle can create. PROCESSES and SESSIONS together determine how many simultaneous connections Oracle can accept. In large
vSphere environments (as defined in vSphere Installation and Setup for vSphere 5.0) we recommend setting PROCESSES to 800.

If database operations are slow, after checking that the statistics are up to date and the indexes are not fragmented, you should
move the indexes to separate tablespaces (i.e., place tables and primary key (PK) constraint index on one tablespace and the other
indexes (i.e., BTree) on another tablespace).

For large inventories (i.e., those that approach the limits for the number of hosts or virtual machines), increase the
db_writer_processes parameter to 4.

VMware vMotion and Storage vMotion



VMware vMotion

ESXi 5.0 introduces virtual hardware version 8. Because virtual machines running on hardware version 8 cant run on prior versions
of ESX/ESXi, such virtual machines can be moved using VMware vMotion only to other ESXi 5.0 hosts. ESXi 5.0 is also compatible
with virtual machines running on virtual hardware version 7 and earlier, however, and these machines can be moved using VMware
vMotion to ESX/ESXi 4.x hosts.

vMotion performance will increase as additional network bandwidth is made available to the vMotion network. Consider
provisioning 10Gb vMotion network interfaces for maximum vMotion performance. Multiple vMotion vmknics, a new feature in ESXi
5.0, can provide a further increase in network bandwidth available to vMotion. All vMotion vmknics on a host should share a single
vSwitch. Each vmknic's portgroup should be configured to leverage a different physical NIC as its active vmnic. In addition, all
vMotion vmknics should be on the same vMotion network.
While a vMotion operation is in progress, ESXi opportunistically reserves CPU resources on both the source and destination hosts in
order to ensure the ability to fully utilize the network bandwidth. ESXi will attempt to use the full available network bandwidth
regardless of the number of vMotion operations being performed. The amount of CPU reservation thus depends on the number of
vMotion NICs and their speeds; 10% of a processor core for each 1Gb network interface, 100% of a processor core for each 10Gb
network interface, and a minimum total reservation of 30% of a processor core. Therefore leaving some unreserved CPU capacity in
a cluster can help ensure that vMotion tasks get the resources required in order to fully utilize available network bandwidth.

vMotion performance could be reduced if host-level swap files are placed on local storage (whether SSD or hard drive).

VMware Storage vMotion

VMware Storage vMotion performance depends strongly on the available storage infrastructure bandwidth between the ESXi host
where the virtual machine is running and both the source and destination data stores.
During a Storage vMotion operation the virtual disk to be moved is being read from the source data store and written to the
destination data store. At the same time the virtual machine continues to read from and write to the source data store while also
writing to the destination data store.
This additional traffic takes place on storage that might also have other I/O loads (from other virtual machines on the same ESXi host
or from other hosts) that can further reduce the available bandwidth.

Storage vMotion will have the highest performance during times of low storage activity (when available storage bandwidth is
highest) and when the workload in the virtual machine being moved is least active. During a Storage vMotion operation, the benefits
of moving to a faster data store will be seen only when the migration has completed. However, the impact of moving to a slower
data store will gradually be felt as the migration progresses.

Storage vMotion will often have significantly better performance on VAAI-capable storage arrays.

VMware Distributed Resource Scheduler (DRS)

Cluster Configuration Settings

When deciding which hosts to group into DRS clusters, try to choose hosts that are as homogeneous as possible in terms of CPU and
memory. This improves performance predictability and stability. When heterogeneous systems have compatible CPUs, but have
different CPU frequencies and/or amounts of memory, DRS generally prefers to locate virtual machines on the systems with more
memory and higher CPU frequencies (all other things being equal), since those systems have more capacity to accommodate peak
loads.

VMware vMotion is not supported across hosts with incompatible CPU's. Hence with incompatible CPU heterogeneous systems,
the opportunities DRS has to improve the load balance across the cluster are limited. You can also use Enhanced vMotion
Compatibility (EVC) to facilitate vMotion between different CPU generations.

The more vMotion compatible ESXi hosts DRS has available, the more choices it has to better balance the DRS cluster.

Virtual machines with smaller memory sizes and/or fewer vCPUs provide more opportunities for DRS to migrate them in order to
improve balance across the cluster. Virtual machines with larger memory sizes and/or more vCPUs add more constraints in migrating

the virtual machines. This is one more reason to configure virtual machines with only as many vCPUs and only as much virtual
memory as they need.

Have virtual machines in DRS automatic mode when possible, as they are considered for cluster load balancing migrations across the
ESXi hosts before the virtual machines that are not in automatic mode.

Powered-on virtual machines consume memory resourcesand typically consume some CPU resourceseven when idle. Thus even
idle virtual machines, though their utilization is usually small, can affect DRS decisions. For this and other reasons, a marginal
performance increase might be obtained by shutting down or suspending virtual machines that are not being used.

Resource pools help improve manageability and troubleshooting of performance problems. We recommend, however, that resource
pools and virtual machines not be made siblings in a hierarchy. Instead, each level should contain only resource pools or only virtual
machines.

DRS affinity rules can keep two or more virtual machines on the same ESXi host (VM/VM affinity) or make sure they are always on
different hosts (VM/VM anti-affinity). DRS affinity rules can also be used to make sure a group of virtual machines runs only on (or
has a preference for) a specific group of ESXi hosts (VM/Host affinity) or never runs on (or has a preference against) a specific
group of hosts (VM/Host anti-affinity).

In most cases leaving the affinity settings unchanged will provide the best results. In rare cases, however, specifying affinity rules can
help improve performance. To change affinity settings, select a cluster from within the vSphere Client, choose the Summary tab,
click Edit Settings, choose Rules, click Add, enter a name for the new rule, choose a rule type, and proceed through the GUI as
appropriate for the rule type you selected.

Besides the default setting, the affinity setting types are:
Keep Virtual Machines Together This affinity type can improve performance due to lower latencies of communication
between machines.
Separate Virtual Machines This affinity type can maintain maximal availability of the virtual machines. For instance, if they
are both web server front ends to the same application, you might want to make sure that they don't both go down at the
same time. Also co-location of I/O intensive virtual machines could end up saturating the host I/O capacity, leading to
performance degradation. DRS currently does not make virtual machine placement decisions based on their I/O resources
usage.
Virtual Machines to Hosts (including Must run on..., Should run on..., Must not run on..., and Should not run on...) These
affinity types can be useful for clusters with software licensing restrictions or specific availability zone requirements.

To allow DRS the maximum flexibility:

Place virtual machines on shared datastores accessible from all hosts in the cluster.
Make sure virtual machines are not connected to host devices that would prevent them from moving off of those hosts.

The drmdump files produced by DRS can be very useful in diagnosing potential DRS performance issues during a support call. For
particularly active clusters, or those with more than about 16 hosts, it can be helpful to keep more such files than can fit in the
default maximum drmdump directory size of 20MB. This maximum can be increased using the DumpSpace option, which can be set
using DRS Advanced Options.

Cluster Sizing and Resource Settings

Exceeding the maximum number of hosts, virtual machines, or resource pools for each DRS cluster specified in Configuration
Maximums for VMware vSphere 5.0 is not supported. Even if it seems to work, doing so could adversely affect vCenter Server or DRS
performance.

Carefully select the resource settings (that is, reservations, shares, and limits) for your virtual machines.
Setting reservations too high can leave few unreserved resources in the cluster, thus limiting the options DRS has to balance
load.
Setting limits too low could keep virtual machines from using extra resources available in the cluster to improve their
performance.

Use reservations to guarantee the minimum requirement a virtual machine needs, rather than what you might like it to get.

Note that shares take effect only when there is resource contention. Note also that additional resources reserved for virtual machine
memory overhead need to be accounted for when sizing resources in the cluster.

If the overall cluster capacity might not meet the needs of all virtual machines during peak hours, you can assign relatively higher
shares to virtual machines or resource pools hosting mission-critical applications to reduce the performance interference from less-
critical virtual machines.

If you will be using vMotion, its a good practice to leave some unused CPU capacity in your cluster. As described in VMware
vMotion on page 51, when a vMotion operation is started, ESXi reserves some CPU resources for that operation.

DRS Performance Tuning

The migration threshold for fully automated DRS (cluster > DRS tab > Edit... > vSphere DRS) allows the administrator to control the
aggressiveness of the DRS algorithm.

The migration threshold should be set to more aggressive levels when the following conditions are satisfied:

If the hosts in the cluster are relatively homogeneous.
If the virtual machines' resource utilization does not vary much over time and you have relatively few constraints on where
a virtual machine can be placed.

The migration threshold should be set to more conservative levels in the converse situations.

NOTE If the most conservative threshold is chosen, DRS will only apply move recommendations that must be taken either to satisfy
hard constraints, such as affinity or anti-affinity rules, or to evacuate virtual machines from a host entering maintenance or standby
mode.

VMware Distributed Power Management (DPM)

VMware Distributed Power Management (DPM) conserves power by migrating virtual machines to fewer hosts when utilizations are
low. DPM is most appropriate for clusters in which composite virtual machine demand varies greatly over time; for example, clusters
in which overall demand is higher during the day and significantly lower at night. If demand is consistently high relative to overall
cluster capacity DPM will have little opportunity to put hosts into standby mode to save power.

Because DPM uses DRS, most DRS best practices (described in VMware Distributed Resource Scheduler (DRS) on page 52) are
relevant to DPM as well.

DPM considers historical demand in determining how much capacity to keep powered on and keeps some excess capacity available
for changes in demand. DPM will also power on additional hosts when needed for unexpected increases in the demand of existing
virtual machines or to allow ] virtual machine admission

The aggressiveness of the DPM algorithm can be tuned by adjusting the DPM Threshold in the cluster settings menu. This parameter
controls how far outside the target utilization range per-host resource utilization can be before DPM makes host power-on/power-
off recommendations. The default setting for the threshold is 3 (medium aggressiveness).

For datacenters that often have unexpected spikes in virtual machine resource demands, you can use the DPM advanced option
MinPoweredOnCpuCapacity (default 1 MHz) or MinPoweredOnMemCapacity (default 1 MB) to ensure that a minimum amount of
CPU or memory capacity is kept on in the cluster.

DPM can be disabled on individual hosts that are running mission-critical virtual machines, and the VM/Host affinity rules can be
used to ensure that these virtual machines are not migrated away from these hosts.

DPM can be enabled or disabled on a predetermined schedule using Scheduled Tasks in vCenter Server. When DPM is disabled, all
hosts in a cluster will be powered on. This might be useful, for example, to reduce the delay in responding to load spikes expected at
certain times of the day or to reduce the likelihood of some hosts being left in standby for extended periods.

In a cluster with VMware High Availability (HA) enabled, DRS/DPM maintains excess powered-on capacity to meet the High
Availability settings. The cluster might therefore not allow additional virtual machines to be powered on and/or some hosts might

not be powered down even when the cluster appears to be sufficiently idle. These factors should be considered when configuring
HA.

If VMware HA is enabled in a cluster, DPM always keeps a minimum of two hosts powered on. This is true even if HA admission
control is disabled or if no virtual machines are powered on.

VMware Storage Distributed Resource Scheduler (Storage DRS)

A new feature in vSphere 5.0, Storage Distributed Resource Scheduler (Storage DRS), provides I/O load balancing across datastores
within a datastore cluster (a new vCenter object). This load balancing can avoid storage performance bottlenecks or address them if
they occur.

When deciding which datastores to group into a datastore cluster, try to choose datastores that are as homogeneous as possible in
terms of host interface protocol (i.e., FCP, iSCSI, NFS), RAID level, and performance characteristics. We recommend not mixing SSD
and hard disks in the same datastore cluster.

While a datastore cluster can have as few as two datastores, the more datastores a datastore cluster has, the more flexibility
Storage DRS has to better balance that clusters I/O load.

As you add workloads you should monitor datastore I/O latency in the performance chart for the datastore cluster, particularly
during peak hours. If most or all of the datastores in a datastore cluster consistently operate with latencies close to the congestion
threshold used by Storage I/O Control (set to 30ms by default, but sometimes tuned to reflect the needs of a particular deployment),
this might be an indication that there aren't enough spare I/O resources left in the datastore cluster. In this case, consider adding
more datastores to the datastore cluster or reducing the load on that datastore cluster.

NOTE Make sure, when adding more datastores to increase I/O resources in the datastore cluster, that your changes do actually add
resources, rather than simply creating additional ways to access the same underlying physical disks.

By default, Storage DRS affinity rules keep all of a virtual machines virtual disks on the same datastore (using intra-VM affinity).
However you can give Storage DRS more flexibility in I/O load balancing, potentially increasing performance, by overriding the
default intra-VM affinity rule. This can be done for either a specific virtual machine (from the vSphere Client, select Edit Settings >
Virtual Machine Settings, then deselect Keep VMDKs together) or for the entire datastore cluster (from the vSphere Client, select
Home > Inventory > Datastore and Datastore Clusters, select a datastore cluster, select the Storage DRS tab, click Edit, select Virtual
Machine Settings, then deselect Keep VMDKs together).

Inter-VM anti-affinity rules can be used to keep the virtual disks from two or more different virtual machines from being placed on
the same datastore, potentially improving performance in some situations. They can be used, for example, to separate the storage
I/O of multiple workloads that tend to have simultaneous but intermittent peak loads, preventing those peak loads from combining
to stress a single datastore.


VMware High Availability

VMware High Availability (HA) minimizes virtual machine downtime by monitoring hosts, virtual machines, or applications within
virtual machines, then, in the event a failure is detected, restarting virtual machines on alternate hosts.

When vSphere HA is enabled in a cluster, all active hosts (those not in standby mode, maintenance mode, or disconnected)
participate in an election to choose the master host for the cluster; all other hosts become slaves. The master has a number
of responsibilities, including monitoring the state of the hosts in the cluster, protecting the powered-on virtual machines,
initiating failover, and reporting cluster health state to vCenter Server. The master is elected based on the properties of the
hosts, with preference being given to the one connected to the greatest number of datastores. Serving in the role of master
will have little or no effect on a hosts performance.

When the master host cant communicate with a slave host over the management network, the master uses datastore
heartbeating to determine the state of that slave host. By default, vSphere HA uses two datastores for heartbeating,
resulting in very low false failover rates. In order to reduce the chances of false failover even furtherat the potential cost
of a very slight performance impactyou can use the advanced option das.heartbeatdsperhost to change the number of
datastores (up to a maximum of five).

Enabling HA on a host reserves some host resources for HA agents, slightly reducing the available host capacity for
powering on virtual machines.

When HA is enabled, the vCenter Server reserves sufficient unused resources in the cluster to support the failover capacity
specified by the chosen admission control policy. This can reduce the number of virtual machines the cluster can support.



VMware Fault Tolerance

For each virtual machine there are two FT-related actions that can be taken: turning on or off FT and enabling or disabling FT.
Turning on FT prepares the virtual machine for FT by prompting for the removal of unsupported devices, disabling unsupported
features, and setting the virtual machines memory reservation to be equal to its memory size (thus avoiding ballooning or
swapping). Enabling FT performs the actual creation of the secondary virtual machine by live-migrating the primary.

Each of these operations has performance implications.

Dont turn on FT for a virtual machine unless you will be using (i.e., Enabling) FT for that machine. Turning on FT
automatically disables some features for the specific virtual machine that can help performance, such as hardware virtual
MMU (if the processor supports it).

Enabling FT for a virtual machine uses additional resources (for example, the secondary virtual machine uses as much CPU
and memory as the primary virtual machine). Therefore make sure you are prepared to devote the resources required
before enabling FT.

The live migration that takes place when FT is enabled can briefly saturate the vMotion network link and can also cause spikes in
CPU utilization.

If the vMotion network link is also being used for other operations, such as FT logging (transmission of all the primary
virtual machines inputs (incoming network traffic, disk reads, etc.) to the secondary host), the performance of those other
operations can be impacted. For this reason it is best to have separate and dedicated NICs (or use Network I/O Control,
described in Network I/O Control (NetIOC) on page 34) for FT logging traffic and vMotion, especially when multiple FT
virtual machines reside on the same host.

Because this potentially resource-intensive live migration takes place each time FT is enabled, we recommend that FT not
be frequently enabled and disabled.

FT-enabled virtual machines must use eager-zeroed thick-provisioned virtual disks. Thus when FT is enabled for a virtual
machine with thin provisioned virtual disks or lazy-zeroed thick-provisioned virtual disks these disks need to be converted.
This one-time conversion process uses fewer resources when the virtual machine is on storage hardware that supports
VAAI (described in Hardware
Storage Considerations on page 11).

Because FT logging traffic is asymmetric (the majority of the traffic flows from primary to secondary), congestion on the logging NIC
can be reduced by distributing primaries onto multiple hosts. For example on a cluster with two ESXi hosts and two virtual machines
with FT enabled, placing one of the primary virtual machines on each of the hosts allows the network bandwidth to be utilized
bidirectionally.

FT virtual machines that receive large amounts of network traffic or perform lots of disk reads can create significant bandwidth on
the NIC specified for the logging traffic. This is true of machines that routinely do these things as well as machines doing them only
intermittently, such as during a backup operation. To avoid saturating the network link used for logging traffic limit the number of FT
virtual machines on
each host or limit disk read bandwidth and network receive bandwidth of those virtual machines.

Make sure the FT logging traffic is carried by at least a Gigabit-rated NIC (which should in turn be connected to at least Gigabit-rated
network infrastructure).

Avoid placing more than four FT-enabled virtual machines on a single host. In addition to reducing the possibility of saturating the

network link used for logging traffic, this also limits the number of simultaneous live-migrations needed to create new secondary
virtual machines in the event of a host failure.

If the secondary virtual machine lags too far behind the primary (which usually happens when the primary virtual machine is CPU
bound and the secondary virtual machine is not getting enough CPU cycles), the hypervisor might slow the primary to allow the
secondary to catch up. The following recommendations help avoid this situation:

Make sure the hosts on which the primary and secondary virtual machines run are relatively closely matched, with similar
CPU make, model, and frequency.

Make sure that power management scheme settings (both in the BIOS and in ESXi) that cause CPU frequency scaling are
consistent between the hosts on which the primary and secondary virtual machines run.

Enable CPU reservations for the primary virtual machine (which will be duplicated for the secondary virtual machine) to
ensure that the secondary gets CPU cycles when it requires them.

Though timer interrupt rates do not significantly affect FT performance, high timer interrupt rates create additional network traffic
on the FT logging NICs. Therefore, if possible, reduce timer interrupt rates as described in Guest Operating System CPU
Considerations on page 39.


VMware vCenter Update Manager

VMware vCenter Update Manager provides a patch management framework for VMware vSphere. It can be used to apply patches,
updates, and upgrades to VMware ESX and ESXi hosts, VMware Tools and virtual hardware, and so on.

Update Manager Setup and Configuration

When there are more than 300 virtual machines or more than 30 hosts, separate the Update Manager database from the
vCenter Server database.

When there are more than 1000 virtual machines or more than 100 hosts, separate the Update Manager server from the
vCenter Server and the Update Manager database from the vCenter Server database.

Allocate separate physical disks for the Update Manager patch store and the Update Manager database. To reduce network
latency and packet drops, keep to a minimum the number of network hops between the Update Manager server system
and the ESXi hosts.

In order to cache frequently used patch files in memory, make sure the Update Manager server host has at least 2GB of
RAM.

Update Manager General Recommendations

For compliance view for all attached baselines, latency is increased linearly with the number of attached baselines. We
therefore recommend the removal of unused baselines, especially when the inventory size is large.

Upgrading VMware Tools is faster if the virtual machine is already powered on. Otherwise, Update Manager must power on
the virtual machine before the VMware Tools upgrade, which could increase the overall latency.

Upgrading virtual machine hardware is faster if the virtual machine is already powered off. Otherwise, Update Manager
must power off the virtual machine before upgrading the virtual hardware, which could increase the overall latency.

NOTE Because VMware Tools must be up to date before virtual hardware is upgraded, Update Manager might need to
upgrade VMware Tools before upgrading virtual hardware. In such cases the process is faster if the virtual machine is
already powered-on.


Update Manager Cluster Remediation



Limiting the remediation concurrency level (i.e., the maximum number of hosts that can be simultaneously updated) to half
the number of hosts in the cluster can reduce vMotion intensity, often resulting in better overall host remediation
performance. (This option can be set using the cluster remediate wizard.)

When all hosts in a cluster are ready to enter maintenance mode (that is, they have no virtual machines powered on),
concurrent host remediation will typically be faster than sequential host remediation.

Cluster remediation is most likely to succeed when the cluster is no more than 80% utilized. Thus for heavily-used clusters,
cluster remediation is best performed during off-peak periods, when utilization drops below 80%. If this is not possible, it is
best to suspend or power-off some virtual machines before the operation is begun.


Update Manager Bandwidth Throttling

During remediation or staging operations, hosts download patches. On slow networks you can prevent network congestion
by configuring hosts to use bandwidth throttling. By allocating comparatively more bandwidth to some hosts, those hosts
can more quickly finish remediation or staging.

To ensure that network bandwidth is allocated as expected, the sum of the bandwidth allocated to multiple hosts on a
single network link should not exceed the bandwidth of that link. Otherwise, the hosts will attempt to utilize bandwidth up
to their allocation, resulting in bandwidth utilization that might not be proportional to the configured allocations.

Bandwidth throttling applies only to hosts that are downloading patches. If a host is not in the process of patch
downloading, any bandwidth throttling configuration on that host will not affect the bandwidth available in the network
link.
11. VMware vSphere Distributed Switch Best Practices

Design Considerations
The following three main aspects influence the design of a virtual network infrastructure:
1.
2.
3.

Customers infrastructure design goals


Customers infrastructure component configurations
Virtual infrastructure traffic requirements

Lets take a look at each of these aspects in a little more detail.


Infrastructure Design Goals
Customers want their network infrastructure to be available 24/7, to be secure from any attacks, to perform efficiently throughout
day-to-day operations, and to be easy to maintain. In the case of a virtualized environment, these requirements become increasingly
demanding as growing numbers of business-critical applications run in a consolidated setting. These requirements on the
infrastructure translate into design decisions that should incorporate the following best practices for a virtual network
infrastructure:

Avoid any single point of failure in the network
Isolate each traffic type for increased resiliency and security
Make use of traffic management and optimization capabilities

Infrastructure Component Configurations

In every customer environment, the utilized compute and network infrastructures differ in terms of configuration, capacity and
feature capabilities. These different infrastructure component configurations influence the virtual network infrastructure design
decisions. The following are some of the configurations and features that administrators must look out for:

Server configuration: rack or blade servers


Network adapter configuration: 1GbE or 10GbE network adapters, number of available adaptors, offload function of these
adaptors if any.
Physical network switch infrastructure capabilities: switch clustering


It is impossible to cover all the different virtual network infrastructure design deployments based on the various combinations of
type of servers, network adaptors and network switch capability parameters. In this paper, the following four commonly used
deployments that are based on standard rack server and blade server configurations are described:

Rack server with eight 1GbE network adaptors


Rack server with two 10GbE network adaptors
Blade server with two 10GbE network adapters
Blade server with hardware-assisted multiple logical Ethernet network adaptors


It is assumed that the network switch infrastructure has standard layer 2 switch features (high availability, redundant paths, fast
convergence, port security) available to provide reliable, secure and scalable connectivity to the server infrastructure.
Virtual Infrastructure Traffic
vSphere virtual network infrastructure carries different traffic types. To manage the virtual infrastructure traffic effectively, vSphere
and network administrators must understand the different traffic types and their characteristics. The following are the key traffic
types that flow in the vSphere infrastructure, along with their traffic characteristics:
Management traffic: This traffic flows through a vmknic and carries VMware ESXi host-to-VMware vCenter configuration and
management communication as well as ESXi host-to-ESXi host high availability (HA) related communication. This traffic has low
network utilization but has very high availability and security requirements.
VMware vSphere vMotion traffic: With advancement in vMotion technology, a single vMotion instance can consume almost a full
10Gb of bandwidth. A maximum of eight simultaneous vMotion instances can be performed on a 10Gb uplink; four simultaneous
vMotion instances are allowed on a 1Gb uplink. vMotion traffic has very high network utilization and can be bursty at times.
Customers must make sure that vMotion traffic doesnt impact other traffic types, because it might consume all available I/O
resources. Another property of vMotion traffic is that it is not sensitive to throttling and makes a very good candidate on which to
perform traffic management.
Fault-tolerant traffic: When VMware Fault Tolerance (FT) logging is enabled for a virtual machine, all the logging traffic is sent to the
secondary fault-tolerant virtual machine over a designated vmknic port. This process can require a considerable amount of
bandwidth at low latency because it replicates the I/O traffic and memory-state information to the secondary virtual machine.
iSCSI/NFS traffic: IP storage traffic is carried over vmknic ports. This traffic varies according to disk I/O requests. With end-to-end
jumbo frame configuration, more data is transferred with each Ethernet frame, decreasing the number of frames on the network.
This larger frame reduces the overhead on server/targets and improves the IP storage performance. On the other hand, congested
and lower-speed networks can cause latency issues that disrupt access to IP storage. It is recommended that users provide a high-
speed path for IP storage and avoid any congestion in the network infrastructure.
Virtual machine traffic: Depending on the workloads that are running on the guest virtual machine, the traffic patterns will vary from
low to high network utilization. Some of the applications running in virtual machines might be latency sensitive as is the case with
VOIP workloads.
Table 1 summarizes the characteristics of each traffic type.


To understand the different traffic flows in the physical network infrastructure, network administrators use network traffic
management tools. These tools help monitor the physical infrastructure traffic but do not provide visibility into virtual infrastructure
traffic. With the release of vSphere5, VDS now supports the NetFlow feature, which enables exporting the internal (virtual machine-
to-virtual machine) virtual infrastructure flow information to standard network management tools. Administrators now have the
required visibility into virtual infrastructure traffic. This helps administrators monitor the virtual network infrastructure traffic
through a familiar set of network management tools. Customers should make use of the network data collected from these tools
during the capacity planning or network design exercises.

Example Deployment Components
After looking at the different design considerations, this section provides a list of components that are used in an example
deployment. This example deployment helps illustrate some standard VDS design approaches. The following are some common
components in the virtual infrastructure. The list doesnt include storage components that are required to build the virtual
infrastructure. It is assumed that customers will deploy IP storage in this example deployment.
Hosts
Four ESXi hosts provide compute, memory and network resources according to the configuration of the hardware. Customers can
have different numbers of hosts in their environment, based on their needs. One VDS can span across 350 hosts. This capability to
support large numbers of hosts provides the required scalability to build a private or public cloud environment using VDS (excellent
use case).
Clusters
A cluster is a collection of ESXi hosts and associated virtual machines with shared resources. Customers can have as many clusters in
their deployment as are required. With one VDS spanning across 350 hosts, customers have the flexibility of deploying multiple
clusters with a different number of hosts in each cluster. For simple illustration purposes, two clusters with two hosts each are
considered in this example deployment. One cluster can have a maximum of 32 hosts.
VMware vCenter Server
VMware vCenter Server centrally manages a vSphere environment. Customers can manage VDS through this centralized
management tool, which can be deployed on a virtual machine or a physical host. The vCenter Server system is not shown in the
diagrams, but customers should assume that it is present in this example deployment. It is used only to provision and manage VDS
configuration. When provisioned, hosts and virtual machine networks operate independently of vCenter Server. All components
required for network switching reside on ESXi hosts. Even if the vCenter server system fails, the hosts and virtual machines will still
be able to communicate.
Network Infrastructure
Physical network switches in the access and aggregation layer provide connectivity between ESXi hosts and to the external world.
These network infrastructure components support standard layer 2 protocols providing secure and reliable connectivity. Along with
the preceding four components of the physical infrastructure in this example deployment, some of the virtual infrastructure traffic
types are also considered during the design. The following section describes the different traffic types in the example deployment.

Virtual Infrastructure Traffic Types


In this example deployment, there are standard infrastructure traffic types, including iSCSI, vMotion, FT, management and virtual
machine. Customers might have other traffic types in their environment, based on their choice of storage infrastructure (FC, NFS,
FCoE). Figure 1 shows the different traffic types along with associated port groups on an ESXi host. It also shows the mapping of the
network adapters to the different port groups.



Important Virtual and Physical Switch Parameters
Before going into the different design options in the example deployment, lets take a look at the virtual and physical network switch
parameters that should be considered in all of the design options. There are some key parameters that vSphere and network
administrators must take into account when designing VMware virtual networking. Because the configuration of virtual networking
goes hand in hand with physical network configuration, this section will cover both the virtual and physical switch parameters.
VDS Parameters
VDS simplifies the challenges of the configuration process by providing one single pane of glass to perform virtual network
management tasks. As opposed to configuring a vSphere standard (VSS) on each individual host, administrators can configure and
manage one single VDS. All centrally configured network policies on VDS get pushed down to the host automatically when the host
is added to the distributed switch. In this section, an overview of key VDS parameters is provided.
Host Uplink Connections (vmnics) and dvuplink Parameters
VDS has new abstraction, called dvuplink, for the physical Ethernet network adaptors (vmnics) on each host. It is defined during the
creation of the VDS and can be considered as a template for individual vmnics on each host. All the properties including network
adaptor-teaming, load balancing and failover policies on VDS and dvportgroups are configured on dvuplinks. These dvuplink
properties are automatically applied to vmnics on individual hosts when a host is added to the VDS and when each vmnic on the
host is mapped to a dvuplink.
This dvuplink abstraction therefore provides the advantage of consistently applying teaming and failover configurations to all the
hosts physical Ethernet network adaptors (vmnics).
Figure 2, Shows two ESXi hosts with four Ethernet network adaptors each. When these hosts are added to the VDS, with four
dvuplinks configured on a dvuplink port group, administrators must assign the network adaptors (vmnics) of the hosts to dvuplinks.
To illustrate the mapping of the dvuplinks to vmnics, Figure 2 shows one type of mapping, where ESXi hosts vmnic0 is mapped to
dvuplink1, vmnic1 to dvuplink2 and so on. Customers can choose different mapping if required, where vmnic0 can be mapped to a

different dvuplink instead of dvuplink1. VMware recommends having consistent mapping across different hosts because it reduces
complexity in the environment.
Figure 2. dvuplink-to-vmnic Mapping



As a best practice, customers should also try to deploy hosts with the same number of physical Ethernet network adaptors with
similar port speeds. Also, because the number of dvuplinks on VDS depends on the maximum number of physical Ethernet network
adaptors on a host, administrators should take that into account during dvuplink port group configuration. Customers always have
an option to modify this dvuplink configuration based on the new hardware capabilities.
Traffic Types and dvportgroup Parameters
Similar to port groups on standard switches, dvportgroups define how the connection is made through the VDS to the network. The
VLAN ID, traffic shaping, port security, teaming and load balancing parameters are configured on these dvportgroups. The virtual
ports (dvports) connected to a dvportgroup share the same properties configured on a dvgortgroup. When customers want a group
of virtual machines to share the security and teaming policies, they must make sure that the virtual machines are part of one
dvportgroup. Customers can choose to define different dvportgroups based on the different traffic types they have in their
environment or based on the different tenants or applications they support in the environment. If desired, multiple dvportgroups
can share the same VLAN ID.
In this example deployment, the dvportgroup classification is based on the traffic types running in the virtual infrastructure. After
administrators understand the different traffic types in the virtual infrastructure and identify specific security, reliability and
performed requirements for individual traffic types, the next step is to create unique dvportgroups associated with each traffic type.
As was previously mentioned, the dvportgroup configuration defined at VDS level is automatically pushed down to every host that is
added to the VDS. For example, in Figure2, the two dvportgroups, PG-A (yellow) and PG-B (green), defined at the distributed switch
level are each available on each of the ESXi hosts that are part of that VDS.
dvportgroup Specific Configuration
After customers decide on the number of unique dvportgroup they want to create in their environment, the can start configuring
them. The configuration options/parameters are similar to those available with port groups on vSphere standard switches. There are
some additional options available on VDS dvportgroups that are related to teaming setup and are not available on vSphere standard
switches. Customers can configure the following key parameters for each dvportgroup.

Number of virtual ports (dvports)


Port binding (static, dynamic, ephemeral)
VLAN trunking/private VLANs
Teaming and load balancing along with active and standby link
Bidirectional traffic-shaping parameters

Port security


As part of the teaming algorithm support, VDS provides a unique approach to load balancing traffic across the teamed network
adaptors. This approach is called load-based teaming (LBT), which distributes the traffic across the network adaptors based on the
percentage utilization of traffic on those adaptors. LBT algorithm works on both ingress and egress direction of the network adaptor
traffic, as opposed to the hashing algorithms that work only in egress direction (traffic flowing out of the network adaptor). Also LBT
prevents the worst-case scenario that might happen with hashing algorithms, where all traffic hashes to one network adaptor of the
team while other network adaptors are not used to carry any traffic. To improve the utilization of all the links/network adaptors,
VMware recommends the use of this advanced feature, LBT, of VDS. The LBT approach is recommended over EtherChannel on
physical switches and route-based IP hash configuration on the virtual switch.
Port security policies at port group level enable customer protection from certain activity that might compromise security. For
example, a hacker might impersonate a virtual machine and gain unauthorized access by spoofing the virtual machines MAC
address. VMware recommends setting the MAC address Changes and Forged Transmits to Reject to help protect against
attacks launched by a rogue guest operating system. Customers should set the Promiscuous Mode to Reject unless they want to
monitor the traffic for network troubleshooting or intrusion detection purposes.

NIOC

Network I/O control (NIOC) is the traffic management capability available on VDS. The NIOC concept revolves around resource pools
that are similar in many ways to the ones existing for CPU and memory. vSphere and network administrators now can allocate I/O
shares to different traffic types similarly to allocating CPU and memory resources to a virtual machine. The share parameter specifies
the relative importance of a traffic type over other traffic and provides a guaranteed minimum when the other traffic competes for a
particular network adaptor. The shares are specified in abstract units numbered 1 to 100. Customers can provision shares to
different traffic types based on the amount of resources each traffic type requires.

This capability of provisioning I/O resources is very useful in situations where there are multiple traffic types competing for
resources. For example, in a deployment where vMotion and virtual machine traffic types are flowing through one network adaptor,
it is possible that vMotion activity might impact the virtual machine traffic performance. In this situation, shares configured in NIOC
provide the required isolation to the vMotion and virtual machine traffic type and prevent one flow (traffic type) from dominating
the other flow. NIOC configuration provides one more parameter that customers can utilize if they want to put any limits on a
particular traffic type. This parameter is called the limit. The limit configuration specifies the absolute maximum bandwidth for a
traffic type on a host. The configuration of the limit parameter is specified in Mbps. NIOC limits and shares parameters work only on
the outbound traffic, i.e., traffic that is flowing out of the ESXi host.

VMware recommends that customers utilize this traffic management feature whenever they have multiple traffic types flowing
through one network adaptor, a situation that is more prominent with 10 Gigabit Ethernet (GbE) network deployments but can
happen in 1GbE network deployments as well. The common use case for using NIOC in 1GbE network adaptor deployments is when
the traffic from different workloads or different customer virtual machines is carried over the same network adaptor. As multiple-
workload traffic flows through a network adaptor, it becomes important to provide I/O resources based on the needs of the
workload. With the release of vSphere 5, customers now can make use of the new user-defined network resource pools capability
and can allocate I/O resources to the different workloads or different customer virtual machines, depending on their needs. This
user-defined network resource pools feature provides the granular control in allocating I/O resources and meeting the service-level
agreement (SLA) requirements for the virtualized tier 1 workloads.

Bidirectional Traffic Shaping

Besides NIOC, there is another traffic-shaping feature that is available in the vSphere platform. It can be configured on a
dvportgroup or dvport level. Customers can shape both inbound and outbound traffic using three parameters: average bandwidth,
peak bandwidth and burst size. Customers who want more granular traffic-shaping controls to manage their traffic types can take
advantage of this capability of VDS along with the NIOC feature. It is recommended that network administrators in your organization
be involved while configuring these granular traffic parameters. These controls make sense only when there are oversubscription
scenarios caused by the oversubscribed physical switch infrastructure or virtual infrastructurethat are causing network
performance issues. So it is very important to understand the physical and virtual network environment before making any
bidirectional traffic-shaping configurations.

Physical Network Switch Parameters



The configurations of the VDS and the physical network switch should go hand in hand to provide resilient, secure and scalable
connectivity to the virtual infrastructure. The following are some key switch configuration Parameters the customer should pay
attention to.

VLAN

If VLANs are used to provide logical isolation between different traffic types, it is important to make sure that those VLANs are
carried over to the physical switch infrastructure. To do so, enable virtual switch tagging (VST) on the virtual switch, and trunk all
VLANs to the physical switch ports. For security reasons, it is recommended that customers not use the VLAN ID 1 (default) for any
VMware infrastructure traffic.

Spanning Tree Protocol

Spanning Tree Protocol (STP) is not supported on virtual switches, so no configuration is required on VDS.
But it is important to enable this protocol on the physical switches. STP makes sure that there are no loops in the network.
As a best practice, customers should configure the following:

1. Use PortFast on an ESXi host-facing physical switch ports. With this setting, network convergence on these switch ports will take
place quickly after the failure because the port will enter the STP forwarding state immediately, bypassing the listening and learning
states.

2. Use the PortFast Bridge Protocol Data Unit (BPDU) guard feature to enforce the STP boundary. This configuration protects against
any invalid device connection on the ESXi host-facing access switch ports. As was previously mentioned, VDS doesnt support STP, so
it doesnt send any BPDU frames to the switch port. However, if any BPDU is seen on these ESXi host-facing access switch ports, the
BPDU guard feature puts that particular switch port in error-disabled state. The switch port is completely shut down and prevents
affecting the Spanning Tree Topology.

The recommendation of enabling PortEast and the BPDU guard feature on the switch ports is valid only when customers connect
non-switching/bridging devices to these ports (eg ESXi Hosts). The switching/bridging devices can be hardware-based physical boxes
or servers running a software-based switching/bridging function. Customers should make sure that there is no switching/bridging
function enabled on the ESXi hosts that are connected to the physical switch ports.

However, in the scenario where the ESXi host has a guest virtual machine that is configured to perform a bridging function, the
virtual machine will generate BPDU frames and send them out to the VDS, which then forwards the BPDU frames through the
network adaptor to the physical switch port. When the switch port configured with BPDU guard receives the BPDU frame, the switch
will disable the port and the virtual machine will lose connectivity. To avoid this network failure scenario when running the software
bridging function on an ESXi host, customers should disable the PortFast and BPDU guard configuration on the physical switch port
and run STP.

If customers are concerned about hacks that can generate BPDU frames, they should make use of VMware vShield App, which can
block the frames and protect the virtual infrastructure from such layer 2 attacks. Refer to VMware vShield product documentation
for more details on how to secure your vSphere virtual infrastructure: http://www.vmware.com/products/vshield/overview.html.


Link Aggregation Setup

Link aggregation is used to increase throughput and improve resiliency by combining multiple network connections. There are
various proprietary solutions on the market along with vendor-independent IEEE 802.3ad (LACP) standard-based implementation. All
solutions establish a logical channel between the two endpoints, using multiple physical links. In the vSphere virtual infrastructure,
the two ends of the logical channel are the VDS and physical switch. These two switches must be configured with link aggregation
parameters before the logical channel is established. Currently, VDS supports static link aggregation configuration and does not
provide support for dynamic LACP. When customers want to enable link aggregation on a physical switch, they should configure
static link aggregation on the physical switch and select IP hash as network adaptor teaming on the VDS.

When establishing the logical channel with multiple physical links, customers should make sure that the Ethernet network adaptor
connections from the host are terminated on a single physical switch. However, if customers have deployed clustered physical

switch technology, the Ethernet network adaptor connections can be terminated on two different physical switches . The clustered
physical switch technology is referred to by different names by networking vendors. For example, Cisco calls their switch clustering
solution Virtual Switching System (Nexus6K, 7K use vPC); Brocade calls theirs Virtual Cluster Switching. Refer to the networking
vendor guidelines and configuration details when deploying switch clustering technology.


Link-State Tracking

Link-state tracking is a feature available on Cisco switches to manage the link state of downstream ports, ports connected to servers,
based on the status of upstream ports, ports connected to aggregation/core switches. When there is any failure on the upstream
links connected to aggregation or core switches, the associated downstream link status goes down. The server connected on the
downstream link is then able to detect the failure and reroute the traffic on other working links. This feature therefore provides the
protection from network failures due to the failed upstream ports in non-mesh topologies.

Unfortunately, this feature is not available on all vendors switches, and even if it is available, it might not be referred to as link-state
tracking. Customers should talk to the switch vendors to find out whether a similar feature is supported on their switches.

Figure 3 shows the resilient mesh topology on the left and a simple loop-free topology on the right. VMware highly recommends
deploying the mesh topology shown on the left, which provides highly reliable redundant design and doesnt need a link-state
tracking feature. Customers who dont have high-end networking expertise and are also limited in number of switch ports might
prefer the deployment shown on the right. In this deployment, customers dont have to run STP because there are no loops in the
network design. The downside of this simple design is seen when there is a failure in the link between the access and aggregation
switches. In that failure scenario, the server will continue to send traffic on the same network adaptor even when the access layer
switch is dropping the traffic at the upstream interface. To avoid this black holing of server traffic, customers can enable link-state
tracking on the virtual and physical switches and indicate any failure between access and aggregation switch layers to the server
through link-state information.



VDS has default network failover detection configuration set as link status only. Customers should keep this configuration if they
are enabling the link-state tracking feature on physical switches. If link-state tracking capability is not available on physical switches,
and there are no redundant paths available in the design, customers can make use of the beacon probing feature available on VDS.
The beacon probing function is a software solution available on virtual switches for detecting link failures upstream from the access
layer physical switch to the aggregation/core switches. Beacon probing is most useful with three or more uplinks in a team.

Maximum Transmission Unit



Make sure that the maximum transmission unit (MTU) configuration matches across the virtual and physical network switch
infrastructure.

Rack Server in Example Deployment

After looking at the major components in the example deployment and key virtual and physical switch parameters, lets take a look
at the different types of servers that customers can have in their environment. Customers can deploy an ESXi host on either a rack
server or a blade server. This section discusses a deployment in which the ESXi host is running on a rack server.

Two types of rack server configuration will be described in the following section:
Rack server with eight 1GbE network adaptors
Rack server with two 10Gb E network adaptors

The various VDS design approaches will be discussed for each of the two configurations.

Rack Server with Eight 1GbE Network Adaptors

In a rack server deployment with eight 1GbE network adaptors per host, customers can either use the traditional static design
approach of allocating network adaptors to each traffic type or make use of advanced features of VDS such as NIOC and LBT. The
NIOC and LBT features help provide a dynamic design that efficiently utilizes I/O resources. In this section, both the traditional and
new design approaches are described, along with their pros and cons.

Design Option 1 - Static Configuration

This design option follows the traditional approach of statically allocating network resources to the different virtual infrastructure
traffic types. As shown in Figure 4, each host has eight Ethernet network adaptors. Four are connected to one of the first access
layer switches; the other four are connected to the second access layer switch, to avoid single point of failure. Lets look in detail at
how VDS parameters are configured.




dvuplink Configuration

To support the maximum of eight 1GbE network adaptors per host, the dvuplink port group is configured with eight dvuplinks
(dvuplink1dvuplink8). On the hosts, dvuplink1 is associated with vmnic0, dvuplink2 is associated with vmnic1, and so on. It is a
recommended practice to change the names of the dvuplinks to something meaningful and easy to track. For example, dvuplink1,

which gets associated with vmnic on a motherboard, can be renamed as LOM-uplink1; dvuplink2, which gets associated with
vmnic on an expansion card, can be renamed as Expansion-uplink1.

If the hosts have some Ethernet network adaptors as LAN on motherboard (LOM) and some on expansion cards, for a better
resiliency story, VMware recommends selecting one network adaptor from LOM and one from an expansion card when configuring
network adaptor teaming. To configure this teaming on a VDS, administrators must pay attention to the dvuplink and vmnic
association along with dvportgroup configuration where network adaptor teaming is enabled. In the network adaptor-teaming
configuration on a dvportgroup, administrators must choose the various dvuplinks that are part of a team. If the dvuplinks are
named appropriately according to the host vmnic association, administrators can select LOM-uplink1 and Expansion-uplink1
when configuring the teaming option for a dvportgroup.

dvportgroup Configuration

As described in Table 2, there are five different port groups that are configured for the five different traffic types. Customers can
create up to 5,000 unique port groups per VDS. In this example deployment, the decision on creating different port groups is based
on the number of traffic types.

According to Table 2, dvportgroup PG-A is created for the management traffic type. There are other dvportgroups defined for the
other traffic types. The following are the key configurations of dvportgroup PG-A:

Teaming option: Explicit failover order provides a deterministic way of directing traffic to a particular uplink. By selecting
dvuplink1 as an active uplink and dvuplink2 as a standby uplink, management traffic will be carried over dvuplink1 unless
there is a failure on dvuplink1. All other dvuplinks are configured as unused. Configuring the failback option to No is also
recommended, to avoid the flapping of traffic between two network adaptors. The failback option determines how a
physical adaptor is returned to active duty after recovering from a failure. If failback is set to No, a tailed adaptor is left
inactive, even after recovery, until another currently active adaptor fails and requires a replacement.

VMware recommends isolating all traffic types from each other by defining a separate VLAN for each dvportgroup.

There are several other parameters that are part of the dvportgroup configuration. Customers can choose to configure
these parameters based on their environment needs. For example, customers can configure PVLAN to provide isolation
when there are limited VLANs available in the environment.

As you follow the dvportgroups configuration in Table 2, you can see that each traffic type is carried over a specific dvuplink, with
the exception of the virtual machine traffic type, The virtual machine traffic type uses two active links, dvuplink7 and dvuplink8, and
these links are utilized through the LBT algorithm. As was previously mentioned, the LBT algorithm is much more efficient than the
standard hashing algorithm in utilizing link bandwidth.

Physical Switch Configuration


The external physical switchwhere the rack servers network adaptors are connected tois configured with trunk configuration
with all the appropriate VLANs enabled. As described in the Physical Network Switch Parameters section, the following switch
configurations are performed based on the VDS setup described in Table 2.

Enable STP on the trunk ports facing the ESXi hosts, along with the PortFast mode and BPDU guard feature.
The teaming configuration on VDS is static, so no link aggregation is configured on the physical switches.
Because of the mesh topology deployment, as shown in Figure 4, the link-state tracking feature is not required on the
physical switches.

In this design approach, resiliency to the infrastructure traffic is achieved through active/standby uplinks, and security is
accomplished by providing separate physical paths for the different traffic types. However, with this design, the I/O resources are
underutilized because the dvuplink2 and dvuplink6 standby links are not used to send or receive traffic. Also, there is no flexibility to
allocate more bandwidth to a traffic type when it needs it.

There is another variation to the static design approach that addresses the need of some customers to provide higher bandwidth to
the storage and vMotion traffic type. In the static design that was previously described, iSCSI and vMotion traffic is limited to 1GB. If
a customer wants to support higher bandwidth for iSCSI, they can make use of the iSCSI multipathing solution. Also, with the release
of vSphere 5, vMotion traffic can be carried over multiple Ethernet network adaptors through the support of multi-network adaptor
vMotion, thereby providing higher bandwidth to the vMotion process.

For more details on how to set up iSCSI multipathing, refer to the VMware vSphere Storage guide:
https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html.

The configuration of multi-network adaptor vMotion is quite similar to the iSCSI multipath setup, where administrators must create
two separate VMkernel interfaces and bind each one to a separate dvportgroup. This configuration with two separate dvportgroups
provides the connectivity to two different Ethernet network adaptors or dvuplinks.

Table 3. Static Design Configuration with iSCSI Multipathing and MultiNetwork Adaptor vMotion



Table 3. Static Design Configuration with iSCSI Multipathing and Multi-Network Adaptor vMotion As shown in Table 3, there are two
entries each for the vMotion and iSCSI traffic types. Also shown is a list of the additional dvportgroup configurations required to
support the multi-network adaptor vMotion and iSCSI multipathing processes. For multi-network adaptor vMotion, dvportgroups
PG-B1 and PG-B2 are listed, configured with dvuplink 3 and dvuplink4 respectively as active links. And for iSCSI multipathing,
dvportgroups PG-D1 and PG-D2 are connected to dvuplink5 and dvuplink6 respectively as active links. Load balancing across the
multiple dvuplinks is performed by the multipathing logic in the iSCSI process and by the ESX platform in the vMotion process.
Configuring the teaming policies for these dvportgroups is not required.

FT, management and virtual machine traffic-type dvportgroup configuration and physical switch configuration for this design remain
the same as those described in Design Option 1 of the previous section.

This static design approach improves on the first design by using advanced capabilities such as iSCSI multipathing and multi network
adaptor vMotion. But at the same time, this option has the same challenges related to underutilized resources and inflexibility in
allocating additional resources on the fly to different traffic types.

Design Option 2 - Dynamic Configuration with NIOC and LBT

After looking at the traditional design approach with static uplink configurations, lets take a look at the VMware recommended
design option that takes advantage of the advanced VDS features such as NIOC and LBT. In this design, the connectivity to the
physical network infrastructure remains the same as that described in the static design option. However, instead of allocating
specific dvuplinks to individual traffic types, the ESXi platform utilizes those dvuplinks dynamically. To illustrate this dynamic design,
each virtual infrastructure traffic types bandwidth utilization is estimated. In a real deployment, customers should first monitor the
virtual infrastructure traffic over a period of time, to gauge the bandwidth utilization, and then come up with bandwidth numbers
for each traffic type. The following are some bandwidth numbers estimated by traffic type for the scenario:

Management traffic (<1GB)
vMotion (1GB)
FT (1GB)
iSCSI (1GB)
Virtual machine (2G B)

Based on this bandwidth information, administrators can provision appropriate I/O resources to each traffic type by using the NIOC
feature of VDS. Lets take a look at the VDS parameter configurations for this design, as well as the NIOC setup. The dvuplink port
group configuration remains the same, with eight dvuplinks created for the eight 1GbE network adaptors. The dvportgroup
configuration is described in the following section.

dvportgroup Configuration

In this design, all dvuplinks are active and there are no standby and unused uplinks, as shown in Table 4. All dvuplinks are therefore
available for use by the teaming algorithm. The following are the key parameter configurations of dvportgroup PG-A:

Teaming option: LBT is selected as the teaming algorithm. With LBT configuration, the management traffic initially will be
scheduled based on the virtual port ID hash. Depending on the hash output, management traffic is sent out over one of the
dvuplinks. Other traffic types in the virtual infrastructure can also be scheduled on the same dvuplink initially. However,
when the utilization of the dvuplink goes beyond the 75 percent threshold, the LBT algorithm will be invoked and some of
the traffic will be moved to other underutilized dvuplinks. It is possible that management traffic will be moved to other
dvuplinks when such an LBT event occurs.

The failback option means going from using a standby link to using an active uplink after the active uplink comes back into
operation after a failure. This failback option works when there are active and standby dvuplink configurations. In this
design, there are no standby dvuplinks. So when an active uplink fails, the traffic flowing on that dvuplink is moved to
another working dvuplink. If the failed dvuplink comes back, the LBT algorithm will schedule new traffic on that dvuplink.
This option is left as the default.

VMware recommends isolating all traffic types from each other by defining a separate VLAN for each dvportgroup.

There are several other parameters that are part of the dvportgroup configuration. Customers can choose to configure
these parameters based on their environment needs. For example, they can configure PVLAN to provide isolation when
there are limited VLANs available in the environment.

As you follow the dvportgroups configuration in Table 4, you can see that each traffic type has all dvuplinks active and that these
links are utilized through the LBT algorithm. Lets now look at the NIOC configuration described in the last two columns of Table 4.

Table 4. Dynamic Design Configuration with NIOC and LBT



The NIOC configuration in this design helps provide the appropriate I/O resources to the different traffic types (through shares).
Based on the previously estimated bandwidth numbers per traffic type, the shares parameter is configured in the NIOC shares
column in Table 4. The shares values specify the relative importance of specific traffic types, and NIOC ensures that during
contention scenarios on the dvuplinks, each traffic type gets the allocated bandwidth. For example, a shares configuration of 10 for
vMotion, iSCSI and FT allocates equal bandwidth to these traffic types. Virtual machines get the highest bandwidth with 20 shares
and management gets lower bandwidth with 5 shares.

To illustrate how share values translate to bandwidth numbers, lets take an example of 1Gb capacity dvuplink carrying all five traffic
types. This is a worst-case scenario where all traffic types are mapped to one dvuplink.

This will never happen when customers enable the LBT feature, because LBT will balance the traffic based on the utilization of
uplinks. This example shows how much bandwidth each traffic type will be allowed on one dvuplink during a contention or
oversubscription scenario and when LBT is not enabled.

Total shares: management (5) ( vMotion (10) + FT (10) + iSCSI (10) + virtual machine (20) = 55 )
1Gb = 1000Mbps
o Management: 5 shares; (5/55) x 1000 = 90.91Mbps
o vMotion: 10 shares; (10/55) x 1000 = 181.18Mbps
o FT: 10 shares; (10/55) x 1000 = 181.18Mbps
o iSCSI: lo shares; (10/55) x 1000 = 181 .18Mbps
o Virtual machine: 20 shares: (20/55) x 1000 = 363.64Mbps

Note: Given a workload requirement for a portgroup provided in Mbps identify the required share value:

To calculate the bandwidth numbers during contention, you should first calculate the percentage of bandwidth for a traffic type by
dividing its share value by the total available share number (55). In the second step, the total bandwidth of the dvuplink (1Gb) is
multiplied with the percentage of bandwidth number calculated in the first step. For example, 5 shares allocated to management
traffic translate to 90.91Mbps of bandwidth to management process on a fully utilized 1Gb network adaptor. In this example,
custom share configuration is discussed, but a customer can make use of predefined high (100), normal (50) and low (25) shares
when assigning them to different traffic types.

The vSphere platform takes these configured share values and applies them per uplink. The schedulers running at each uplink are
responsible for making sure that the bandwidth resources are allocated according to the shares. In the case of an eight 1GbE
network adaptor deployment, there are eight schedulers running. Depending on the number of traffic types scheduled on a
particular uplink, the scheduler will divide the bandwidth among the traffic types, based on the share numbers. For example, if only
FT (10 shares) and management (5 shares) traffic are flowing through dvuplink 5, FT traffic will get double the bandwidth of
management traffic, based on the shares value. Also, when there is no management traffic flowing, all bandwidth can be utilized by
the FT process. This flexibility in allocating I/O resources is the key benefit of the NIOC feature.

The NIOC limits parameter of Table 4 is not configured in this design. The limits value specifies an absolute maximum limit on egress
traffic for a traffic type. Limits are specified in Mbps. This configuration provides a hard limit on any traffic, even if I/O resources are
available to use. Using limits configuration is not recommended unless you really want to control the traffic, even though additional
resources are available.

There is no change in physical switch configuration in this design approach, even with the choice of the new LBT algorithm. The LBT
teaming algorithm doesnt require any special configuration on physical switches. Refer to the physical switch settings described in
Design Option 1.


Table 4. Dynamic Design Configuration with NIOC and LBT

This design does not provide higher than 1Gb bandwidth to the vMotion and iSCSI traffic types as is the case with static design using
multi-network adaptor vMotion and iSCSI multipathing. The LBT algorithm cannot split the infrastructure traffic across multiple
dvuplink ports and utilize all the links. So even if vMotion dvportgroup PG-B has all eight 1GbE network adaptors as active uplinks,
vMotion traffic will be carried over only one of the eight uplinks. The main advantage of this design is evident in the scenarios where
the vMotion process is not using the uplink bandwidth, and other traffic types are in need of the additional resources. In these
situations, NIOC makes sure that the unused bandwidth is allocated to the other traffic types that need it.

This dynamic design option is the recommended approach because it takes advantage of the advanced VDS features and utilizes I/O
resources efficiently. This option also provides active-active resiliency where no uplinks are in standby mode. In this design
approach, customers allow the vSphere platform to make the optimal decisions on scheduling traffic across multiple uplinks.

Some customers who have restrictions in the physical infrastructure in terms of bandwidth capacity across different paths and
limited availability of the layer 2 domain might not be able to take advantage of this dynamic design option. When deploying this
design option, it is important to consider all the different traffic paths that a traffic type can take and to make sure that the physical
switch infrastructure can support the specific characteristics required for each traffic type. VMware recommends that vSphere and
network administrators work together to understand the impact of the vSphere platforms traffic scheduling feature over the
physical network infrastructure before deploying this design option.

Every customer environment is different, and the requirements for the traffic types are also different. Depending on the need of the
environment, a customer can modify these design options to fit their specific requirements. For example, customers can choose to
use a combination of static and dynamic design options when they need higher bandwidth for iSCSI and vMotion activities. In this
hybrid design, four uplinks can be statically allocated to iSCSI and vMotion traffic types while the remaining four uplinks are used
dynamically for the remaining traffic types (it may also be that the IP storage infrastructure uses separate physical switches). Table 5
shows the traffic types and associated port group configurations for the hybrid design. As shown in the table, management, FT and
virtual machine traffic will be distributed on dvuplink1 to dvuplink4 through the vSphere platforms traffic scheduling features, LBT
and NIOC. The remaining four dvuplinks are statically assigned to vMotion and iSCSI traffic types.

Rack Server with Two 1OGbE Network Adaptors



The two 1OGbE network adaptors deployment model is becoming very common because of the benefits they provide through I/O
consolidation. The key benefits include better utilization of I/O resources, simplified management and reduced CAPEX and OPEX.
Although this deployment provides these benefits, there are some challenges when it comes to the traffic management aspects.
Especially in highly consolidated virtualized environments where more traffic types are carried over fewer 1OGbE network adaptors,
it becomes critical to prioritize traffic types that are important and provide the required SLA guarantees.

The NIOC feature available on the VDS helps in this traffic management activity. In the following sections, you will see how to utilize
this feature in the different designs.

As shown in Figure 5, rack servers with two 1OGbE network adaptors are connected to the two access layer switches to avoid any
single point of failure. Similar to the rack server with eight 1GbE network adaptors, the different VDS and physical switch parameter
configurations are taken into account with this design. On the physical switch side, the new 10Gb switches might have support for
FCoE that enables convergence for SAN and LAN traffic. This document covers only the standard 10Gb deployments that support IP
storage traffic (iSCSI/NFS) and not FC0E.

In this section, two design options are described; one is a traditional approach and the other one is a VMware recommended
approach.



Figure 5. Rack Server with Two 1OGbE Network Adaptors


Design Option 1 - Static Configuration

The static configuration approach for rack server deployment with 1OGbE network adaptors is similar to the one described in
Design Option 1 of rack server deployment with eight 1GbE adaptors. There are a few differences in the configuration where the
numbers of dvuplinks are changed from eight to two, and dvportgroup parameters are different. Lets take a look at the
configuration details on the VDS front.

dvuplink Configuration

To support the maximum two Ethernet network adaptors per host, the dvuplink port group is configured with two dvuplinks
(dvuplink, dvuplink2). On the hosts, dvuplink1 is associated with vmnic0 and dvuplink2 is associated with vmnic1.

dvportgroup Configuration

As described in Table 6, there are five different dvportgroups that are configured for the five different traffic types. For example,
dvportgroup PG-A is created for the management traffic type. The following are the other key configurations of dvportgroup PG-A:

Teaming option: An explicit failover order provides a deterministic way of directing traffic to a particular uplink. By selecting
dvuplink as an active uplink and dvuplink2 as a standby uplink, management traffic will be carried over dvuplink unless
there is a failure with it. Configuring the failback option to No is also recommended, to avoid the flapping of traffic
between two network adaptors. The failback option determines how a physical adaptor is returned to active duty after
recovering from a failure. It failback is set to No, a failed adaptor is left inactive, even after recovery, until another
currently active adaptor tails, requiring its replacement.

VMware recommends isolating all traffic types from each other by defining a separate VLAN for each dvportgroup.

There are various other parameters that are part of the dvportgroup configuration. Customers can choose to configure
these parameters based on their environment needs.

Table 6 provides the configuration details for all the dvportgroups. According to the configuration, dvuplink carries management,
iSCSI and virtual machine traffic; dvuplink2 handles vMotion, FT and virtual machine traffic. As you can see, the virtual machine
traffic type makes use of two uplinks, and these uplinks are utilized through the LBT algorithm.

With this deterministic teaming policy, customers can decide to map different traffic types to the available uplink ports, depending
on environment needs. For example, if iSCSI traffic needs higher bandwidth and other traffic types have relatively low bandwidth
requirements, customers can decide to keep only iSCSI traffic on dvuplink1 and move all other traffic to dvuplink2. When deciding
on these traffic paths, customers should understand the physical network connectivity and the paths bandwidth capacities.

Physical Switch Configuration

The external physical switch, which the rack servers network adaptors are connected to, has trunk configuration with all the
appropriate VLANs enabled. As described in the physical network switch parameters sections, the following switch configurations
are performed based on the VDS setup described in Table 6.

Enable STP on the trunk ports facing ESXi hosts, along with the PortFast mode and BPDU guard feature.
The teaming configuration on VDS is static and therefore no link aggregation is configured on the physical switches.
Because of the mesh topology deployment shown in Figure 5, the link state-tracking feature is not required on the physical
switches.



Table 6. Static Design Configuration

This static design option provides flexibility in the traffic path configuration, but it cannot protect against one traffic types
dominating others. For example, there is a possibility that a network-intensive vMotion process might take away most of the
network bandwidth and impact virtual machine traffic. Bidirectional traffic-shaping parameters at port group and port levels can

provide some help in managing different traffic rates. However, using this approach for traffic management requires customers to
limit the traffic on the respective dvportgroups. Limiting traffic to a certain level through this method puts a hard limit on the traffic
types, even when the bandwidth is available to utilize. This underutilization of I/O resources because of hard limits is overcome
through the NIOC feature, which provides flexible traffic management based on the shares parameters. Design Option 2, described
in the following section, is based on the NIOC feature.

Design Option 2 - Dynamic Configuration with NIOC and LBT

This dynamic design option is the VMware-recommended approach that takes advantage of the NIOC and LBT features of the VDS.

Connectivity to the physical network infrastructure remains the same as that described in Design Option 1. However, instead of
allocating specific dvuplinks to individual traffic types, the ESXi platform utilizes those dvuplinks dynamically. To illustrate this
dynamic design, each virtual infrastructure traffic types bandwidth utilization is estimated. In a real deployment, customers should
first monitor the virtual infrastructure traffic over a period of time to gauge the bandwidth utilization, and then come up with
bandwidth numbers.

The following are some bandwidth numbers estimated by traffic type:
Management traffic (<1G B)
vMotion (2GB)
FT (1GB)
iSCSI (2G B)
Virtual machine (2GB)

These bandwidth estimates are different from the one considered with rack server deployment with eight 1GbE network adaptors.
Lets take a look at the VDS parameter configurations for this design. The dvuplink port group configuration remains the same, with
two dvuplinks created for the two 1OGbE network adaptors. The dvportgroup configuration is as follows.

dvportgroup Configuration

In this design, all dvuplinks are active and there are no standby and unused uplinks, as shown in Table 7. All dvuplinks are therefore
available for use by the teaming algorithm. The following are the key configurations of dvportgroup PG-A:

Teaming option: LBT is selected as the teaming algorithm. With LBT configuration, management traffic initially will be
scheduled based on the virtual port ID hash. Based on the hash output, management traffic will be sent out over one of the
dvuplinks. Other traffic types in the virtual infrastructure can also be scheduled on the same dvuplink with LBT
configuration. Subsequently, if the utilization of the uplink goes beyond the 75 percent threshold, the LBT algorithm will be
invoked and some of the traffic will be moved to other underutilized dvuplinks. It is possible that management traffic will
get moved to other dvuplinks when such an event occurs.

There are no standby dvuplinks in this configuration, so the failback setting is not applicable for this design approach. The
default setting for this failback option is Yes.

VMware recommends isolating all traffic types from each other by defining a separate VLAN for each dvportgroup.
There are several other parameters that are part of the dvportgroup configuration. Customers can choose to configure
these parameters based on their environment needs.

As you follow the dvportgroups configuration in Table 7, you can see that each traffic type has all the dvuplinks as active and these
uplinks are utilized through the LBT algorithm. Lets take a look at the NIOC configuration.

The NIOC configuration in this design not only helps provide the appropriate I/O resources to the different traffic types but also
provides SLA guarantees by preventing one traffic type from dominating others. Based on the bandwidth assumptions made for
different traffic types, the shares parameters are configured in the NIOC shares column in Table 7. To illustrate how share values
translate to bandwidth numbers in this deployment, lets take an example of a 10Gb capacity dvuplink carrying all five traffic types.
This is a worst-case scenario in which all traffic types are mapped to one dvuplink. This will never happen when customers enable
the LBT feature, because LBT will move the traffic type based on the uplink utilization.

The following example shows how much bandwidth each traffic type will be allowed on one dvuplink during a contention or
oversubscription scenario and when LBT is not enabled:

Total shares: management (5) + vMotion (20) + FT (10) + SCSI (20) + virtual machine (20) = 75
10Gb = 10000Mbps
o Management: 5 shares; (5/75) x 10Gb = 667Mbps
o vMotion: 20 shares; (20/75) x 10Gb = 2.67Gbps
o FT: 10 shares; (10/75) x 10Gb = 1.33Gbps
o iSCSI: 20 shares; (20/75) x 10Gb = 2.67Gbps
o Virtual machine: 20 shares; (20/75) x 10Gb = 2.67Gbps


For each traffic type, first the percentage of bandwidth is calculated by dividing the share value by the total available share number
(75), and then the total bandwidth of the dvuplink (10Gb) is used to calculate the bandwidth share for the traffic type. For example,
20 shares allocated to vMotion traffic translate to 2.67Gbps of bandwidth to the vMotion process on a fully utilized 1OGbE network
adaptor.

In this 1OGbE deployment, customers can provide bigger pipes to individual traffic types without the use of trunking or multipathing
technologies. This was not the case with an eight-1GbE deployment.

There is no change in physical switch configuration in this design approach, so refer to the physical switch settings described in
Design Option 1 in the previous section.
Table. 7. Dynamic Design Configuration
This design option utilizes the advanced VDS features and provides customers with a dynamic and flexible design approach. In this
design, I/O resources are utilized effectively and SLAs are met based on the shares allocation.





Blade Server in Example Deployment

Blade servers are server platforms that provide higher server consolidation per rack unit as well as lower power and cooling costs.
Blade chassis that host the blade servers have proprietary architectures and each vendor has its own way of managing resources in
the blade chassis. It is difficult in this document to look at all of the various blade chassis available on the market and to discuss their
deployments. In this section, we will focus on some generic parameters that customers should consider when deploying VDS in a
blade chassis environment. From a networking point of view, all blade chassis provide the following two options:

Integrated switches: With this option, the blade chassis enables built-in switches to control traffic flow between the blade
servers within the chassis and the external network.
Pass-through technology: This is an alternative method of network connectivity that enables the individual blade servers to
communicate directly with the external network.

In this document, the integrated switch option is described as where the blade chassis has a built-in Ethernet switch. This Ethernet
switch acts as an access layer switch, as shown in Figure 6. This section discusses a deployment in which the ESXi host is running on a
blade server. The following two types of blade server configuration will be described in the next section:

Blade server with two 10G bE network adaptors
Blade server with hardware-assisted multiple logical network adaptors


For each of these two configurations, various VDS design approaches will be discussed.

Blade Server with Two 10GbE Network Adaptors

This deployment is quite similar to that of a rack server with two 1OGbE network adaptors in which each ESXi host is provided with
two 1OGbE network adaptors. As shown in Figure 6, an ESXi host running on a blade server in the blade chassis is also provided with
two 1OGbE network adaptors.

Figure 6. Blade Server with Two 1OGbE Network Adaptors



In this section, two design options are described. One is a traditional static approach and the other one is a VMware recommended
dynamic configuration with NIOC and LBT features enabled, These two approaches are exactly the same as the deployment
described n the Rack Server with Two 1OGbE Network Adaptors section. Only blade chassisspecific design decisions will be
discussed as part of this section. For all other VDS and switch-related configurations, refer to the Rack Server with Two 1OGbE
Network Adaptors section of this document.
Design Option 1 - Static Configuration

The configuration of this design approach is exactly the same as that described in the Design Option 1 section under Rack Server
with Two 1OGbE Network Adaptors. Refer to Table 6 for dvportgroup configuration details. Lets take a look at the blade server
specific parameters that require attention during the design. Network and hardware reliability considerations should be
incorporated during the blade server design as well. In these blade server designs, customers must focus on the following two areas:

High availability of blade switches in the blade chassis
Connectivity of blade server network adaptors to internal blade switches

High availability of blade switches can be achieved by having two Ethernet switching modules in the blade chassis. And the
connectivity of two network adaptors on the blade server should be such that one network adaptor is connected to the first
Ethernet switch module, and the other network adaptor is hooked to the second switch module in the blade chassis.

Another aspect that requires attention in the blade server deployment is the network bandwidth availability across the midplane of
the blade chassis and between the blade switches and aggregation layer. If there is an oversubscription scenario in the deployment,
customers must think about utilizing traffic shaping and prioritization (802.lp tagging) features available in the vSphere platform. The
prioritization feature enables customers to tag the important traffic coming out of the vSphere platform. These high-priority

tagged packets are then treated according to priority by the external switch infrastructure. During congestion scenarios, the switch
will drop lower-priority packets first and avoid dropping the important, high-priority packets.

This static design option provides customers with the flexibility to choose different network adaptors for different traffic types.
However, when doing the traffic allocation on a limited, two 1OGbE network adaptors, administrators ultimately will schedule
multiple traffic types on a single adaptor. As multiple traffic types flow through one adaptor, the chances of one traffic types
dominating others increases. To avoid the performance impact of the noisy neighbors (dominating traffic type), customers must
utilize the traffic management tools provided in the vSphere platform. One of the traffic management features is NIOC, and that
feature is utilized in Design Option 2, which is described in the following section.


Design Option 2 - Dynamic Configuration with NIOC and LBT

This dynamic configuration approach is exactly the same as that described in the Design Option 2 section under Rack Server with
Two 1OGbE Network Adaptors. Refer to Table 7 for the dvportgroup configuration details and NIOC settings. The physical switch
related configuration in the blade chassis deployment is the same as that described in the rack server deployment. For the blade
center-specific recommendation on reliability and traffic management, refer to the previous section.

VMware recommends this design option, which utilizes the advanced VDS features and provides customers with a dynamic and
flexible design approach. With this design, I/O resources are utilized effectively and SLAs are met based on the shares allocation.

Blade Server with Hardware-Assisted Logical Network Adaptors (HP Flex-lO- or Cisco UCS-like Deployment)

Some of the new blade chassis support traffic management capabilities that enable customers to carve I/O resources. This is
achieved by providing logical network adaptors for the ESXi hosts. Instead of two 1OGbE network adaptors, the ESX1 host now sees
multiple physical network adaptors that operate at different configurable speeds. As shown in Figure 7, each ESXi host is provided
with eight Ethernet network adaptors that are carved out of two 1OGbE network adaptors.

Figure 7. Multiple Logical Network Adaptors



This deployment is quite similar to that of the rack server with eight 1GbE network adaptors. However, instead of 1GbE network
adaptors, the capacity of each network adaptor is configured at the blade chassis level. In the blade chassis, customers can carve out
different capacity network adaptors based on the need of each traffic type. For example, if iSCSI traffic needs 2.5Gb of bandwidth, a
logical network adaptor with that amount of I/O resources can be created on the blade chassis and provided for the blade server.


As for the configuration of the VDS and blade chassis switch infrastructure, the configuration described in Design Option 1 under
Rack Server with Eight 1GbE Network Adaptors is more relevant for this deployment. The static configuration option described in
that design can be applied as is in this blade server environment. Refer to Table 2 for the dvportgroup configuration details and
switch configurations described in that section for physical switch configuration details.

The question now is whether NIOC capability adds any value in this specific blade server deployment. NIOC is a traffic management
feature that helps in scenarios where multiple traffic types flow through one uplink or network adaptor. If in this particular
deployment only one traffic type is assigned to a specific Ethernet network adaptor, the NIOC feature will not add any value.
However, if multiple traffic types are scheduled over one network adaptor, customers can make use of NIOC to assign appropriate
shares to different traffic types. This NIOC configuration will ensure that bandwidth resources are allocated to traffic types and that
SLAs are met.


As an example, lets consider a scenario in which vMotion and iSCSI traffic is carried over one 3Gb logical uplink. To protect the iSCSI
traffic from network-intensive vMotion traffic, administrators can configure NIOC and allocate shares to each traffic type. If the two
traffic types are equally important, administrators can configure shares with equal values (10 each). With this configuration, when
there is a contention scenario, NIOC will make sure that the iSCSI process will get half of the 1Gb uplink bandwidth and avoid having
any impact on the vMotion process.

VMware recommends that the network and server administrators work closely together when deploying the traffic management
features of the VDS and blade chassis. To achieve the best end-to-end quality of service (Q0S) result, a considerable amount of
coordination is required during the configuration of the traffic management features.

Operational Best Practices

After a customer successfully designs the virtual network infrastructure, the next challenges are how to deploy the design and how
to keep the network operational. VMware provides various tools, APIs, and procedures to help customers effectively deploy and
manage their network infrastructure. The following are some key tools available in the vSphere platform:

VMware vSphere Command-Line Interface (vSphere CLI)
VMware vSphere API
Virtual network monitoring and troubleshooting
NetFlow
Port mirroring

In the following section, we will briefly discuss how vSphere and network administrators can utilize these tools to manage their
virtual network. Refer to the vSphere documentation for more details on the tools.

VMware vSphere Command-Line Interface

vSphere administrators have several ways to access vSphere components through vSphere interface options, including VMware
vSphere CIient, vSphere Web Client, and vSphere Command-Line Interface. The vSphere CLI command set enables administrators to
perform configuration tasks by using a vSphere vCLI package installed on supported platforms or by using VMware vSphere
Management Assistant (vMA).
Refer to the Getting Started with vSphere CLI document for more details on the commands:
http://www.vmware.com/support/developer/vcli.

The entire networking configuration can be performed through vSphere vCLI, helping administrators automate the deployment
process.

VMware vSphere API

The networking setup in the virtualized datacenter involves configuration of virtual and physical switches. VMware has provided
APIs that enable network switch vendors to get information about the virtual infrastructure, which helps them to automate the
configuration of the physical switches and the overall process.

For example, vCenter can trigger an event after the vMotion process of a virtual machine is performed. After receiving this event
trigger and related information, the network vendors can reconfigure the physical switch port policies such that when the virtual
machine moves to another host, the VLAN/access control list (ACL) configurations are migrated along with the virtual machine,
Multiple networking vendors have provided this automation between physical and virtual infrastructure configurations through
integration with vSphere APIs.

Customers should check with their networking vendors to learn whether such an automation tool exists that will bridge the gap
between physical and virtual networking and simplify the operational challenges.


Virtual Network Monitoring and Troubleshooting

Monitoring and troubleshooting network traffic in a virtual environment require similar tools to those available in the physical
switch environment. With the release of vSphere 5, VMware gives network administrators the ability to monitor and troubleshoot
the virtual infrastructure through features such as NetFlow and port mirroring.

NetFlow capability on a distributed switch along with a NetFlow collector tool helps monitor application flows and measures flow
performance over time. It also helps in capacity planning and ensuring that I/O resources are utilized properly by different
applications, based on their needs.

Port mirroring capability on a distributed switch is a valuable tool that helps network administrators debug network issues in a
virtual infrastructure. Granular control over monitoring ingress, egress or all traffic of a port helps administrators fine-tune what
traffic is sent for analysis.

vCenter Server on a Virtual Machine

As mentioned earlier, vCenter Server is only used to provision and manage VDS configurations. Customers can choose to deploy it on
a virtual machine or a physical host, depending on their management resource design requirements. In case of vCenter Server failure
scenarios, the VDS will continue to provide network connectivity, but no VDS configuration changes can be performed.

By deploying vCenter Server on a virtual machine, customers can take advantage of vSphere platform features such as vSphere High
Availability (HA) and VMware Fault Tolerance (Fault Tolerance) ?? to provide higher resiliency to the management plane.

In such deployments, customers must pay more attention to the network configurations. This is because if the networking for a
virtual machine hosting vCenter Server is misconfigured, the network connectivity of vCenter Server is lost. This misconfiguration
must be fixed. However, customers need vCenter Server to fix the network configuration because only vCenter Server can configure
a VDS. As a work-around to this situation, customers must connect to the host directly where the vCenter Server virtual machine is
running through vSphere Client. Then they must reconnect the virtual machine hosting vCenter Server to a VSS that is also
connected to the management network of hosts. After the virtual machine running vCenter Server is reconnected to the network, it
can manage and configure VDS.

Refer to the community article Virtual Machine Hosting a vCenter Server Best Practices for guidance regarding the deployment of
vCenter on a virtual machine:

http://communities.vmware.com/servlet/JiveServlet/previewBody/14089-102-1-16292/VM hostVCBestPracitices. html.

Conclusion

A VMware vSphere distributed switch provides customers with the right measure of features, capabilities and operational simplicity
for deploying a virtual network infrastructure. As customers move on to build private or public clouds, VDS provides the scalability
numbers for such deployments. Advanced capabilities such as NIOC and LBT are key for achieving better utilization of I/O resources
and for providing better SLAs for virtualized business-critical applications and multitenant deployments. Support for standard
networking visibility and monitoring features such as port mirroring and NetFlow helps administrators manage and troubleshoot a
virtual infrastructure through familiar tools. VDS also is an extensible platform that enables integration with other networking
vendor products through open vSphere APIs.

12. VMware Network I/O Control: Architecture, Performance and Best Practices

The Network I/O Control (NetIOC) feature available in VMware vSphereTM 4.1 (vSphere) addresses these challenges by
introducing a software approach to partitioning physical network bandwidth among the different types of network traffic flows. It
does so by providing appropriate quality of service (QoS) policies enforcing traffic isolation, predictability and prioritization,
therefore helping IT organizations overcome the contention resulting from consolidation. The experiments conducted in VMware
performance labs using industry-standard workloads show that NetIOC:

Maintains NFS and/or iSCSI storage performance in the presence of other network traffic such as vMotionTM and bursty
virtual machines.
Provides network service level guarantees for critical virtual machines.
Ensures adequate bandwidth for VMware Fault Tolerance (VMware FT) logging.
Ensures predictable vMotion performance and duration.
Facilitates any situation where a minimum or weighted level of service is required for a particular traffic type independent
of other traffic types.

The sections that follow discuss:

Use cases and application of NetIOC with 10GbE in contrast to traditional 1GbE deployments
The NetIOC technology and architecture used within the vNetwork Distributed Switch (vDS)
How to configure NetIOC from the vSphere Client
Examples of NetIOC usage to illustrate possible deployment scenarios
Results from actual performance tests using NetIOC to illustrate how NetIOC can protect and prioritize traffic in the face of
network contention and oversubscription
Best practices for deployment


Moving from 1GbE to 10GbE

Virtualized datacenters are characterized by newer and complex types of network traffic flows such as vMotion and VMware FT
logging traffic. In todays virtualized datacenters where 10GbE connectivity is still not commonplace, networking is typically based on
large numbers of 1GbE physical connections that are used to isolate different types of traffic flows and to provide sufficient
bandwidth.
Table 1. Typical Deployment and Provisioning of 1GbE NICs with vSphere 4.0


Provisioning a large number of GbE network adapters to accommodate peak bandwidth requirements of these different types of
traffic flows has a number of shortcomings:

Limited bandwidth: Flows from an individual source (virtual machine, vMotion interface, and so on) are limited and bound
to the bandwidth of a single 1GbE interface even if more bandwidth is available within a team
Excessive complexity: Use of large numbers of 1GbE adapters per server leads to excessive complexity in cabling and
management, with an increased likelihood of misconfiguration
Higher capital costs: Large numbers of 1GbE adapters requires more physical switch ports, which in turn leads to higher
capital costs including additional switches and rack space

Lower utilization: Static bandwidth allocation to accommodate peak bandwidth for different traffic flows means poor
average network bandwidth utilization

10GbE provides ample bandwidth for all the traffic flows to coexist and share the same physical 10GbE link. Flows that were limited
to the bandwidth of a single 1GbE link are now able to use as much as 10GbE. While the use of a 10GbE solution greatly simplifies
the networking infrastructure and addresses all the shortcomings listed above, there are a few challenges that still need to be
addressed to maximize the value of a 10GbE solution. One means of optimizing the 10GbE network bandwidth is to prioritize the
network traffic by traffic flows. This ensures that latency-sensitive and critical traffic flows can access the bandwidth they need.
NetIOC enables the convergence of diverse workloads on a single networking pipe. It provides sufficient controls to the vSphere
administrator in the form of limits and shares parameters to enable and ensure predictable network performance when multiple
traffic types contend for the same physical network resources.
NetIOC Architecture
Prerequisites for NetIOC
NetIOC is only supported with the vNetwork Distributed Switch (vDS). With vSphere 4.1, a single vDS can span up to 350 ESX/ESXi
hosts (500 as of vSphere 5.5), providing a simplified and more powerful management environment versus the per-host switch model
using the vNetwork Standard Switch (vSS). The vDS also provides a superset of features and capabilities over that of the vSS, such as
network vMotion, bi-directional traffic shaping and private VLANs.
Configuring and managing a vDS involves use of distributed port groups (DV Port Groups) and distributed virtual uplinks (dvUplinks).
DV Port Groups are port groups associated with a vDS similar to port groups available with vSS. dvUplinks provide a level of
abstraction for the physical NICs (vmnics) on each vSphere host.
NetIOC Feature Set
NetIOC provides users with the following features:

Isolation: ensure traffic isolation so that a given flow will never be allowed to dominate over others, thus preventing drops
and undesired jitter.
Shares: allow flexible networking capacity partitioning to help users to deal with over commitment when flows compete
aggressively for the same resources
Limits: enforce traffic bandwidth limit on the overall vDS set of dvUplinks
Load-Based Teaming: efficiently use a vDS set of dvUplinks for networking capacity


NetIOC Traffic Classes

The NetIOC concept revolves around resource pools that are similar in many ways to the ones already existing for CPU and Memory.
NetIOC classifies traffic into six predefined resource pools as follows:
vMotion
iSCSI
FT logging
Management
NFS
Virtual machine traffic

Figure 1. NetIOC Architecture

Shares

A user can specify the relative importance of a given resource-pool flow using shares that are enforced at the dvUplink level. The
underlying dvUplink bandwidth is then divided among resource-pool flows based on their relative shares in a work-conserving way.
This means that unused capacity will be redistributed to other contending flows and wont go to waste. As shown in Figure 1, the
network flow scheduler is the entity responsible for enforcing shares and therefore is in charge of the overall arbitration under
overcommitment. Each resource-pool flow has its own dedicated software queue inside the scheduler so that packets from a given
resource pool wont be dropped due to high utilization by other flows.
Limits
A user can specify an absolute shaping limit for a given resource-pool flow using a bandwidth capacity limiter. As opposed to shares
that are enforced at the dvUplink level, limits are enforced on the overall vDS set of dvUplinks, which means that a flow of a given
resource pool will never exceed a given limit for a vDS out of a given vSphere host.
Load-Based Teaming (LBT)
As of vSphere 4.1, which introduced a load-based teaming (LBT) policy that ensures vDS dvUplink capacity is optimized. LBT avoids
the situation of other teaming policies in which some of the dvUplinks in a DV Port Groups team were idle while others were
completely saturated just because the teaming policy used is statically determined (IP Hashing). LBT reshuffles port binding
dynamically based on load and dvUplinks usage to make an efficient use of the bandwidth available. LBT only moves ports to
dvUplinks configured for the corresponding DV Port Groups team. Note that LBT does not use shares or limits to make its judgment
while rebinding ports from one dvUplink to another. LBT is not the default teaming policy in a DV Port Group so it is up to the user to
configure it as the active policy.
LBT will only move a flow when the mean send or receive utilization on an uplink exceeds 75 percent of capacity over a 30-second
period. LBT will not move flows more often than every 30 seconds.
Configuring NetIOC
NetIOC is configured through the vSphere Client in the Resource Allocation tab of the vDS from within the Home->Inventory-
>Networking panel.
NetIOC is enabled by clicking on Properties... on the right side of the panel and then checking Enable network I/O control on this
vDS in the pop up box.
The Limits and Shares for each traffic type is configured by right-clicking on the traffic type (for example, Virtual Machine Traffic) and
selecting Edit Settings... This will bring up a Network Resource Pool Setting dialog box in which you can select the Limits and
Shares values for that traffic type.


NetIOC Usage

Unlike the limits that are specified in absolute units of Mbps, shares are used to specify the relative importance of the flows. Shares
are specified in abstract units with a value ranging from 1 to 100. In this section, we provide an example that describes the usage of
shares.
Figure 6. NetIOC shares usage example



Figure 6 highlights the following characteristics of the shares:
In absence of any other traffic, a particular traffic flow gets 100 percent of the bandwidth available, even if it was
configured with 25 shares
During the periods of contention, the bandwidth is divided among the traffic flows based on their relative shares

NetIOC Performance

In this section, we describe in detail the test-bed configuration, the workloads used to generate the network traffic flows, and the
test results.

Test Configuration In our test configuration, we used an ESX cluster that comprised two Dell PowerEdge R610 servers running the
GA release of ESX 4.1. Each of the servers was configured with dual-socket, quad-core 2.27 GHz Intel Xeon L5520 processors, 96 GB
of RAM, and a 10 GbE Intel Oplin NIC. The following figure depicts the hardware configuration used in our tests. The complete
hardware details are provided in Appendix A.
Figure 7. Physical Hardware Setup Used in the Tests



In our test configuration, we used a single vDS that spanned both vSphere hosts. We configured the vDS with a single dvUplink
(dvUplink1). The 10GbE physical NIC port on each of two vSphere hosts was mapped to dvUplink1. We configured the vDS with four
DV Port Groups as follows:

dvPortGroup-FT for FT logging traffic


dvPortGroup-NFS for NFS traffic
dvPortGroup-VM for virtual machine traffic
dvPortGroup-vMotion for vMotion traffic

Using four distinct DV Port Groups enabled us to easily track the network bandwidth usage of the different traffic flows. As shown in
Figure 8, on both vSphere hosts, the virtual network adapters (vNICs) of all the virtual machines used for virtual machine traffic, and
the VMkernel interfaces (vmknics) used for vMotion, NFS, and VMware FT logging were configured to use the same 10GbE physical
network adapter through the vDS interface.
Figure 8. vDS Configuration Used in the Tests

Workloads Used for Performance Testing


To simulate realistic high network I/O load scenarios, we used the industry-standard workloads SPECweb2005 and SPECjbb2005, as
they are representative of what most customers would run in their environments.
SPECweb2005 workload: SPECweb2005 is an industry-standard web server workload that is comprised of three component
workloads. The support workload emulates a vendor support site that provides downloads such as driver updates and
documentation over HTTP. It is a highly intensive networking workload. The performance score of the workload is measured in
terms of the number of simultaneous user sessions that meet the quality of service requirements specified by the benchmark.
SPECjbb2005 workload: SPECjbb2005 is an industry-standard server-side Java benchmark. It is a highly memory-intensive workload
because of Javas usage of the heap and associated garbage collection. Due to these characteristics, when a virtual machine running
a SPECjbb2005 workload is subject to vMotion, one could expect to generate heavy vMotion network traffic. This is because during
vMotion the entire memory state of the virtual machine is transferred from the source ESX server to a destination ESX server
through a high-speed network. During the process of migration, if the memory state of the virtual machine is actively changing,
vMotion will need multiple iterations to transfer the active memory state that results in an increase in duration of vMotion and the
associated network traffic.
IOmeter workload: IOmeter was used to generate NFS traffic.
NetIOC Performance Test Scenarios
Impact of the (or lack of the) network resource management controls is evident only when aggregate bandwidth requirements
of the competing traffic flows exceed the available interface bandwidth. The impact is more apparent when one of the competing
traffic flows is latency sensitive. Accordingly, we designed three different test scenarios with a mix of critical and noncritical traffic
flows, with the aggregate bandwidth requirements of all the traffic flows under consideration exceeding the capacity of the network
interface.
To evaluate and compare the performance and scalability of the virtualized environment with and without NetIOC controls, we used
following different scenarios:
Virtual machine and vMotion traffic flows contending on a vmnic
NFS, VMware FT, virtual machine, and vMotion traffic flows contending on a vmnic
Multiple vMotion traffic flows initiated from different vSphere hosts converging onto the same destination vSphere host

The goal was to determine if NetIOC provides good controls in achieving the QoS requirements in SPECweb2005 testing
environments that otherwise would not have been met in absence of NetIOC.
Test Scenario 1: Using Two Traffic FlowsVirtual Machine Traffic and vMotion Traffic
We chose latency-sensitive SPECweb2005 traffic and vMotion traffic flows in our first set of tests. The goal was to evaluate the
performance of a SPECweb2005 workload in a virtualized environment with and without NetIOC when latency-sensitive
SPECweb2005 traffic and vMotion traffic contended for the same physical network resources. As shown in Figure 9, our test-bed
was configured such that both the traffic flows used the same 10GbE physical network adapter. This was done by mapping the
virtual network adapters of the virtual machines (used for SPECweb2005 traffic) and the VMkernel interface (used for vMotion
traffic) to the same 10GbE physical network adapter. The complete experimental setup details for these tests are provided in
Appendix B.














Figure 9. Setup for the Test Scenario 1


At first, we measured the bandwidth requirements of the SPECweb2005 virtual machine traffic and vMotion traffic flows in isolation.
The bandwidth usage of the virtual machine traffic while running 17,000 SPECweb2005 user sessions was a little more than 7Gbps
during the steady-state interval of the benchmark. The peak network bandwidth usage of the vMotion traffic flow used in our tests
was measured to be more than 8Gbps. Thus, if both traffic flows used the same physical resources, the aggregate bandwidth
requirements would certainly exceed the 10GbE interface capacity. In the test scenario, during the steady-state period of the
SPECweb2005 benchmark, we initiated vMotion traffic flow, which resulted in both the vMotion traffic and the virtual machine
traffic flows contending on the same physical 10GbE link.
Figure 10 shows the performance of the SPECweb2005 workload in a virtualized environment without NetIOC. The graph plots the
number of SPECweb2005 user sessions that meet the QoS requirements (Time Good) at a given time. In this graph, the first dip
corresponds to the start of the steady-state interval of the SPECweb2005 benchmark when the statistics are cleared. The second dip
corresponds to the loss of QoS due to vMotion traffic competing for the same physical network resources.
Figure 10. SPECweb2005 Performance without NetIOC



We note that when we repeated the same test scenario several times, the loss of performance shown in the graph varied, possibly
due to the nondeterministic nature of vMotion traffic. Nevertheless, these tests clearly demonstrate that lack of any network
resource management controls results both in loss of performance and predictability that is required to guarantee SLAs required by
critical traffic flows.

Figure 11 shows the performance of a SPECweb2005 workload in a virtualized environment with NetIOC controls in place. We
configured the virtual machine traffic with twice the number of shares than those configured for vMotion traffic. In other words, we
ensured the virtual machine traffic had twice the priority over vMotion traffic when both the traffic flows competed for the same
physical network resources. Our tests revealed that although the duration of the vMotion was doubled due to the controls enforced
by NetIOC, as shown in Figure 11, the SPECweb2005 performance was unperturbed due to vMotion traffic.
Figure 11. SPECweb2005 Performance with NetIOC



Test Scenario 2: Using Four Traffic Flows NFS Traffic, Virtual Machine Traffic, VMware FT Traffic and vMotion Traffic
In this test scenario, we chose a very realistic customer deployment scenario that featured fault-tolerant Web servers.
A recent VMware customer survey found Web servers had the distinction of topping the high ranks among the popular applications
used in conjunction with the VMware FT feature. This is no coincidence because fault-tolerant Web servers provide some compelling
features that are not available with typical Web server-farm deployment scenarios using load balancers that redirect user requests
when a Web server goes down. Such loadbalancer based solutions may not be the most customer-friendly for Web sites that
provide very large downloads, such as driver updates and documentation. As an example, consider a failure of a Web server while a
user is downloading a large user manual. In a load-balancer based Web-farm deployment scenario, this will result in user request to
fail (or timeout) and the user would need to resubmit the request. On the other hand, in a VMware FTenabled Web server
environment, the user will not experience such failure due to the presence of a secondary hypervisor that has full information on
pending I/O operations from the failed primary virtual machine, and commits all the pending I/O. Refer to VMware vSphere 4 Fault
Tolerance: Architecture and Performance for more information on VMware FT.
As shown in Figure 12, our test-bed was configured such that all the traffic flows used in the test mix contended for the same
network resources. The complete experimental setup details for these tests are provided in Appendix B.
Figure 12. Setup for the Test Scenario 2

Our test-bed configuration featured four virtual machines that included:

Two VMware FTenabled Web server virtual machines serving SPECweb2005 benchmark requests (that generated virtual
machine traffic and VMware FT logging traffic)
One virtual machine (VM3) accessing an NFS store (that generated NFS traffic)
One virtual machine (VM4) running a SPECjbb2005 workload (used to generate vMotion traffic)

At first we measured the network bandwidth usage of all the four traffic flows in isolation. Table 2 describes the network bandwidth
usage.
Table 2. Network Bandwidth Usage of the Four Traffic Flows used in the Test Environment


The goal was to evaluate the latencies of critical traffic flows including VMware FT and NFS traffic in a virtualized environment with
and without NetIOC controls when four traffic flows contended for the same physical network resources. The test scenario had three
phases:
Phase 1: The SPECweb2005 workload in the two VMware FTenabled virtual machines was in the steady state.
Phase 2: The NFS workload in VM3 became active. SPECweb2005 workload in the other two virtual machines continued to be active.
Phase 3: The VM4 running the SPECjbb2005 workload was subject to vMotion while the NFS and SPECweb2005 workloads remained
active in the other virtual machines.
The following figures depict the performance of different traffic flows in absence of NetIOC.
Let us first consider the performance of the VMware FTenabled Web server virtual machines. The graph plots the number of
SPECweb2005 user sessions that meet the QoS requirements (Time Good) at a given time. In this graph, the first dip corresponds
to the start of the steady-state interval of the SPECweb2005 benchmark when the statistics are cleared. The second dip corresponds
to the loss of QoS due to multiple traffic flows competing for the same physical network resources. The number of SPECweb2005
users sessions that meet the QoS requirements dropped by about 67 percent during the period of contention. We note that the
SPECweb2005 performance degradation in the VMware FT environment was much more severe in the absence of NetIOC than what
we observed in the first test scenario. This is because in a VMware FT environment, the primary and secondary virtual machines run
in vLockstep, and so the network link between the primary and secondary ESX hosts plays a critical role in performance. During the
periods of heavy contention on the network link, the primary virtual machine will make little or no forward progress.
Figure 13. SPECweb2005 Performance in a VMware FT Environment without NetIOC


Figure 14. NFS Access Latency without NetIOC



Similarly, we noticed a significant jump in the NFS store access latencies. As shown in Figure 14, the maximum I/O latency reported
by the IOmeter increased from a mere 162 ms to 2166 ms (a factor of 13).



Figure 15. Network Bandwidth Usage of Traffic Flows in Different Phases without NetIOC



A detailed explanation of the bandwidth usage in each phase follows:
Phase 1: In this phase, the VMware FTenabled VM1 and VM2 were active and the SPECweb2005 benchmark was in a steady-state
interval. The aggregate network bandwidth usage of the virtual machine traffic flow and the VMware FT logging traffic flows was less
than 4Gbps.
Phase 2: At the beginning of this phase, VM3 became active and added NFS traffic flow to the test mix. This resulted in three traffic
flows competing for the network resources. Even so there was no difference in the QoS, as the aggregate bandwidth usage was still
less than 5Gbps.
Phase 3: An addition of vMotion traffic flow to the test mix resulted in the aggregate bandwidth requirements of the four traffic
flows exceeding the capacity of the physical 10GbE link. Lack of any control mechanism to manage access to the 10GbE bandwidth
resulted in vSphere sharing the bandwidth among all the traffic flows. Critical traffic flows including VMware FT and NFS traffic flows
got the same treatment as the vMotion traffic flow, which resulted in a significant drop in performance.
The performance requirements of the different traffic flows must be considered to put network I/O resource controls in place. In
general, the bandwidth requirement of the VMware FT logging traffic is expected to be much smaller than the requirements of the
other traffic flows. However, given its impact on performance, we configured VMware FT logging traffic with the highest priority
over other traffic flows. We also ensured NFS traffic and virtual machine traffic flows had higher priority over vMotion traffic. Figure
16 shows shares assigned to the different traffic flows.

Figure 16. Share Allocation to Different Traffic Flows with NetIOC



Figure 17 shows the network bandwidth usage of the different traffic flows in different phases. As shown in the figure, thanks to the
network I/O resource controls, vSphere was able to enforce priority among the traffic flows, and so the bandwidth usage of the
critical traffic flows remained unperturbed during the period of contention.
Figure 17. Network Bandwidth Usage of Traffic Flows in Different Phases with NetIOC



The following figures show the performance of SPECweb2005 and NFS workloads in a VMware FTenabled virtualized environment
with NetIOC in place. As shown in the figures, vSphere was able to ensure service level guarantees to both the workloads in all the
phases. Figure 18. SPECweb2005 Performance in FT Environment with NetIOC

Figure 19. NFS Access Latency with NetIOC



The maximum I/O latency reported by the IOmeter remained unchanged at 162 ms in all the phases, and the SPECweb2005
Performance remained unaffected by the network bandwidth usage spike caused by the vMotion traffic flow.
Test Scenario 3: Using Multiple vMotion Traffic Flows
In this final test scenario, we will show how NetIOC can be used in combination with Traffic Shaper to provide a comprehensive
network convergence solution in a virtualized datacenter environment.
While NetIOC enables you to limit vMotion traffic initiated from a vSphere host, it fails to prevent performance loss when multiple
vMotion traffic flows initiated on different vSphere hosts converge onto a single vSphere host and possibly overwhelm the latter.
We will show how a solution based on NetIOC and Traffic Shaper can prevent such an unlikely event.
In vSphere 4.0, support for traffic shaping was introduced, providing some rudimentary controls on network bandwidth usage. For
instance, it only provided bandwidth usage controls at the port level, and did not enforce prioritization among traffic flows. These
controls were provided for both egress and ingress traffic. In vSphere deployment, the egress and ingress traffic are with respect to a
vDS (or vSS). The traffic going into a vDS is ingress/input, and traffic leaving a vDS is egress/output. So, from the perspective of a
vNIC port (or vmknic port), the network traffic from the physical network (or vmnic) will ingress into the vDS and egress from vDS to
vNIC. Similarly, the traffic flow from vNIC will ingress into the vDS and egress to the physical network (or vmnic). In other words, the
ingress and egress need to be interpreted as follows:
Ingress traffic: traffic from a vNIC (or vmknic) to vDS Egress traffic: traffic from vDS to the vNIC (or vmknic)
In this final test scenario, we added a third vSphere host to the same cluster that we used in our previous tests. As shown in Figure
20, the cluster used for this test comprised three vSphere hosts.

We initiated vMotion traffic (peak network bandwidth usage of 9Gbps) from vSphere Host 2, and vMotion traffic (peak network
bandwidth usage close to 1Gbps) from vSphere Host 3. Both of these traffic flows converged onto the same destination vSphere
host (Host 1). Below, we describe the results of the three test configurations.
Without NetIOC
As a point of reference, we first disabled NetIOC in our test configuration. Our tests indicated that, without any controls, the receive
link on Host 1 was fully saturated due to multiple vMotion traffic flows whose aggregate network bandwidth usage exceeded the
link capacity.
With NetIOC
As shown in Figure 21, we used NetIOC to enforce limits on vMotion traffic.
Figure 21. NetIOC Settings to Enforce Limits on vMotion Traffic Flow


Figure 22 shows the Rx network bandwidth usage on Host 1 (with NetIOC controls in place) as multiple vMotion traffic flows
converge on it.
Figure 22. Rx Network Bandwidth Usage on Host 1 with Multiple vMotions (with NetIOC On)


A detailed explanation of the bandwidth usage in each phase follows:
Phase 1: In this phase, vMotion from Host 3 to Host 1 was active. Due to the 1GbE link capacity on Host 3, the bandwidth usage of

the vMotion traffic flow was limited to 1Gbps.


Phase 2: At the beginning of this phase, vMotion from Host 2 to Host 1 became active, resulting in two active vMotion traffic flows
converging onto the same destination vSphere host. Thanks to the NetIOC controls, the vMotion traffic flow from Host 2 was only
limited to 3Gbps. The aggregate network bandwidth usage of both the active vMotion flows was close to 4Gbps.
NOTE: If there had been more concurrent vMotions (even if such an event is very unlikely), NetIOC would have failed to prevent these
vMotions from saturating the receive link on the Host 1.
With NetIOC and Traffic Shaper
With NetIOC controls in place, we also used Traffic Shaper to enforce limits on the egress traffic. NetIOC controls obviate the need
for traffic-shaping policies on ingress traffic. Accordingly, as shown in Figure 23, we used Traffic Shaper to enforce policies only on
egress traffic. Also note that, each of the DV Port Groups can have its own traffic-shaping policy. In our example, we configured the
dvPortGroup-vMotion with the traffic-shaping policies shown in Figure 21.


Figure 24 shows the Rx network bandwidth usage on Host 1 (with both NetIOC and Traffic Shaper controls in place) as multiple
vMotion traffic flows converge on it.


A detailed explanation of the bandwidth usage in each phase follows:
Phase 1: In this phase, vMotion from Host 3 to Host 1 was active. Due to the 1GbE link capacity on Host 3, the bandwidth usage of

the vMotion traffic flow was limited to 1Gbps.


Phase 2: At the beginning of this phase, vMotion from Host 2 to Host 1 became active, resulting in two active vMotion traffic flows
converging onto the same destination vSphere host. With both NetIOC and Traffic Shaper controls in place, the aggregate bandwidth
usage on the receiver never exceeded 3Gbps.
These tests confirm that NetIOC in combination with Traffic Shaper can be a viable solution that provides effective controls on both
receive and transmit traffic flows in a virtualized datacenter environment.
NetIOC Best Practices
NetIOC is a very powerful feature that will make your vSphere deployment even more suitable for your I/O-consolidated datacenter.
However, follow these best practices to optimize the usage of this feature:
Best practice 1: When using bandwidth allocation, use shares instead of limits, as the former has greater flexibility for unused
capacity redistribution. Partitioning the available network bandwidth among different types of network traffic flows using limits has
shortcomings. For instance, allocating 2Gbps bandwidth by using a limit for the virtual machine resource pool provides a maximum
of 2Gbps bandwidth for all the virtual machine traffic even if the team is not saturated. In other words, limits impose hard limits on
the amount of the bandwidth usage by a traffic flow even when there is network bandwidth available.
Best practice 2: If you are concerned about physical switch and/or physical network capacity, consider imposing limits on a given
resource pool. For instance, you might want to put a limit on vMotion traffic flow to help in situations where multiple vMotion traffic
flows initiated on different ESX hosts at the same time could possibly oversubscribe the physical network. By limiting the vMotion
traffic bandwidth usage at the ESX host level, we can prevent the possibility of jeopardizing performance for other flows going
through the same points of contention.
Best practice 3: Fault tolerance is a latency-sensitive traffic flow, so it is recommended to always set the corresponding resource-
pool shares to a reasonably high relative value in the case of custom shares. However, in the case where you are using the
predefined default shares value for VMware FT, leaving it set to high is recommended.
Best practice 4: We recommend that you use LBT as your vDS teaming policy while using NetIOC in order to maximize the
networking capacity utilization.
NOTE: As LBT moves flows among uplinks it may occasionally cause reordering of packets at the receiver.
Best practice 5: Use the DV Port Group and Traffic Shaper features offered by the vDS to maximum effect when configuring the vDS.
Configure each of the traffic flow types with a dedicated DV Port Group. Use DV Port Groups as a means to apply configuration
policies to different traffic flow types, and more important, to provide additional Rx bandwidth controls through the use of Traffic
Shaper. For instance, you might want to enable Traffic Shaper for the egress traffic on the DV Port Group used for vMotion. This can
help in situations when multiple vMotions initiated on different vSphere hosts converge to the same destination vSphere server.
Conclusions
Consolidating the legacy GbE networks in a virtualized datacenter environment with 10GbE offers many benefits ease of
management, lower capital costs and better utilization of network resources. However, during the peak periods of contention, the
lack of control mechanisms to share the network I/O resources among the traffic flows can result in significant performance drop of
critical traffic flows. Such performance loss is unpredictable and uncontrollable if the access to the network I/O resources is
unmanaged. NetIOC provides a mechanism to manage the access to the network I/O resources when multiple traffic flows compete.
The experiments conducted in VMware performance labs using industry standard workloads show that:

Lack of NetIOC can result in unpredictable loss in performance of critical traffic flows during periods of contention.
NetIOC can effectively provide service level guarantees to the critical traffic flows. Our test results showed that NetIOC
eliminated a performance drop of as much as 67 percent observed in an unmanaged scenario.
NetIOC in combination with Traffic Shaper provides a comprehensive network convergence solution enabling features that
are not available with the any of the hardware solutions in the market today.

13. Storage I/O Control Technical Overview and Considerations for Deployment

Whats new vSphere 5.0
vSphere Storage I/O Control now supports NFS Set storage quality of service priorities per virtual machine for better access to
storage resources for high-priority applications
Storage I/O Control (SIOC) provides storage I/O performance isolation for virtual machines, thus enabling VMware vSphereTM
(vSphere) administrators to comfortably run important workloads in a highly consolidated virtualized storage environment. It
protects all virtual machines from undue negative performance impact due to misbehaving I/O-heavy virtual machines, often known
as the noisy neighbor problem.
Furthermore, the service level of critical virtual machines can be protected by SIOC by giving them preferential I/O resource
allocation during periods of congestion. SIOC achieves these benefits by extending the constructs of shares and limits, used
extensively for CPU and memory, to manage the allocation of storage I/O resources
SIOC improves upon the previous host-level I/O scheduler by detecting and responding to congestion occurring at the array, and
enforcing share-based allocation of I/O resources across all virtual machines and hosts accessing a datastore.
With SIOC, vSphere administrators can mitigate the performance loss of critical workloads due to high congestion and storage
latency during peak load periods. The use of SIOC will produce better and more predictable performance behavior for workloads
during periods of congestion. Benefits of leveraging SIOC:

Provides performance protection by enforcing proportional fairness of access to shared storage


Detects and manages bottlenecks at the array
Maximizes your storage investments by enabling higher levels of virtual-machine consolidation across your shared
datastores The purpose of this paper is to explain the basic mechanics of how SIOC, a new feature in vSphere 4.1, works
and to discuss considerations for deploying it in your VMware virtualized environments.

The Challenge of Shared Resources


Controlling the dynamic allocation of resources in distributed systems has been a long-standing challenge. Virtualized environments
introduce further challenges because of the inherent sharing of physical resources by many virtual machines. VMware has provided
ways to manage shared physical resources, such as CPU and memory, and to prioritize their use among all the virtual machines in
the environment. CPU and memory controls have worked well since memory and CPU resources are shared only at a local-host level,
for virtual machines residing within a single ESX server.
The task of regulating shared resources that span multiple ESX hosts, such as shared datastores, presents new challenges, because
these resources are accessed in a distributed manner by multiple ESX hosts. Previous disk shares did not address this challenge,
as the shares and limits were enforced only at a single ESX host level, and were only enforced in response to host-side HBA
bottlenecks, which occur rarely. This approach had the problem of potentially allowing lower-priority virtual machines greater access
to storage resources based on their placement across different ESX hosts, as well as neglecting to provide benefits in the case that
the datastore is congested but the host-side queue is not. An ideal I/O resource-management solution should provide the allocation
of I/O resources independent of the placement of virtual machines and with consideration of the priorities of all virtual machines
accessing the shared datastore. It should also be able to detect and control all instances of congestion happening at the shared
resource.
The Storage I/O Control Solution
SIOC solves the problem of managing shared storage resources across ESX hosts. It provides a fine-grained storage-control
mechanism by dynamically managing the size of, and access to, ESX host I/O queues based on assigned shares. SIOC enhances the
disk-shares capabilities of previous releases of VMware ESX Server by enforcing these disk shares not only at the local-host level but
also at the per-datastore level. Additionally, for the first time, vSphere with SIOC provides storage-device latency monitoring and
control, with which SIOC can throttle back storage workloads according to their priority in order to maintain total storage-device
latency below a certain threshold.

How Storage I/O Control Works


SIOC monitors the latency of I/Os to datastores at each ESX host sharing that device. When the average normalized datastore
latency exceeds a set threshold (30ms by default), the datastore is considered to be congested, and SIOC kicks in to distribute the
available storage resources to virtual machines in proportion to their shares. This is to ensure that low-priority workloads do not
monopolize or reduce I/O bandwidth for high-priority workloads. SIOC accomplishes this by throttling back the storage access of the
low-priority virtual machines by reducing the number of I/O queue slots available to them. Depending on the mix of virtual machines
running on each ESX server and the relative I/O shares they have, SIOC may need to reduce the number of device queue slots that
are available on a given ESX server
Host-Level Versus Datastore-Level Disk Schedulers
It is important to understand the way queuing works in the VMware virtualized storage stack to have a clear understanding of how
SIOC functions. SIOC leverages the existing host device queue to control I/O prioritization. Prior to vSphere 4.1, the ESX server device
queues were static and virtual-machine storage access was controlled within the context of the storage traffic on a single ESX server
host. With vSphere 4.1, SIOC provides datastore-wide disk scheduling that responds to congestion at the array, not just on the host-
side HBA.. This provides an ability to monitor and dynamically modify the size of the device queues of each ESX server based on
storage traffic and the priorities of all the virtual machines accessing the shared datastore.
An example of a local host-level disk scheduler is as follows:
Figure 1 shows the local scheduler influencing ESX host-level prioritization as two virtual machines are running on the same ESX
server with a single virtual disk on each.
Figure 1. I/O Shares for Two Virtual Machines on a Single ESX Server (Host-Level Disk Scheduler)


In the case in which I/O shares for the virtual disks (VMDKs) of each of those virtual machines are set to different values, it is the
local scheduler that prioritizes the I/O traffic only in case the local HBA becomes congested.
This described host-level capability has existed for several years in ESX Server prior to vSphere 4.1. It is this local-host level disk
scheduler that also enforces the limits set for a given virtual-machine disk. If a limit is set for a given VMDK, the I/O will be controlled
by the local disk scheduler so as to not exceed the defined amount of I/O per second.
vSphere 4.1 has added two key capabilities: (1) the enforcement of I/O prioritization across all ESX servers that share a common
datastore, and (2) detection of array-side bottlenecks. These are accomplished by way of a datastore-wide distributed disk scheduler
that uses I/O shares per virtual machine to determine whether device queues need to be throttled back on a given ESX server t o
allow a higher-priority workload to get better performance. The datastore-wide disk scheduler totals up the disk shares for all the
VMDKs that a virtual machine has on the given datastore. The scheduler then calculates what percentage of the shares the virtual
machine has compared to the total number of shares of all the virtual machines running on the datastore. This percentage of shares
is displayed in the list of details shown in the view of virtual machines tab for each datastore, as seen in Figure 2.

Figure 2. Datastore View of Disk Share Allocation Among Virtual Machines


As described before, SIOC engages only after a certain device-level latency is detected on the datastore. Once engaged, it begins to
assign fewer I/O queue slots to virtual machines with lower shares and more I/O queue slots to virtual machines with higher shares.
It throttles back the I/O for the lower-priority virtual machines, those with fewer shares, in exchange for the higher-priority virtual
machines getting more access to issue I/O traffic. However, it is important to understand that the maximum number of I/O queue
slots that can be used by the virtual machines on a given host cannot exceed the maximum device-queue depth for the device queue
of that ESX host. The ESX maximum queue depth varies by HBA model. The queue-depth maximum value is typically in range of 32 to
128. The lowest that SIOC can reduce the device queue depth to is 4. Figure 3a shows that, without SIOC, a virtual machine with a
lower number of shares, VM C, may get a larger percentage of the available storage-array device-queue slots and thus greater
storage array performance, while a virtual machine with higher I/O shares, VM A, gets fewer than its fair share and reduced
storage array performance. However, with SIOC engaged on that datastore, as in Figure 3b, the result will be that the lower-priority
virtual machine that is by itself on a separate host will be assigned a reduced number of I/O queue slots. That will result in fewer
storage array queue slots being used and a reduction in average device latency. The reduction in average device latency provides VM
A and VM B higher storage performance, as now the same number of I/Os that they previously were issuing complete faster due to
the reduced latency for each of those I/Os.
For instance, assume that VM A was using 18 I/O slots as shown in figure 3a. Without SIOC, the storage array latency could be
unbounded and the I/O workloads being performed by the lower priority VM C could cause a high storage device latency of, say,
40ms. In this example, VM A would have 18 I/Os @ 40ms worth of storage performance. Once enabled, SIOC controls the latency at
the configured congestion threshold, say 30ms. SIOC determines the number of storage array queue slots that can be used while still
maintaining an average device latency below the SIOC congestion threshold. Although SIOC does not directly manage the storage
array queue, it is able to indirectly control the storage array device queue by managing the ESX device queues that feed into it. As
shown in Figure 3b, SIOC has determined that 30 host-side storage queue slots can be used while still maintaining the desired
average device latency. SIOC then distributes those storage array queue slots to the various virtual machine workloads according to
their priorities. The net effect in this example is that VM C is throttled back to use only its correct relative share of the storage array.
VM A, entitled to 60 percent of the queue slots (1500/2500 = 60 percent), is still is able to issue the same 18 I/Os but at a reduced
30ms latency. SIOC provides VM A greater storage performance by controlling VM C and ensuring it uses only its appropriate
allocation of total storage resources per performance. By throttling the ESX device-queue depths in proportion to the priorities of
the virtual machines that are using them, SIOC is able to control storage congestion at the storage array and distribute storage array
performance appropriately.
Figure 3. SIOC Device-Queue Management with Prioritized Disk Shares


SIOC provides isolation and prioritized distribution of storage resources even when vSphere administrators have not manually set
individual disk-share priorities on each VMDK per virtual machine. SIOC protects virtual machines that are running on higher
consolidated ESX servers. In Figures 4a and 4b, all virtual machine disks have default (1000 shares), or equal disk shares. Without
SIOC, VM A and VM B are penalized and not provided equal access to storage resources simply because they are running together on
the same ESX server and sharing the same ESX device queue. Whereas VM C, running on a lower consolidated ESX host, is given
unfair preference to storage resources. Even administrators who do not wish to individually set VMDK disk shares can benefit from
this feature. SIOC provides these vSphere administrators the ability to enable storage isolation for all virtual machines accessing a
datastore by simply checking a single check box at the datastore level. This new storage management capability offered by SIOC
allows vSphere administrators the ability to run higher consolidated virtual environments by preventing imbalances of storage
resource allocation during times of storage contention.
Figure 4. SIOC Device-Queue Management with Equal Disk Shares


In these examples, SIOC is able to fully manage the storage array queue by throttling the ESX host device queues. This is possible
because all the workloads impacting the storage array queue are coming from the ESX hosts and are under SIOCs control. However,

SIOC is able to provide storage workload isolation/prioritization even in scenarios in which external workloads, not under SIOCs
control, are competing with those that it controls. In this scenario, SIOC will first automatically detect this situation, and then will
increase the number of device-queue slots it makes available to the virtual machine workloads so that they can compete more fairly
for total storage resources against external workloads. Using this approach, SIOC is able to maintain a balance between workload
isolation/prioritization and storage I/O throughput even when it cannot directly control or influence the external workload. This
behavior continues as long as the external workload persists and SIOC resumes normal operation once it stops detecting the
external workload.
Enabling Storage I/O Control
Since SIOC is an attribute of a datastore, it is set under the properties of a specific datastore. By default SIOC is not enabled on the
datastore. The default value for SIOC to kick in is 30ms, but this value can be modified by selecting the Advanced option where
one enables SIOC in the vCenter interface as shown in Figure 5.

Figure 5. Datastore Properties SIOC Enablement and Congestion Threshold Setting


SIOC can be used on any FC, iSCSI, or locally attached block storage device that is supported with vSphere 4.1. Review the vSphere
4.1 Hardware Compatibility List (http://www.vmware.com/go/hcl) for the entire list of supported storage devices. SIOC is supported
with FC and iSCSI storage devices that have automated tiered storage capabilities. However, when using SIOC with automated tiered
storage, the SIOC Congestion Threshold must be set appropriately to make sure the storage devices automated tiered storage
capabilities are not impacted by SIOC.
At this time, SIOC is not supported with NFS storage devices or with Raw Device Mapping (RDM) virtual disks. SIOC is also not
supported with datastores that have multiple extents or are being managed by multiple vCenter Management Servers.
For complete step-by-step instructions on how to enable SIOC, or change the default latency threshold for a datastore or other
limitations, consult the documentation or see Managing Storage I/O Resources (Chapter 4) in the vSphere 4.1 Resource
Management Guide (http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resource_mgmt.pdf)
Consideration for Deploying Storage I/O Control
Configuring Disk Shares
Disk shares specify the relative priority a virtual machine has on a given storage resource. When you assign disk shares to a virtual
disk/virtual machine, you specify the priority for that virtual machines access to storage resources relative to other powered-on
virtual machines. Disk shares in vSphere 4.1 can be leveraged at both a local, perESX host level, and now at a datastore level when
SIOC is enabled and actively prioritizing storage resources. Disk shares are set by selecting Edit Settings for a virtual machine and
are set on each VMDK, as seen in Figure 6. When SIOC is not enabled, disk shares and the relative priority they specify are enforced
only at a localESX host level, and then only when local HBAs are saturated. Virtual machines running on the same ESX hosts will be
prioritized relative to other virtual machines on the same host but not relative to virtual machines running on other ESX hosts. When

SIOC is enabled and actively controlling the ESX hosts to control storage latencies, disk shares and relative priorities are enforced
across all the ESX servers that access the SIOC controlled datastore. So a virtual machine running on one ESX host will have access to
storage resources based on the number of disk shares the virtual machine has compared to the total number of disk shares in use on
the datastore by all virtual machines across all ESX hosts in the shared storage environment. If a virtual machine does not f ully use its
allocation of I/O access, the extra I/O slots are redistributed proportionally to the other virtual machines that are actively issuing I/O
requests on the datastore.
Figure 6. Virtual Machine Properties Disk Shares and IOP Limits


As part of vSphere 4.1, I/O per second (IOPS) limits on a per-VMDK level can be set to further manage and prioritize virtual machine
workloads. Limits (expressed in terms of IOPS) are implemented at the local-disk scheduler level and are always enforced regardless
of whether or not SIOC is enabled.
Configuring the Storage I/O Control Congestion Latency Value
SIOC is designed to only engage and enforce storage I/O shares when the storage resource becomes contended. This is very similar
to CPU scheduling, in that it is only enforced when the resource is contended. To determine when a storage device is contended,
SIOC uses a congestion-threshold latency value that vSphere administrators can specify. The default congestion-threshold latency,
30ms, in vSphere 4.1, is a conservative value that should work well for most users. The SIOC congestion-threshold value is
configurable, so vSphere administrators have the opportunity to maximize the benefits of SIOC suited to their own virtual
environment and storage- management preferences. This section discusses the considerations and recommendations for changing
this key parameter.
The SIOC threshold represents a balance between (1) isolation and prioritized access to the storage resource at lower latencies, and
(2) higher throughput. When the SIOC congestion threshold is set low, SIOC can begin prioritizing storage access earlier and throttle
storage workloads more aggressively in order to maintain a datastore-wide latency below the congestion latency threshold. The
more aggressive throttling needed to maintain a lower latency might reduce the overall storage throughput. When the congestion
threshold is set higher, SIOC will not engage and begin prioritizing resources among virtual machines until the higher latency is
reached. When using a higher SIOC congestion latency, SIOC does not need to throttle storage workloads as much in order to
maintain the storage latency below the higher congestion threshold. This may allow for higher overall storage throughput.
The default congestion threshold has been set to minimize the impact of throttling on storage throughput while still providing
reasonably low storage latency and isolation for high-priority virtual machines. In most cases it is not necessary to modify the
storage congestion threshold from its default value. However, a user may decide to modify the value depending on the type and
speed of their storage device, the characteristics of the workloads in their virtual environment, and their storage-management
preference between workload isolation/prioritization and workload throughput. Because various storage devices have different
latency characteristics, users may need to modify the congestion threshold depending on their storage type. See Table 1 to
determine the recommended range of values for your storage-device type.


Table 1. SIOC Congestion Threshold Recommendations


The congestion threshold may also need to be adjusted when using automated tiered storage devices. These are systems that
contain two or more types of storage media and automatically and transparently migrate data between the storage types in order to
optimize I/O performance. These systems typically try to keep the most frequently accessed or hot data on faster storage such as
SSD, and less frequently accessed or cold data on slower media such as SAS or FC disks. This means that the type of storage media
backing a particular LUN can change over time.
For full LUN auto-tiering storage devices, in which the entire LUN is migrated between different storage tiers, use the recommended
value or range for the slowest tier of storage in the device. For example, in a full LUN auto-tiering storage device that contains SSD
and Fibre Channel disks, use the congestion threshold value that is recommended for Fibre Channel.
With sub-LUN or block-level auto-tiering storage, in which individual storage blocks inside a LUN are migrated between storage tiers,
combine the recommended congestion threshold values/ranges for each storage type in the auto-tiering storage devices. For
example, in a sub-LUN / block-level auto-tiering storage device that contains an SSD storage tier and a Fibre Channel storage tier,
use an SIOC congestion threshold value in the range of 1030ms. The exact SIOC congestion-threshold value to use is based on your
individual storage-device characteristics and your preference of isolation (using a smaller SIOC congestion-threshold value) or
throughput
(using a larger SIOC congestion-threshold value). For example, in the SSD-FC scenario, the more SSD storage you have in the array,
the more your storage device characteristics will match that of the SSD storage type and thus the closer your threshold should be to
the SSD recommended value of 10ms, the low end of the combined SSD-FC range. Customers can use the midpoint of the range as a
conservative congestion threshold value that provides a balance between the preference for isolation and the preference for
throughput. In the SSD-FC example in which there was a range of 1030ms, the conservative congestion threshold value would be
20ms.
When modifying the SIOC congestion threshold, keep in mind that the SIOC latency is a normalized latency metric calculated and
normalized for I/O size and aggregate number of IOPS across all the storage workloads accessing the datastore. SIOC uses a
normalized latency to take into consideration that not all storage workloads are the same. Some storage workloads may issue larger
I/O operations that would naturally result in longer device latencies to service these larger I/O requests. Normalizing the storage-
workload latencies allows SIOC to compare and prioritize workloads more accurately by bringing them all into a common
measurement. Because the SIOC value is normalized, the actual observed latency as seen from the guest OS inside the virtual
machine or from an individual ESX host may be different than the calculated SIOC-normalized latency per datastore.
Monitoring Storage I/O Control Effects
SIOC includes new metrics inside vCenter to allow users to observe SIOCs actions and latency measurements. There are two new
SIOC metrics in vCenter, SIOC normalized latency and SIOC Aggregated IOPS. The SIOC normalized latency is the value that SIOC
calculates per datastore and uses when comparing with the SIOC congestion latency threshold to determine what actions to take, if
any. SIOC calculates these metrics every four seconds and they are refreshed in the vCenter display every 20 seconds. These metrics
can be viewed on the datastore performance screen inside vCenter, as seen in Figure 7. Additionally, vCenter reports the device-

queue depths for each ESX host. The ESX hosts device-queue depth metrics can be reviewed to determine what actions SIOC is
taking on individual ESX hosts and their device queues in order to maintain a datacenter-wide SIOC latency on the datastore under
the set congestion threshold.
Figure 7. vCenter Datastore Performance and SIOC Metrics


SIOC detects the moment when external workloads, not under SIOCs control, may be impacting the virtual environments storage
resources. When SIOC detects an external workload, it will trigger a Non-VI workload detected informational alert in vCenter. In
most cases, this alert is purely informational and requires no action on the part of the vSphere administrator. However, the alert
may be an indicator of an incorrectly configured SIOC environment. vSphere administrators should verify that they are running a
supported SIOC configuration and that all datastores that utilize the same disk spindles have SIOC enabled with identical SIOC
congestion- threshold values. The alert might also be triggered by some backup products and other administrative workloads that
bypass the ESX host and directly access the datastore in order to accomplish their tasks. SIOC is supported in these configurations
and the alert can be safely ignored for these products. Refer to VMware KB article 1020651 for more details on the Non-VI
workload detected alert.
Benefits of using Storage I/O Control
SIOC enables improved I/O resource management for a multitude of conditions and provides peace of mind when running business-
critical I/O intensive applications in a shared VMware virtualization environment.
Provides performance protection
A common concern in any shared resource environment is that one consumer may get far more than its fair share of that
resource and adversely impact the performance of the other users that share the resource. SIOC provides the ability, at the
datastore level, to support multiple-tenant environments that share a datastore, by enabling service-level protections during periods
of congestion. SIOC prevents a single virtual machine from monopolizing the I/O throughput of a datastore even when the virtual
machines have default (equal value) I/O shares set.
Detects and manages bottlenecks at the array only when congestion exists
SIOC detects a bottleneck at the datastore level, and manages I/O queue slot distribution across the ESX servers that share a
datastore. SIOC expands the I/O resource control beyond the bounds of a single ESX server to work across all ESX servers that share
a datastore.
When SIOC is enabled on a datastore and no congestion exists at the device level, it will not be engaged in managing I/O resources
and will have no effect on I/O latency or throughput. In an optimized and well-configured environment, SIOC may only engage at

certain peak periods during the day. During these times of congestion and in the presence of external or nonSIOC controlled
workloads, SIOC strikes a balance between aggregate throughput and enforcement of virtual machine I/O shares.
SIOC helps vSphere administrators understand when more I/O throughput (device capacity) is needed. If SIOC is engaged for
significant periods of time during the day, it raises the question if there is a need for a change in the storage configuration. In this
case, an administrator might consider either adding more I/O capacity or using VMware Storage vMotion to migrate I/O intensive
virtual machines to an alternate datastore.
Enables higher levels of consolidation with less storage expense
SIOC enables vSphere administrators to maximize their storage investments by running more virtual machines on their existing
storage infrastructure with confidence that periodic peak periods of high I/O activity will be controlled. Without SIOC, administrators
will often overprovision their storage to avoid latency issues that pop up during peak periods of storage activity. With SIOC, the
administrators can now comfortably run more virtual machines on a single datastore with confidence that the storage I/O will be
controlled and managed at the device level.
Leveraging SIOC can reduce storage costs because the cost of overprovisioning a storage environment, to the point that no
contention occurs, could be prohibitively expensive. Alternately, the cost of storage may drop dramatically by leveraging SIOC to
manage the I/O queue slot allocations to ensure proportional fairness and prioritization of virtual machines based on their I/O
shares.
Conclusion
SIOC offers I/O prioritization to virtual machines accessing shared storage resources. It allows vSphere administrators to align high-
priority virtual machine traffic with better performance and lower latency storage performance as compared to the lower-priority
virtual machines. It monitors datastore latency and engages when a preset congestion threshold has been exceeded. SIOC gives
vSphere administrators a new means to manage their VMware virtualized environments by allowing quality of service to be
expressed for storage workloads. As such, SIOC is a big step forward in the journey toward automated, policy-based management of
shared storage resources.
SIOC provides the means to better control a consolidated shared-storage resource by providing datastore-wide I/O prioritization,
helping to manage traffic on a shared and congested datastore. With the introduction of SIOC in vSphere 4.1, vSphere administrators
now have a new tool available to help them increase the consolidation density while ensuring that they will have peace of mind,
knowing that during periodic periods of peak I/O activity there will be a prioritization and proportional fairness enforced across all
the virtual machines accessing that shared resource.

Anda mungkin juga menyukai