Anda di halaman 1dari 73

2009 VMware Inc.

All rights reserved


ESX Performance Troubleshooting
VMware Technical Support
Broomfield, Colorado
Confidential
What is slow performance?
What does slow performance mean?
Application responds slowly - latency
Application takes longer time to do a job throughput
Interpretation varies wildly
Slower than expectation
Throughput is low
Latency is high
Throughput, latency fine but uses excessive resources (efficiency)
What are high latency, low throughput, and excessive
resource usage?
These are subjective and relative
Both related
to time
Bandwidth, Throughput, Goodput, Latency
Bandwidth vs. Throughput
Higher Bandwidth does not guarantee Throughput.
Low Bandwidth is a bottleneck for higher Throughput
Throughput vs. Goodput
Higher Throughput does not mean higher Goodput
Low Throughput is indicative of lower Goodput
Efficiency = Goodput/Bandwidth
Throughput vs. Latency
Low Latency does not guarantee higher Throughput and vice versa
Throughput or Latency alone can dominate performance
Bandwidth, Throughput, Goodput, Latency
Latency
Bandwidth
Goodput
Throughput
How to measure performance?
Higher throughput does not necessarily mean higher
performance Goodput could be low
Throughput is easy to measure, but Goodput is not

How do we measure performance?
Performance is actually never measured
We could only quantify different metrics that affect performance.
These metrics describe the state of: CPU, memory, disk and
network
Performance Metrics
CPU
Throughput: MIPS (%used), Goodput: useful instructions
Latency: Instruction Latency (cache latency, cache miss)
Memory
Throughput: MB/Sec, Goodput: useful data
Latency: nanosecs
Storage
Throughput: MB/Sec, IOPS/Sec, Goodput: useful data
Latency: Seek time
Networking
Throughput: MB/Sec, IO/Sec, Goodput: useful traffic
Latency: microseconds
Hardware and Performance
CPU
Processor Architecture: Intel XEON, AMD Opteron
Processor cache L1, L2, L3, TLB
Hyperthreading
NUMA
Hardware and Performance
Processor Architecture
Clock Speeds from one architecture is not comparable with other
P-III outperforms P4 on a clock by clock basis
Opteron outperforms P4 on a clock by clock basis
Higher clock speeds is not always beneficial
Bigger cache or better architecture may outperform higher clock speeds
Processor memory communication is often the performance
bottleneck
Processor wastes 100s of instruction cycles while waiting on memory
access
Caching alleviates this issue

Hardware and Performance
Processor Cache
Cache reduces memory access latency
Bigger cache increases cache hit probability
Why not build bigger cache ?
Expensive
Cache access latency increases with cache size
Cache is built into stages L1, L2, L3 with varying cache access
latency
ESX benefits from larger cache sizes
L3 cache seems to boost performance of networking workloads
Hardware and Performance
TLB Translation Lookaside Buffer
Every running process needs virtual address (VA) to physical
address (PA) translation
Historically this translation table was done entirely from memory
Since memory access is significantly slower and process needs
access to this table on every context switch, TLB was introduced
TLB is a hardware circuitry that caches VA to PA mappings
When VA is not available in TLB, Page Fault occurs and OS needs
to bring the address to TLB (load latency)
Performance of application depends on effective use of TLB
TLB is flushed during context switch
Hardware and performance
Hyperthreading
Introduced with Pentium 4 and Xeon processors
Allows simultaneous execution of two threads on a single processor
HT maintains separate architectural states for the same processor
but shares underlying processor resources like execution unit,
cache etc
HT strives to improve throughput by taking advantage of processor
stalls on the logical processor
HT performance could be worse than UniProcessor (non-HT)
performance if the threads have higher cache hit (more than 50%)
Hardware and Performance
Multicores
Cores have their own L1 Cache
L2 Cache is shared between processors
Cache coherency is relatively faster compared to SMP systems
Performance scaling is same as SMP systems
Hardware and performance
NUMA
Memory contention increases as the number of processors increase
NUMA alleviates memory contention by localizing memory per
processor
Hardware and Performance - Memory
Node Interleaving
Opteron processors supports two type of memory access
NUMA and Node Interleaving mode
Node interleaving mode alternates memory pages between
processor nodes so that the memory latencies are made uniform.
This can offer performance improvements to systems that are not
NUMA aware
NUMA on single core Opteron systems contains only one core
per NUMA node.
SMP VM on ESX running on a single core Opteron systems will
have to access memory across the NUMA boundary. So SMP
VMs may benefit from Node Interleaving
On dual core Opteron systems a single NUMA node will have two
cores. So NUMA mode could be turned on.
Hardware and Performance I/O devices
I/O Devices
PCI-E, PCI-X, PCI
PCI at 66MHz 533 MB/s
PCI-X at 133 MHz 1066 MB/s
PCI-X at 266 MHz 2133 MB/s
PCI-E bandwidth depends on the number of Lanes, x16 Lanes - 4GB/s,
each Lane adds 250 MB/s.
PCI bus saturation dual port, quad port devices
In PCI protocol the bus bandwidth is shared by all the devices in the bus.
Only one device could communicate at a time.
PCI-E allows parallel full duplex transmission with the use of Lanes
Hardware and Performance I/O Devices
SCSI
Ultra3/Ultra 160 SCSI 160 MB/s
Ultra320 SCSI 320 MB/s
SAS 3Gbps 300 MB/s duplex
FC
Speed constrained by Medium, Laser wavelength
Link Speeds: 1G FC 200 MB/s, 2G 400 MB/s, 4G 800 MB/s,
8GB 1600 MB/s
ESX Architecture
17
Performance Perspective
Confidential
ESX Architecture Performance Perspective
CPU Virtualization Virtual Machine Monitor
ESX doesnt trap and emulate every instruction, x86 arch does not
allow this
System calls and Faults are trapped by the monitor
Guest code runs in one of three contexts
Direct execution
Monitor code (fault handling)
Binary Translation (BT - non virtualizable instructions)
BT behaves much like JIT
Previously translated code fragments are stored in translation cache
and reused saves translation overhead
ESX Architecture Performance Implications
Virtual Machine Monitor Performance implications
Programs that dont fault or invoke system calls run at near native
speeds ex. Gzip
Micro-benchmarks that do nothing but invoke system calls will incur
nothing but monitor overhead
Translation overhead varies with different Privileged instructions.
Translation cache tries to offset some of the overhead.
Applications will have varying amount of monitor overhead
depending on their call stack profile.
Call stack profile of an application can vary depending on its
workload, errors and other factors.
It is hard to generalize monitor overheads for any workload. Monitor
overheads measured for an application are strictly applicable only to
Identical test conditions.
ESX Architecture Performance Perspective
Memory virtualization
Modern OSes set up page tables for each running process. x86
paging hardware (TLB) caches VA - PA mappings
Page table shadowing additional level of indirection
VMM maintains PA MA mappings in a shadow table
Allows the guest to use x86 paging hardware with the shadow table
MMU updates
VMM write protects shadow page tables (trace)
When the guest updates page table, monitor kicks in (page fault) and
keeps shadow page table consistent with the physical page table
Hidden page faults
Trace faults are hidden to the guest OS - monitor overhead.
Hidden page faults are similar to TLB misses on native environments
ESX Architecture Performance Perspective
Page table shadowing
ESX Architecture Performance Implications
Context Switches
On Native hardware TLB is flushed during a context switch. Newly
switched process will incur TLB miss on first memory access.
VMM caches Page Table Entries (PTE) during context switches
(caching MMU). We try to keep the Shadow PTE consistent with the
Physical PTE
If there are lots of processes running in the guest, and they context
switch frequently, VMM may run out of PT caching.
Workload=terminalservices increases this cache size (vmx).
Process creation
Every new process created requires new PT mapping. MMU
updates are frequent
Shell Scripts that spawns commands can cause MMU overhead
ESX Architecture Performance Perspective
I/O Path
ESX Architecture Performance Perspective
I/O Virtualization
I/O devices are non virtualizable and therefore they are emulated in
the guest OS
VMkernel handles Storage and Networking devices directly as they
are performance critical in server environments. CDROM, floppy
devices are handled by the service console.
I/O is interrupt driven and therefore incurs monitor overhead. All I/O
goes through VMkernel and involves a context switch from VMM to
VMKernel
Latency of networking device is lower and therefore delay due to
context switches can hamper throughput
VMkernel fields I/O interrupts and delivers it to correct VM. From
ESX 2.1, VMKernel delivers the interrupts to the idle processor.
ESX Architecture Performance Perspective
Virtual Networking
Virtual NICs
Queue buffer could overflow
- if the pkt tx/rx rate is high
- VM is not scheduled frequently
VMs are scheduled when they have packets for delivery
Idle VMs still receive broadcast frames. Wastes CPU resources.
Guest Speed/duplex settings is irrelevant.
Virtual Switches dont learn MAC address
VMs register MAC address, virtual switch knows the location of the MAC
VMnics
Listens for the MAC addresses that are registered by the VMs.
Layer 2 Broadcast frames are passed above
ESX Architecture Performance Perspective
NIC Teaming
Teaming only provides outbound load balancing
NICs with different capabilities could be teamed
Least common Capability in the bond is used
Out-MAC mode scales with number of VMs/virtual NICs. Traffic from
a single virtual NIC is never load balanced.
Out-IP scales with the number of Unique TCP/IP sessions.
Incoming traffic can come on the same NIC. Link aggregation on the
physical switches provides inbound load balancing.
Packet reflections can cause performance hits in the guest OS. No
empirical data available.
We Failback when the Link comes alive again.
Performance could be affected if the Link flips flops.
ESX Architecture Performance Perspective
vmxnet optimizations
vmxnet handles cluster of packets at once reduces context
switches and interrupts
Clustering kicks in only when the packet receive/transmit rate is
high.
vmxnet shares memory area with VMkernel reduces copying
overhead
vmxnet can take advantage of TCP checksum and Segmentation
offloading (TSO)
NIC Morphing allows loading vmxnet driver for valance virtual
device. Probes a new register with the valance device.
Performance of a NIC Morphed vlance device is same as the
performance of vmxnet virtual device.
ESX Architecture Performance Perspective
SCSI performance
Queue depth determines the SCSI throughput. When the queue is
full, SCSI I/Os are blocked limiting effective throughput.
Stages of Queuing
Buslogic/LSILogic -> VMkernel Queue -> VMkernel Driver Queue depth -
> Device Firmware Queue -> Queue depth of the LUN
Sched.numrequestOutstanding number of outstanding I/O
commands per VM see KB 1269
Buslogic driver in windows limits the queue depth size to 1 see KB
1890
Registry settings available for maximizing queue depth for LSILogic
adapter (Maximum Number of Concurrent I/Os)
ESX Architecture Performance Perspective
VMFS
Uses larger block sizes (1MB default)
Larger block size reduces Metadata size metadata is completely cached
in memory
Near native speeds is possible, because metadata overhead is removed
Fewer I/O operations. Improves read-ahead cache hits for sequential
reads
Spanning
Data is filled to the other LUN sequentially after overflow. There is no
striping.
Does not offer performance improvements.
Distributed Access
Multiple ESX hosts can access the VMFS volume, only one ESX host
updates the meta-data
ESX Architecture Performance Perspective
VMFS
Volume Locking
Metadata updates are performed through locking mechanism
SCSI reservation is used to lock the volume
Do not confuse this locking with the file level locks implemented in the
VMFS volume for different access modes
SCSI reservation
SCSI reservation blocks all I/O operations until the lock is released by the
owner
SCSI reservation is held usually for a very short time and released as
soon as the update is performed
SCSI reservation conflict happens when SCSI reservation is attempted on
a volume that is already locked. This usually happens when multiple ESX
hosts contend for metadata updates
ESX Architecture Performance Perspective
VMFS
Contention for metadata updates
Redo log updates from multiple ESX hosts
Template deployment with redo log activity
Anything that changes/modifies file permission on every ESX host
VMFS 3.0 uses new volume locking mechanism that significantly
reduces the number of SCSI reservations used

ESX Architecture Performance Perspective
Service Console
Service console can share Interrupt resources with VMkernel.
Shared interrupt lines reduce performance of I/O devices KB 1290
MKS is handled in the service console in ESX 2.x. and its
performance is determined by the resources available in the COS
The default Min CPU allocated is 8% and may not be sufficient if
there are lots of VMs running
Memory recommendations for service console do not account
memory that will be used by the agents
Scalability of VMs is limited by COS in ESX 2.x. ESX 3.x avoids this
problems with userworlds for VMkernel.
Understanding ESX Resource
33
Management & Over-Commitment
Confidential
ESX Resource Management
Scheduling
Only one VCPU runs on a CPU at any time
Scheduler tries to run the VM on the same CPU as much as possible
Scheduler can move VMs to others Processors when it has to meet the CPU
demands of the VM
Co-scheduling
SMP VMs are co-scheduled, i.e. all the VCPUs run on their own
PCPUs/LCPUs simultaneously
Co-scheduling facilitates synchronization/communication between
processors, like in the case of spinlock wait between CPUs
Scheduler can run a VCPU without the other for a short period of time (1.5
ms)
Guest could halt the co-scheduled CPU, if it is not using it, but Windows
doesnt seem to halt the CPU wastes CPU cycles
ESX Resource Management
NUMA Scheduling
Scheduler tries to schedule the world within the same NUMA node
so that cross NUMA migrations are fewer
If a VMs memory pages are split between NUMA nodes, the
memory scheduler slowly migrates all the VMs pages to the local
node. Over time the system becomes completely NUMA balanced.
On NUMA architecture, CPU utilization per NUMA node gives better
idea of CPU contention
While factoring %ready, factor the CPU contention within the same
NUMA node.
ESX Resource Management
Hyperthreading
Hyperthreading support was added in ESX 2.1, recommended
Hyperthreading increases schedulers flexibility especially in the
case of running SMP VMs with UP VMs
A VM scheduled on a LCPU is charged only half the package
seconds
Scheduler tries to avoid scheduling a SMP VM onto the logical
CPUS of the same package
A high priority VM may be scheduled to a package with one its of
LCPU halted this prevents other running worlds from using the
same package
ESX Resource Management
HTSharing
Controls hyperthreading behavior with individual VMs.
htsharing=any
Virtual CPUs could be scheduled on any LCPUs. Most flexible option for the
scheduler.
htsharing=none
Excludes sharing of LCPUs with other VMs. VM with this option gets a full package
or never gets scheduled.
Essentially this excludes the VM from using logical CPUs (useful for the security
paranoid). Use this option if an application in the VM is known to perform poorly with
HT.
htsharing=internal
Applies to SMP VMs only. This is same as none, but allows sharing the same
package for the VCPUs of the same VM. Best of both worlds for SMP VMs.
For UP VMs this translates to none
ESX Resource Management
HT Quarantining
ESX uses P4 Performance counters to constantly evaluate HT
performance of running worlds
If a VM appears to interact badly with HT, the VM is automatically
placed into a quarantining mode (i.e. htsharing is set to none)
If the bad events disappear, the VM is automatically pulled back
from quarantining mode
Quarantining is completely transparent
ESX Resource Management
CPU affinity
Defines a subset of LCPUs/PCPUs that a world could run on
Useful to
partition server between departments
troubleshoot system reliability issues
For manually setting NUMA affinity in ESX 1.5.x
applications that benefit from cache affinity
Caveats
Worlds that dont have affinity can run on any CPU, so they have better chance of
getting scheduled
Affinity reduces Schedulers capability to maintain fairness min CPU guarantees
may not be possible under some circumstances
NUMA optimizations (page migrations) are excluded for VMs that have CPU affinity
(can enforce manual memory affinity)
SMP VMs should not be pinned to LCPUs
Disallows vMotion operations
ESX Resource Management
Proportional Shares
Shares are used only when there is a resource contention
Unused shares (shares of a halting/idling VM) are partitioned across
active VMs.
In ESX 2.x shares operate on a flat namespace
Changing shares of one world affects the effective CPU cycles
received by other running worlds.
If VMs use a different share scale then shares for other worlds
should be changed to the same scale
ESX Resource Management
Minimum CPU
Guarantees CPU resources when the VM requests for it
Unused resources are not wasted, and is given to other worlds that
requires it.
Setting min CPU to 100% (200% in case of SMP) ensures that the
VM is not bound by the CPU resource limits
Using min CPU is favored over using CPU affinity or proportional
shares
Admission control verifies if Min CPUs could be guaranteed when
the VM is powered on or VMotioned
ESX Resource Management
Demystifying Ready time
Powered on VM could be either running, halted or in a ready state
Ready time signifies the time spent by a VM on the run queue waiting to be
scheduled
Ready time accrues when more than one world wants to run at the same
time on the same CPU
PCPU, VCPU over-commitment with CPU intensive workloads
Scheduler constraints - CPU affinity settings
Higher ready time reduces response times or increases job completion time
Total accrued ready time is not useful
VM could have accrued ready time during their runtime without incurring performance
loss (for example during boot)
%ready = ready time accrual rate
ESX Resource Management
Demystifying Ready time
There are no good/bad values for %ready.
Depends on the priority of the VMs - latency sensitive applications may
require less or no ready time
Ready time could be reduced by increasing the priority of the VM
Allocate more shares, set minCPU, remove CPU affinity
ESX Resource Management
Unexplained Ready time
If the VM accrues ready time while there are enough CPU resources
then it is called Unexplained Ready time
There are some belief in the field that such a thing actually exists
hard to prove or disprove
Very hard to determine if CPU resources are available when ready
time accrues
CPU utilization is not a good indicator of CPU contention
Burstiness is very hard to determine
NUMA boundaries All VMs may contend within the same NUMA node
Misunderstanding of how scheduler works

ESX Resource Management
Resource Management in ESX 3.0
Resource Pools
Extends hierarchy. Shares operate within the resource pool domain.
MHz
Resource allocation are absolute based on clock cycles. % based
allocation could vary with processor speeds.
Clusters
Aggregates resources from multiple ESX hosts

Resource Over-Commitment
CPU Over-Commitment
Scheduling
Too many things to do!
Symptoms: high %ready
Judicious use of SMP
CPU utilization
Too much to do!
Symptoms: 100% CPU
Things to watch
- Misbehaving applications inside the guest
- Do not rely on Guest CPU utilization halting issues, timer interrupts
- Some applications/services seem to impact guest halting behavior. No longer tied
to SMP HALs.
Resource Over-Commitment
CPU Over-Commitment
Higher CPU utilization does not necessarily mean lesser
performance.
Applications progress is not affected by higher CPU utilization
However if higher CPU utilization is due to monitor overheads then it may
impact performance by increasing latency
When there is no headroom (100% CPU), performance degrades
100% CPU utilization and %ready are almost identical both delay
application progress
CPU Over-Commitment could lead to other performance problems
Dropped network packets
Poor I/O throughput
Higher latency, poor response time
Resource Over-Commitment
Memory Over-Commitment
Guest Swapping - Warning
Guest page faults while swapping.
Performance is affected by both guest swapping and due to monitor overhead
handling page faults.
Additional disk I/O
Ballooning Serious
VMkernel Swapping - Critical
COS Swapping - Critical
VMX process could stall and affect the progress of the VM
VMX could be a victim of random process killed by the kernel
COS requires additional CPU cycles, for handling frequent page faults and disk I/O
Memory shares determine the rate of ballooning/swapping
Resource Over-Commitment
Memory Over-Commitment
Ballooning
Ballooning/swapping stalls processor, increases delay
Windows VMs touches all allocated memory pages during boot. Memory
pages touched by the guest could be reclaimed only by ballooning
Linux guest touches memory pages on demand. Ballooning kicks in only
when the guest is under complete memory pressure
Ballooning could be avoided by using min=max
/proc/vmware/sched/mem
- size <>sizetgt indicates memory pressure
- mctl > mctlgt ballooning out (giving away pages)
- mctl < mctlgt ballooning in (taking in pages)
Memory shares affect ballooning rate
Resource Over-Commitment
Memory Over-Commitment
VMKernel Swapping
Processor stalls due to VMkernel swapping are more expensive than
ballooning (due to disk I/O)
Do not confuse this with
- Swap reservation: Swap is always reserved for worst case scenario if
min<> max, reservation = max min
- Total swapped pages: Only current swap I/O affects performance
/proc/vmware/sched/mem-verbose
- swpd total pages swapped
- swapin, swapout swap I/O activity
SCSI I/O delays during VMKernel I/O swapping could result in system
reliability issues
Resource Over-Commitment
I/O bottlenecks
PCI Bus saturation
Target device saturation
Easy to saturate storage arrays if the topology is not designed correctly for load
distribution
Packet drops
Effective throughput reduces
Retransmissions can cause congestion
Window size scales down in the case of TCP
Latency affects throughput
TCP is very sensitive to Latency and packet drops
Broadcast traffic
Multicast and broadcast traffic sent to all VMs.
Keep an eye on Pkts/sec and IOPS and not just bandwidth consumption
ESX Performance
52
Application Performance issues
Confidential
ESX Performance Application Issues
Before we begin
From VM perspective, an running application is just a x86 workload.
Any Application performance tuning that makes the application to run more
efficiently will help
Application performance can vary between versions
New version could be more or less efficient
Tuning recommendations could change
Application behavior could change based on its configuration
Application performance tuning requires intimate knowledge on how the
application behaves
Nobody at VMware specializes on application performance tuning
Vendors should optimize their software with the thought that the hardware resources
could be shared by other Operating Systems.
TAP program
- SpringSource (unit of VMware) Provides developer support for API scripting
ESX Performance Application issues
Citrix
Roughly 50-60% monitor overhead takes 50-60% more CPU cycles than
on the native machine
The maximum number of users limit is hit when the CPU is maxed out
roughly 50% of users as would be seen on native environment with an
apples to apples comparison.
Citrix Logon delays
This could happen even on native machines when roaming profiles are configured.
Refer Citrix and MS KB articles
Monitor overhead can introduce logon delays
Workarounds
Disable com ports, workload=terminalservices, disable unused apps, scale
horizontally
ESX 3.0 improves Citrix performance roughly 70-80% of native
performance
ESX Performance Application issues
Database performance
Scales well with vSMP recommended
Exceptions: Pervasive SQL not optimized for SMP
Two key parameters for database workloads
Response time
- Transaction logs
CPU utilization
Understanding SQL performance is complex. Most enterprise
databases run some sort of query optimizer that changes the SQL
Engine parameters dynamically
Performance will vary with run time. Typically benchmarking is done after
priming the database
Memory resource is key. SQL performance can vary a lot depending
on the available memory.
ESX Performance Application Issues
Lotus Domino Server
One of the better performing workloads. 80-90% of direct_exec
CPU and I/O intensive
Scalability issues Not a good idea to run all domino servers on the
same ESX server.
ESX Performance Application Issues
16-bit applications
16 bit applications on windows NT/2000 and above run in a
Sandboxed Virtual Machine
16 bit apps depend on segmentation possible monitor overhead.
Some 16-bit apps seem to spin idle loop instead of halting the CPU
Consumes excessive CPU cycles
No performance studies done yet
No compelling application
ESX Performance Application Issues
Netperf throughput
Max Throughput is bound by a variety of parameters
Available Bandwidth, TCP window size, available CPU cycles
VM incurs additional CPU overhead for I/O
CPU utilization for networking varies with
Socket buffer size, MTU affects the number of I/O operations performed
Driver vmxnet consumes lesser CPU cycles
Offloading features depending on the driver settings and NIC
capabilities
For most applications, throughput is not the bottleneck
Measuring throughput and improving it may not always resolve the
underlying performance issue
ESX Performance Application Issues
Netperf Latency
Latency plays an important role for many applications
Latency can increase
When there are too many VMs to schedule
VM is CPU bound
Packets are dropped and then re-transmitted
ESX Performance Application Issues
Compiler Workloads
MMU intensive: Lots of new processes created, context switched,
and destroyed.
SMP VM may hurt performance
Many compiler workloads are not optimized by SMP
Process threads could ping-pong between the vCPUs
Workarounds:
Disable NPTL
Try UP (dont forget to change the HAL)
Workload=terminalservices might help
ESX Performance Forensics
61 Confidential
ESX Performance Forensics
Troubleshooting Methodology
Understand the problem.
Pay attention to all the symptoms
Pay less attention to subjective metrics.
Know the mechanics of the application
Find how the application works
What resources it uses, and how it interacts with the rest of the system
Identify the key bottleneck
Look for clues in the data and see if that could be related to the symptoms
Eliminate CPU, Disk I/O, Networking I/O, Memory bottlenecks by running
tests
Running the right test is critical.
ESX Performance Forensics
Isolating memory bottlenecks
Ballooning
Swapping
Guest MMU overheads
ESX Performance Forensics
Isolating Networking Bottlenecks
Speed/Duplex settings
Link state flapping
NIC Saturation /Load balancing
Packet drops
Rx/Tx Queue Overflow
ESX Performance Forensics
Isolating Disk I/O bottlenecks
Queue depth
Path thrashing
LUN thrashing
ESX Performance Forensics
Isolating CPU bottlenecks
CPU utilization
CPU scheduling contention
Guest CPU usage
Monitor Overhead
ESX Performance Forensics
Isolating Monitor overhead
Procedures for release builds
Collect performance snapshots
Monitor Components
ESX Performance Forensics
Collecting Performance Snapshots
Duration
Delay
Proc nodes
Running esxtop on performance snapshots
ESX Performance Forensics
Collecting Benchmarking numbers
Client side benchmarks
Running benchmarks inside the guest
ESX Performance
70
Troubleshooting - Summary
Confidential
ESX Performance Troubleshooting - Summary
Key points
Address real performance issues. Lots of time could be spent on spinning
wheels on theoretical benchmarking studies
Real performance issues could be easily described by the end user who
uses the application
There is no magical configuration parameter that will solve all performance
problems
ESX performance problems are resolved by
Re-architecting the deployment
Tuning application
Applying workarounds to circumvent bad workloads
Moving to a newer version that addresses a known problem
Understanding Architecture is the key
Understanding both ESX and application architecture is essential to resolve
performance problems
Questions?
Reference links
http://www.vmware.com/files/pdf/perf-vsphere-memory_management.pdf
http://www.vmware.com/resources/techresources/10041
http://www.vmware.com/resources/techresources/10054
http://www.vmware.com/resources/techresources/10066
http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf
http://www.vmware.com/pdf/RVI_performance.pdf
http://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf
http://www.vmware.com/files/pdf/perf-vsphere-fault_tolerance.pdf