Beruflich Dokumente
Kultur Dokumente
Agenda
Introduction to performance monitoring in VI Common customer/partner questions (use cases) Tips and Tricks
Every counter includes a description and unit of measure. Latest Information http://vmware.com/developer VMware Developer Center Blog
Interval Period
5 minutes* 30 minutes 2 hours 1 day
Interval Length
1 day* 1 week 1 month 1 year*
The aggregation can also be turned off after a particular historical time level. Below is an example of a customer configuration
Interval
Per day Per week Per month Per year
Interval Period
5 minutes* 30 minutes 2 hours 1 day
Interval Length
1 day* 1 week 1 month 1 year*
Level
4 3 2 1
Aggregate
Yes Yes No No
esxtop has better interval granularity. I will use it all the time.
esxtop puts a very high load on the server. It should be used for interactive troubleshooting at best. The API counters are optimized for retrieval and aggregation and provide all the data that is necessary to debug problems.
CPU capacity
CPU capacity
Some caveats on ready time Used time ~ ready time: may signal contention. However, might not be overcommitted due to workload variability In this example, we have periods of activity and idle periods: CPU isnt overcommitted all the time
Ready time ~ used time
(screenshot from VI Client)
Used time
High Ready Time High MLMTD: there is a limit on this VM High ready time not always because of overcommitment
Reservation set on VM
4x2GHz host, 2 vcpu VM, limit set to 1GHz (VM can consume 1GHz) Without limit, max is 2GHz. With limit, max is 1GHz (50% of 2GHz) CPU all busy: %USED: 50%; %MLMTD & %RDY = 150% [total is 200%, or 2 CPUs]
Possible solutions
VMotion the VM or use DRS to optimize resources Change share allocations to de-prioritize less important VMs Check CPU limit settings More CPUs may be the solution
Memory capacity
How do we identify host memory contention?
Host-level swapping (e.g., robbing VM A to satify VM B). Active memory for all VMs > physical memory on host This could mean possible memory over-commitment
What do I do?
Check swapin (cumulative), swapout (cumulative) and swapused (instantaneous) for the host. Ballooning (vmmemctl) is also useful. If swapin and swapout are increasing, it means that there is possible memory over-commitment Another possibility: sum up active memory for each VM. See if it exceeds host physical memory.
Memory capacity
Increased swap activity may be a sign of over-commitment
Increase in swap activity
No swap activity
Swap in
Swap usage
Ballooning active
What do I do?
Check requests per sampling interval and bytes transferred/received per sampling interval For disks, check latencies Compare with specs for the network or disk subsystems
Effective Link Bandwidth = ~80% of Real Bandwidth Effective (2Gbps) = 200 MBps Effective (4Gbps) = 400 MBps
Disk throughput
Counters of interest
If you are looking at real-time statistics .
CPU: usage (% or MHz), used time, ready time, wait time Memory: consumed, active, swapused, swapin, swapout, vmmemctl Disk: diskReadLatency, diskWriteLatency, commands,
commandsAborted, bytes transferred/received, disk bus resets
Counters of interest
Counter Name
cpu.usage.average cpu.used.summation cpu.ready.summary cpu.wait.summation mem.consumed.average mem.active.average mem.swapused.average mem.swapin.average mem.swapout.average mem.vmmemctl.average CPU usage (%) Used time (ms) ready to run, no resources available (ms) blocked waiting (e.g., for I/O) (ms) Machine pages taken by VM working set of VM instantaneous swapped memory for VM Cumulative swapped-in memory for VM Cumulative swapped-out memory for VM Ballooned memory for VM
Description
Counters of interest
Counter Name
disk.commands.summation disk.usage.average disk.commandsAborted.summation disk.busResets.summation disk.deviceLatency.average disk.kernelLatency.average net.usage.average net.packetsRx.summation net.packetsTx.summation
Description
Disk commands issued Disk Bandwidth consumed Disk commands aborted SCSI bus resets Latency at the device Latency within the vmkernel Network bandwidth consumed Packets received in sample interval Packets transmitted in sample interval
For 12 data points (1 hour of 5-minute stats): 5.4MB Things add up, dont they 5.4MB serialization cost becomes significant
Bottom line:
serialization is important: pick metrics wisely As DB grows, its latency becomes significant
64 hosts, 1233 powered-on VMs, real-time stats, VIPerl toolkit used Querying through VC can be ~ Querying through hosts (inventory monitoring easier with VC, thoughconsider views) Different client implementations may yield different results (# threads?)
pqsArray = new PerfQuerySpec[]; One element array for (i = 0; i < 1000; i++ ) { PerfQuerySpec pqs = new PerfQuerySpec( ); pqsArray[0] = pqs; PerfEntityMetricBase[ ] pemb = service.queryPerf(perfManager, pqsArray); }
pqsArray = new PerfQuerySpec[]; for (i = 0; i < 1000; i++ ) { PerfQuerySpec pqs = new PerfQuerySpec( ); pqsArray[i] = pqs; } PerfEntityMetricBase[ ] pemb = service.queryPerf(perfManager, pqsArray);
Remember: Collect only what you will use Use everything that you collect