For RDM

AP02
NFS & iSCSI: Performance Characterization and Best Practices in ESX 3.5
Priti Mishra
MTS, VMware
Bing Tsai
Sr. R&D Manager, VMware
Housekeeping
Please turn off your mobile phones, blackberries and laptops Your feedback is valued: please fill in the session evaluation form (specific to that session) & hand it to the room monitor / the materials pickup area at registration Each delegate to return their completed event evaluation form to the materials pickup area will be eligible for a free evaluation copy of VMwares ESX 3i Please leave the room between sessions, even if your next session is in the same room as you will need to be rescanned
Topics
General Performance Data and Comparison
Improvements in ESX 3.5 over ESX 3.0.x
Performance Best Practices Troubleshooting Techniques

Basic methodology Tools Case studies
Key performance improvements since ESX3.0.x (1 of 3)

NFS
Accurate CPU accounting further improves load balancing among multiple VMs Optimized buffer and heap sizes Improvements in TSO support
TSO (TCP segmentation offload) improves large writes
H/W iSCSI (with QLogic 405x HBA)

Improvements in PAE (large memory) support
Results in better multi-VM performance in large systems
Minimized NUMA performance overhead

This overhead exists in physical systems as well
Improved CPU cost per I/O

S/W iSCSI (S/W-based initiator in ESX)
Improvements in CPU costs per I/O Accurate CPU accounting further improves load balance among multiple VMs Increased maximum transfer size
Minimizes iSCSI protocol processing cost Reduces network overhead for large I/Os
Ability to handle more concurrent I/Os

Improved multi-VM performance

S/W iSCSI (continued)
Improvements in PAE (large memory) support
CPU efficiency much improved for systems with >4GB memory
Minimizing NUMA performance overhead
Performance Experiment Setup (1 of 3)

Workload: Iometer
Standard set based on
Request size
1k, 4k, 8k, 16k, 32k, 64k, 72k, 128k, 256k, 512k
Access mode
50% read/ write
Access pattern
100% sequential
1 worker, 16 Outstanding I/Os
Cached runs
100MB data disks to minimize array/server disk activities All I/Os served from server/array cache Gives upper bound on performance

VM information
Windows 2003 Enterprise Edition 1 VCPU; 256 MB memory
No file system used in VM (Iometer sees disk as physical drive) No caching done in VM
Virtual disks located on RDM device configured in physical mode

Note: VMFS-formatted volumes are used in some tests where noted

ESX Server
4-socket, 8 x 2.4GHz cores 32GB DRAM 2 x Gigabit NICs
One for vmkernel networking: used for NFS and software iSCSI protocols One for general VM connectivity
Networking Configuration
Dedicated VLANs for data traffic isolated from general networking
How to read performance comparison charts

Throughput
Higher is better Positive is better higher throughput
Latency
Lower is better Negative is better lower response time
CPU cost
Lower is better Negative is better reduced CPU cost How does this metric matter?
CPU Costs
Why is CPU cost data useful?
Determines how much I/O traffic the system CPUs can handle
How many I/O-intensive VMs can be consolidated in a host
How to compute CPU cost

Measure total physical CPU usage in ESX
esxtop counter: Physical Cpu(_Total)
Normalize to per I/O or per MBps

Example: MHz/MBps =
{(Physical CPU usage percentage out 100%) ) X (# of physical CPUs) X (CPU MHz rating)} / (throughput in MBps)
Performance Data
First set: Relative to baselines in ESX 3.0.x Second set: Comparison of storage options using Fibre Channel data as the baseline Last: VMFS vs. RDM physical
Software iSCSI Throughput Comparison to 3.0.x:

higher is better
20% 15% 10% ifferen e 5% 0% 8 -5% -10% -15% -20% IO Si e te 6 3 6 7 8 6
Se uentia
Write Throughput Comparison
Software iSCSI Latency Comparison to 3.0.x:

lower is better
Se ential 0 rite Latency Comparison
ifference
3 I Si e ( yte)
Software iSCSI CPU Cost Comparison to 3.0.x:

lower is better
0% -5% -10% -15% % Difference -20% -25% -30% -35% -40% -45% -50% I Size (Byte) 1k 4k k 1 k 32k 4k 2k 12 k 25 k 512k
Sequential 50% Write CPU Efficiency Comparison
Software iSCSI Performance Summary

Lower CPU costs
Can lead to higher throughput for small IO sizes when CPU is pegged CPU costs per IO also greatly improved for larger block sizes
Latency is lower
Especially for smaller data sizes Read operations benefit most
Throughput levels
Dependent on workload
Mixed read-write patterns show most gain Read I/Os show gains for small data sizes
Hardware iSCSI Throughput Comparison to 3.0.x:

higher is better
100% 80% 60% 40% i eren e 20% 0% -20% -40% -60% -80% -100% IO Si e te 8 6 3 6 7 8 6
Se uentia
Write Throughput Comparison
Hardware iSCSI Latency Comparison to 3.0.x:

lower is better
Se ential 0 rite Latency Comparison
ifference
Si e ( yte)
Hardware iSCSI CPU Cost Comparison to 3.0.x :

lower is better
0% -10% -20% -30% % Difference -40% -50% -60% -70% -80% -90% -100% I Size (Byte) 1k 4k k 1 k 32k 4k 2k 12 k 25 k 512k
Sequential 50% Write CPU Efficiency Comparison
Hardware iSCSI Performance Summary

Lower CPU costs
Results in higher throughput levels for small IO sizes CPU costs per IO are especially improved for larger data sizes
Latency is better
Smaller data sizes show the most gain Mixed read-write and read I/Os benefit more
Throughput levels
Dependent on workload
Mixed read-write patterns show most gain for all block sizes Pure read and write I/Os show gains for small block sizes
NFS Performance Summary

Performance also significantly improved in ESX 3.5 Data now shown here for interest of time
Protocol Comparison
Which storage option to choose?
IP Storage vs. Fibre Channel
How to read the charts?

All data is presented as ratio to the corresponding 2Gb FC (Fibre Channel) data If the ratio is 1, the FC and IP protocol data is identical; if < 1, FC data value is larger
Comparison with FC: Throughput

if < 1, FC data va u is arg r
Comparison with FC - Throughput -
u ntia -
rit
1.4
H/ i C i C
1.2
/
1.0 0.8 0.6 0.4 0.2 0.0 1k 4k 8k 16k 32k 64k 72k
128k
256k
512k
Comparison with FC: Latency

lower is better 2.5
2.0
Comparison with FC -
esponse ime - e
entia
rite
H/ / F
i C i C
1.5
1.0
0.5
0.0 1k 4k 8k 16k 32k 64k 72k 128k 256k 512k
VMFS vs. RDM

Which one has better performance? Data shown as ratio to RDM physical
VMFS vs. RDM-physical: Throughput

higher is better
1.0 0.5 0.0 -0.5 -1.0 % Di -1.5 -2.0 -2.5 -3.0 Si yt
tial
% rit Throughput o pariso
VMFS vs. RDM-physical: Latency

lower is better
3.00 2.50 2.00
% Di e ence Se ential % ite Latency pa is n
1.50 1.00 0.50 0.00 -0.50 -1.00

Si e yte
VMFS vs. RDM-physical: CPU Cost

lower is better
30
tial
it CPU Cost Co pa iso
25
20 c 15 % Di 10 5 0 1 4 8 16 32 64 72 128 256 512
Topics
General Performance Data and Comparison
Improvements in ESX 3.5 over ESX 3.0.x
Performance Best Practices Troubleshooting Techniques

Basic methodology Tools Case studies
Pre-Deployment Best Practices:

Storage server/array Networking hardware and configurations ESX host platform
verview
Understand the performance capability of your
Know your workloads Establish performance baselines
Pre-Deployment Best Practices (1 of 4)

Storage server/array: a complex system by itself
Total spindle count Number of spindles allocated for use RAID level and stripe size Storage processor specifications Read/write cache sizes and caching policy settings
Read-Ahead, Write-Behind, etc.
Useful sources of information:

Vendor documentation: manuals, best practice guides, white papers, etc. Third-party benchmarking reports NFS-specific tuning information: SPEC-SFS disclosures in http://www.spec.org

Networking Routing topology and path configurations: # of links in between, etc. Switch type, speed and capacity NIC brand/model, speed and features H/W iSCSI HBAs ESX host CPU: revision, speed and core count Architecture basics
SMP or NUMA?
Disabling NUMA is not recommended
Bus speed, I/O subsystems, etc.
Memory configuration and size

Note: NUMA nodes may not have equal amount of memory

Workload characteristics
What are the smallest, largest and most common I/O sizes? What is the read%? write%? Is access pattern sequential? random? mixed? Response time more important or aggregate throughput? Response time variance an issue or not? Important: know the peak resource usage, not just the average

Establish performance baselines by running standardized benchmarks Whats the upperbound IOps for small I/Os? Whats the upperbound MBps? Whats the average/worst case response time? Whats the CPU cost of doing I/O?
Additional Considerations (1 of 3)
NFS parameters # of NFS mount points
Multiple VMs using multiple mount points may give higher aggregate throughput with slightly higher CPU cost
Export option on NFS server affects performance iSCSI protocol parameters Header digest processing: slight impact on performance Data digest processing: turning off may result in
Improved CPU utilization Slightly lower latencies Minor throughput improvement Actual outcome highly dependent on workload
NUMA specific
If only one VM is doing heavy I/O, may be beneficial to pin the VM and its memory to node 0 If CPU usage is not a concern; no pinning necessary On each VM reboot, ESX Server will place it on the next adjacent NUMA node Minor performance implications for certain workloads To avoid this movement, VM should be affinitized using VI client
SMP VMs
For I/O workloads within an SMP VM that migrate frequently between VCPUs Pin the guest thread/process to a specific VCPU Some versions of Linux has KHz timer rate and may incur high overhead
CPU headroom Software initiated iSCSI and NFS protocols can consume significant amount of CPU in certain I/O patterns
Small I/O workloads require large amount of CPU; ensure that CPU saturation does not restrict I/O rate
Networking Avoid link over-subscription Ensure all networking parameters or even the basic gigabit connection is consistent across the full network path Intelligent use of VLAN or zoning to minimize traffic interference
General Troubleshooting Tips (1 of 3)

Identify Components in the whole I/O path Possible issues at each layer in the path Check all hardware & software configuration parameters, in particular Disk configurations and cache management policies on storage server/array Network settings and routing topology Design experiments to isolate problems, such as: Cached runs Use a small file or logical device, or a physical host configured with RAM-disks: Minimizing physical disk effects Indicate upper-bound throughput and I/O rate achievable

Run tests with single outstanding I/O Easier for analysis on packet traces Throughput entirely dependent on I/O response times Micro benchmarking each layer in the I/O path Compare to non-virtualized, native performance results Collect data Guest OS data: But dont trust the CPU% Esxtop data Storage server/array data: Cache hit ratio, storage processor busy%, etc. Packet tracing with tools like TCPdump, Ethereal, Wireshark, etc.

Analyze performance data Do any stats, e.g., throughput or latency, change drastically over time? Check esxtop data for anomalies, e.g., CPU spikes or excessive queueing Server/array stats Compare array stats with ESX stats Is cache hit ratio reasonable? Storage processor overloaded? Network trace analysis Inspect packet traces to see if NFS and iSCSI requests are processed timely IO sizes issued by the guest match the transfer sizes over the wire Block addresses aligned to appropriate boundaries?
Isolating Performance Problems: Case Study#1 (1 of 3)

Symptoms
Throughput can reach Gigabit wire speed doing 128KB sequential reads from a 20GB LUN on an iSCSI array with 2GB cache Throughput degrades for larger data sizes beyond 128KB
From esxtop data CPU utilization also lower for l/O sizes larger than 128KB CPU cost per I/O is in expected range for all I/O sizes

From esxtop or benchmark output
I/O response times in the 10 to 20ms range for the problematic IOs
Indicates constant physical disk activities required to serve the reads
From network packet traces

No retransmissions or packet loss observed indicating no networking issue Packet time stamps indicating array takes 10ms to 20ms to respond to a read request, no delay in the ESX host
From cached run results

No throughput degradation above 128KB! Problem exists only for file sizes exceeding cache capacity
Array appears to have cache-management issues with large sequential reads

From native tests to same array Same problem observed From the administration GUI of the array Read-ahead policies set to highly aggressive Is the policy appropriate for the workload? Solution Understand performance characteristics of the array
Experiment with different read-ahead policies Try turning off read-ahead entirely to get the baseline behavior

Symptoms
1KB random write throughput much lower (< 10%) than sequential writes to a 4GB vmdk file located on an NFS server
Even after extensive warm-up period
But very little difference in performance between random and sequential reads From NFS server spec 3GB read/write cache
Most data should be in cache after warming up

From esxtop and application/benchmark data CPU% utilization lower but CPU cost per I/O mostly same regardless of randomness
Not likely a client side (i.e., ESX host) issue
Random write latency in the 20ms range Sequential write < 1ms From NFS server stats cache hit% much lower for random writes, even after warm-up

From cached runs to a 100MB vmdk Random write latency almost matches sequential write
Again, suggests that issue is not in ESX host
From native tests Random and sequential write performance is almost same From network packet traces Server responds to random writes in 10 to 20ms, sequential writes in <1ms Offset in NFS WRITE requests is not aligned to power-of-2 boundary
Packet traces from native runs show correct alignment

Question Why are sequential writes not affected? NFS Server file system idiosyncrasies Manages cache memory at 4KB granularity Old blocks are not updated in place; writes go to new blocks Each < 4KB write incurs a read from the old block Aggressive read-ahead masks the read latency associated with sequential writes
Solution
Use disk alignment tool in the guest OS to align disk partition Alternatively, use unformatted partition inside guest OS
Summary and Takeaways

IP-based storage performance in ESX is being constantly improved; Key enhancements in ESX 3.5: Overall storage subsystem Networking Resource scheduling and management Optimized NUMA, multi-core, and large memory support IP-based network storage technologies are maturing Price/performance can be excellent Deployment and troubleshooting could be challenging Knowledge is key: server/array, networking, host, etc. Stay tuned for further updates from VMware
Questions?
NFS & iSCSI

Performance Characterization and Best Practices in ESX 3.5
Priti Mishra & Bing Tsai VMware

For RDM

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

For RDM

Hochgeladen von

Copyright:

Verfügbare Formate

AP02

Performance Best Practices Troubleshooting Techniques

Key performance improvements since ESX3.0.x (1 of 3)

H/W iSCSI (with QLogic 405x HBA)

Minimized NUMA performance overhead

Improved CPU cost per I/O

Key performance improvements since ESX3.0.x (2 of 3)

Ability to handle more concurrent I/Os

Key performance improvements since ESX3.0.x (3 of 3)

Minimizing NUMA performance overhead

Performance Experiment Setup (1 of 3)

1 worker, 16 Outstanding I/Os

Performance Experiment Setup (2 of 3)

Virtual disks located on RDM device configured in physical mode

Performance Experiment Setup (3 of 3)

How to read performance comparison charts

How to compute CPU cost

Normalize to per I/O or per MBps

Software iSCSI Throughput Comparison to 3.0.x:

Write Throughput Comparison

Software iSCSI Latency Comparison to 3.0.x:

Software iSCSI CPU Cost Comparison to 3.0.x:

Sequential 50% Write CPU Efficiency Comparison

Software iSCSI Performance Summary

Hardware iSCSI Throughput Comparison to 3.0.x:

Write Throughput Comparison

Hardware iSCSI Latency Comparison to 3.0.x:

Hardware iSCSI CPU Cost Comparison to 3.0.x :

Sequential 50% Write CPU Efficiency Comparison

Hardware iSCSI Performance Summary

NFS Performance Summary

How to read the charts?

Comparison with FC: Throughput

Comparison with FC - Throughput -

Comparison with FC: Latency

0.0 1k 4k 8k 16k 32k 64k 72k 128k 256k 512k

VMFS vs. RDM

VMFS vs. RDM-physical: Throughput

% rit Throughput o pariso

VMFS vs. RDM-physical: Latency

1.50 1.00 0.50 0.00 -0.50 -1.00

VMFS vs. RDM-physical: CPU Cost

it CPU Cost Co pa iso

20 c 15 % Di 10 5 0 1 4 8 16 32 64 72 128 256 512

Performance Best Practices Troubleshooting Techniques

Pre-Deployment Best Practices:

Understand the performance capability of your

Know your workloads Establish performance baselines

Pre-Deployment Best Practices (1 of 4)

Useful sources of information:

Pre-Deployment Best Practices (2 of 4)

Bus speed, I/O subsystems, etc.

Memory configuration and size

Pre-Deployment Best Practices (3 of 4)

Pre-Deployment Best Practices (4 of 4)

General Troubleshooting Tips (1 of 3)

General Troubleshooting Tips (2 of 3)

General Troubleshooting Tips (3 of 3)

Isolating Performance Problems: Case Study#1 (1 of 3)

Isolating Performance Problems: Case Study#1 (2 of 3)

From network packet traces

From cached run results