Sie sind auf Seite 1von 41

KVM PERFORMANCE IMPROVEMENTS AND OPTIMIZATIONS

Mark Wagner Principal SW Engineer, Red Hat August 14, 2011


1

Overview
Discuss a range of topics about KVM performance

How to improve out of the box experience But crammed into 30 minutes
Use libvirt where possible

Note that not all features in all releases

Before we dive in...


Arrow shows improvement
4 50 4 00 350 300 250 200 150 100 50 0 RHEL6-default

Guest NFS Write Performance - are we sure ?


Is this really a 10Gbit line ?

Throughput ( MBytes / second)

By default the rtl8139 device is chosen

Agenda
Low hanging fruit Memory Networking Block I/O basics NUMA and affinity settings CPU Settings Wrap up

Recent Performance Improvements

Performance enhancements in every component


Component CPU/Kernel Feature NUMA Ticketed spinlocks; Completely fair scheduler; Extensive use of Read Copy Update (RCU) Scales up to 64 vcpus per guest Large memory optimizations: Transparent Huge Pages is ideal for hardware based virtualization Vhost-net a kernel based virtio w/ better throughput and latency. SRIOV for ~native performance AIO, MSI, scatter gather.

Memory Networking Block

Remember this ?
Guest NFS Write Performance

4 50

Throughput ( MBytes / second)

4 00 350 300 250 200 150 100 50 0 RHEL6-default

Impact of not specifying OS at guest creation

Be Specific !

virt-manager will:

Make sure the guest will function Optimize as it can


The more info you provide the more tailoring will happen
Specify the OS details

Specify OS + flavor

Specifying Linux will get you:

The virtio driver If the kernel is recent enough the vhost_net drivers

I Like This Much Better


Guest NFS Write Performance
Impact of specifying OS Type at Creation
4 50 4 00 350 300 250 200 150 100 50 0

Throughput ( MBytes / second)

12.5 x

Default

vhost

virtio

Memory Tuning Huge Pages

2M pages vs 4K standard Linux page

Virtual to physical page map is 512 times smaller TLB can map more physical page resulting fewer misses
Traditional Huge Pages always pinned We now have Transparent Huge Pages Most databases support Huge Pages Benefits not only Host but guests

Try them in a guest too !

10

Transparent Huge Pages


SPECjbb workload
24-cpu, 24 vcpu Westmere EP, 24GB 500K 450K

Transactions Per Minute

30%

400K 350K 300K 250K 200K 150K 100K 50K K

25%
guest bare metal

No-THP

THP

11

Network Tuning Tips


Separate networks for different functions

Use arp_filter to prevent ARP Flux echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter Use /etc/sysctl.conf for permanent
Packet size - MTU

Need to make sure it is set across all components


Don't need HW to bridge intra-box communications

VM traffic never hits the HW on same box Can really kick up MTU as needed
12

KVM Network Architecture - VirtIO


Virtual Machine sees paravirtualized network device VirtIO VirtIO drivers included in Linux Kernel VirtIO drivers available for Windows Network stack implemented in userspace

13

KVM Network Architecture


Virtio

Context switch host kernel <-> userspace

14

Latency comparison
Network Latency virtio
Guest Receive (Lower is better)
400 350 300 250 200 150 100 50 0
1 21 64 19 5 76 5 20 48 61 47 5 24 73 5 65 36

Latency (usecs)

4X gap in latency
Virtio Host

Message Size (Bytes)

15

KVM Network Architecture vhost_net

Moves QEMU network stack from userspace to kernel Improved performance Lower Latency Reduced context switching One less copy

16

Latency comparison
Network Latency - vhost_net
Guest Receive (Lower is better)
400 350 300 250 200 150 100 50 0
1 19 51 13 1 38 7 10 27 30 75 81 95 2 7 45 9 5 65 39

Latency (usecs)

Latency much closer to bare metal

Virtio Vhost_net Host

Message Size (Bytes)

17

Host CPU Consumption virtio vs vhost_net


Two columns is a data set
45

Host CPU Consumption, virtio vs Vhost


8 Guests TCP Receive

Major difference is usr time

% Total Host CPU (Lower is Better)

40 35 30 25 20 15 10 5 0

%usr %soft %guest %sys

io

io

st

st

st

io

os

os

os 32

vi

vi

-v

-v

ho

8-

2-

ho

ho

-v

vh

vh

32

48

-v

-v

51

-v

92

vh

8-

12

2-

20

51

48

92

32

81

12

20

81

Message Size (Bytes)

32

76

8-

76

8-

vi

18

vhost_net Efficiency
8 Guest Scale Out RX Vhost vs Virtio - % Host CPU
Mbit per % CPU netperf TCP_STREAM 400

Mbit / % CPU (bigger is better)

350 300 250 200 150 100 50 0


32 64 12 8 25 6 51 2 10 24 20 48 40 92 81 92 3 16 84 7 32 68 5 65 07

Vhost Virtio

Message Size (Bytes)

19

KVM Architecture Device Assignment vs SR/IOV


Device Assignment

SR-IOV

20

KVM Network Architecture PCI Device Assignment


Physical NIC is passed directly to guest

Device is not available to anything else on the host


Guest sees real physical device

Needs physical device driver


Requires hardware support

Intel VT-D or AMD IOMMU


Lose hardware independence 1:1 mapping of NIC to Guest BTW - This also works on some I/O controllers

21

KVM Network Architecture SR-IOV


Single Root I/O Virtualization New class of PCI devices that present multiple virtual devices that appear as regular PCI devices Guest sees real physical device

Needs physical (virtual) device driver


Requires hardware support Actual device can still be shared Low overhead, high throughput No live migration well its difficult Lose hardware independence
22

Latency comparison
Network Latency by guest interface method
Guest Receive (Lower is better)
400 350 300 250 200 150 100 50 0
1 19 2 SR-IOV latency close to bare metal 6 51 13 1 38 7 2 10 7 7 30 5 9 81 5 7 45 9 3 55 9

Latency (usecs)

SR-IOV latency close to bare metal

Virtio Vhost_net SR-IOV Host

Message Size (Bytes)

23

KVM w/ SR-IOV Intel Niantic 10Gb Postgres DB

Throughput in Order/min (OPM)

D DS reV rs n2re u V to e io s lts


100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0 8 ,4 9 6 6 6 ,9 4 9 8 9 ,6 0 2 8

M P O l a t o T

76% Bare Metal

93% Bare Metal

1R dH t K M e a V b g dg e t rid e u s

1R dH t K M e a V S -IO g e t R V us

1d ta a e in ta c a b s s ne ( a m ta b re e l)

24

I/O Tuning - Hardware


Know your Storage

SAS or SATA? Fibre Channel, Ethernet or SSD? Bandwidth limits


Multiple HBAs

Device-mapper-multipath Provides multipathing capabilities and LUN persistence


How to test

Low level I/O tools dd, iozone, dt, etc

25

I/O Tuning Understanding I/O Elevators

Deadline
Two queues per device, one for read and one for writes IOs dispatched based on time spent in queue Per process queue

CFQ

Noop

Each process queue gets fixed time slice (based on process priority)

FIFO Simple I/O Merging Lowest CPU Cost Can set at Boot-time

Grub command line elevator=deadline/cfq/noop


Or Dynamically per device

echo deadline > /sys/class/block/sda/queue/scheduler


26

Virtualization Tuning I/O elevators - OLTP


Performance Impact of I/O Elevators on OLTP Workload
Host running Deadline Scheduler
300K 250K

Transactions per Minute

200K 150K 100K 50K K

Noop CFQ Deadline

1Guest

2 Guests

4 Guests

27

Virtualization Tuning - Caching


Cache=none

I/O from the guest in not cached


Cache=writethrough

I/O from the guest is cached and written through on the host Potential scaling problems with this option with multiple guests (host cpu used to maintain cache)
Cache=writeback - Not supported

28

Effect of I/O Cache settings on Guest performance


OLTP like workload
FusionIO storage
900K

Transaction Per Minute

800K 700K 600K 500K 400K 300K 200K 100K K


1Guest 4 Guest s

Cache=WT Cache=none

29

I/O Tuning - Filesystems

Configure read ahead

Database ( parameters to configure read ahead) Block devices ( getra , setra )


Asynchronous I/O

Eliminate Synchronous I/O stall Critical for I/O intensive applications

30

AIO Native vs Threaded (default)


Impact of AIO selection on OLTP Workload
"cache=none" setting used - Threaded is default
1000K

Transactions Per Minute

900K 800K 700K 600K 500K 400K 300K 200K 100K K 10U 20U

AIO Threaded AIO Native

Number of Users (x 100)

Configurable per device (only by xml configuration file) Libvirt xml file - driver name='qemu' type='raw' cache='none' io='native' 31

Remember Network Device Assignment ?


Device Assignment

It works for Block too ! Device Specific Similar Benefits And drawbacks...

32

Block Device Passthrough - SAS Workload


SAS Mixed Analytics Workload - Metal/KVM
Intel Westmere EP 12-core, 24 GB Mem, LSI 16 SAS Time to complete (secs)
20k 18k 16k 14 k 12k 10k 8k 6k 4k 2k k

6% longer
SAS system SAS T otal

25% longer

KVM VirtIO

KVM/PCI-PassT hrough

Bare-Metal

34 33

NUMA (Non Uniform Memory Access)


Multi Socket Multi core architecture

NUMA is needed for scaling Keep memory latencies low Linux completely NUMA aware Additional performance gains by enforcing NUMA placement Still some out of the box work is needed
How to enforce NUMA placement numactl CPU and memory pinning One way to test if you get a gain is to mistune it. Libvirt now supports some NUMA placement

34

Memory Tuning - NUMA


# numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 node 0 size: 8189 MB node 0 free: 7220 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 8192 MB ... node 7 cpus: 42 43 44 45 46 47 node 7 size: 8192 MB node 7 free: 7816 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 22 16 22 16 22 1: 16 10 22 16 16 22 22 16 2: 16 22 10 16 16 16 16 16 3: 22 16 16 10 16 16 22 22 4: 16 16 16 16 10 16 16 22 5: 22 22 16 16 16 10 22 16 6: 16 22 16 22 16 22 10 16 7: 22 16 16 22 22 16 16 10

Internode Memory distance From SLIT table

Note variation in internode distances

35

Virtualization Tuning Using NUMA


Impact of NUMA in multiguest OLTP
location,location,location
4 00K 350K

Transactions Per Second

300K 250K 200K 150K 100K 50K K Guest Guest Guest Guest 4 3 2 1

4 Guest-24 vcpu-56G

4 Guest-24 vcpu-56G-NUMA

36

Specifying Processor Details


Mixed results with CPU type and topology The Red Hat team is still exploring some topology performance quirks

Both model and topology


Experiment and see what works best in your case

37

CPU Pinning - Affinity

Virt-manager allows CPU selection based on NUMA topology

True NUMA support in libvirt


Virsh pinning allows finer grain control

1:1 pinning
Good gains with pinning

38

Performance monitoring tools


Monitoring tools

top, vmstat, ps, iostat, netstat, sar, perf


Kernel tools

/proc, sysctl, AltSysrq


Networking

ethtool, ifconfig
Profiling

oprofile, strace, ltrace, systemtap, perf

39

Wrap up
KVM can be tuned effectively

Understand what is going on under the covers Turn off stuff you don't need Be specific when you create your guest Look at using NUMA or affinity Choose appropriate elevators (Deadline vs CFQ) Choose your cache wisely

40

For More Information


KVM Wiki

irc, email lists, etc http://www.linux-kvm.org/page/Lists%2C_IRC libvirt Wiki New, revamped edition of the Virtualization Guide
http://docs.redhat.com/docs/enUS/Red_Hat_Enterprise_Linux/index.html Should be available soon ! http://libvirt.org/

http://www.linux-kvm.org/page/Main_Page

41

Das könnte Ihnen auch gefallen