Sie sind auf Seite 1von 47

Filesystem Performance Characterization

For Red Hat Enterprise Linux + KVM


May 2011
Barry Marson
Principal Performance Engineer

D. John Shakshober (Shak)


Director Red Hat Performance Engineering
bmarson@redhat.com
dshaks@redhat.com

Overview

RHEL Scalability History

RHEL filesystems

EXT2/3/4

XFS

GFS2

RHEL6 vs RHEL5 x86_64, FiberChannel, LSI SAS

RHEL6 KVM enhancements

aio=native, versus user threads

Passthrough and SR-IOV, LSI and Network IO

CPUs

Memory

File Systems

Scalability

RHEL6 Scaling Improvements

Tickless kernel (2.6.17)

Split LRU (2.6.28)

Scalable / predictable locking

Per-bdi flush (2.6.31)

Scalable flushing of dirty blocks


Transparent Huge pages (2.6.31)

Memory

2.1

64GB

16

128GB

32

256GB

255

1TB

4096

64TB

Better hardware utilization

Ticket spinlocks (2.6.25 / 2.6.28)

Efficient reclaim (large systems)

C groups (2.6.18/2.6.29)

Reduced power consumption

RHEL x86_64 version CPUs

Automatically use huge pages

RHEL 6 Differences w/NUMA


AMD MC 2node/ Intel EX 4/8 node
Split LRU (2.6.28) / NUMA

CFS NUMA scheduling

Efficient reclaim

Better hardware utilization

Ticket spinlocks (2.6.25 / 2.6.28)

RHEL6 NUMA Application Performance

Scalable / predictable locking

Per-bdi flush (2.6.31)

2 socket AMD MC, 4/8 socket Intel EX x86_64


600

500

1.8

NoNUMA Memory
interleaved
NUMA (default)
%NUMA Gain

1.54

1.6
1.4

Application Performance

400

1.13

1.2

1.2
1

300
0.8
200

0.6
0.4

100
0.2

Scalable flushing of dirty blocks

0
Oracle oltp(k) 4-node IntelEX
Sybase oltp(k) - 2-node AMD
SAS jobs/ksec 8-node IntelEX

Per device/file/LUN page flush daemon

Each file system or block device has its own flush


daemon
Allows different flushing thresholds and resources for
each daemon/device/file system.
Prevents some devices from not getting flushed
because a shared daemon blocks used all resources
Replaces pdflushd where a pool of threads flushed all
devices.

Per file system flush daemon


pagecache

Read()/Write()
memory copy

Pagecache
page

Flush daemon

buffer

User space
File system

Kernel

Technology Innovation
RHEL 6 File Systems
ext4

Scales to 16TB; default in Red Hat Enterprise Linux 6

XFS

Support for extremely large file-sizes and high-end


arrays.

GFS2 (replaces GFS 1)

Supports 2 to 16 nodes

BTRFS

New file system - included as Technology Preview.

Understanding IOzone Results

GeoMean per category are


statistically meaningful.
Understand HW setup

Layout file systems

LVM or MD devices
Partions w/ fdisk

Baseline raw IO DD/DT


EXT3 perf w/ IOzone

Disk, RAID, HBA, PCI

In-cache file sizes which fit


goal -> 90% memory BW.
Out-of-cache file sizes more
tan 2x memory size
O_DIRECT 95% of raw

Global File System GFS goal


--> 90-95% of local EXT3

Use raw command


fdisk /dev/sdX
raw /dev/raw/rawX /dev/sdX1
dd if=/dev/raw/rawX bs=64k

Mount file system


mkfs t ext3 /dev/sdX1
Mount t ext3 /dev/sdX1 /perf1

IOzone commands
Iozone a f /perf1/t1 (incache)
Iozone a -I f /perf1/t1 (w/ dio)
Iozone s 2xmem f /perf1/t1 (big)

RHEL6.0 ext4 vs RHEL5.5 ext3


RHEL IOzone Dell 6800 LSI
RHEL5.5 ext3

RHEL6.0 ext4

%R6vsR55

Geo Mean MB/sec (1k-1m,1g-4g)

2000

1.45

1800

1.4

1600

1.35

1400

1.3

1200
1000

1.25

800

1.2

600

1.15

400

1.1

200
0

In-Cache

Direct IO

Out-Cache

11

1.05

2
0

RHEL6.1 file system comparisons


IOzone: 6.0 vs 6.1 - InCache

IOzone: 6.0 vs 6.1 - OutOfCache

AMD, deadline, barrier=0

AMD, deadline, barrier=0

1600

300

2.6.32-71
2.6.32-125

1400

2.6.32-71
2.6.32-125

250
1200
200

MB / Sec

MB / Sec

1000
800
600

150
100

400
50
200
0

gfs2

xfs
FS Type

ext4

ext3

gfs2

xfs

ext4

ext3

FS Type

12

2
0

RHEL6.1 file system comparisons


IOzone: 6.0 vs 6.1 - InCache w/MMAP

IOzone: 6.0 vs 6.1 - DirectIO


AMD, deadline, barrier=0

AMD, deadline, barrier=0


120

1400

2.6.32-71
2.6.32-125

1200

100

1000

80

MB / Sec

800

MB / Sec

2.6.32-71
2.6.32-125

600

60

40
400
20

200
0

gfs2

xfs

FS Type

ext4

ext3

gfs2

xfs

ext4

ext3

FS Type

13

2
0

SAS Mixed Analytics Workload


SAS created multi-user benchmarking scenarios to simulate the
workload of a typical Foundation SAS customer.
The goal of these scenarios is to evaluate the multi-user
performance of SAS on various platforms.
Various-sized mixed analytic workloads were created to simulate
many users utilizing CPU, RAM and I/O resources which SAS
programs heavily utilize during typical program execution.
Specific results available from RH / SAS
SAS Grid Manager at Bank of America
SAS Grid Manager at another U.S. national bank
SAS on RHEL Reference Architecture Papers (NEC, RH/KVM)

14

RHEL 6 Application Performance w/ SAS


SAS 9.2 mixed analytics 8 core workload
2 socket - 8 CPU x 48GB
LVM striping - 4-way multipath I/O - 2-FC Adaptors - enterprise storage

Time in Secs (lower is better)

30000

SAS TOTAL system


time
SAS TOTAL user
time

25000
20000
15000
10000
5000
0

ext3

ext4

xfs

15

RHEL6 vs RHEL5.5 SAS


SAS multi-stream workload baremetal
Intel Nehalem EP 8core/ 48GB / 2 FC (not (RHEL5.5 vs RHEL6.0)
ext3-R6*

SAS-systime

ext3-R5

TOTAL SAStime RHEL 6


TOTAL SAStime RHEL 5

ext4-R6*
ext4-R5

xfs-R6*
xfs-R5

gfs2-R6*
gfs2-R5

time (lower is better)

*Transparent Huge Pages turned off - tuned

16

File System Tuning

Separate swap and busy partitions etc.

EXT3/4 separate talks (Ric Wheeler)

Tune2fs or mount options

data=ordered only metadata journaled

data=journal both metadata and data journaled

data=writeback use with care !

Setup default block size < 4k using mkfs -b XX

Adjust Read-Ahead default 256 sectors=128 KB

Current value - blockdev --getra /dev/sdX

Change value - blockdev --setra N /dev/sdX

Can be used for LVM volumes

GFS2 enhanced performance in RHEL 5.6 and 6.X

RHEL6 tuned-adm profiles


# tuned-adm list
Some available profiles:

default - CFQ elevator (cgroup), I/O barriers on,


ondemand power savings, upstream VM, 4 msec
quantum
latency-performance elevator=deadline,
power=performance
throughput-performance latency + 10 msec
quantum, readahead 4x, VM dirty_ratio=40
enterprise-storage throughput + I/O barriers off

Example:
# tuned-adm profile enterprise-storage

Recommend enterprise-storage w/ KVM

RHEL w/ High End HP AIM7 results w/ tuned


HP DL980 64-core/256GB/30 FC/480 luns

AIM7 results w/ tuned

RHEL w/ High End HP AIM7 results w/ tuned


HP DL980 64-core/256GB/30 FC/480 luns

AIM7 results w/ tuned

Large App and Database Performance

Memory Performance

Huge Pages

2MB huge pages

Set value in /etc/sysctl.conf (vm.nr_hugepages)

NUMA

Swap

Localized memory access for certain workloads improves performance

Set value of vm.swappiness (Default 60)

CPU Performance

Multiple cores

Red Hat Performance NDA Required 2009

AsynchronousI/OtoFileSystems

Allows application to continue processing


while I/O is in progress

Eliminates Synchronous I/O stall


Stall for
completi
on

Critical for I/O intensive server


applications

App I/O
Request

Device
Driver
I/O Request
Issue
I/O

Red Hat Enterprise Linux since 2002

Synchronous I/O

Support for RAW devices only

Application

With Red Hat Enterprise Linux 4, significant


improvement:

Support for Ext3, NFS, GFS file system


access
Supports Direct I/O (e.g. Database
applications)

I/O Request
Completion

Asynchronous I/O

No stall for
completion

Makes benchmark results more


appropriate for real-world comparisons

App I/O
Request

I/O
I/O
Completion
Application

Red Hat Confidential

Device
Driver
I/O Request
Issue

I/O Request
Completion

Disk IO tuning - RHEL4/5/6

RHEL4/54tunableI/OSchedulers

CFQelevator=cfq.CompletelyFairQueuingdefault,balanced,fair
formultipleluns,adaptors,SMPservers
NOOPelevator=noop.Nooperationinkernel,simple,lowcpu
overhead,leaveopttoramdisk,raidcntrletc.
Deadlineelevator=deadline.Optimizeforruntimelikebehavior,low
latencyperIO,balanceissueswithlargeI/Oluns/controllers(NOTE:
currentbestforFC5)
Anticipatoryelevator=as.InsertsdelaystohelpstackaggregateIO,
bestonsystemw/limitedphysicalI/OSATA

RHEL4Setatboottimeoncommandline

RHEL5Changeonthefly

echodeadline>/sys/block/<sdx>/queue/scheduler
Red Hat Performance NDA Required 2009

Disk IO tuning - RHEL4/5/6


Comparison CFQ vs Deadline (time lower=better)
Oracle DSS workload (with different thread count)
70

10:30.00

60

09:00.00

50

07:30.00
40
06:00.00
30
04:30.00
20

03:00.00

10

01:30.00

00:00.00

16

32

CFQ
Deadline
% diff

RHEL5.x Oracle 10G OLTP (tpm)


1-FusionIO (SSD) vs. 2 400 MB/sec FC w/ MPIO)
RHEL5 / 6 Oracle OLTP
Intel EX 64-cpu, 128 GB, 2 FC, 1 dual FusionIO
350,000

OLTP (tpm)

300,000

70.00

2-FiberChannel 4Gb
FusionIO-duo
% diff

60.00

250,000

50.00

200,000

40.00

150,000

30.00

100,000

20.00

50,000

10.00

0.00

10U

20U

40U

60U

80U

100U

Red Hat Enterprise Virtualization

26

SPECvirt2010: RHEL 5/6 and KVM Post Industry Leading


Results on IBM x3640M3 w/ Xeon 5680

1 SPECvirt Tile/core
1 SPECvirt Tile/core

Key Enablers:
SR-IOV
Blue
= Disk I/O
Green = Network I/O

Virtualization Layer and Hardware

System Under Test (SUT)


Client Hardware

Huge Pages
NUMA
Node Binding

http://www.spec.org/virt_sc2010/results/
27

RHEL Guest e/ all Virtualization

"SPECvirt_sc2010 Benchmark Results " May 2011

ALL SPECvirt_sc2010 results published to date


use RHEL as the guest / VM Operating System!
RHEL 6 shows 29% better SPECvirt performance
than RHEL 5.5 (KVM) on the same hardware!

28

SPECvirt_sc2010 Published Results May 2011


SPECvirt_sc2010 2-Socket Results(x86_64 Servers) 5/2011
2000

1600

1367

1400
1200

1763

1169

1221

vmware

RHEL5

1811

RHEL6

1820

1369

1.4
1.2
1
0.8

1000
800

0.6

600

0.4

400

0.2

200

0
RHEL 5.5
(KVM) /
IBMx365
0 M3 /
12 cores

Vmware
ESX 4.1 /
HP D380
G7 / 12
cores

RHEL 6.0
(KVM) /
IBM
HS22V/
12 cores

RHEL 5.5
(KVM) /
IBMx369
0 X5 / 16
cores

System

RHEL 6
(KVM) /
IBMx369
0 X5 / 16
cores

Vmware
ESX 4.1 /
HP
BL620c
G7 / 20
cores

RHEL6.1
(KVM) /
HP
BL620c
G7 / 20
cores

Tiles / Core

SPECvirt_sc2010 Score

1800

SPECvirt_sc2010 Score
SPECvirt Tiles/Core

SPECvirt_sc2010 Published Results May 2011


SPECvirt_sc2010 2-4 Socket Results
(x86_64 Servers) 4/2011
1
7067

7000
6000
5000

0.8
5466

SPECvirt_sc2010 Score
SPECvirt Tiles/Core

0.7
0.6

3723

4000
3000

0.9

2721

0.5
0.4

2742

0.3

2000

0.2

1000

0.1
0

0
Vmware ESX
4.1 / Bull
SAS /32
cores

Vmware ESX
4.1 /
IBMx3850X5
/ 32 cores

Vmware ESX
4.1 / HP
DL580 G7 /
40 cores

System

RHEL 6
(KVM) /
IBMx3850
X5 / 64
cores

RHEL 6
(KVM) /
IBMx3850
X5 / 80
cores

Tiles / core

SPECvirt_sc2010 Score

8000

Virtualization
Memory Enhancements
Transparent hugepages

Efficiently manage large memory allocations as one unit

Extended Page Table (EPT) age bits


Allow host to make smarter swap choice when under
pressure.

Kernel Same-page Merging (KSM)


Consolidate duplicate pages.
Particularly efficient for Windows guests.

Virtualization:
RHEL6 2.6.32 SAS Intel EP (8core/48GB)
SAS multi-stream workload in KVM guest (RHEL5.5 vs RHEL6.0)
Intel Nahelem 8core, 48GB, 2FC
Guest (8x44GB virtIO, nocache)
5.5 - 5.5 - ext3

SAS-systime

Host Guest File System

6.0 - 5.5 - ext3


6.0 - 6.0 - ext3

TOTAL SAStime RHEL 6

5.5 - 5.5 - ext4

TOTAL SAStime RHEL 5

6.0 - 5.5 - ext4


6.0 - 6.0 - ext4
5.5 - 5.5 - xfs
6.0 - 5.5 - xfs
6.0 - 6.0 - xfs
5.5 - 5.5 - gfs2
6.0 - 5.5 - gfs2
6.0 - 6.0 - gfs2

Time (shorter is better)

32

Virtio (bridge) vs PCI assignment (vt-d/SR-IOV)


Intel 82599 10GbE

Guest VM

Guest VM
virtio-net guest driver
tx

tx

rx

rx

QEMU

virtio-net host driver


tap

VF NIC #2
kernel/HV

kernel/HV

VF NIC #1

bridge
NIC

Physical NIC

Virtualization:
RHEL6 2.6.32 SAS Intel EP (12cpu/24GB)
RHEL6.1 SAS Mixed Analytics Workload - Bare-Metal/KVM
Intel Westmere EP 12-core, 24 GB Mem, LSI 16 SAS drives
20000

1.2

Time in secs. (lower is better)

18000

0.94

16000
14000

0.79

0.8

12000
10000

0.6

SAS system
SAS Total
%virt

8000
0.4

6000
4000

0.2

2000
0

KVM VirtIO

KVM/PCI-PassThrough

Bare-Metal

34

RHEL6 KVM w/ SRIOV Intel Niantic 10Gb

Postgress DB 93% bare metal, 20% faster than Bridged


(Throughput in Order/min (OPM), Bigger == Better)

M
P
lO
a
t
o
T

DVDStoreVersion2results
100,000
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0

86,469

92,680

69,984

1Red Hat KVM


bridged guest

1Red Hat KVM


SR-IOV guest

1database instance
(baremetal)

Summary

RHEL filesystems

EXT4 4-12% improvement over EXT3

XFS 15-25% faster than EXT4 for large sequential IO

GFS2 - +50% improvement, approx equal to EXT3

RHEL6 KVM enhancements

aio=native as storage option


Pci-passthrough and SR-IOV approach 95% on most
difficult workloads (SAS / TPC)
Transparent hugepages w/ guest automatic in 6.X

Pagecache Tuning(RHEL)
Filesystem/pagecache Allocation
Accessed(pagecache under limit)

ACTIVE

Aging

INACTIVE
(new -> old)

Accessed(pagecache over limit)

FREE

reclaim

swappiness

Not needed as much in RHEL6


Controls how aggressively the system reclaims
mapped memory:

Anonymous memory - swapping

Mapped file pages writing if dirty and freeing

System V shared memory - swapping

Decreasing: more aggressive reclaiming of unmapped


pagecache memory
Increasing: more aggressive swapping of mapped
memory

/proc/sys/vm/swappiness
Database server with /proc/sys/vm/swappiness set to 60(default)
procs -----------memory---------- ---swap-r b
swpd
free
buff cache
si
so
5 1 643644 26788
3544 32341788 880 120

-----io---- --system-- ----cpu---bi


bo
in
cs us sy id wa
4044 7496 1302 20846 25 34 25 16

Database server with /proc/sys/vm/swappiness set to 10


procs -----------memory---------- ---swap-r b
swpd
free
buff cache
si
so
8 3
0
24228
6724 32280696 0
0

-----io---- --system-- ----cpu---bi


bo
in
cs us sy id wa
23888 63776 1286 20020 24 38 13 26

zone_reclaim_mode

Controls NUMA specific memory allocation policy

When set and node memory is exhausted:

Reclaim memory from local node rather than allocating


from next node
Slower allocation, higher NUMA hit ratio

When clear and node memory is exhausted:

Allocate from all nodes before reclaiming memory

Faster allocation, higher NUMA miss ratio

Default is set at boot time based on NUMA factor

/proc/sys/vm/min_free_kbytes
Directly controls the page reclaim watermarks in KB
Defaults are higher when THP is enabled
# echo 1024 > /proc/sys/vm/min_free_kbytes
----------------------------------------------------------Node 0 DMA free:4420kB min:8kB low:8kB high:12kB
Node 0 DMA32 free:14456kB min:1012kB low:1264kB high:1516kB
----------------------------------------------------------# echo 2048 > /proc/sys/vm/min_free_kbytes
----------------------------------------------------------Node 0 DMA free:4420kB min:20kB low:24kB high:28kB
Node 0 DMA32 free:14456kB min:2024kB low:2528kB high:3036kB
-----------------------------------------------------------

Memory reclaim Watermarks - min_free_kbytes


Free List
All of RAM
Do nothing

Pages High kswapd sleeps above High


kswapd reclaims memory

Pages Low kswapd wakesup at Low


kswapd reclaims memory

Pages Min all memory allocators reclaim at Min


user processes/kswapd reclaim memory

/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_background_bytes
Controls when dirty pagecache memory starts getting
written asynchronously

Default is 10%

Lower

Higher

flushing starts earlier


less dirty pagecache and smaller I/O streams
flushing starts later
more dirty pagecache and larger I/O streams

dirty_background_bytes over-rides when you want < 1%

/proc/sys/vm/dirty_ratio
/proc/sys/vm/dirty_bytes

Absolute upper percentage limit of dirty pagecache memory

Default is 20%

Lower means clean pagecache and smaller I/O streams

Higher means dirty pagecache and larger I/O streams

dirty_bytes overrides when you want < 1%

dirty_ratio and dirty_background_ratio


pagecache
100% of pagecache RAM dirty

flushd and write()'ng processes write dirty buffers

dirty_ratio(20% of RAM dirty) processes start synchronous writes


flushd writes dirty buffers in background

dirty_background_ratio(10% of RAM dirty) wakeup flushd


do_nothing

0% of pagecache RAM dirty

(Hint) Flushing the pagecache


# sync
# echo 1 > /proc/sys/vm/drop_caches
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---r b
swpd
free
buff cache
si
so
bi
bo
in
cs us sy id wa
0 0
224 57184 107808 3350196
0
0
0
56 1136
212 0 0 83 17
0 0
224 57184 107808 3350196
0
0
0
0 1039
198 0 0 100 0
0 0
224 57184 107808 3350196
0
0
0
0 1021
188 0 0 100 0
0 0
224 57184 107808 3350196
0
0
0
0 1035
204 0 0 100 0
0 0
224 57248 107808 3350196
0
0
0
0 1008
164 0 0 100 0
3 0
224 2128160
176 1438636
0
0
0
0 1030
197 0 15 85 0
0 0
224 3610656
204 34408
0
0
28
36 1027
177 0 32 67 2
0 0
224 3610656
204 34408
0
0
0
0 1026
180 0 0 100 0
0 0
224 3610720
212 34400
0
0
8
0 1010
183 0 0 99 1

(Hint) Flushing the slabcache


# echo 2 > /proc/sys/vm/drop_caches

[tmp]# cat /proc/meminfo


MemTotal:
3907444 kB
MemFree:
3104576 kB

tmp]# cat /proc/meminfo


MemTotal:
3907444 kB
MemFree:
3301788 kB

Slab:

Slab:

Hugepagesize:

415420 kB
2048 kB

Hugepagesize:

218208 kB
2048 kB

Das könnte Ihnen auch gefallen