Shak Barry W 0530 Fileperf Summit2011 PDF

Filesystem Performance Characterization
For Red Hat Enterprise Linux + KVM

May 2011
Barry Marson
Principal Performance Engineer
D. John Shakshober (Shak)

Director Red Hat Performance Engineering
bmarson@redhat.com
dshaks@redhat.com
Overview
RHEL Scalability History
RHEL filesystems
EXT2/3/4
XFS
GFS2
RHEL6 vs RHEL5 x86_64, FiberChannel, LSI SAS
RHEL6 KVM enhancements
aio=native, versus user threads
Passthrough and SR-IOV, LSI and Network IO
CPUs
Memory
File Systems
Scalability
RHEL6 Scaling Improvements
Tickless kernel (2.6.17)
Split LRU (2.6.28)
Scalable / predictable locking
Per-bdi flush (2.6.31)
Scalable flushing of dirty blocks

Transparent Huge pages (2.6.31)
Memory
2.1
64GB
16
128GB
32
256GB
255
1TB
4096
64TB
Better hardware utilization
Ticket spinlocks (2.6.25 / 2.6.28)
Efficient reclaim (large systems)
C groups (2.6.18/2.6.29)
Reduced power consumption
RHEL x86_64 version CPUs
Automatically use huge pages
RHEL 6 Differences w/NUMA

AMD MC 2node/ Intel EX 4/8 node
Split LRU (2.6.28) / NUMA
CFS NUMA scheduling
Efficient reclaim
Better hardware utilization
Ticket spinlocks (2.6.25 / 2.6.28)
RHEL6 NUMA Application Performance
Scalable / predictable locking
Per-bdi flush (2.6.31)
2 socket AMD MC, 4/8 socket Intel EX x86_64

600
500
1.8
NoNUMA Memory
interleaved
NUMA (default)
%NUMA Gain
1.54
1.6
1.4
Application Performance
400
1.13
1.2
1.2
1
300
0.8
200
0.6
0.4
100
0.2
Scalable flushing of dirty blocks
0
Oracle oltp(k) 4-node IntelEX
Sybase oltp(k) - 2-node AMD
SAS jobs/ksec 8-node IntelEX
Per device/file/LUN page flush daemon
Each file system or block device has its own flush

daemon
Allows different flushing thresholds and resources for
each daemon/device/file system.
Prevents some devices from not getting flushed
because a shared daemon blocks used all resources
Replaces pdflushd where a pool of threads flushed all
devices.
Per file system flush daemon

pagecache
Read()/Write()
memory copy
Pagecache
page
Flush daemon
buffer
User space
File system
Kernel
Technology Innovation
RHEL 6 File Systems
ext4
Scales to 16TB; default in Red Hat Enterprise Linux 6
XFS
Support for extremely large file-sizes and high-end

arrays.
GFS2 (replaces GFS 1)
Supports 2 to 16 nodes
BTRFS
New file system - included as Technology Preview.
Understanding IOzone Results
GeoMean per category are

statistically meaningful.
Understand HW setup
Layout file systems
LVM or MD devices
Partions w/ fdisk
Baseline raw IO DD/DT

EXT3 perf w/ IOzone
Disk, RAID, HBA, PCI
In-cache file sizes which fit

goal -> 90% memory BW.
Out-of-cache file sizes more
tan 2x memory size
O_DIRECT 95% of raw
Global File System GFS goal

--> 90-95% of local EXT3
Use raw command

fdisk /dev/sdX
raw /dev/raw/rawX /dev/sdX1
dd if=/dev/raw/rawX bs=64k
Mount file system

mkfs t ext3 /dev/sdX1
Mount t ext3 /dev/sdX1 /perf1
IOzone commands
Iozone a f /perf1/t1 (incache)
Iozone a -I f /perf1/t1 (w/ dio)
Iozone s 2xmem f /perf1/t1 (big)
RHEL6.0 ext4 vs RHEL5.5 ext3

RHEL IOzone Dell 6800 LSI
RHEL5.5 ext3
RHEL6.0 ext4
%R6vsR55
Geo Mean MB/sec (1k-1m,1g-4g)
2000
1.45
1800
1.4
1600
1.35
1400
1.3
1200
1000
1.25
800
1.2
600
1.15
400
1.1
200
0
In-Cache
Direct IO
Out-Cache
11
1.05
2
0
RHEL6.1 file system comparisons

IOzone: 6.0 vs 6.1 - InCache
IOzone: 6.0 vs 6.1 - OutOfCache
AMD, deadline, barrier=0
1600
300
2.6.32-71
2.6.32-125
1400
2.6.32-71
2.6.32-125
250
1200
200
MB / Sec
MB / Sec
1000
800
600
150
100
400
50
200
0
gfs2
xfs
FS Type
ext4
ext3
gfs2
xfs
ext4
ext3
FS Type
12
2
0
RHEL6.1 file system comparisons

IOzone: 6.0 vs 6.1 - InCache w/MMAP
IOzone: 6.0 vs 6.1 - DirectIO


120
1400
2.6.32-71
2.6.32-125
1200
100
1000
80
MB / Sec
800
MB / Sec
2.6.32-71
2.6.32-125
600
60
40
400
20
200
0
gfs2
xfs
FS Type
ext4
ext3
gfs2
xfs
ext4
ext3
FS Type
13
2
0
SAS Mixed Analytics Workload

SAS created multi-user benchmarking scenarios to simulate the
workload of a typical Foundation SAS customer.
The goal of these scenarios is to evaluate the multi-user
performance of SAS on various platforms.
Various-sized mixed analytic workloads were created to simulate
many users utilizing CPU, RAM and I/O resources which SAS
programs heavily utilize during typical program execution.
Specific results available from RH / SAS
SAS Grid Manager at Bank of America
SAS Grid Manager at another U.S. national bank
SAS on RHEL Reference Architecture Papers (NEC, RH/KVM)
14
RHEL 6 Application Performance w/ SAS

SAS 9.2 mixed analytics 8 core workload
2 socket - 8 CPU x 48GB
LVM striping - 4-way multipath I/O - 2-FC Adaptors - enterprise storage
Time in Secs (lower is better)
30000
SAS TOTAL system

time
SAS TOTAL user
time
25000
20000
15000
10000
5000
0
ext3
ext4
xfs
15
RHEL6 vs RHEL5.5 SAS

SAS multi-stream workload baremetal
Intel Nehalem EP 8core/ 48GB / 2 FC (not (RHEL5.5 vs RHEL6.0)
ext3-R6*
SAS-systime
ext3-R5
TOTAL SAStime RHEL 6

ext4-R6*
ext4-R5
xfs-R6*
xfs-R5
gfs2-R6*
gfs2-R5
time (lower is better)
*Transparent Huge Pages turned off - tuned
16
File System Tuning
Separate swap and busy partitions etc.
EXT3/4 separate talks (Ric Wheeler)
Tune2fs or mount options
data=ordered only metadata journaled
data=journal both metadata and data journaled
data=writeback use with care !
Setup default block size < 4k using mkfs -b XX
Adjust Read-Ahead default 256 sectors=128 KB
Current value - blockdev --getra /dev/sdX
Change value - blockdev --setra N /dev/sdX
Can be used for LVM volumes
GFS2 enhanced performance in RHEL 5.6 and 6.X
RHEL6 tuned-adm profiles

# tuned-adm list
Some available profiles:
default - CFQ elevator (cgroup), I/O barriers on,

ondemand power savings, upstream VM, 4 msec
quantum
latency-performance elevator=deadline,
power=performance
throughput-performance latency + 10 msec
quantum, readahead 4x, VM dirty_ratio=40
enterprise-storage throughput + I/O barriers off
Example:
# tuned-adm profile enterprise-storage
Recommend enterprise-storage w/ KVM
RHEL w/ High End HP AIM7 results w/ tuned

HP DL980 64-core/256GB/30 FC/480 luns
AIM7 results w/ tuned
RHEL w/ High End HP AIM7 results w/ tuned

HP DL980 64-core/256GB/30 FC/480 luns
AIM7 results w/ tuned
Large App and Database Performance
Memory Performance
Huge Pages
2MB huge pages
Set value in /etc/sysctl.conf (vm.nr_hugepages)
NUMA
Swap
Localized memory access for certain workloads improves performance
Set value of vm.swappiness (Default 60)
CPU Performance
Multiple cores
Red Hat Performance NDA Required 2009
AsynchronousI/OtoFileSystems
Allows application to continue processing

while I/O is in progress
Eliminates Synchronous I/O stall

Stall for
completi
on
Critical for I/O intensive server

applications
App I/O
Request
Device
Driver
I/O Request
Issue
I/O
Red Hat Enterprise Linux since 2002
Synchronous I/O
Support for RAW devices only
Application
With Red Hat Enterprise Linux 4, significant

improvement:
Support for Ext3, NFS, GFS file system

access
Supports Direct I/O (e.g. Database
applications)
I/O Request
Completion
Asynchronous I/O
No stall for
completion
Makes benchmark results more

appropriate for real-world comparisons
App I/O
Request
I/O
I/O
Completion
Application
Red Hat Confidential
Device
Driver
I/O Request
Issue
I/O Request
Completion
Disk IO tuning - RHEL4/5/6
RHEL4/54tunableI/OSchedulers
CFQelevator=cfq.CompletelyFairQueuingdefault,balanced,fair
formultipleluns,adaptors,SMPservers
NOOPelevator=noop.Nooperationinkernel,simple,lowcpu
overhead,leaveopttoramdisk,raidcntrletc.
Deadlineelevator=deadline.Optimizeforruntimelikebehavior,low
latencyperIO,balanceissueswithlargeI/Oluns/controllers(NOTE:
currentbestforFC5)
Anticipatoryelevator=as.InsertsdelaystohelpstackaggregateIO,
bestonsystemw/limitedphysicalI/OSATA
RHEL4Setatboottimeoncommandline
RHEL5Changeonthefly
echodeadline>/sys/block/<sdx>/queue/scheduler
Red Hat Performance NDA Required 2009
Disk IO tuning - RHEL4/5/6

Comparison CFQ vs Deadline (time lower=better)
Oracle DSS workload (with different thread count)
70
10:30.00
60
09:00.00
50
07:30.00
40
06:00.00
30
04:30.00
20
03:00.00
10
01:30.00
00:00.00
16
32
CFQ
Deadline
% diff
RHEL5.x Oracle 10G OLTP (tpm)

1-FusionIO (SSD) vs. 2 400 MB/sec FC w/ MPIO)
RHEL5 / 6 Oracle OLTP
Intel EX 64-cpu, 128 GB, 2 FC, 1 dual FusionIO
350,000
OLTP (tpm)
300,000
70.00
2-FiberChannel 4Gb
FusionIO-duo
% diff
60.00
250,000
50.00
200,000
40.00
150,000
30.00
100,000
20.00
50,000
10.00
0.00
10U
20U
40U
60U
80U
100U
Red Hat Enterprise Virtualization
26
SPECvirt2010: RHEL 5/6 and KVM Post Industry Leading

Results on IBM x3640M3 w/ Xeon 5680
1 SPECvirt Tile/core
1 SPECvirt Tile/core
Key Enablers:
SR-IOV
Blue
= Disk I/O
Green = Network I/O
Virtualization Layer and Hardware
System Under Test (SUT)

Client Hardware
Huge Pages
NUMA
Node Binding
http://www.spec.org/virt_sc2010/results/
27
RHEL Guest e/ all Virtualization
"SPECvirt_sc2010 Benchmark Results " May 2011
ALL SPECvirt_sc2010 results published to date

use RHEL as the guest / VM Operating System!
RHEL 6 shows 29% better SPECvirt performance
than RHEL 5.5 (KVM) on the same hardware!
28
SPECvirt_sc2010 Published Results May 2011

SPECvirt_sc2010 2-Socket Results(x86_64 Servers) 5/2011
2000
1600
1367
1400
1200
1763
1169
1221
vmware
RHEL5
1811
RHEL6
1820
1369
1.4
1.2
1
0.8
1000
800
0.6
600
0.4
400
0.2
200
0
RHEL 5.5
(KVM) /
IBMx365
0 M3 /
12 cores
Vmware
ESX 4.1 /
HP D380
G7 / 12
cores
RHEL 6.0
(KVM) /
IBM
HS22V/
12 cores
RHEL 5.5
(KVM) /
IBMx369
0 X5 / 16
cores
System
RHEL 6
(KVM) /
IBMx369
0 X5 / 16
cores
Vmware
ESX 4.1 /
HP
BL620c
G7 / 20
cores
RHEL6.1
(KVM) /
HP
BL620c
G7 / 20
cores
Tiles / Core
SPECvirt_sc2010 Score
1800
SPECvirt Tiles/Core
SPECvirt_sc2010 Published Results May 2011

SPECvirt_sc2010 2-4 Socket Results
(x86_64 Servers) 4/2011
1
7067
7000
6000
5000
0.8
5466
SPECvirt Tiles/Core
0.7
0.6
3723
4000
3000
0.9
2721
0.5
0.4
2742
0.3
2000
0.2
1000
0.1
0
0
Vmware ESX
4.1 / Bull
SAS /32
cores
Vmware ESX
4.1 /
IBMx3850X5
/ 32 cores
Vmware ESX
4.1 / HP
DL580 G7 /
40 cores
System
RHEL 6
(KVM) /
IBMx3850
X5 / 64
cores
RHEL 6
(KVM) /
IBMx3850
X5 / 80
cores
Tiles / core
8000
Virtualization
Memory Enhancements
Transparent hugepages
Efficiently manage large memory allocations as one unit
Extended Page Table (EPT) age bits

Allow host to make smarter swap choice when under
pressure.
Kernel Same-page Merging (KSM)

Consolidate duplicate pages.
Particularly efficient for Windows guests.
Virtualization:
RHEL6 2.6.32 SAS Intel EP (8core/48GB)
SAS multi-stream workload in KVM guest (RHEL5.5 vs RHEL6.0)
Intel Nahelem 8core, 48GB, 2FC
Guest (8x44GB virtIO, nocache)
5.5 - 5.5 - ext3
SAS-systime
Host Guest File System
6.0 - 5.5 - ext3

6.0 - 6.0 - ext3
5.5 - 5.5 - ext4
6.0 - 5.5 - ext4

6.0 - 6.0 - ext4
5.5 - 5.5 - xfs
6.0 - 5.5 - xfs
6.0 - 6.0 - xfs
5.5 - 5.5 - gfs2
6.0 - 5.5 - gfs2
6.0 - 6.0 - gfs2
Time (shorter is better)
32
Virtio (bridge) vs PCI assignment (vt-d/SR-IOV)

Intel 82599 10GbE
Guest VM
Guest VM
virtio-net guest driver
tx
tx
rx
rx
QEMU
virtio-net host driver

tap
VF NIC #2
kernel/HV
kernel/HV
VF NIC #1
bridge
NIC
Physical NIC
Virtualization:
RHEL6 2.6.32 SAS Intel EP (12cpu/24GB)
RHEL6.1 SAS Mixed Analytics Workload - Bare-Metal/KVM
Intel Westmere EP 12-core, 24 GB Mem, LSI 16 SAS drives
20000
1.2
Time in secs. (lower is better)
18000
0.94
16000
14000
0.79
0.8
12000
10000
0.6
SAS system
SAS Total
%virt
8000
0.4
6000
4000
0.2
2000
0
KVM VirtIO
KVM/PCI-PassThrough
Bare-Metal
34
RHEL6 KVM w/ SRIOV Intel Niantic 10Gb
Postgress DB 93% bare metal, 20% faster than Bridged

(Throughput in Order/min (OPM), Bigger == Better)
M
P
lO
a
t
o
T
DVDStoreVersion2results
100,000
90,000
80,000
70,000
60,000
50,000
40,000
30,000
20,000
10,000
0
86,469
92,680
69,984
1Red Hat KVM

bridged guest
1Red Hat KVM

SR-IOV guest
1database instance
(baremetal)
Summary
RHEL filesystems
EXT4 4-12% improvement over EXT3
XFS 15-25% faster than EXT4 for large sequential IO
GFS2 - +50% improvement, approx equal to EXT3
RHEL6 KVM enhancements
aio=native as storage option

Pci-passthrough and SR-IOV approach 95% on most
difficult workloads (SAS / TPC)
Transparent hugepages w/ guest automatic in 6.X
Pagecache Tuning(RHEL)
Filesystem/pagecache Allocation
Accessed(pagecache under limit)
ACTIVE
Aging
INACTIVE
(new -> old)
Accessed(pagecache over limit)
FREE
reclaim
swappiness
Not needed as much in RHEL6

Controls how aggressively the system reclaims
mapped memory:
Anonymous memory - swapping
Mapped file pages writing if dirty and freeing
System V shared memory - swapping
Decreasing: more aggressive reclaiming of unmapped

pagecache memory
Increasing: more aggressive swapping of mapped
memory
/proc/sys/vm/swappiness
Database server with /proc/sys/vm/swappiness set to 60(default)
procs -----------memory---------- ---swap-r b
swpd
free
buff cache
si
so
5 1 643644 26788
3544 32341788 880 120
-----io---- --system-- ----cpu---bi

bo
in
cs us sy id wa
4044 7496 1302 20846 25 34 25 16
Database server with /proc/sys/vm/swappiness set to 10

procs -----------memory---------- ---swap-r b
swpd
free
buff cache
si
so
8 3
0
24228
6724 32280696 0
0
-----io---- --system-- ----cpu---bi

bo
in
cs us sy id wa
23888 63776 1286 20020 24 38 13 26
zone_reclaim_mode
Controls NUMA specific memory allocation policy
When set and node memory is exhausted:
Reclaim memory from local node rather than allocating

from next node
Slower allocation, higher NUMA hit ratio
When clear and node memory is exhausted:
Allocate from all nodes before reclaiming memory
Faster allocation, higher NUMA miss ratio
Default is set at boot time based on NUMA factor
/proc/sys/vm/min_free_kbytes
Directly controls the page reclaim watermarks in KB
Defaults are higher when THP is enabled
# echo 1024 > /proc/sys/vm/min_free_kbytes
----------------------------------------------------------Node 0 DMA free:4420kB min:8kB low:8kB high:12kB
Node 0 DMA32 free:14456kB min:1012kB low:1264kB high:1516kB
----------------------------------------------------------# echo 2048 > /proc/sys/vm/min_free_kbytes
----------------------------------------------------------Node 0 DMA free:4420kB min:20kB low:24kB high:28kB
Node 0 DMA32 free:14456kB min:2024kB low:2528kB high:3036kB
-----------------------------------------------------------
Memory reclaim Watermarks - min_free_kbytes

Free List
All of RAM
Do nothing
Pages High kswapd sleeps above High

kswapd reclaims memory
Pages Low kswapd wakesup at Low

kswapd reclaims memory
Pages Min all memory allocators reclaim at Min

user processes/kswapd reclaim memory
/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_background_bytes
Controls when dirty pagecache memory starts getting
written asynchronously
Default is 10%
Lower
Higher
flushing starts earlier

less dirty pagecache and smaller I/O streams
flushing starts later
more dirty pagecache and larger I/O streams
dirty_background_bytes over-rides when you want < 1%
/proc/sys/vm/dirty_ratio
/proc/sys/vm/dirty_bytes
Absolute upper percentage limit of dirty pagecache memory
Default is 20%
Lower means clean pagecache and smaller I/O streams
Higher means dirty pagecache and larger I/O streams
dirty_bytes overrides when you want < 1%
dirty_ratio and dirty_background_ratio

pagecache
100% of pagecache RAM dirty
flushd and write()'ng processes write dirty buffers
dirty_ratio(20% of RAM dirty) processes start synchronous writes

flushd writes dirty buffers in background
dirty_background_ratio(10% of RAM dirty) wakeup flushd

do_nothing
0% of pagecache RAM dirty
(Hint) Flushing the pagecache

# sync
# echo 1 > /proc/sys/vm/drop_caches
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---r b
swpd
free
buff cache
si
so
bi
bo
in
cs us sy id wa
0 0
224 57184 107808 3350196
0
0
0
56 1136
212 0 0 83 17
0 0
224 57184 107808 3350196
0
0
0
0 1039
198 0 0 100 0
0 0
224 57184 107808 3350196
0
0
0
0 1021
188 0 0 100 0
0 0
224 57184 107808 3350196
0
0
0
0 1035
204 0 0 100 0
0 0
224 57248 107808 3350196
0
0
0
0 1008
164 0 0 100 0
3 0
224 2128160
176 1438636
0
0
0
0 1030
197 0 15 85 0
0 0
224 3610656
204 34408
0
0
28
36 1027
177 0 32 67 2
0 0
224 3610656
204 34408
0
0
0
0 1026
180 0 0 100 0
0 0
224 3610720
212 34400
0
0
8
0 1010
183 0 0 99 1
(Hint) Flushing the slabcache

# echo 2 > /proc/sys/vm/drop_caches
[tmp]# cat /proc/meminfo

MemTotal:
3907444 kB
MemFree:
3104576 kB
tmp]# cat /proc/meminfo

MemTotal:
3907444 kB
MemFree:
3301788 kB
Slab:
Slab:
Hugepagesize:
415420 kB
2048 kB
Hugepagesize:
218208 kB
2048 kB

Shak Barry W 0530 Fileperf Summit2011 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Shak Barry W 0530 Fileperf Summit2011 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Filesystem Performance Characterization

For Red Hat Enterprise Linux + KVM

D. John Shakshober (Shak)

RHEL Scalability History

RHEL6 vs RHEL5 x86_64, FiberChannel, LSI SAS

RHEL6 KVM enhancements

aio=native, versus user threads

Passthrough and SR-IOV, LSI and Network IO

RHEL6 Scaling Improvements

Tickless kernel (2.6.17)

Split LRU (2.6.28)

Scalable / predictable locking

Per-bdi flush (2.6.31)

Scalable flushing of dirty blocks

Better hardware utilization

Ticket spinlocks (2.6.25 / 2.6.28)

Efficient reclaim (large systems)

Reduced power consumption

RHEL x86_64 version CPUs

Automatically use huge pages

RHEL 6 Differences w/NUMA

CFS NUMA scheduling

Better hardware utilization

Ticket spinlocks (2.6.25 / 2.6.28)

RHEL6 NUMA Application Performance

Scalable / predictable locking

Per-bdi flush (2.6.31)

2 socket AMD MC, 4/8 socket Intel EX x86_64

Scalable flushing of dirty blocks

Per device/file/LUN page flush daemon

Each file system or block device has its own flush

Per file system flush daemon

Scales to 16TB; default in Red Hat Enterprise Linux 6

Support for extremely large file-sizes and high-end

GFS2 (replaces GFS 1)

New file system - included as Technology Preview.

Understanding IOzone Results

GeoMean per category are

Layout file systems

Baseline raw IO DD/DT

Disk, RAID, HBA, PCI

In-cache file sizes which fit

Global File System GFS goal

Use raw command

Mount file system

RHEL6.0 ext4 vs RHEL5.5 ext3

Geo Mean MB/sec (1k-1m,1g-4g)

RHEL6.1 file system comparisons

IOzone: 6.0 vs 6.1 - OutOfCache

AMD, deadline, barrier=0

AMD, deadline, barrier=0

RHEL6.1 file system comparisons

IOzone: 6.0 vs 6.1 - DirectIO

AMD, deadline, barrier=0

SAS Mixed Analytics Workload

RHEL 6 Application Performance w/ SAS

Time in Secs (lower is better)

SAS TOTAL system

RHEL6 vs RHEL5.5 SAS

TOTAL SAStime RHEL 6

time (lower is better)

*Transparent Huge Pages turned off - tuned

File System Tuning