Beruflich Dokumente
Kultur Dokumente
David Quenzler
IBM Systems and Technology Group ISV Enablement
June 2012
Table of contents
Abstract........................................................................................................................................1
Introduction .................................................................................................................................1
External storage subsystem - XIV .............................................................................................2
External SAN switches ...............................................................................................................4
Bottleneck monitoring .............................................................................................................................. 4
Fabric parameters.................................................................................................................................... 5
Basic port configuration ........................................................................................................................... 5
Advanced port configuration .................................................................................................................... 5
Abstract
This white paper discusses an end-to-end approach for Linux I/O tuning in a typical data center
environment consisting of external storage subsystems, storage area network (SAN) switches,
IBM System x Intel servers, Fibre Channel host bus adapters (HBAs) and 64-bit Red Hat
Enterprise Linux.
Anyone with an interest in I/O tuning is welcome to read this white paper.
Introduction
Linux I/O tuning is complex. In a typical environment, I/O makes several transitions from the client
application out to disk and vice versa. There are many pieces to the puzzle.
We will examine the following topics in detail:
You should follow an end-to-end tuning methodology in order to minimize the risk of poor tuning.
Recommendations in this white paper are based on the following environment under test:
An architecture comprising IBM hardware and Red Hat Linux provides a solid framework for maximizing
I/O performance.
Familiarize yourself with the XIV command-line interface (XCLI) as documented in the IBM XIV
Storage System User Manual.
Ensure that you connect the XIV system to your environment in the FC fully redundant
configuration as documented in the XIV Storage System: Host Attachment and Interoperability
guide from IBM Redbooks.
Although you can define up to 12 paths per host, a maximum of six paths per host provides sufficient
redundancy and performance.
Useful XCLI commands:
# module_list -t all
# module_list -x
# fc_port_list
The XIV storage subsystem contains six FC data modules (4 to 9), each with 8 GB memory. The FC rate
is 4 Gbps and the data partition size is 1 MB.
Check the XIV HBA queue depth setting: The higher the host HBA queue depth, the more
parallel I/O goes to the XIV system, but each XIV port can only sustain up to 1400 concurrent
I/Os to the same type target or logical unit (LUN). Therefore, the number of connections
multiplied by the host HBA queue depth should not exceed that value. The number of
connections should take the multipath configuration into account.
Note: The XIV queue limit is 1400 per XIV FC host port and 256 per LUN per worldwide port
name (WWPN) per port.
Twenty-four multipath connections to the XIV system would dictate that host queue depth be set
to 58. (24*58=1392)
Check the operating system (OS) disk queue depth (see below)
Make use of the XIV host attachment kit for RHEL
Useful commands:
# xiv_devlist
Latency bottleneck
Congestion bottleneck
Latency bottlenecks occur when frames are sent faster than they can be received. This can be due to
buffer credit starvation or slow drain devices in the fabric.
Congestion bottlenecks occur when the required throughput exceeds the physical data rate for the
connection.
Most SAN Switch web interfaces can be used to monitor the basic performance metrics, such as
throughput utilization, aggregate throughput, and percentage of utilization.
The Fabric OS command-line interface (CLI) can also be used to create frame monitors. These monitors
analyze the first 64 bytes of each frame and can detect various types of protocols that can be monitored.
Some performance features, such as frame monitor configuration (fmconfig), require a license.
Some of the useful commands:
switch:admin>perfhelp
switch:admin>perfmonitorshow
switch:admin>perfaddeemonitor
switch:admin>fmconfig
Bottleneck monitoring
Enable bottleneck monitoring on SAN switches by using the following command:
switch:admin> bottleneckmon --enable -alert
Useful commands
switch:admin> bottleneckmon --status
switch:admin> bottleneckmon --show -interval 5 -span 300
switch:admin> switchstatusshow
switch:admin> switchshow
switch:admin> configshow
switch:admin> configshow -pattern "fabric"
switch:admin> diagshow
switch:admin> porterrshow
Fabric parameters
Fabric parameters are described in the following table. Default values are in brackets []:
Fabric parameter
Description
BBCredit
E_D_TOV
R_A_TOV
dataFieldSize
Set this mode only if N_Pord discovery causes attached devices to fail
[0]
[0]
fabric.ididmode [0]
Useful commands:
# asu64 show
# asu64 show --help
# asu64 set IMM.LanOverUsb Disabled --kcs
# asu64 set uEFI.OperatingMode Performance
Task
Command
# scli -c
WWPNs can also be determined from the Linux command line or using a small script
#!/bin/sh
###
hba_location=$(lspci | grep HBA | awk '{print $1}')
HBA parameters as reported by the scli command appear in the following table:
Parameter
Default value
Connection Options
Data Rate
Auto
Disabled
Disabled
Disabled
Yes
Yes
Execution Throttle
16
Frame Size
2048
Hard Loop ID
30
128
Operation Mode
Disabled
30 seconds
Use the lspci command to show which type(s) of Fibre Channel adapters exist in the system. For
example:
# lspci | grep HBA
The Red Hat enterprise-storage tuned profile uses the deadline scheduler. The deadline scheduler can
be enabled by adding the elevator=deadline parameter to the kernel command line in grub.conf.
Useful commands:
# cat /proc/cmdline
Page size
The default page size for Red Hat Linux is 4096 bytes.
# getconf PAGESIZE
Disable Qlogic failover. If the output of the following command shows the -k driver (not the -fo driver) then
failover is disabled.
# modprobe qla2xxx | grep -w ^version
version: <some_version>-k
Useful commands:
# modinfo -p qla2xxx
The qla_os.c file in the Linux kernel source contains information on many of the qla2xxx module
parameters. Some parameters as listed by modinfo -p do not exist in the Linux source code. Others
are not explicitly defined but may be initialized by the adapter firmware.
Description
Default value
ql2xallocfwdump
1 - allocate memory
HBA initialization
ql2xasynctmfenable
Issue TM IOCBs
asynchronously via
IOCB mechanism
ql2xdbwr
1 - CAMRAM doorbell
(faster)
ql2xdontresethba
Reset behavior
0 - reset on failure
ql2xenabledif
T10-CRC-DIF
1 - DIF support
ql2xenablehba_err_chk
T10-CRC-DIF Error
isolation by HBA
0 - disabled
ql2xetsenable
ql2xextended_error_log
ging
0 - no logging
ql2xfdmienable
FDMI registrations
0 - no FDMI
ql2xfwloadbin
ql2xgffidenable
ql2xiidmaenable
IDMA setting
1 - perform iIDMA
ql2xloginretrycount
ql2xlogintimeout
20
20
ql2xmaxqdepth
32
32
ql2xmaxqueues
MQ
1 - single queue
ql2xmultique_tag
CPU affinity
not defined
0 - no affinity
ql2xplogiabsentdevice
PLOGI
not defined
0 - no PLOGI
ql2xqfulrampup
120 seconds
detected
ql2xqfulltracking
1 - perform tracking
ql2xshiftctondsd
Control shifting of
command type
processing based on
total number of SG
elements
ql2xtargetreset
Target reset
1 - use hw defaults
qlport_down_retry
Maximum number of
command retries to a
port in PORT-DOWN
state
not defined
To send down large-size requests (greater than 512 KB on 4 KB page size systems):
SCSI device parameters appear in the following table. Values that can be changed are shown as (rw):
Parameter
Description
Value
hw_sector_size (ro)
512
max_hw_sectors_kb (ro)
32767
max_sectors_kb (rw)
512
nomerges (rw)
nr_requests (rw)
128
read_ahead_kb (rw)
rq_affinity (rw)
8192
Always complete a request on
the same CPU that queued it.
deadline
Using max_sectors_kb:
By default, Linux devices are configured for a maximum 512 KB I/O size. When using a larger file system
block size, increase the max_sectors_kb parameter. Max_sectors_kb must be less than or equal to
max_hw_sectors_kb.
The default queue_depth is 32 and represents the total number of transfers that can be queued to a
device. You can check the queue depth by examining /sys/block/<device>/device/queue_depth.
Note: XFS writes are not guaranteed to be committed unless the program issues a fsync() call
afterwards.
If necessary, you can increase the amount of space allowed for inodes using the
mkfs.xfs -i maxpct= option. The default percentage of space allowed for inodes varies by file
system size. For example, a file system between 1 TB and 50 GB in size will allocate 5% of the
total space for inodes.
Normally, the XFS file system directory block size is the same as the file system block size.
Choose a larger value for the mkfs.xfs -n size= option, if there are many millions of directory
entries.
The metadata log can be placed on another device, for example, a solid-state drive (SSD) to reduce disk
seeks.
Specify the stripe unit and width for hardware RAID devices
-b block_size_options
size=<int>
-- size in bytes
default
4096
minimum
512
-d data_section_options
More allocation groups imply that more parallelism can be achieved when
allocating blocks and inodes
agcount=<int> -- number of allocation groups
agsize
name
file
size
sunit
su
swidth
sw
-i inode_options
size
log
perblock
maxpct
align
attr
-l log_section_options
internal
logdev
size
version
sunit
su
lazy-count
-n naming_options
size
log
version
-r realtime_section_options
rtdev
extsize
size
-s sector_size
log
size
-N
Dry run.
nobarrier
noatime
inode64 XFS is allowed to create inodes at any location in the file system. Starting from kernel 2.6.35,
XFS file systems will mount either with or without the inode64 option.
logbsize Larger values can improve performance. Smaller values should be used with fsync-heavy
workloads.
The Red Hat 6.2 Release Notes mention that XFS has been improved in order to better handle metadata
intensive workloads. The default mount options have been updated to use delayed logging.
Useful commands:
# tuned-adm help
# tuned-adm list
# tuned-adm active
The enterprise-storage profile contains the following files. When comparing the enterprise-storage profile
with the throughput-performance profile, some files are identical:
# cd /etc/tune-profiles
# ls enterprise-storage/
ktune.sh
ktune.sysconfig
sysctl.ktune
tuned.conf
2 throughput-performance/sysctl.s390x.ktune
08073
2 enterprise-storage/sysctl.ktune
15419
2 enterprise-storage/ktune.sysconfig
15419
2 throughput-performance/ktune.sysconfig
15570
1 enterprise-storage/ktune.sh
43756
1 enterprise-storage/tuned.conf
43756
1 throughput-performance/tuned.conf
47739
2 throughput-performance/sysctl.ktune
57787
1 throughput-performance/ktune.sh
ktune.sh
The enterprise-storage ktune.sh is the same as the throughput-performance ktune.sh but adds
functionality for disabling or enabling I/O barriers. The enterprise-storage profile is preferred when using
XIV storage. Important functions include:
ktune.sysconfig
ktune.sysconfig is identical for both throughput-performance and enterprise-storage profiles:
# grep -h ^[A-Za-z] enterprise-storage/ktune.sysconfig \ throughputperformance/ktune.sysconfig | sort | uniq -c
2 ELEVATOR="deadline"
2 ELEVATOR_TUNE_DEVS="/sys/block/{sd,cciss,dm-}*/queue/scheduler"
2 SYSCTL_POST="/etc/sysctl.conf"
2 USE_KTUNE_D="yes"
Listing 6: Sorting the ktune.sysconfig file
sysctl.ktune
sysctl.ktune is functionally identical for both throughput-performance and enterprise-storage profiles:
# grep -h ^[A-Za-z] enterprise-storage/sysctl.ktune \ throughputperformance/sysctl.ktune | sort | uniq -c
2 kernel.sched_min_granularity_ns = 10000000
2 kernel.sched_wakeup_granularity_ns = 15000000
2 vm.dirty_ratio = 40
Listing 7: Sorting the sysctl.ktune file
tuned.conf
tuned.conf is identical for both throughput-performance and enterprise-storage profiles:
# grep -h ^[A-Za-z] enterprise-storage/tuned.conf \throughputperformance/tuned.conf | sort | uniq -c
12 enabled=False
Listing 8: Sorting the tuned.conf file
Linux multipath
Keep it simple: configure just enough paths for redundancy and performance.
features='1 queue_if_no_path'
Set 'no_path_retry N', then remove features='1 queue_if_no_path' option or set 'features 0'
Default value
polling interval
udev_dir/dev
/dev
multipath_dir
/lib/multipath
find_multipaths
no
verbosity
path_selector
round-robin 0
path_grouping_policy
failover
getuid_callout
prio
const
features
queue_if_no_path
path_checker
directio
failback
manual
rr_min_io
1000
rr_weight
uniform
no_path_retry
user_friendly_names
no
queue_without_daemon
yes
flush_on_last_del
no
max_fds
checker_timer
/sys/block/sdX/device/timeout
fast_io_fail_tmo
determined by the OS
dev_loss_tmo
determined by the OS
mode
uid
gid
The default load balancing policy (path_selector) is round-robin 0. Other choices are queue-length 0 and
service-time 0.
Consider using the XIV Linux host attachment kit to create the multipath configuration file.
# cat /etc/multipath.conf
devices {
device {
vendor "IBM"
product "2810XIV"
path_selector "round-robin 0"
path_grouping_policy multibus
rr_min_io 15
path_checker tur
failback 15
no_path_retry 5
#polling_interval 3
}
}
defaults {
...
user_friendly_names yes
...
}
Listing 9: A sample multipath.conf file
Sample scripts
You can use the following script to query various settings related to I/O tuning:
#!/bin/sh
#!/bin/sh
# Query scheduler, hugepages, and readahead settings for fibre channel scsi
devices
###
Summary
This white paper presented an end-to-end approach for Linux I/O tuning in a typical data center
environment consisting of external storage subsystems, storage area network (SAN) switches, IBM
System x Intel servers, Fibre Channel HBAs and 64-bit Red Hat Enterprise Linux.
Visit the links in the Resources section for more information on topics presented in this white paper.
Resources
The following websites provide useful references to supplement the information contained in this paper:
XIV Redbooks
ibm.com/redbooks/abstracts/sg247659.html
ibm.com/redbooks/abstracts/sg247904.html
Note: IBM Redbooks are not official IBM product documentation.
XIV Infocenter
http://publib.boulder.ibm.com/infocenter/ibmxiv/r2
XIV Host Attachment Kit for RHEL can be downloaded from Fix Central
ibm.com/support/fixcentral
Qlogic
http://driverdownloads.qlogic.com
ftp://ftp.qlogic.com/outgoing/linux/firmware/rpms
Linux
Documentation/kernel-parameters.txt
Documentation/block/queue-sysfs.txt
Documentation/filesystems/xfs.txt
drivers/scsi/qla2xxx
http://xfs.org/index.php/XFS_FAQ
information is presented here to communicate IBM's current investment and development activities as a
good faith effort to help with our customers' future planning.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending
upon considerations such as the amount of multiprogramming in the user's job stream, the I/O
configuration, the storage configuration, and the workload processed. Therefore, no assurance can be
given that an individual user will achieve throughput or performance improvements equivalent to the
ratios stated here.
Photographs shown are of engineering prototypes. Changes may be incorporated in production models.
Any references in this information to non-IBM websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of
the materials for this IBM product and use of those websites is at your own risk.