Sie sind auf Seite 1von 12

Brendan's blog Activity of the ZFS ARC

Pgina 1 de 12

Brendan Gregg's professional blog


Home
About

Brendan's blog
Search

Find

Activity of the ZFS ARC


Disk I/O is still a common source of performance issues, despite modern cloud environments, modern file systems and huge amounts of
main memory serving as file system cache. Understanding how well that cache is working is a key task while investigating disk I/O
issues. In this post, Ill show the activity of the ZFS file system Adaptive Replacement Cache (ARC).
There are often more statistics available than you realize (or have been documented), which may certainly be true with the ARC. Apart
from showing these statistics, Ill also show how to extend observability using dynamic tracing (DTrace). These tracing techniques are
also applicable to any kernel subsystem. This is an advanced topic, where Ill sometimes dip into kernel code.

Architecture
For background on the ZFS ARC, see the paper ARC: A Self-Tuning, Low Overhead Replacement Cache, by Nimrod Megiddo and
Dharmendra S. Modha. In a nutshell, the ARC achieves a high cache hit rate by using multiple cache algorithms at the same time: most
recently used (MRU) and most frequently used (MFU). Main memory is balanced between these algorithms based on their performance,
which is known by maintaining extra metadata (in main memory) to see how each algorithm would perform if it ruled all of memory.
Such extra metadata is held on ghost lists.
The ZFS ARC has some changes beyond this design, as described in the block comment at the top of uts/common/fs/zfs/arc.c. These
changes include the ability to lock pages, vary the size of the cache, and to cache buffers of different sizes.

Lists
+----------------------------------------------------------+
| ZFS ARC
|
|
+---------------+----------------------------------+
|
|
| MRU
|
MRU ghost |
|
|
+---------------+---------------+------------------+
|
|
| MFU
|
MFU ghost |
|
|
+-------------------------------+------------------+
|
|
<--- available main memory --->
|
|
|
+----------------------------------------------------------+

The MRU + MFU lists refer to the data cached in main memory; the MRU ghost + MFU ghost lists consist of themselves only (the
metadata) to track algorithm performance.
This is a simplification to convey the basic principle. The current version of the ZFS ARC splits the lists above into separate data and
metadata lists, and also has a list for anonymous buffers and one for L2ARC only buffers (which I added when I developed the L2ARC).
The actual lists are these, from arc.c:
typedef struct arc_state {
list_t arcs_list[ARC_BUFC_NUMTYPES];
/* list of evictable buffers */
uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
uint64_t arcs_size;
/* total amount of data in this state */
kmutex_t arcs_mtx;
} arc_state_t;
/* The
static
static
static
static
static
static

6 states: */
arc_state_t ARC_anon;
arc_state_t ARC_mru;
arc_state_t ARC_mru_ghost;
arc_state_t ARC_mfu;
arc_state_t ARC_mfu_ghost;
arc_state_t ARC_l2c_only;

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

Pgina 2 de 12

These lists exhibit MRU- and MFU-like behavior, but arent strictly MRU/MFU. This can be understood from a lifecycle of an ARC
buffer: on the first access, it is created and moved to the head of the MRU list. On the second access, it is moved to the head of the MFU
list. On the third access, it moves back to the start of the MFU list. (Other lifecycles are possible, this is just one example.) So, the most
recently accessed buffer may be at the start of the MFU list, not the MRU list. And, the most frequently accessed buffer may not be at the
very start of the MFU list.

Locks
Data exists in the cache as buffers, where the primary structures are the arc_buf_hdr_t (header struct, defined in arc.c) and arc_buf_t
(buffer struct, defined in arc.h). Access to these is protected by a hash table based on the 128-bit ZFS data virtual address (DVA). The
hash table has 256 buffer chains (BUF_LOCKS, which may vary based on your ZFS version), each protected by a padded lock (to avoid
false sharing). From arc.c:
#define HT_LOCK_PAD
struct ht_lock {
kmutex_t
#ifdef _KERNEL
unsigned char
#endif
};

64
ht_lock;
pad[(HT_LOCK_PAD - sizeof (kmutex_t))];

#define BUF_LOCKS 256


typedef struct buf_hash_table {
uint64_t ht_mask;
arc_buf_hdr_t **ht_table;
struct ht_lock ht_locks[BUF_LOCKS];
} buf_hash_table_t;

These are optimized for performance since ARC buffers can be accessed, modified and moved between lists frequently.
For more details on ARC lists and locks, see the block comments in arc.c, and the overview by Joerg Moellenkamp.

Sizing
The ARC grows to fill available memory on the system, on the principle that if there is free memory, use it. It shouldnt do this at the
expense of applications, ie, it shouldnt push out application memory (at least, in any large and sustained way). It keeps its size in check
via:
allocation: once the ARC size has grown to its expected maximum, it will begin evicting buffers during new allocations. There is
also some logic in arc_evict() to recycle a buffer of equal size, an optimization to avoid doing an evict-free-alloc path of the same
size.
reclaim thread: this is arc_reclaim_thread(), which wakes up every second (or sooner if signaled by the arc_reclaim_thr_cv
conditional variable) and will attempt to reduce the size of the ARC to the target size. It calls arc_kmem_reap_now() to clean up
the kmem caches, and arc_adjust() to resize the ARC lists. If arc_shrink() is called by arc_kmem_reap_now(), the target ARC size
is reduced by arc_shrink_shift (or needfree), which means shrinking the ARC by 3%. If you plot the ARC size, you sometimes see
these arc_shrink() steps appearing as teeth on a saw a sharp drop followed by a gradual increase.
This is a brief summary, and includes keywords so you can find the right places in the source to start reading. I should note that the ARC
did have sizing issues in the past, where it did seem to push out application memory; those were since fixed. (One issue was where it
didnt account for its own footprint accurately, missing a source of metadata into its size calculation, which meant the ARC was reaping
later than it should have.)

Statistics
On Solaris-based systems, ARC statistics are available from kstat (kernel statistics), the same resource used by tools such as vmstat(1M)
and iostat(1M). kstats are global (entire system, not individual zones) and accessible from non-root users. On the down side, they usually
are not documented and are not considered a stable interface.
On FreeBSD, the same kstats for the ARC are available via sysctl (kstat.zfs.misc.arcstats).

ARC Hit/Miss Rate


ARC hit or miss rate can be determined from the kstats zfs::arcstats:hits and zfs::arcstats:misses. To watch a rate over time, they can be
processed using a little awk (example for Solaris-based systems):
# cat -n archits.sh
1 #!/usr/bin/sh
2
3 interval=${1:-5}
# 5 secs by default
4
5 kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses $interval | awk '
6
BEGIN {
7
printf "%12s %12s %9s\n", "HITS", "MISSES", "HITRATE"
8
}

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

9
10
11
12
13
14
15
16
17
18
19
20
21
22

Pgina 3 de 12

/hits/ {
hits = $2 - hitslast
hitslast = $2
}
/misses/ {
misses = $2 - misslast
misslast = $2
rate = 0
total = hits + misses
if (total)
rate = (hits * 100) / total
printf "%12d %12d %8.2f%%\n", hits, misses, rate
}
'

This program could be shorter; Ive spent some extra lines to write it more clearly. You could also write this in Perl (see my
Sun::Solaris::Kstat examples), or C via libkstat.
$ ./archits.sh 1
HITS
MISSES
651329528960
370490565
22600
11
17984
6
8978
8
87041
28
89861
10
[...]

HITRATE
99.94%
99.95%
99.97%
99.91%
99.97%
99.99%

The first line is the summary since boot, then interval summaries. These counters are system wide. The hit rate on this system is
impressive (99.94% since boot), although hit rates can be misleading. Im usually studying the MISSES column, as a linear measure of
pain.

arcstat.pl
Neelakanth Nadgir wrote arcstat.pl (Solaris), which prints various statistics including reads, misses and the size of the ARC. Mike Harsch
delevoped arcstat.pl further, including L2ARC statistics.
$ ./arcstat.pl 1
time read miss
04:45:47
0
0
04:45:49
15K
10
04:45:50
23K
81
04:45:51
65K
25
04:45:52
30K
11
[...]

miss%
0
0
0
0
0

dmis
0
10
81
25
11

dm%
0
0
0
0
0

pmis
0
0
0
0
0

pm%
0
0
0
0
0

mmis
0
1
1
4
3

mm%
0
0
0
0
0

arcsz
14G
14G
14G
14G
14G

c
14G
14G
14G
14G
14G

Instead of hit rates, this tool uses miss rates.


In Neels version the first line is the summary since boot; this isnt the case in Mikes current L2ARC version: an extra snap_stats() for an
early L2ARC check means that by the time the statistics loop is reached the first iteration is comparing now with now instead of now
with boot.
Jason Hellenthal has created a FreeBSD version.

All statistics
All the kstats from the arcstat group (which feed the tools seen above) can be listed using:
$ kstat -pn arcstats
zfs:0:arcstats:c
15730138449
zfs:0:arcstats:c_max
50447089664
zfs:0:arcstats:c_min
6305886208
zfs:0:arcstats:class
misc
zfs:0:arcstats:crtime
95.921230719
zfs:0:arcstats:data_size
13565817856
zfs:0:arcstats:deleted 388469245
zfs:0:arcstats:demand_data_hits 611277816567
zfs:0:arcstats:demand_data_misses
258220641
zfs:0:arcstats:demand_metadata_hits
40050025212
zfs:0:arcstats:demand_metadata_misses
88523590
zfs:0:arcstats:evict_skip
5669994
zfs:0:arcstats:hash_chain_max
20
zfs:0:arcstats:hash_chains
248783
zfs:0:arcstats:hash_collisions 2106095400
zfs:0:arcstats:hash_elements
971654
zfs:0:arcstats:hash_elements_max
5677254
zfs:0:arcstats:hdr_size 188240232
zfs:0:arcstats:hits
651328694708
[...l2arc statistics truncated...]
zfs:0:arcstats:memory_throttle_count
0
zfs:0:arcstats:mfu_ghost_hits
55377634
zfs:0:arcstats:mfu_hits 649347616033
zfs:0:arcstats:misses
370489546

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

Pgina 4 de 12

zfs:0:arcstats:mru_ghost_hits
127477329
zfs:0:arcstats:mru_hits 1980639328
zfs:0:arcstats:mutex_miss
11530337
zfs:0:arcstats:other_size
1967741376
zfs:0:arcstats:p
14713329404
zfs:0:arcstats:prefetch_data_hits
21342
zfs:0:arcstats:prefetch_data_misses
20782630
zfs:0:arcstats:prefetch_metadata_hits
831587
zfs:0:arcstats:prefetch_metadata_misses 2962685
zfs:0:arcstats:recycle_miss
27036925
zfs:0:arcstats:size
15721799464
zfs:0:arcstats:snaptime 29379870.8764106

More of the activity related statistics will be discussed in the next sections.

Demand/Prefetch
Hits and misses can be broken down into four components, such that:
hits = demand_data_hits + demand_metadata_hits + prefetch_data_hits + prefetch_metadata_hits
And similar for misses. Prefetch and demand refer to how the ARC request was initiated; data and metadata refer to the type of data
requested.
Prefetch is the ZFS read-ahead feature, to predict and pre-cache blocks for streaming (sequential) workloads. All the prefetch statistics
refer to ARC requests that have originated from the ZFS prefetch algorithm which happens before the ARC and without knowing
whether the data is already cached in the ARC. So, a prefetch hit means that ZFS initiated a prefetch, which was then found in the ARC.
A prefetch miss happens when that prefetch request was not in the ARC, and so initiated a disk I/O request (normal behavior). Demand
is the opposite of prefetch: direct requests to the ARC, not predicted requests.
Another way to understand prefetch statistics is to follow the code. In dbuf.c, see the ARC_PREFETCH flag set in dbuf_prefetch(), which
is then checked in arc.c via the ARCSTAT_CONDSTAT macro to determine which kstat to increment.
You can also add these up in other ways; eg:
streaming ratio = prefetch_* / (hits + misses)
At least, that identifies the ratio of the workload that ZFS has identified as streaming. This can be turned into a kstat tool (awk/Perl/C), as
with hits/misses earlier, to show both summary since boot and interval summaries (current activity).

Data/Metadata
Metadata describes the ZFS dataset (file system or volume) and the objects within it. The data is the contents of those objects, including
file, directory and volume blocks.
metadata ratio = *_metadata_* / (hits + misses)
This may be useful to check for considering the effect of picking a small recsize setting (thus increasing metadata), or when considering
the effect of setting primarycache to metadata only.

Others
Some other activity related kstats worth mentioning for the ARC:
mru_hits, mru_ghost_hits, mfu_hits, mfu_ghost_hits, p: Comparing the mru_hits and mfu_hits statistic with misses can
determine the performance of each ARC list type (its not comparing performance of the MRU/MFU algorithms alone, since these
arent strictly MRU/MFU, as mentioned in Architecture). By adding _hits + _ghost_hits for each type, and then comparing the ratio
of each type over time, you can also identify if the workload changes in terms of ARC MRU/MFU. And you can also see how
quickly the ARC adapts to the workload, by watching the p statistic (ARC parameter) change.
hash_chain_max, hash_collisions: These show how well the DVA hash table is hashing. hash_chain_max is the longest length
seen for a chain, when DVAs hash to the same table entry, and is usually less than 10. If that was much higher, performance may
degrade as the hash locks are held longer while the chains are walked, assuming the max is reflective and not an anomaly caused by
some short event. This could be double checked by studying the hash_collisions rate. If an issue is found, the number of hash table
entries (BUF_LOCKS) could be increased in arc.c, and ZFS recompiled (this isnt a regular tunable); although I wouldnt expect
needing to tune this for a while.
Other kstats in the arcstats group describe sizes of the ARC, and the L2ARC.

arc_summary.pl
Another Perl Sun::Solaris::Kstat-based ARC tool worth mentioning is Ben Rockwoods arc_summary.pl, which prints a neat summary of
the hit/miss rate and many of the other counters. Jason Hellenthal also ported the tool to FreeBSD.
$ ./arc_summary.pl
System Memory:

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

Physical RAM:
Free Memory :
LotsFree:
ZFS Tunables
set
set
set

Pgina 5 de 12

49134 MB
1925 MB
767 MB

(/etc/system):
zfs:zil_disable=1
zfs:zfs_prefetch_disable=1
zfs:zfs_nocacheflush=1

ARC Size:
Current Size:
Target Size (Adaptive):
Min Size (Hard Limit):
Max Size (Hard Limit):

15172 MB (arcsize)
15256 MB (c)
6013 MB (zfs_arc_min)
48110 MB (zfs_arc_max)

ARC Size Breakdown:


Most Recently Used Cache Size:
Most Frequently Used Cache Size:
ARC Efficency:
Cache Access Total:
Cache Hit Ratio:
Cache Miss Ratio:
REAL Hit Ratio:

99%
0%
99%

Data Demand
Efficiency:
Data Prefetch Efficiency:

77%
22%

11865 MB (p)
3391 MB (c-p)

654018720316
653646329407
372390909
653645890054

[Defined State for buffer]


[Undefined State for Buffer]
[MRU/MFU Hits Only]

99%
0%

CACHE HITS BY CACHE LIST:


Anon:
--%
Most Recently Used:
0%
Most Frequently Used:
99%
Most Recently Used Ghost:
0%
Most Frequently Used Ghost: 0%
CACHE HITS BY DATA TYPE:
Demand Data:
93%
Prefetch Data:
0%
Demand Metadata:
6%
Prefetch Metadata:
0%
CACHE MISSES BY DATA TYPE:
Demand Data:
69%
Prefetch Data:
5%
Demand Metadata:
23%
Prefetch Metadata:
0%
---------------------------------------------

Counter Rolled.
1989696958 (mru)
651656193096 (mfu)
128471495 (mru_ghost)
55618357 (mfu_ghost)

[ Return Customer ]
[ Frequent Customer ]
[ Return Customer Evicted, Now Back ]
[ Frequent Customer Evicted, Now Back ]

613371468593
21342
40274007879
831593
259735783
20782630
88909678
2962818

Percentages and raw counters are provided, and the four breakdowns of hit/miss statistics (which I documented above; Bens been
bugging me to document the arcstats for a while).

Tracing
Apart from statistics, the activity of the ARC can also be observed by tracing function points and probes in the kernel. While statistics are
always enabled and collected, tracing is enabled when needed, and costs much higher overhead. This overhead is relative to the frequency
of the traced events, which for the ARC can be very frequent (hundreds of thousands of events per second). I usually only trace the ARC
for short periods (seconds or minutes) to gather debug data.
There isnt a stable DTrace provider for the ARC (and there probably never will be other areas make much more sense), but there are
sdt-provider probes in the ARC code:
# dtrace -ln 'sdt:zfs::arc-*'
ID
PROVIDER
MODULE
19307
sdt
zfs
19310
sdt
zfs
19311
sdt
zfs
19312
sdt
zfs
19313
sdt
zfs

FUNCTION
arc_read_nolock
arc_evict_ghost
arc_evict
arc_read_nolock
arc_buf_add_ref

NAME
arc-miss
arc-delete
arc-evict
arc-hit
arc-hit

If these didnt exist, you could use the fbt provider. Id begin by inspecting the functions listed in the FUNCTION column.
Note that neither of these providers (sdt or fbt) are available from within Solaris zones these must be traced from the global zone. They
are also both considered unstable interfaces, meaning the one-liners and scripts that follow may not work on future versions of the ARC
without maintenance to match the code changes.

ARC accesses by applicaiton


Checking which applications are (directly) using the ARC:
# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname] = count() }'
dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes
^C
sendmail
qmgr

1
3

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

[...]
nscd
httpd
imapd
python2.6
awstats.pl
php
mysqld

Pgina 6 de 12

81
243
1417
2572
4285
6934
105901

This frequency counts the execname during ARC access. mysqld was the heaviest user, with 105,901 accesses while tracing.
The kernel will show up as sched, for activities including ZFS transaction group flushes (TXG flush).

ARC accesses by kernel call path


For more details on why the ARC is being accessed, the kernel calling stack can be frequency counted:
# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss {
@[execname, probefunc, stack()] = count(); }'
dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes
^C
[...]
sched
arc_buf_add_ref
zfs`dbuf_hold_impl+0xea
zfs`dbuf_hold+0x2e
zfs`dmu_buf_hold+0x75
zfs`zap_lockdir+0x67
zfs`zap_update+0x5b
zfs`uidacct+0xc4
zfs`zfs_space_delta_cb+0x112
zfs`dmu_objset_do_userquota_callbacks+0x151
zfs`dsl_pool_sync+0xfe
zfs`spa_sync+0x32b
spa sync
zfs`txg_sync_thread+0x265
unix`thread_start+0x8
26
[...]
python2.6
arc_buf_add_ref
zfs`dbuf_hold_impl+0xea
zfs`dbuf_hold+0x2e
zfs`dmu_buf_hold+0x75
zfs`zap_get_leaf_byblk+0x56
zfs`zap_deref_leaf+0x78
zfs`fzap_cursor_retrieve+0xa7
zfs`zap_cursor_retrieve+0x152
zfs`zfs_readdir+0x2b8
genunix`fop_readdir+0xab
read directory
genunix`getdents64+0xbc
unix`_sys_sysenter_post_swapgs+0x149
2130
[...]
mysqld
arc_buf_add_ref
zfs`dbuf_hold_impl+0xea
zfs`dbuf_hold+0x2e
zfs`dmu_buf_hold_array_by_dnode+0x1a7
zfs`dmu_buf_hold_array+0x71
zfs`dmu_read_uio+0x4d
zfs`zfs_read+0x19a
genunix`fop_read+0x6b
read
genunix`read+0x2b8
genunix`read32+0x22
unix`_sys_sysenter_post_swapgs+0x149
101955

The output was many pages long; Ive truncated to include a few different stacks, and added annotations.

ARC misses by user-land call path


Heres another view of ARC access call paths, this time for misses only, and the user-land stack trace that led to the miss. Ive filtered on
mysqld processes only:
# dtrace -n 'sdt:zfs::arc-miss /execname == "mysqld"/ {
@[execname, probefunc, ustack()] = count(); }'
dtrace: description 'sdt:zfs::arc-miss ' matched 1 probe
^C
[...]
mysqld
arc_read_nolock
libc.so.1`__read+0x15
mysqld`my_read+0x43
mysqld`_Z7openfrmP3THDPKcS2_jjjP8st_table+0x95
mysqld`_ZL17open_unireg_entryP3THDP8st_tablePKcS4_S4_P10TABLE_LISTP1...
mysqld`_Z10open_tableP3THDP10TABLE_LISTP11st_mem_rootPbj+0x6d7
mysqld`_Z11open_tablesP3THDPP10TABLE_LISTPjj+0x1b0
mysqld`_Z30open_normal_and_derived_tablesP3THDP10TABLE_LISTj+0x1b
mysqld`_Z14get_all_tablesP3THDP10TABLE_LISTP4Item+0x73b

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

Pgina 7 de 12

mysqld`_Z24get_schema_tables_resultP4JOIN23enum_schema_table_state+0x18
mysqld`_ZN4JOIN4execEv+0x59e
mysqld`_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8s...
mysqld`_Z13handle_selectP3THDP6st_lexP13select_resultm+0x102
mysqld`_Z21mysql_execute_commandP3THD+0x51c6
mysqld`_Z11mysql_parseP3THDPKcjPS2_+0x1be
mysqld`_Z16dispatch_command19enum_server_commandP3THDPcj+0x895
mysqld`handle_one_connection+0x318
libc.so.1`_thrp_setup+0x7e
libc.so.1`_lwp_start
124

The kernel stack trace could be included as well, showing the complete call path from user-land to a kernel event.

ARC access sizes


Digging a bit deeper; the sdt probes used previously were declared as:
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);

Which means arg0 is an arc_buf_hdr_t. Its declared in arc.c, and contains various members including:
struct arc_buf_hdr {
[...]
arc_buf_t
uint32_t
[...]
arc_buf_contents_t
uint64_t
uint64_t
[...]
clock_t
[...]

*b_buf;
b_flags;
b_type;
b_size;
b_spa;
b_arc_access;

Lets pick out the size, and trace ARC accesses by buffer size:
# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss {
@["bytes"] = quantize(((arc_buf_hdr_t *)arg0)->b_size); }'
dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes
^C
bytes
value
0
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144

------------- Distribution ------------|


|
|
|
|
|
|
|
|
|
|@
|
|
|
|@
|@@
|
|@@
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|

count
0
82
1
0
1
1
1
1
0
0
1526
605
780
913
1094
4386
618
4196
60811
0

Here I used a power-of-2 quantization, which showed that most of the buffers were in the 128 Kbyte range. (Which is also the default
recsize for the datasets on this system.) Smaller buffers will exist due to cases including files and directories that are smaller than 128k.
Other members of arc_buf_hdr_t can be retrieved and inspected in similar ways.

ARC buffer age


Heres a neat use of the b_arc_access member, which tracks the time that the buffer was last accessed in terms of clock ticks. This time
the fbt provider is used, to trace arc_access() before and after it updates b_arc_access:
# cat -n arcaccess.d
1 #!/usr/sbin/dtrace -s
2
3 #pragma D option quiet
4
5 dtrace:::BEGIN
6 {
7
printf("lbolt rate is %d Hertz.\n", `hz);
8
printf("Tracing lbolts between ARC accesses...");
9 }

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

10
11
12
13
14
15
16
17
18
19
20
21
22
23

Pgina 8 de 12

fbt::arc_access:entry
{
self->ab = args[0];
self->lbolt = args[0]->b_arc_access;
}
fbt::arc_access:return
/self->lbolt/
{
@ = quantize(self->ab->b_arc_access - self->lbolt);
self->ab = 0;
self->lbolt = 0;
}

Running for 10 seconds:


# ./arcaccess.d -n 'tick-10s { exit(0); }'
lbolt rate is 100 Hertz.
Tracing lbolts between ARC accesses...
value
-1
0
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432

------------- Distribution ------------|


|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

count
0
729988
3805
3038
2028
1428
1398
1618
2883
738
681
338
569
166
607
632
808
373
110
142
39
5
97
10
44
617
1
0

10 ms

1 second

1 minute

1 hour

1 day

This is interesting data. It shows that most buffers were accessed less than one clock tick apart (10 ms), with 729,988 accesses in the 0 to
1 tick range. The oldest buffer accessed was in the 16777216+ range, which (converting lbolts @100 Hertz into time) means it is at least
46 hours since last access. The above output has been annotated to show where times fall in the lbolt ranges (eg, 1 second falls in the 64
127 lbolt range).
This gives us an insight into the age of the oldest buffers in the ARC (at least, in terms of access rate not birth), and, of its churn rate.
This particular ARC is 25 Gbytes, and has been running with a 99.94% hit rate as shown earlier which may be less surprising now we
know that it is so large that it can contain buffers accessed 40+ hrs apart.

ARC hash lock


To get a handle on ARC hash lock contention (instead of using more heavyweight tools like lockstat(1M)), you can try tracing the time
for arc_buf_add_ref(), since it grabs the buffer hash lock:
# dtrace -n 'arc_buf_add_ref:entry { self->s = timestamp; }
arc_buf_add_ref:return /self->s/ {
@["ns"] = quantize(timestamp - self->s); self->s = 0; }'
dtrace: description 'arc_buf_add_ref:entry ' matched 2 probes
^C
ns
value
256
512
1024
2048
4096
8192
16384
32768
65536
131072

------------- Distribution ------------|


|@
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|@@@@@
|@@
|
|
|
|
|

count
0
2123
44784
7556
2267
385
4
0
1
0

Most of the times were in the 1 2 us range, with only a single occurrence passing 65 us.

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

Pgina 9 de 12

ARC reap
Heres a simple script to provide insight into the ARC reclaim thread an asynchronous task that keeps the size of the ARC in check.
# cat -n arcreap.d
1 #!/usr/sbin/dtrace -s
2
3 fbt::arc_kmem_reap_now:entry,
4 fbt::arc_adjust:entry
5 {
6
self->start[probefunc] = timestamp;
7 }
8
9 fbt::arc_shrink:entry
10 {
11
trace("called");
12 }
13
14 fbt::arc_kmem_reap_now:return,
15 fbt::arc_adjust:return
16 /self->start[probefunc]/
17 {
18
printf("%Y %d ms", walltimestamp,
19
(timestamp - self->start[probefunc]) / 1000000);
20
self->start[probefunc] = 0;
21 }

Different functions are traced: arc_kmem_reap_now(), to see the time taken to reap the ARC kmem caches; arc_adjust(), for resizing the
ARC lists; and arc_shrink(), to know when the ARC size has been stepped down (this isnt timed, since any real work will be done by
arc_adjust()).
# ./arcreap.d
dtrace: script './arcreap.d' matched 5 probes
CPU
ID
FUNCTION:NAME
0 64929
arc_shrink:entry
called
0 62414
arc_adjust:return 2012 Jan
9 62420
arc_kmem_reap_now:return 2012 Jan
0 62414
arc_adjust:return 2012 Jan
6 62414
arc_adjust:return 2012 Jan

9
9
9
9

23:10:01
23:10:03
23:10:24
23:10:49

18 ms
1511 ms
0 ms
0 ms

This isnt the only way the ARC keeps its size sane; it will also evict/recycle buffers during allocation, as mentioned in the Architecture
section. This reclaim thread is the more aggressive method, so if you have occasional odd ARC behavior it may be handy to check if it is
related to reclaims.

Evicts by list and type


Tracing the function that does eviction, with details:
# cat -n arcevict.d
1 #!/usr/sbin/dtrace -s
2
3 #pragma D option quiet
4
5 dtrace:::BEGIN
6 {
7
trace("Tracing ARC evicts...\n");
8 }
9
10 fbt::arc_evict:entry
11 {
12
printf("%Y %-10a %-10s %-10s %d bytes\n", walltimestamp, args[0],
13
arg4 == 0 ? "data" : "metadata",
14
arg3 == 0 ? "evict" : "recycle", arg2);
15 }

Sample output:
# ./arcevict.d
Tracing ARC evicts...
2012 Jan 8 08:13:03 zfs`ARC_mru
2012 Jan 8 08:13:03 zfs`ARC_mfu
2012 Jan 8 08:13:03 zfs`ARC_mfu
2012 Jan 8 08:13:04 zfs`ARC_mfu
2012 Jan 8 08:13:07 zfs`ARC_mfu
2012 Jan 8 08:13:07 zfs`ARC_mfu
2012 Jan 8 08:13:08 zfs`ARC_mfu
2012 Jan 8 08:13:08 zfs`ARC_mfu
[...]

data
data
data
data
data
data
metadata
data

evict
evict
recycle
recycle
recycle
recycle
recycle
recycle

812181411 bytes
5961212 bytes
131072 bytes
131072 bytes
131072 bytes
131072 bytes
16384 bytes
131072 bytes

The output begins by catching an 800 Mbyte evict from the ARC MRU data list, followed by a 6 Mbyte evict from the MFU data list.
After that, buffers were evicted due to the recycle code path, which recycles buffers when the ARC is getting full instead of allocating
new ones.

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

Pgina 10 de 12

To understand (and maintain) the arg mappings above, see the invocations of arc_evict() in arc.c. Eg, from arc_adjust():
if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
(void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);
adjustment -= delta;
}

This is the first arc_evict() in arc_adjust(), which is why the ARC MRU data list is hit up first.

And more
The previous tracing examples show the sort of additional information that can be obtained using static tracing (the sdt provider) and
dynamic tracing (the fbt provider). With dynamic tracing, a lot more can be seen as needed. Every function that makes up the ARC can be
traced, along with their arguments.
One detail that is actually difficult to trace is the file names during ARC accesses, since vnode pointers are not passed down to the ARC
layer. It is possible, and has been done before (I dont have an example on-hand though). You could more easily cache them from upper
layers (eg, VFS; see the sample chapter from the DTrace book).

Conclusion
In this post, I examined ZFS ARC activity in detail, starting with statistics provided by kstat and then tracing provided by DTrace. Apart
from calculating hit and miss rates, I discussed other statistics including prefetch and metadata ratios. I then used tracing to observe
information from the ARC including who is using the ARC and why, ARC buffer sizes, the age of the ARC buffers, lock contention
timings and eviction details. More can be traced as needed: ZFS with DTrace provides great performance and observability.
Ive spent much time on kernel internals, but I havent really blogged about the deeper areas. Im trying to change that, at least
occasionally, starting with this post on ARC activity. I hope it is useful.
Thanks to the original ZFS team especially Mark Maybee for writing the ARC and explaining details to me, and to Bryan Cantrill for
kstat-ifying the ARC statistics and creating DTrace.
Posted on January 9, 2012 at 4:50 pm by Brendan Gregg Permalink
In: Kernel Tagged with: ARC, dtrace, performance, ZFS

3 Responses
Subscribe to comments via RSS

1.

Written by Richard Elling


on January 10, 2012 at 10:34 am
Permalink
Spent a lot of time last summer looking at evictions and their impact on the system. These do not scale well with the size of
memory in the system. For example, in the arcevict.d data above, evicting 800 MBytes is not a big deal, but evicting 8 GBytes is a
big deal. Look for some illumos putbacks in this area RSN :-)

2.

Written by mic
on January 11, 2012 at 2:46 am
Permalink
Thanks Brendon, Great Blog.

3.

Written by Kyle Hailey


on January 19, 2012 at 11:01 pm
Permalink
Awesome to see all this information on ARC analysis. Till now Ive found information on these stats pretty sparse. Thanks!
- Kyle

Subscribe to comments via RSS


Previous post

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

Pgina 11 de 12

Next post

Recent Posts

A New Challenge
Another 10 Performance Wins
Benchmarking the Cloud
Cloud Performance Training
Systems Performance: available now
Open Source Systems Performance
The TSA Method
Control T for TENEX
The USE Method: Unix 7th Edition Performance Checklist
The USE Method: FreeBSD Performance Checklist
The USE Method: Mac OS X Performance Checklist
Memory Leak (and Growth) Flame Graphs

My Books

Tags
7410 analytics art benchmarking book cloud cloud analytics CPI
javascript

dtrace example

experimental

joyent L2ARC latency limits linux macosx methodology mysql NAS nfs off-cpu omnios

filesystem frequencytrail heatmaps illumos iSCSI

performance

personal PICs

pid

provider slides SLOG smartos solaris SSD statistics talk testing usemethod video visualizations ZFS

People

Adam Leventhal dtrace.org


Brendan Gregg dtrace.org (professional)
Brendan Gregg blogspot (personal)
Bryan Cantrill dtrace.org
Dave Pacheco dtrace.org
Deirdr Straughan beginningwithi.com
Jim Mauro sun.com
Robert Mustacchi dtrace.org

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Brendan's blog Activity of the ZFS ARC

Pgina 12 de 12

Links
Brendan's homepage
Joyent
SolarisInternals

Meta

Log in
Entries RSS
Comments RSS
WordPress.org
Copyright 2013 Brendan Gregg, all rights reserved
Brendan's blog.
Powered by WordPress and Grey Matter.

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

20/03/2015

Das könnte Ihnen auch gefallen