Agrawal

Generating Realistic Impressions
for File-System Benchmarking
Nitin Agrawal
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
“For better or for worse,"
benchmarks shape a field”"
David Patterson
2
Inputs to file-system benchmarking
Application Input: Benchmark workload
Postmark, FileBench, Fstress,
FS logical Bonnie, IOZone, TPCC, etc etc
organization
Input: In-memory state
Cold cache/warm cache
File System
Input: File-System Image
Anything goes!
Disk layout
Storage device
3
FS images in past: use what is convenient
Typical desktop file system w/ no description (SOSP 05)
5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04)
Randomly generated files of several MB (FAST 08)
1000 files in 10 dirs w/ random data (SOSP 03)
188GB and 129GB volumes in Engg dept (OSDI 99)
10702 files from /usr/local, size 354MB (SOSP 01)

1641 files, 109 dirs, 13.4 MB total size (OSDI 02)
Performance of find operation
Disk layout File-system logical
& cache state organization
Time Taken
Relative
5
Problem scope
Characteristics of file-system images
have strong impact on performance
We need to incorporate representative
file-system images in benchmarking & design
How to create representative

file-system images?
6
Requirements for creating FS images
•  Access to data on file systems and disk layout
–  Properties of file-system metadata [Satyanarayan81,
Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07]
–  Disk fragmentation [Smith97]
–  More such studies in future?
•  A technique to create file-system images that
is
–  Representative: given a set of input distributions
–  Controllable: supply additional user constraints
–  Reproducible: control & report internal parameters
–  Easy to use: for widespread adoption and consensus 7
Introducing Impressions
•  Powerful statistical framework to generate
file-system images
–  Takes properties of file-system attributes as input
–  Works out underlying statistical details of the image
–  Mounted on a disk partition for real benchmarking
–  Satisfies the four design goals
•  Applying Impressions gives useful insights
–  What is the impact on performance and storage size?
–  How does an application behave on a real FS image?
8
Outline
•  Introduction
•  Generating realistic file-system images
•  Applying Impressions: Desktop search
•  Conclusion
9
Overview of Impressions
Impressions
10
Properties of file-system metadata
“Five-year study of file-system metadata” [FAST07]
(Agrawal, Bolosky, Douceur, Lorch)
Used as exemplar for metadata properties in Impressions
11
Features of Impressions
•  Modes of operation for different usages
–  Basic mode: choose default settings for parameters
–  Advanced mode: several individually tunable knobs
•  Thorough statistical machinery ensures accuracy
–  Uses parameterized curve fits
–  Allows arbitrary user constraints
–  Built-in statistical tests for goodness-of-fit
•  Generates namespace, metadata, file content, and
disk fragmentation using above techniques
12
Creating valid metadata
•  Creating file-system namespace
–  Uses Generative Model proposed earlier [FAST 07]
–  Explains the process of directory tree creation
–  Accurately regenerates distribution of directory
size and of directory depth
13
Creating namespace
DirsDirs
by namespace
by by
Directories
Directories subdir count
depth
bySubdirectory
Namespace Depth
Count
of directories
% directories
0.18
100
0.16 D
90
0.14 G
0.12
80
0.1
0.08
Fraction of
70
0.06 Dataset
Cumulative
0.04
60 D
0.02 Generated G
50 0
0 0 22 44 66 88 10 10 12
12 14 16
Namespace depth (bin size 1)
Count of subdirectories
Directory tree
Monte Carlo run
Probability of parent selection Incorporates dirs by depth
i
≈ Count(subdirs)+2 and dirs by subdir count
14
Creating valid metadata
•  Creating file-system namespace
•  Creating files: stepwise process
–  File size, file extension, file depth, parent directory
–  Uses statistical models & analytical approximations
15
Example: creating realistic file sizes
to used space File Sizes
Contribution
Lognormal
Hybrid
Containing file size (bytes, log scale)
•  Pure lognormal distribution no longer good fit

•  Hybrid model: lognormal body, Pareto tail
–  Fits observed data more accurately, used to recreate
file sizes in Impressions
16
Creating files
Files by
Filescontaining
by Containing bytes
Bytes
0.12
D
Fraction of bytes
0.1 G
0.08
0.06
0.04
0.02
0
0 8 2K 512K 512M128G
File Size (bytes, log scale)
S9 S8 S7 S6 S5 S4 S3 S2 S1
i File Size Model
Lognormal body,
Pareto tail
Captures bimodal curve
17
Creating files
Top extensions byCount
Top Extensions by count
1
0.8
Fraction of files
others others
0.6
txt txt
0.4 null null
jpg jpg
htm htm
0.2 h h
gif gif
exe exe
dll dll
0 cpp cpp
i
Desired Generated
File Extensions
S9 S8 S7 S6 S5 S4 S3 S2 S1
Percentile values
E9 E8 E7 E6 E5 E4 E3 E2 E1 Top 20 extensions account
for 50% of files and bytes
18
Creating files
Bytes byFiles
Files by namespace
namespace
Bytes depth
depth
bybyNamespace
NamespaceDepth
Depth
0.16
file
0.14 D
files
2MB
0.12 G
bytesofper
(log scale)
768KB
0.1
0.08
Fraction
256KB
0.06
64KB
0.04
Mean
0.02
16KB
0
0 0 2 2 4 4 6 6 88 10
10 12
12 14
14 16
Namespace
Namespacedepthdepth(bin
(binsize
size1)1)
S9 S8 S7 S6 S5 S4 S3 S2 S1
E9 E8 E7 E6 E5 E4 E3 E2 E1
i
File Depth
Poisson
Multiplicative model along
D9 D8 D7 D6 D5 D4 D3 D2 D1 with bytes by depth
19
Creating files
Files by namespace
Files by Namespacedepth
Depth
(with Specialdirs
w/ special Directories)
0.25
D
Fraction of files
0.2 G
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16
Namespace depth (bin size 1)
i Parent Dir
Inverse Polynomial
Satisfies distribution of dirs
with file count
20
Resolving arbitrary constraints
21
0.15
O
C
Fraction of files
Constrained
0.1 Original
Contrived
for sum
0.05
0
8 2K 512K 8M
File Size (bytes, log scale)
Constraint: Given count of files & size distribution, ensure
Accurate both for the sum and the distribution
sum of file sizes matches a desired total file system size
22
•  Arbitrarily specified on file system parameters
•  Variant of NP-complete “Subset Sum Problem”
–  Approximation algorithm based solution (in paper)
–  Oversampling to get additional sample values
–  Local improvement to iteratively converge to the
desired sum by identifying best-fit in current sample
•  While constraints are satisfied, constrained
distribution also retains original characteristics
23
Interpolation and extrapolation
•  Why don’t we just use available data values?
–  Limited to empirical data in input dataset
–  “What-if” analysis beyond available dataset
–  Efficient to maintain compact curve fits and use
interpolation/extrapolation instead of all data
•  Technique: Piecewise interpolation
24
Interpolation technique & accuracy
0.14 Piecewise Interpolation
Piecewise Interpolation
0.12 100 GB
0.06 Interpola9on: Seg 19
Segment Value 50 GB
0.1 0.04
10 GB
Fraction of bytes
0.02
0.08 0
0.06 0 50
File System Size (GB)
100
0.04 Segment 19
0.02
0
0
2
8
32
128
512
32K
2M
8M
32M
128M
512M
2G
8G
32G
128G
128K
512K
2K
8K
File Size
File Size interpolation 75GB File Size extrapolation 125GB
•  Each distribution broken down 0.12 into segments
Interpolation (75 GB) Extrapolation (125 GB)
0.12R Fraction of bytes
Real
Fraction of bytes
0.1 R
Interpolated 0.1E
•  Data points within a segment used for curve fit
I
0.08
0.08
0.06
Real 0.06
0.04
0.04
•  Combine segment interpolations 0.02
0.02
0 0
for new curve
Extrapolated
8 2K 512K 128M 32G 8 2K 512K 128M 32G
25
File Size File Size
File content
•  Files having natural language content
–  Word-popularity model (heavy tailed)
–  Word-length frequency model (for the long tail)
•  Content for other files (mp3, gif, mpeg etc)
–  Impressions generates valid header/footer
–  Uses third-party libraries and software
26
Disk layout and fragmentation
27
Disk layout and fragmentation
•  Simplistic technique
–  Layout Score for degree of fragmentation [Smith97]
–  Pairs of file create/delete operations till desired
layout score is achieved
•  In future more
File 1 nuanced ways areFile
possible
2
–  Out-of-order file writes, writes with long delays
–  Access to file-system specific interfaces
•  FIPMAP inblock
1 non-contiguous Linux,(out
XFS_IOC_GETBMAP
of 8) forcontiguous
All blocks XFS
File Layout
–  Perhaps Scorecomplementary
a tool = 7/8 File to
Layout Score = 1 (6/6)
Impressions
28
Outline
•  Introduction
•  Generating realistic file-system images
•  Applying Impressions: Desktop search
•  Conclusion
29
Applying Impressions
•  Case study: desktop search
–  Google desktop for linux (GDL) and Beagle
–  Metrics of interest:
•  Size of index, time to build initial search index
–  Identifying application bugs and policies
•  GDL doesn’t index content beyond 10-deep
•  Computing realistic rules of thumb
–  Overhead of metadata replication?
30
Impact of file content
Index Size Comparison
Text (1 Word)
Index Size/FS size
Text (Model)
Binary
0.1
0.01
Beagle GDL
File
Understanding
content hasdesign:
significant
GDLaffect:
indexaround
smaller300%
than
increase
Beagle for
in index
text files,
size for
larger
both
forGDL
binary
& Beagle
files
31
Impact of metadata and content
Relative Index Size Beagle: Index Size
3.5
3 Default
2.5 Text
2 Image
Binary
1.5
1
0.5
0
he
ir
ac
r
D
al
lte
is
in
Fi
D
xt
rig
is
Te
O
D
Developer aid: understanding impact of different
file system content & different index schemes
32
Impact of metadata and content
Beagle
Beagle: Index Size
Future App
Relative Index Size
3.5
3 Default
2.5 Text
2 Image
Binary
1.5
1
0.5
0
he
ri
ac
r
sD
l
ilte
na
xtC
Di
igi
sF
Te
Or
Di
Reproducing identical file-system image to

compare other apps or ones developed later
33
Conclusion
•  Impressions framework for realistic FS images
–  Representative, controllable, reproducible, easy to use
–  Includes almost all file system params of interest
•  Extensible architecture
–  Plug in new statistical constructs, new models for
metadata and content generation
•  Powerful utility for file-system benchmarking
–  To be contributed publicly (coming soon)
http://www.cs.wisc.edu/adsl/Software/Impressions
34
Questions?
Nitin Agrawal
http://www.cs.wisc.edu/~nitina
i
Impressions download (coming soon)
http://www.cs.wisc.edu/adsl/Software/Impressions
35

Agrawal

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Agrawal

Hochgeladen von

Copyright:

Verfügbare Formate

Generating Realistic Impressions

for File-System Benchmarking

1000 files in 10 dirs w/ random data (SOSP 03)

188GB and 129GB volumes in Engg dept (OSDI 99)

10702 files from /usr/local, size 354MB (SOSP 01)

How to create representative

Containing file size (bytes, log scale)

•  Pure lognormal distribution no longer good fit

Reproducing identical file-system image to

Das könnte Ihnen auch gefallen

Agrawal

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Agrawal

Hochgeladen von

Copyright:

Verfügbare Formate

Generating Realistic Impressions

for File-System Benchmarking

1000 files in 10 dirs w/ random data (SOSP 03)

188GB and 129GB volumes in Engg dept (OSDI 99)

10702 files from /usr/local, size 354MB (SOSP 01)

How to create representative

Containing file size (bytes, log scale)

• Pure lognormal distribution no longer good fit

Reproducing identical file-system image to

Das könnte Ihnen auch gefallen

•  Pure lognormal distribution no longer good fit