Beruflich Dokumente
Kultur Dokumente
Nitin Agrawal
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
“For better or for worse,"
benchmarks shape a field”"
David Patterson
2
Inputs to file-system benchmarking
Application
Input: Benchmark workload
Postmark, FileBench, Fstress,
FS logical Bonnie, IOZone, TPCC, etc etc
organization
Input: In-memory state
Cold cache/warm cache
File System
Input: File-System Image
Anything goes!
Disk layout
Storage device
3
FS images in past: use what is convenient
Typical desktop file system w/ no description (SOSP 05)
5-deep tree, 5 subdirs, 10 8KB files in each (FAST 04)
Randomly generated files of several MB (FAST 08)
5
Problem scope
Characteristics of file-system images
have strong impact on performance
We need to incorporate representative
file-system images in benchmarking & design
6
Requirements for creating FS images
• Access to data on file systems and disk layout
– Properties of file-system metadata [Satyanarayan81,
Mullender84, Irlam93, Sienknecht94, Douceur99, Agrawal07]
– Disk fragmentation [Smith97]
– More such studies in future?
• A technique to create file-system images that
is
– Representative: given a set of input distributions
– Controllable: supply additional user constraints
– Reproducible: control & report internal parameters
– Easy to use: for widespread adoption and consensus
7
Introducing Impressions
• Powerful statistical framework to generate
file-system images
– Takes properties of file-system attributes as input
– Works out underlying statistical details of the image
– Mounted on a disk partition for real benchmarking
– Satisfies the four design goals
• Applying Impressions gives useful insights
– What is the impact on performance and storage size?
– How does an application behave on a real FS image?
8
Outline
• Introduction
• Generating realistic file-system images
• Applying Impressions: Desktop search
• Conclusion
9
Overview of Impressions
Impressions
10
Properties of file-system metadata
“Five-year study of file-system metadata” [FAST07]
(Agrawal, Bolosky, Douceur, Lorch)
Used as exemplar for metadata properties in Impressions
11
Features of Impressions
• Modes of operation for different usages
– Basic mode: choose default settings for parameters
– Advanced mode: several individually tunable knobs
• Thorough statistical machinery ensures accuracy
– Uses parameterized curve fits
– Allows arbitrary user constraints
– Built-in statistical tests for goodness-of-fit
• Generates namespace, metadata, file content, and
disk fragmentation using above techniques
12
Creating valid metadata
• Creating file-system namespace
– Uses Generative Model proposed earlier [FAST 07]
– Explains the process of directory tree creation
– Accurately regenerates distribution of directory
size and of directory depth
13
Creating namespace
DirsDirs
by namespace
by by
Directories
Directories subdir count
depth
bySubdirectory
Namespace Depth
Count
of directories
% directories
0.18
100
0.16 D
90
0.14 G
0.12
80
0.1
0.08
Fraction of
70
0.06 Dataset
Cumulative
0.04
60 D
0.02 Generated
G
50 0
0 0 22 44 66 88 10 10 12
12 14 16
Namespace depth (bin size 1)
Count of subdirectories
Directory tree
Monte Carlo run
Probability of parent selection
Incorporates dirs by depth
i
≈ Count(subdirs)+2
and dirs by subdir count
14
Creating valid metadata
• Creating file-system namespace
• Creating files: stepwise process
– File size, file extension, file depth, parent directory
– Uses statistical models & analytical approximations
15
Example: creating realistic file sizes
to used space
File Sizes
Contribution
Lognormal
Hybrid
Fraction of bytes
0.1 G
0.08
0.06
0.04
0.02
0
0 8 2K 512K 512M128G
File Size (bytes, log scale)
S9
S8
S7
S6
S5
S4
S3
S2
S1
i
File Size Model
Lognormal body,
Pareto tail
Captures bimodal curve
17
Creating files
Top extensions byCount
Top Extensions by count
1
0.8
Fraction of files
others others
0.6
txt txt
0.4 null null
jpg jpg
htm htm
0.2 h h
gif gif
exe exe
dll dll
0 cpp cpp
i
Desired Generated
File Extensions
S9
S8
S7
S6
S5
S4
S3
S2
S1
Percentile values
E9
E8
E7
E6
E5
E4
E3
E2
E1
Top 20 extensions account
for 50% of files and bytes
18
Creating files
Bytes byFiles
Files by namespace
namespace
Bytes depth
depth
bybyNamespace
NamespaceDepth
Depth
0.16
file
0.14 D
files
2MB
0.12 G
bytesofper
(log scale)
768KB
0.1
0.08
Fraction
256KB
0.06
64KB
0.04
Mean
0.02
16KB
0
0 0 2 2 4 4 6 6 88 10
10 12
12 14
14 16
Namespace
Namespacedepthdepth(bin
(binsize
size1)1)
S9
S8
S7
S6
S5
S4
S3
S2
S1
E9
E8
E7
E6
E5
E4
E3
E2
E1
i
File Depth
Poisson
Multiplicative model along
D9
D8
D7
D6
D5
D4
D3
D2
D1
with bytes by depth
19
Creating files
Files by namespace
Files by Namespacedepth
Depth
(with Specialdirs
w/ special Directories)
0.25
D
Fraction of files
0.2 G
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16
Namespace depth (bin size 1)
i
Parent Dir
Inverse Polynomial
Satisfies distribution of dirs
with file count
20
Resolving arbitrary constraints
21
Resolving arbitrary constraints
0.15
O
C
Fraction of files
Constrained
0.1 Original
Contrived
for sum
0.05
0
8 2K 512K 8M
File Size (bytes, log scale)
Constraint: Given count of files & size distribution, ensure
Accurate both for the sum and the distribution
sum of file sizes matches a desired total file system size
22
Resolving arbitrary constraints
• Arbitrarily specified on file system parameters
• Variant of NP-complete “Subset Sum Problem”
– Approximation algorithm based solution (in paper)
– Oversampling to get additional sample values
– Local improvement to iteratively converge to the
desired sum by identifying best-fit in current sample
• While constraints are satisfied, constrained
distribution also retains original characteristics
23
Interpolation and extrapolation
• Why don’t we just use available data values?
– Limited to empirical data in input dataset
– “What-if” analysis beyond available dataset
– Efficient to maintain compact curve fits and use
interpolation/extrapolation instead of all data
• Technique: Piecewise interpolation
24
Interpolation technique & accuracy
0.14 Piecewise Interpolation
Piecewise Interpolation
0.12 100 GB
0.06 Interpola9on: Seg 19
Segment Value 50 GB
0.1 0.04
10 GB
Fraction of bytes
0.02
0.08 0
0.06 0 50
File System Size (GB)
100
0.04 Segment 19
0.02
0
0
2
8
32
128
512
32K
2M
8M
32M
128M
512M
2G
8G
32G
128G
128K
512K
2K
8K
File Size
File Size interpolation 75GB
File Size extrapolation 125GB
• Each distribution broken down 0.12 into segments
Interpolation (75 GB) Extrapolation (125 GB)
0.12R Fraction of bytes
Real
Fraction of bytes
0.1 R
Interpolated
0.1E
• Data points within a segment used for curve fit
I
0.08
0.08
0.06
Real
0.06
0.04
0.04
• Combine segment interpolations 0.02
0.02
0 0
for new curve
Extrapolated
8 2K 512K 128M 32G 8 2K 512K 128M 32G
25
File Size File Size
File content
• Files having natural language content
– Word-popularity model (heavy tailed)
– Word-length frequency model (for the long tail)
• Content for other files (mp3, gif, mpeg etc)
– Impressions generates valid header/footer
– Uses third-party libraries and software
26
Disk layout and fragmentation
27
Disk layout and fragmentation
• Simplistic technique
– Layout Score for degree of fragmentation [Smith97]
– Pairs of file create/delete operations till desired
layout score is achieved
• In future more
File 1
nuanced ways areFile
possible
2
– Out-of-order file writes, writes with long delays
– Access to file-system specific interfaces
• FIPMAP inblock
1 non-contiguous Linux,(out
XFS_IOC_GETBMAP
of 8)
forcontiguous
All blocks XFS
File Layout
– Perhaps Scorecomplementary
a tool = 7/8
File to
Layout Score = 1 (6/6)
Impressions
28
Outline
• Introduction
• Generating realistic file-system images
• Applying Impressions: Desktop search
• Conclusion
29
Applying Impressions
• Case study: desktop search
– Google desktop for linux (GDL) and Beagle
– Metrics of interest:
• Size of index, time to build initial search index
– Identifying application bugs and policies
• GDL doesn’t index content beyond 10-deep
• Computing realistic rules of thumb
– Overhead of metadata replication?
30
Impact of file content
Index Size Comparison
Text (1 Word)
Index Size/FS size
Text (Model)
Binary
0.1
0.01
Beagle GDL
File
Understanding
content hasdesign:
significant
GDLaffect:
indexaround
smaller300%
than
increase
Beagle for
in index
text files,
size for
larger
both
forGDL
binary
& Beagle
files
31
Impact of metadata and content
Relative Index Size Beagle: Index Size
3.5
3 Default
2.5 Text
2 Image
Binary
1.5
1
0.5
0
he
ir
ac
r
D
al
lte
is
in
Fi
D
xt
rig
is
Te
O
D
Developer aid: understanding impact of different
file system content & different index schemes
32
Impact of metadata and content
Beagle
Beagle: Index Size
Future App
Relative Index Size
3.5
3 Default
2.5 Text
2 Image
Binary
1.5
1
0.5
0
he
ri
ac
r
sD
l
ilte
na
xtC
Di
igi
sF
Te
Or
Di
i
Impressions download (coming soon)
http://www.cs.wisc.edu/adsl/Software/Impressions
35