Sie sind auf Seite 1von 30


C. Ungureanu, B. Atkin, A. Aranya, et al.

Slides: Joe Buck, CMPS 229, Spring 2010

April 27, 2010


Tuesday, April 27, 2010


✤ What is HydraFS?

✤ Why is it necessary?

Tuesday, April 27, 2010

HydraFS is a file system on top of HydraStore, a scalable, distributed CAS

Applications don’t write to a CAS interface, they write to a FS interface. Need an adapter
layer, thus hydraFS.
CAS uses a put/get model

✤ What is HYDRAstor

✤ Immutable data

✤ High Latency

✤ Jitter

✤ Put / Get API

Tuesday, April 27, 2010

Inconsistent use of capitalization in acronyms

Jitter, in this case, means distance between writes to storage
Mention chunking
tions, and HydraFS acts as a front end for the Hydra distributed,
ovel chal- content-addressable block store (Figure 1). In this sec-
tion, we present the characteristics of Hydra and describe
Hydra Diagram
entation of
HydraFS is the key challenges faced when using it for applications,
CAS sys- such as HydraFS, that require high throughput.
e through-
Access Node
ving high File Commit
ore expen- Server Server

t refer to a HydraFS

node data HYDRAstor Block Access Library

files). Sec-
on of high Hydra
rite buffer
Storage Storage Storage Storage
structures Node Node Node Node
will thrash.
Single−System Content−Addressable Store
ee design 4

adata pro-
Tuesday, April 27, 2010

plit allows
applied ef- Figure 1: HYDRAstor Architecture.
nique that

4 KB 4 KB 4 KB
4 KB 4 KB 4 KB


Tuesday, April 27, 2010

CAS - continued

4 KB 4 KB 4 KB
4 KB 4 KB 4 KB



Tuesday, April 27, 2010

The chunker uses some heuristic involving content data and some hard set limits to chunk in
variable sizes
CAS - continued

4 KB 4 KB 4 KB
2 KB


cas1: 10KB

Tuesday, April 27, 2010

Objects in the CAS have ids that are pointed to by meta-data.

cas1 is 10 kb in size
CAS - continued

1 KB 4 KB


cas1: 10KB cas2: 9KB

Tuesday, April 27, 2010

cas2 is 9 kb
A little more on CAS addresses

✤ Same data doesn’t mean the same address

✤ Impossible to calculate prior to write

✤ Foreground processing writes shallow trees

✤ Root cannot be updated until all child nodes are set

Tuesday, April 27, 2010

Differing retention levels can produce different CAS addresses.

Collisions can be detected but are unlikely.
Writes are done asynch, block on root node commit
Issues for a CAS FS

✤ Updates are more expensive

✤ Metadata cache misses cause significant performance issues

✤ The combination of high latency and high throughput means lots of



Tuesday, April 27, 2010

Updates must touch all meta-data that points to affected data

Buffering allows for optimal write ordering and read cache is important as well
Design Decisions

✤ Decouple data and metadata processing

✤ Fixed size caches with admission control

✤ Second-order cache for metadata


Tuesday, April 27, 2010

From the previous 3 issues come 3 design decisions:

1) this is done via a log. Allows batching of meta-data updates
2) this prevents swapping, other resource over-allocations
3) removes operations from reads via cache hits, improves metadata cache hit rate
Issues - continued

✤ Immutable Blocks

✤ FS can only reference blocks already written

✤ Forms DAGs

✤ Height of DAGs needs to be minimized


Tuesday, April 27, 2010

The entire tree must be updated if a block contained in it is updated, makes updates quite
Issues - continued

✤ High latency

✤ In stead of ms - 10’s of ms latency Hydra has 100’s ms - 1 s latency

✤ Stream hints

✤ Delay writes to batch streams together

✤ High degree of parallelism needed to mask high latencies


Tuesday, April 27, 2010

For an IO operation Hydra must:

Scan entire block to compute CAS, compress/decompress, determine block location,
fragment/defragment using ECCs, route to/from nodes.
Issues - continued

✤ Variable sized blocks

✤ Avoids the “shifting window” problem

✤ Use a balanced tree structure


Tuesday, April 27, 2010

This is the “chunking” referred to in the paper.

there is a min / max size for chunks.
tree helps minimize DAGs
FS design

✤ High Throughput

✤ Minimize the number of dependent I/O operations

✤ Availability guarantees no worse than standard Unix FS

✤ Efficiently support both local and remote access


Tuesday, April 27, 2010

close to open consistency (fsync acknowledgment means data is persisted)

Remote access could be NFS or CIFS
File System Layout
Super Blocks
Operations File
Imap Handle

Imap B−Tree

Imap Segmented Array

Directory Inode Regular File Inode
Data Blocks

Inode B−Tree Inode B−Tree

Figure 3: HydraFS Soft

Directory Blocks File Contents

System [23]. In HydraFS, the
Filename1 321 R
array of content addresses and
Filename2 365 R a B-tree. It is used to translate i
Filename3 442 D as well as to allocate and free i
A regular file inode indexes
so as to accommodate very larg
Tuesday, April 27, 2010
size blocks. Regular file data
Inode map similar to Log-Structured
Figurefile systempersistent layout.
2: HydraFS
size blocks using a chunking a
Files dedup across file systems to increase the likelihood that th
3 File System Design block store will generate a ma
ten to the block store on one fil
to another file system using the
HydraFS Software Stack

✤ Uses FUSE

✤ Split into file server and commit server

✤ Simplifies metadata locking

✤ Amortizes the cost of metadata updates via batching

✤ Each server has its own caching strategy


Tuesday, April 27, 2010

File server manages the interface to the client, records file modifications in a transaction log
stored in hydra, in-memory cache of recent file modifications.
Commit server reads transaction log, updates FS metadata, generates new FS versions
Writing Data

✤ Data stored in inode specific buffer

✤ Chunked, marked dirty and written to Hydra

✤ After write confirmation, block freed and entered in uncommitted

block table

✤ Needed until metadata is flushed to storage

✤ Designed for append writing, in-place updates are expensive


Tuesday, April 27, 2010

Chunks have a max size at which point a chunk is created

Writes cached in memory until Hydra confirms them. (this allows for responses to reads in the
meantime or failures in hydra.
Data not visible in Hydra until a new FS is created.
Metadata Cleaning

✤ Dirty data kept until the commit server applies changes

✤ New versions of file systems are created periodically

✤ Metadata in separate structures, tagged by time

✤ Always clean (in Hydra), can be dropped from cache at any time

✤ Cleaning allows file servers to drop changes in the new FS version


Tuesday, April 27, 2010

New FS allows a file server to clean it’s dirty metadata proactively.

Admission Control

✤ Events assume worse case memory usage

✤ If insufficient resources are available, the event blocks

✤ Limits the number of active events

✤ Memory usage is tuned to the amount of physical memory


Tuesday, April 27, 2010

Not all memory used are freed when an action completes. For example, cache. This can be
flushed if the system finds it needs to reclaim memory.
Not swapping is key for keeping latencies low and performance up.
Read Processing

✤ Aggressive read-ahead

✤ Multiple fetches to get metadata

✤ Weighted caching to favor metadata over data

✤ Fast range map

✤ Metadata read-ahead

✤ Primes FRM, cache


Tuesday, April 27, 2010

Read-ahead goes into an in-memory LRU cache, default is 20 MB.

HydraFS caches both meta-data and data. Uses large leaf nodes and high-fan parent nodes.
Fast range map is a look-aside buffer, translates file offset to content address.
FRM and BtreeReadAhead add 36% performance for small memory/cpu overhead

✤ File deletion removes the entry from the current FS

✤ Data remains until there are no pointers to it


Tuesday, April 27, 2010

The data will remain in storage until all FS versions that reference it are garbage collected.
Block maybe pointed to by other files as well.
The FS only marks roots for deletion, Hydra handles reference counting and storage

Raw block device

File system

1.0 ex
Normalized Throughput


0.6 Table 1
ilar har

0.2 ited by
ing thro
0.0 keep th
Read (iSCSI) Read (Hydra) Write (iSCSI) Write (Hydra)
tem23 do
Tuesday, April 27, 2010
Sequential throughput
Figure 5: Comparison of raw device and file system
iSCSI is 6 disks per node -> software raid5 (likely the write hit iscsi takes) server
Block size 64throughput
KB for iSCSI and Hydra user op
HydraFS 82% of read, 88% on write
and all
Metadata Intensive

✤ Postmark

✤ Generates files, then issues transactions.

✤ File size: 512 B - 16 KB

Create Delete
Alone Tx Alone Tx
ext3 1,851 68 1,787 68 136
HydraFS 61 28 676 28 57

Table 1: Postmark comparing HydraFS with ext3 on sim-

ilar hardware
Tuesday, April 27, 2010

This is a worse-case for HydraFS

Had to create FSes on the fly due to limit on outstanding metadata updates
Fewer operations
ited bytothe
number costs
inodes HydraFS creates without go-
ing through the metadata update in the commit server. We
tem does not accumulate a large number of uncommitted
blocks that increase the turnaround times for the commit
e system server processing, increasing unpredictably the latency of
Write Performance vs Dedup
user operations. In contrast, ext3 has no such limitations
and all metadata updates are written to the journal.
k device Hydra
pectively. 300
3 is com-
Throughput (MB/s)

the write 200

o around
of Hydra
ncy. 100
gnificant 0
ce comes 0 20 40 60 80
Duplicate Ratio (%)
d by de- 25
ory man-
Figure 6:
Tuesday, April 27, 2010

Hydra and HydraFS write throughput with vary-
Hydra ac-within 12% of Hydra throughout
ing duplicate ratio

e perfor-
as expected for duplicate data as the number of I/Os to
disk is correspondingly reduced. Second, for all cases, the
HydraFS throughput is within 12% of the Hydra through- 8

Write Behind put. Therefore, we conclude that HydraFS meets the de- 7

Page Memory (MB)

sired goal of maintaining high throughput.

10 5

Offset (GB)


7 Fig

6 tency of
0 5 10 15 20
Time (s)
block w
26 t
Tuesday, April 27, 2010
Figure 7: Write completion order parallel.
Helps with buffering. No IO in the write “critical path”
A lot of jitter around 6 seconds, biggest gap is 1.5 GB To fu
the Cum
1 event lif
Time (s)

Hydra Latency Figure 7: Write completion order paralle

To f
the Cu
1 event l
is crea
0.9 is dest
ure 8 s
0.8 less tha

that H
0.6 lying b
vent th
0.5 use ad
tem, th
0.4 buffer
0 10 20 30 40 50 60 70
Time (ms) wastin

Tuesday, April 27, 2010

90% percentile at 10 ms Figure 8: Write event lifetimes observ
Point: even though Hydra is jittery and high latency, hydraFS still works (smoothes things out)
To support high-throughput streaming writes, HydraFS the res
Future Work

✤ Allow multiple nodes to manage same FS

✤ Makes failover transparent and automatic

✤ Exposing snapshots to users

✤ Incorporating SSD storage to lower latencies, make HydraFS usable

as primary storage


Tuesday, April 27, 2010

Thank you

✤ Questions?

✤ Comments?

✤ email:

✤ Paper:


Tuesday, April 27, 2010

Sample Operations

✤ Block Write

✤ Block Read

✤ Searchable Block Write

✤ Searchable Block Read


Tuesday, April 27, 2010

Writes trade blocks for CAS addresses, reads invert that

Labels can group data for retention or deletion, garbage collection reaps all the data that
isn’t part of a tree anchored by a retention block