Sie sind auf Seite 1von 32

CS530: Advanced Operating Systems (Fall 2004)

Serverless Network
File Systems
(T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D.
Roselli, and R. Wang, TOCS’96)
Background
ƒ RAID (Redundant Arrays of Inexpensive Disks)
• Provides high throughput, data integrity, and availability
• Small writes are expensive.
• All the disks are attached in a single machine.
• Special-purpose hardware to compute parity is required.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 2


Background
ƒ LFS (Log-structured File System)
• Particularly effective for writing small files.
– Buffers writes in memory and then committing them to disk in
large, contiguous, fixed-sized groups called log segments.
• Inode map (Imap) to locate inodes.
– Stored in memory and periodically checkpointed to disk.
• Simple failure recovery
– Checkpointing + Roll forward
• Free disk management through log cleaner
– Coalesces old, partially empty segments into a smaller number
of full segments.
– Cleaning overhead can be large sometimes.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 3


Zebra (1)
ƒ Zebra
• A network file system that uses multiple file servers in
tandem.
• The goal is to provide greater throughput and
availability than can be achieved with a single server.
• Zebra combines two techniques:
– File striping (RAID): higher bandwidth, fault tolerance
– LFS: eliminates small write problem common to RAID and file
striping.
• Log-based network file striping.

• J. Hartman and J. Ousterhout, “The Zebra Striped Network File System,”


SOSP, 1993.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 4


Zebra (2)
ƒ Per-file Striping
• Small files are difficult to handle efficiently.
– If striped: no performance benefit due to network and disk
latency.
– If not striped: storage overhead due to parity, unbalanced disk
utilization and server loading.
• If an existing file is
modified, its parity
must be updated.
– Requires two reads
and two writes.
– Two writes should be
atomic.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 5


Zebra (3)
ƒ Per-client Striping in Zebra
• Parity fragment computation is local.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 6


Zebra (4)
ƒ Zebra Architecture
• The file manager and the stripe cleaner can run on any
machine.
• A storage server may also be a client.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 7


Zebra (5)
ƒ Deltas
• Deltas identify changes to the blocks in a file, and are
used to communicate changes between the clients, the
file manager, and the stripe cleaner.
– A client puts a delta into its log when it writes a file block
– The file manager subsequently reads the delta to update the
metadata.
• Information in deltas:
– File identifier
– File version
– Block number
– Old block number
– New block number

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 8


Zebra (6)
ƒ Clients
• Machines where application programs execute.
• Reading a file
– Contacts the file manager
– Determines which stripe fragments store the desired data
– Retrieves the data from the storage server
• Writing a file
– Appends the new data to its log by creating new stripes to hold
the data
– Computes the parity of the stripes
– Writes the stripes to the storage servers
– A client puts a delta into its log when it writes a file block, and
the file manager subsequently reads the delta to update the
metadata for that block.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 9


Zebra (7)
ƒ Storage Servers
• Storage servers provide five operations:
– Store a fragment
– Append to an existing fragment (if previous write did not fill
the fragment)
– Retrieve a fragment
– Delete a fragment
– Identify fragments
• All fragments are the same size, which should be
chosen large enough to make network and disk
transfers efficient.
– 512KB in the Zebra prototype.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 10


Zebra (8)
ƒ File Manager
• Stores and manages file metadata.
– For each file, there is one file in the file manager’s file system,
which contains the file’s metadata.
• Handles name lookup.
• Maintains the consistency of client file caches.
– Caches are flushed or disabled when files are opened.

ƒ Stripe Cleaner
• Runs as a user-level process.
• Similar to LFS segment cleaner.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 11


Zebra (9)
ƒ Limitations
• File manager may be a performance bottleneck.
– A single file manager tracks where clients store data blocks
and handles cache consistency operations.
– Clients must contact the file manager on each open and close.
• File manager remains a single point of failure.
– Use the storage servers to store the file manager’s metadata.
• Zebra relies on a single cleaner to create empty
segments.
• Zebra stripes each segment to all of the system’s
storage servers.
– Limits scalability.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 12


Cooperative Caching
ƒ Cooperative Caching
• Use remote client memory to avoid disk accesses.
– Much of client memory is not used.
– Manage client memory as a global resource.
• Greedy forwarding
– Client memory Æ server Æ forward the request to a client that is
caching the data Æ the client sends the data
– May cause unnecessary data duplication
• Centrally coordinated caching
– The clients’ local hit rates may be reduced.
• N-chance forwarding
– Dynamically adjusts the fraction of each client’s cache.

• M. Darlin, T. Anderson, D. Patterson, and R. Wang, “Cooperative Caching: Using


Remote Client Memory to Improve File System Performance,” OSDI, 1994.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 13


xFS (1)
ƒ Motivation
• The opportunity provided by fast switched LANs
– ATM or Myrinet
– The aggregate bandwidth scales with the number of machines
on the network.
– LANs can be used as an I/O backplane
• The expanding user demands on file systems
– Multimedia
– Process migration
– Parallel processing, etc.
• The fundamental limitations of central server systems
– Performance – cost? partitioning? client caching?
– Availability: a single point of failure – replication?

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 14


xFS (2)
ƒ xFS Characteristics
• A serverless network file system that works over a
cluster of cooperative workstations.
• xFS dynamically distributes control processing across
the system.
– Metadata managers
• xFS distributes its data storage across storage server
disks by implementing a software RAID using log-based
network striping.
– But it dynamically clusters disks into stripe groups.
• xFS eliminates central server caching by taking
advantage of cooperative caching to harvest portions of
client memory as a large, global file cache.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 15


xFS (3)
ƒ xFS Installations

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 16


xFS Data Structures (1)
ƒ The Manager Map
• Maps file’s index number Æ manager
– Some of index number’s bits are used as an index into the
manager map.
• Globally replicated to all of the managers and all of the
clients.
• xFS can change the mapping from index number to
manager by changing the manager map.
– The map acts as a coarse-grained load balancing mechanism to
split the work of overloaded managers.
– Easy to reconfigure when the manager enters or leaves.
• The manager of a file controls two sets of information:
cache consistency state and disk location metadata.
CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 17
xFS Data Structures (2)
ƒ Imap
• Maps file’s index number Æ disk log address of file’s
index node.
• xFS distributes the imap among managers according to
the manager map.
– Each manager caches its portion of the imap in memory,
storing it on disk in a special file called the ifile.
– Managers handle the imap entries and cache consistency state
of the same files.
• xFS index node
– Contains the disk addresses of the file’s data blocks, log
addresses of indirect, double indirect blocks, etc.
– Maps file offset Æ disk log address of data block.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 18


xFS Data Structures (3)
ƒ File Directories and Index Numbers
• File directory
– Maps file’s name Æ file’s index number.
– xFS stores directories in regular files.
• Index number
– Key used to locate metadata for a file.
• First writer policy
– When a client creates a file, xFS chooses an index number that
assigns the file’s management to the manager co-located with
that client.
– Significantly improves locality; the number of network hops
need to satisfy client requests is reduced by over 40%
compared to a centralized manager.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 19


xFS Data Structures (4)
ƒ The Stripe Group Map
• Maps disk log address Æ list of storage servers.
• Each stripe group includes a separate subset of the
system’s storage servers.
– Each active storage server belongs to exactly one current
stripe group.
– Clients write each segment across a stripe group rather than
across all of the system’s storage servers.
• xFS globally replicates the stripe group map.
• The stripe group map is reconfigured when a storage
server enters or leaves the system.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 20


xFS Data Structures (5)
ƒ Why Stripe Groups?
• Otherwise, clients would stripe each segment over all
of the disks in the system.
– Clients require to send small fragments to each of the many
storage servers, or to buffer enormous amounts of data.
• Stripe groups match the aggregate bandwidth of the
groups’ disks to the network bandwidth of a client.
– Use both resources efficiently.
• Stripe groups make cleaning more efficient by limiting
segment size.
• Stripe groups greatly improve availability.
– The system can survive multiple server failures if they happen
to strike different groups.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 21


xFS Operations (1)
ƒ Reads
• Index nodes are cached at managers but not at clients.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 22


xFS Operations (2)
ƒ Writes
• Clients buffer writes in their local memory until
committed to a stripe group of storage servers.
– Since xFS is log-based, every write changes the disk address of
the modified block.
• After a client commits a segment to a storage server,
the client notifies the modified blocks’ managers.
• The managers update their index nodes and imaps.
– The managers also periodically log these changes to stable
storage.
• The client’s log includes a delta that allows
reconstruction of the manager’s data structures in the
event of a client or manager crash.
CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 23
xFS Cache Consistency (1)
ƒ Cache Consistency
• xFS manages consistency on a per-block rather than
per-file basis.
• A directory-based invalidate cache coherence protocol.
– A client must obtain a read token in order to read a file block.
– A client must obtain a write token in order to overwrite or
modify a block.
– The managers maintain the lists of current cachers of a block.
– In response to client token requests, the managers send
invalidate messages or forward requests to other clients.
• A file is either
– Invalid: no valid copy exists.
– Owned: only one copy exists.
– Read-only: multiple copies exist.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 24


xFS Cache Consistency (2)

Invalid

Write Read
Write by
other node

Read by other node


Owned Read-only
Write

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 25


xFS Experiences (1)
ƒ Cache Coherence
• It is much more complicated than it looks.
– A lot of transient states.
– 3 formal states Æ 22 implementation states.
• Ad hoc test-and-retry leaves unknown error
permanently.
• No one is sure about the correctness.
• Software portability is poor.
• Need of a formal method for verifying cache coherence.
– Used Teapot: a tool for writing memory coherence protocols.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 26


xFS Experiences (2)
ƒ Threads in a Server
• It is a nice concept, but it incurs too much concurrency.
– Too much data races
– The most difficult thing to understand in the world.
– Difficult to debug.
• Solution: iterative server
– Difficult to design but simple to debug.
– Less error-prone and efficient.
ƒ RPC
• Not suitable for multi-party communication.
• Need to gather/scatter RPC servers.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 27


Discussion
ƒ What’s Good
• New workload and new technology necessitates
innovative idea.
• A well-engineered approach which takes advantage of
several research ideas (and make them work!)
– Scalable cache consistency: DASH, Alewife
– Disk striping: RAID, Zebra
– Log-structured file system: Sprite LFS, BSD LFS
– Cooperative caching
ƒ Limitation
• It is only appropriate in a restricted environment
– Among machines that communicate over a fast network and
that trust each other.

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 28


Storage Architecture
Protocols
Block File

NBD NAS
Interconnection

IP-based (Network Block Device) (Network Attached


Networks iSCSI Storage:
(SCSI over TCP/IP) NFS, CIFS)

ATA / SCSI
(Direct)
xFS/DAFS
Non-IP (Direct Access File
SAN
Networks (Storage Area Network:
System:
NFS over VIA)
Fibre Channel)

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 29


SAN + NAS

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 30


HP DirectNFS

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 31


IBM StorageTank

CS530 (Fall 2004) -- Jin-Soo Kim (jinsoo@cs.kaist.ac.kr) 32

Das könnte Ihnen auch gefallen