Jonghyun Lee*$ Xiaosong Mat5 Robert Ross Rajeev Thakur* Marianne Winslettt
*Mathematicsand Computer Science Division, Argonne National Laboratory. Argonne, IL 60439, U.S.A.
Department of Computer Science. North Carolina State University, Raleigh. NC 21695, U.S.A.
%Departmentof Computer Science, University of Illinois, Urbana, IL 61801. U.S.A.
*ComputerScience and Mathematics Division, Oak Ridge National Laboratory. Oak Ridge. TN 37381, U.S.A.
{jlee, rross, thakur}@mcs.anl.gov, ma@csc.ncsu.edu, winslett@cs.uiuc.edu
12
used as a means o f data transfer for RFS. rent computation and U 0 [ 141.
Kangaroo [I81 is a wide-area data movement system,
designed to provide data transfer services with high avail- 3 Design
ability and reliability. Both RFS and Kangaroo optimize
remote output operations by staging the data locally and
As mcntioned earlier, RFS exploits the intermediate
transferring them in the background. However, our work
AD10 layer in ROMIO for portability. AD10 defines a
differs from Kangaroo in that we view the remote VO prob-
set o f basic VO interfaces that are used to implement more
lem from the MPI-IO perspective and thus address collec-
complex, higher-level VO interfaces such as MPI-IO. For
tive VO, noncontiguous VO, and MPI-IO consistency issues
each supported file system, AD10 requires a separate im-
in the remote VO domain, while Kangaroo focuses on the plementation (called a module) o f its YO interfaces. A
more basic and lowcr level remote YO solutions. Also,
generic implementation i s also provided for a subset o f
Kangaroo relies on disk staging only, while RFS performs
AD10 functions. When both implementations exist, either
hybrid staging that uscs both disks and available memory
the generic function is called first, and it may in turn call
through ABT. Kangaroo adopts a chainable architecture,
the file system-specific function, or the file system-specific
consisting of servers that receive data and disk-stage them function i s directly called.
locally and movers that read the staged data and send them RFS has two components, a client-side RFS AD10 mod-
to another scrver. T h i s i s especially useful if links between
ule and a server-side request handler. On the client where
the two end points are slow or down. Like G r i d m , Kan- the application i s running, remote YO routines are placed
garoo can be used to transfer data for RFS. in a new AD10 module also called RFS. When called, RFS
functions communicate with the request handler located at
2.2 Active Buffering with Threads the remote server to carry out the requested VO operation.
On the server where the remote file system resides, the re-
Active buffering [ 141 reduces apparent U 0 cost by ag- quest handler i s implemented on top of ADIO. When it re-
gressively caching output data using 3 hierarchy o f buffers ceives VO requests from the client, the server calls the ap-
allocated from the available memory o f the processors par- propriate AD10 call for the local file system at the server.
ticipating in a run and writing the cached data in the back- Figure I illustrates this architecture.
ground after computation resumes. Traditional buffering More detail on the design and implementation of RFS is
aggregates small or noncontiguous writes into long, sequen- presented below.
tial writes, to speed thcm, but active buffering tries instead
to completely hide the cost of writing. Active buffering 3.1 RFS AD10 Module
has no hard buffer space requirement; i t buffers the data
whenever possible with whatever memory available. This The goal of the RFS project i s to provide a simple and
scheme i s particularly attractive for applications with pe- flexible implementation o f remote file access that minimizes
riodic writes because in-core simulations do not normally data transfer, hides data transfer costs through asynchronous
reread their output in the same run. Thus, once output data operation, and supports the MPI-IO consistency semantics.
are buffered in memory, computation can resume before To this end, we implemented the following basic RFS AD10
the data actually reach the file system. Also, computation functions:
phases are often long enough to hide the cost of writing
all the buffered output to disk. Unlike asynchronous writes e RFS.Open, R F S - C l o s e - open and close a remote
provided by the tile system, active buffering i s transparent file.
to users and allows user code to safely rewrite the output 0 RFSWriteContig, RFSReadContig - write
buffers right after a write call. Active buffering can also
and read a contiguous portion o f an open file.
help when asynchronous U 0 is not available.
Active buffering originally used dedicated processors for R F S W r it e N o n c o n t i g,
buffering and background U 0 [12]. Later, active buffering R F S R e a d N o n c o n t i g - write and read a non-
with threads [ 141 was proposed for VO architectures that contiguous portion o f an open file.
do not use dedicated VO processors, such as ROMIO. In
ABT, data are still buffered using available memory, but These RFS functions take the same arguments as do the
the background U 0 i s performed by a thread spawned on corresponding AD10 functions for other file system mod-
each processor. Local YO performance obtained from the ules. One requirement for R F S - O p e n is that the file name
ABT-enabled ROMIO shows that even without dedicated contain the host name where the remote file system resides
U 0 processors, active buffering efficiently hides the cost of The prcfix RFS denates a RFS-specific function. Generic AD10
periodic output, with only a small slowdown from concur- functions sfan with fhe prefix ADIO.
73
Application
UFS
Client Server
Figure 1. RFS architecture. The bold arrows show the data flow for an ABT-enabled RFS operation
that writes the data to the Unix file system on the remote server.
and the host port number where the server request handler defer RFS-Close until all the burfered write operations
listens. For example, if we need to access the file system are completed. Third, the AD10 functions that cannot
on e1ephant.cs.uiuc.edu through port 12345, we use the pre- be implemented in the previous two ways have their own
fix rfs:elephant .cs.uiuc.edu.12345 : before implementation in RFS (e.g., ADIODelete to delete files
the tile name? with a given file name).
The RFS implementation of the remaining func- Providing specialized noncontiguous U 0 support is key
tions required for an ADIO module can be divided into in local YO but is even more important in the remote U 0
three categories. First, some AD10 functions have a domain because latencies are higher. For noncontiguous
generic implementation that calls ADIO-Writecontig, file access, ROM10 uses data sieving [20] to avoid non-
ADIO-WriteNoncontig, ADIOReadContig, or contiguous small U 0 requests when support for noncon-
ADIOReadNoncontig. With the RFS implementation tiguous U 0 is not available from the AD10 implementa-
of those functions, the ADIO functions that have a generic tion. For noncontiguous reads, ROM10 first reads the en-
implementation can still be used without any changes, tire extent of the requested data and then selects the appro-
For example, ADIOWriteColl, an ADIO function priate pieces of data. For writes, ROM10 reads the whole
for collective writes, can use the RFS implementation extent into a buffer, updates the buffer with pieces of out-
of ADIO-Wri teContig or ADIO-WriteNoncont ig put data, and writes the whole buffer again. This approach
for all data transfer. Second, like the seek operation in makes sense in the local U 0 environment where the cost of
ordinary file systems, some ADIO function calls have no moving additional data is relatively low. However, in the
discernible effect until a subsequent call is made. In order network-constrained environment of remote WO, reducing
to reduce network traffic, these AD10 function calls can be the amount of data to be moved is just as important as re-
deferred and piggybacked onto later messages. For such ducing the number of operations.
functions, RFS provides a simple client-side implementa- RFSs specialized implementation can significantly re-
tion that checks for errors and returns control immediately duce the amount of data transferred in read and write cases.
to the application. For example, when ADIOBet-view is This is especially useful in the write case because for a non-
called at the client by MPIZile-set-view to determine contiguous write, we would be required to read this large
how data will be stored in a file, the client implementation region from across the network, modify it, and write it back.
first checks for errors and returns. Then RFS waits until the The RFS server can use data sieving locally to the server to
next read or write operation on that file and sends the view optimize local data access.
information to the server together with the U 0 operation. For noncontiguous writes, RFS packs the data to be
The appropriate implementation of ADIOSet-view is written using MPI-Pack and sends the packed data as the
chosen by the server based on its local file system. The MPI-PACKED datatype to the server, to reduce the amount
user can also choose to defer RFS-Open until the first read of data transferred. Similarly, for noncontiguous reads, data
or write operation on the file by passing a hint, and can are first read into contiguous buffer space on the server, sent
back to the client, and unpacked by the client using the user-
2 R O M 1 0 ~tile naming convention is to use the prefix <file sys- specified datatype. In both cases, the damtype that describes
tem name, :I to specify the file system to be used. how the data should he stored in memory (called the buffer
74
RFShandle RFSNake_connection(char *host, int port) ;
int RFS.Writen(RFShandle handle, char *buf, int c o u n t l ;
int RFSReadn(RFShand1e handle, char * b u f , int count);
int RFS_Close.connection(RFShandle handle);
datatype) need not be transferred between the client and ing the appropriate AD10 file system module functions. At
server. For example, it is not important for noncontiguous the users request, ABT can intercept and defer file close
write operations whether or not the data are in packed form, operations until the buffered writes for that file are com-
as long as they have the correct number of bytes to write. pleted. Thanks to the stackable ABT module, the integra-
However, file view information must be sent to the remote tion of ABT and RFS required few code changes.
server, to describe how data should be stored o n disks on the To optimize RFS performance with ABT, we augmcnted
server. The file view information contains a datatype (called ABT with two temporary local disk staging schemes. First,
the filetype), which can be a recursively defined derived when there is not enough memory to buffer the data for an
datatype. To portably pass a complex recursive datatype YO opcration, ABT does not wait until a buffer is released,
to the remote server, we use MPLType-get-envelope because that may be very slow with RFS. Instead, ABT
and MPI-Type-get.contents in MPI-2 for datatype de- immediately writes the data into a local cache file cre-
coding. Using these two functions, we perform a pre- ated in the fastest file system available on the client (fore-
order traversal of the given recursive datatype and pack ground staging). The description of the U 0 operation is
the results into a buffer string, which is sent to the remote still buffcred in memory, along with the size and the offset
server. The server reads the buffer string and recursively of the data in the cache file. For each data buffer that is
recreates the original derived datatype. The file view in- ready to write out, the background VO thread first checks
formation is sent once whenever there is an update (e.g., the location of the data. If the data are on disk, the thread
MPI_File.set-view is called). Ching et al. [7] took a allocates enough memory for the staged data and reads the
similar approach to pass datatypes between client and server data from the cache file. Once the data are in memory, the
in the Parallel Virtual File System (PVFS) [ 5 ] on a single requested VO operation is performed.
site, but we believe that this is its first use in remote YO.
Second, to reduce the visible VO cost even more, dur-
A s briefly mentioned in Section 2, RFS requires no spe-
cific communication protocol for data transfer. Instead, ing each computation phasc, we write some of the memory-
buffered data to the cache file in the background, to procure
RFS defines four primitive communication interface func-
enough memory space for the next set of output requests
tions (Figure 2) that implement simple connection-oriented
(background staging). For that purpose, it is helpful to
streams, and allows users to choose a communication proto-
know how much time we have before the next output re-
col for which an implementation of the four interface func-
quest is issued and how much data will be written during
tions is available. For example, users can pick GridITF for
the next set of output requests. RFS can obtain such infor-
its secure data transfer or can use a hybrid protocol of TCP
mation by observing the application, or users can provide
and UDP, such as Reliable Blast UDP [IO], for better trans-
it as hints. This scheme is especially well suited for typ-
fer rates. The current implementation uses TCPIIP.
ical simulation codes that write the same amount of data
at regular intervals. If the amount of data or the interval
3.2 Integration with the ABT Module between two consecutive output phases changes over time,
we can still use the average values as estimates. However.
ABT is implemented as an AD10 module that can be
we want to avoid unnecessary background staging, because
enabled in conjunction with any file system-specific ADIO staged data will have to be read back into memory before
module (Figure I). For example, when ABTis enabled with being sent to the server, thus possibly increasing the overall
the module for a particular file system, read and write re-
execution time.
quests to ADIO are intercepted by the ABT module, which
buffers the data along with the description of the ADIO call We use a simple performance model to determine how
and then returns control to the application? In parallel, the much data to stage in the background before the next output
background thread performs YO on the buffered data us- phase begins. Suppose that the maximum available buffer
size is ABSmL,, bytes, the currently available buffer size
31nstead uf allmating a monolithic buffer space at thc beginning of illl is ABS,,, bytes, the remaining time till the next output
application run. ABT uses B set of small buffcn. dynamically allwilted
zs needed. Each buffer has the svm size. preset according to file system
is I seconds, and the expected size of the next output is
performance chwuctcrislics. If the data 10 be buffered arc larger than this N bytes. If ABS,,, is smaller than min(ABS,,,,N),
buffer size, &hey further divided and stared in multiple buffen. which is the amount of buffer space needed, we perform
75
...............
C*"VW*"
P- I
Figure 3. Sample execution timeline of an application run at the client, when using RFS with ABT.
Both foreground and background staging are being performed. The foreground staging is visible to
the application, while the background staging is hidden by the concurrent computation.
background staging. If we write z bytes to the cache file be- multiple clients, it spawns a thread for each client that it
fore the next output phase begins, it will take an estimated has to serve, and handles the U 0 requests concurrently. Al-
& seconds, where B, is the write bandwidth of the cho- though concurrent U 0 operations on a common file could
sen file system. Also, for I - & seconds, remote writes he requested on a server process, it need not be concerned
will transfer &(I - &) bytes of data, where B, is the about the file consistency semantics, as MPI by default re-
remote write bandwidth that also considers the write time quires the user to be responsible for consistency when hav-
at the server and the response time. Therefore, with this ing multiple writers on a common file.
scheme, the available buffer size I seconds from now will
he ABS,,,, + +z B,(I - e), and this value should be
4 Experimental Results
equal to or greater than min(ABS,,,,,,N). Solving this
equation for 5,we get
Our experiments used Chiba City, a Linux cluster at
Argonne National Laboratory, for the client-side platform.
Chiba has 256 compute nodes, each with two 500 MHz
This is a rough estimate; for example, it assumes that all Pentium I11 processors, 5 12 M B of RAM, and 9 GB of lo-
the data to he transferred are buffered in memory and does cal disk space. All the compute nodes are connected via
not consider the rend cost for the data buffered in the local switched fast ethernet. For the servers, we used two Linux
cache file. The current implementation spreads the staging PCs at different locations. Elephant is at the University of
of z bytes over the current computation phase (i.e., staging Illinois, with a I.4 GHz Pentium 4 processor and 5 I2 M B of
a fraction of z bytes before writing each buffer remotely), RAM. Tallis is at Argonne, with a IGHz Pentium I11 pro-
adjusting z if B, changes over time. Figure 3 depicts a cessor and 256 MB of RAM. The MPICH2 implementation
sample execution timeline of an application using RFS with of MPI is used on all the platforms.
ABT at the client, with these optimizations in place. Table I shows the network and disk bandwidth mea-
sured between Chiba and each server with concurrent
3.3 RFS Request Handler on the Sewer senderslwriters. Both network and disk throughput are
fairly stable and do not vary much as the number of
An RFS request handler on a server is an MPI code senderslwriters increases. As in a typical remote U 0 setup,
that can run on multiple processors and whose functional- the network between Chiba and Elephant is slower than Ele-
ity is relatively simple compared with that of the client RFS phant's local disk. However, Tallis has a very slow disk,
module. The server receives the U 0 requests from client even slower than the network connection to Chiba, thus sim-
processes, performs them locally, and transfers back to the ulating an environment with a high-performance backbone
client an error code and the data requested (for reads). Cur- network.
rently clients arc mapped to server processes in a round- We first measured the apparent U 0 throughput with dif-
robin manner. If a single server process needs to handle ferent file system configurations. The goal was to show how
16
Table 1. Aggregate network and disk bandwidth with each server. The numbers in parentheses show
the 95% confidence interval.
I I hl,. Drnrr II n I Y I I I
cessors far test purposes, the number of RFS writers should be carefully se- %ince a shared hie system on Chiba was not available at the time of
lected. considering Ihe aggregate nswork and disk p e r f o m c e with con- these experiments,we simulated B shared file system by having one q g r e
current senderslwriters, because tm many writers can hurt performance. gator gather and reorganize all the data and write IO ils lucal disk. Many
ROM10 allows users to ~ontrolthe number of writers (called agpga- clusters use NFS-mounted shared file rystcms. whose performanceis often
tars). much lower lhan that of our simulaed shared file system.
. .. .
-
P
)I -
-.
I-- -*--
(a) Between Chiba and Elephant (b) Between Chiba and Tallis
Figure 4. Aggregate application-visible write bandwidth with different file system configurations.
executed the Jacobi code with five iterations, writing up to down caused by concurrent foreground buffering and back-
2.5 G B of data remotely; the numben in the graph were av- ground remote VO.
eraged over five or more runs. The error bars show the 95% When the computation phase is not long enough to
contidence interval. hide an entire remote output operation (RFS+ABT-large-
The PEAR VO throughput increases as the number of short), the visible I/O throughput is still comparable to the
processors and the amount of data grow, reaching 160.2 throughput obtained with long computation phases. In our
MB/s with 16 processors, although it does not scale up well. experiments, the difference in throughput does not exceed
Since this configuration does not involve disk operations, 4% of the long computation phase throughput, proving that
the performance is limited by the message passing perfor- background staging with our performance model can pm-
mance on Chiba done via the fast ethemet. The LOCAL cure enough buffer space for the next snapshot.
U 0 throughput is up to 9.8 MB/s and does not scale up be- When the total buffer space is smaller than the size of
cause we used only one writer. a snapshot (RFS+ABT-small-long), RFS has to perform
Between Chiha and Elephant, the network is the main foreground staging, whose cost i b completely visible. For
performance-limiting factor for remote VO. Thus, as shown our tests, we used f sync to immediately flush the staged
in Figure 4(a), the RFS throughput reaches 10.1 MB/s with data to disk, because we wished to see the effect of local
16 processors, about 86% of the network bandwidth be- staging of larger data: if the amount of data to be staged is
tween the two platforms. The gap between the RFS and small, as in our experiments, the staged data can fit in the
network throughput occurs because RFS writes also in- file cache, producing a very small local staging cost. Even
clude the response time for disk writes at the server and with f sync, RFS with ABT can still improve the remote
the time to transfer the error code back to the client. Our write performance significantly, reaching 103.I MB/s with
tests with RFS reads yielded similar results. As the number 16 processors, an improvement of a factor of 10.2 over RFS
of writers increases, the aggregate RFS throughput also in- alone and a factor of 12.4 over local write performance.
creases slightly, because the data reorganization throughput Without f sync, we obtained performance very close to
increases, too. In this setup, RFS performance is compara- that with 32 M B of buffer.
ble to or even higher than the local VO performance. Figure 4(b) shows the VO bandwidth obtained between
With ABT in place, however, the visible write through- Chiba and Tallis. Here, the very slow disk on Tallis is the
put increases significantly because ABT efficiently hides performance bottleneck, limiting the RFS performance to
the remote VO cost. In RFS+ABT-large-long, where we less than 2 MB/s. We obtained up to 1.7 MB/s of RFS
have enough buffer space and long computation phases, the write throughput, roughly 87.1% of the disk bandwidth
visible VO throughput reaches 146.7 MB/s with 16 proces- on Tallis. Read tests produced similar results. Neverthe-
sors, about 92% of the theoretical peak, a factor of 14.5 im- less, the performance of RFS with ABT between Chiba
provement over the RFS performance and a factor of 17.7 and Tallis is close to the performance between Chiha and
improvement over the local U 0 performance. The gap be- Elephant, making the performance improvement even more
tween the peak and the RFS performance is due mainly to dramatic. For example, the aggregate visible YO through-
the cost of copying data to the active buffers and the slow- put with RFS+ABT-large-long reaches 145.0 MB/s with
78
Nn. Procs 4 8 12 16
Chiba to foreground 0.0 MB (0.0%) 0.0 MB (0.0%) 0.8 MB ( 0 . 0 ~ ~ ) 1.6M B (0.08%)
Elephant background 168.0 M B (32.8%) 322.4 MB (31.5%) 571.2 MB (37.2%) 798.4 M B (39.0%)
Chiba to foreground 12.0 M B (2.3%) 26.0 MB (2.5%) 46.4 MB (3.0%) 71.0 M B (3.5%)
Tallis background 317.0 M B (61.9%) 614.0 M B (60.0%) 961.6 MB (62.6%) 1320.0 MB (64.5%)
16 processors, about 86.8 times higher than the RFS write diate forwarders and let clients directly communicate with
throughput and about 8.3 times higher than the local write servers, effectively reducing the message traffic compared
throughput. With a 16 MB buffer, 103.0 MBls throughput to RIO. For this reason, we expect RFS to be more efficient
was achieved with 16 processors and fsync, a factor of than RIO in many situations.
61.6 improvement over RFS writes and a factor of 12.4 im- To test the performance model with the backgroundstag-
provement over local writes. The reason we could still ob- ing, we measured the amount OF data staged both in the
tain excellent visible U 0 performance with this slow remote foreground and the background at the client in RFS+ABT-
file system is that the client buffers data, and thus, with the large-short (Table 2). The numbers were averaged over five
help from background staging, the hufering cost does not or more runs; the numbers in parentheses are the percentage
vary much with different servers. of staged data out of the total data in Four snapshots. If our
We cannot easily compare the performance of RFS di- performance model accurately predicts the amount of data
rectly with that of RIO. RIO was a one-time development that should be staged, then there should be no foreground
effort, so today RIO depends on a legacy communication staging, because the total buffer size is same as the size of
library. making it impractical to run RIO in our current en- a snapshot. The numbers obtained confirm this claim. In
vironment. Also, the experiments presented in the origi- both setups, less than 4% of the output data are staged in
nal RIO paper [9] were conducted in a simulated wide-area the foreground.
environment, where the RIO authors partitioned a parallel Also, an accurate model should minimize the amount
platform into two parts and performed TCPAP communica- of data staged in the background; otherwise, unnecessary
tion between them, instead of using a real wide-area envi- staging will make the overall transfer longer. However,
ronment as we have. Moreover, the RIO authors measured it is difficult to measure the exact amount of unnecessar-
the sustained remote U 0 bandwidth with a parallel file sys- ily staged data because the amount of data transferred dur-
tem at the server for blocking and nonblocking YO (equiva- ing each computation phase can vary as a result of net-
lent to our U 0 operations without and with ABT), while we work fluctuation and slowdown from multithreading. In
measured the visible U 0 bandwidth with a sequential Unix RFS+ABT-large-short: we roughly estimated the length
file system at the server. of each computation phase to he long enough to transfer
over the network about 70-75% of a snapshot for the Chiba-
However, we can still speculate on the difference in re-
Elephant setup and 45-50% of a snapshot for the Chiba-
mote U 0 performance with RFS and with RIO. According
Tallis setup. Thus, in theory, 25-30% of a snapshot for
to the RIO authors, RIO can achieve blocking remote YO
the Chiba-Elephant setup and 5&55% ofa snapshot for the
performance close to the peak TCP/IP performance with
Chiba-Tallis setup should he staged in the background, to
l a g e messages. Our experiments show that remote U 0
minimize the visible write cost for the next snapshot. When
without ABT can achieve almost 90% of the peak TCP/IP
background staging is in place, however, smaller amounts
bandwidth even with a sequential file system at the other
of data than estimated above may be transferred during a
end. With smaller messages, however, RIOs blocking YO
computation phase, because background staging takes time.
performance dropped significantly, because of the commu-
Also, since the unit of staging is an entire buffer, often
nication overhead with RIOs dedicated forwarders. Since
we cannot stage the exact amount of data calculated by
all remote YO traffic with RIO goes through the forwarders,
the model. Thus, the amount of data staged in the back-
a single YO operation between a client and a server pro-
ground for each snapshot should be larger than the portion
cess involves four more messages than with RFS, two out-
of a snapshot that cannot be transferred during a computa-
bound and incoming messages between the client process
tion phase with RFS alone. Our performance numbers show
and the client-side forwarder and two between the server
process and the server-side forwarder. These can cause sig- Among the five snapshots in each run. [he fin1 cannot he staged in
nificant overhead for an application with many small writes the foreground. and the last cannot be staged in the background in this
that uses RIO. RFS, on the other hand, does not use interme- configuration.
79
Table 3. The computation slowdown caused by concurrent remote U 0 activities.
kABT-large-long 11
5.67%
I Tallis 1 RFSi-ABT-larne-short 11 0.90% i 1.17% i 0.45% i 0.24% 1
that 31-39% of the output for the Chiba-Elephant setup are not cached locally. GASS [4] has facilities for prefetch-
and 6 0 6 5 % of the output for the Chiba-Tallis setup were ing and caching of remote files for reads. GASS transfers
staged in the background, slightly more than the estimated only entire files, however, an approach that can cause ex-
numbers above. Based on these arguments and our perfor- cessive data transfer for partial file access (e.g., visualizing
mance numbers, we conclude that the amount of unneces- only a portion of a snapshot). We instead aim to provide
sarily staged data by RFS is minimal. finer-grained prefetching and caching that use byte ranges
Finally, we measured how much these background re- to specify the data regions to be prefetched and cached.
mote U 0 activities slow the concurrent execution of the Our extensions will comply with the MPI-IO consistency
Jacobi code through their computation and inter-processor semantics.
communication. Table 3 shows the average slowdown of
the computation phases with various configurations when 5.2 Handling Failures
the 32 MB buffer was used. All the measured slowdown
was less than 7%, which is dwarfed by the performance gain Through ABT and hints, the current RFS implementa-
from hiding remote 110 cost. tion allows the user to defer filc open, write, and close calls
to reduce the network traffic and response time. The origi-
5 Discussion nal call to such functions returns immediately with a success
value, and if an error occurs during a deferred operation, the
5.1 Optimizing Remote Read Performance user will be notified after the error. Thus, the user needs to
be aware of the possibility of delayed error message as the
This work focuses on optimizing remote write perfor- cost of this performance improvement. If timely error noti-
mance through ABT for write-intensivc scientific applica- fication is important, the user should avoid these options.
tions. Reads are typically not a big concern for such appli- A failed remote open notification will be received when
cations, because they often read a small amount of initial the following VO operation on the specified remote file fails.
data and do not reread their output snapshots during their A failed write notification can he delayed until a sync or
execution. A restart operation after a system or application close operation is called on the file. The MPI-IO standard
failure may read large amounts of checkpointed output, but says that MPI_File.sync causes all previous writes to the
restarts rarely occur. The current implementation of ABT file by the calling process to be transferred to the storage
requires one to flush the buffered data to the destination file device (MPI.File.close has the same effect), so delay-
system before reading a file for which ABT has buffered ing write error notification until a sync or close operation
write operations. does not violate the standard. If the user chooses to defer
However, applications such as remote visualization tools a file close, too, and a write error occurs after the original
may read large remote data. The traditional approaches to close operation returns, then the error can be delayed until
hide read latency are to prefetch the data to be read and MPIPinalize.
to cache the data for repetitive reads, and we are adding Some RFS operations that are deferred by default, such
such extensions to ABT, using local disks as a cache. More as setting the file view and file seeks, can be checked for er-
specifically, we are providing a flexible prefetching inter- rors at the client and the user can be notified about such er-
face through the use of hints, so that background threads rors, if any, before the operations are executed at the server.
can start prefetching a remote file right after an open call on Thus, they do not need separate failure handling.
that file. When a read operation on a file for which prefetch-
ing is requested is issued, RFS checks the prefetching sta- 6 Conclusions
tus and reads the portion already prefetched locally, and the
portion not yet prefetched remotely. Cached files can be We have presented an effective solution for direct remote
read similarly, performing remote reads for the portions that YO for applications using MPI-IO. The RFS remote file
80
access component has a simple and Hexible U 0 architec- the distributed managrment and analysis of large scientific
ture that supports efficient contiguous and noncontiguous datasets. Journal of Network and Computer Applications,
remote accesses. Coupling this with ABT provides aggres- 23:187-200, 2001.
sive buffering for output data and low-overhead overlapping [71 A. Ching, A. Choudhary, W.-K. Liao. R. Ross. and
of computation and UO.Our local data staging augmcnta- W. Gropp. Elficient structured access in parallel file sys-
tems. In Pmceedings of the Intemutional Conference on
tion to ABT further enhances ABT's ability to hide true U 0
Cluster Computing, 2003.
latencies. Our experimental results show that the write per- [XI I. Foster and C. Kesselman. Globus: A metacomputing in-
formance of RFS without ABT is close to the throughput frastructure toolkit. International Joumal of Superrompurer
of thc slowest component in the path to the remote file sys- Applicarions, 11(2):115-128. 1997.
tem. However, RFS with ABT can significantly reduce the [9] 1. Foster, D. Kohr. Jr.. R. Krishnaiyer, and I. Mogill. Re-
remote YO visible cost, with throughput u p to 92% of the mote 110: Fast access to distant storage. in Pmceedings of
theoretical peak (determined by local interconnect through- the Workshop on Inprct/Ou@ut in Parallel and Distribured
put) with sufficient huffer space. With short computation Systems. 1997.
phases, RFS still reduces visible U 0 cost by performing a [IO] E. He. J. Leigh, 0. Yu. and T. DeFanti. Reliable Blast UDP:
Predictable high performance bulk data transfer. In Proceed-
small amount of background staging to frce up sufficient
ings of the lntemutional Conference on Cluster Computing,
buffer space for the next YO operation. The computation
2002.
slowdown caused by concurrent remote U 0 activities is un- [ I I] D. Kolz. Disk-directed U 0 for MlMD multiprocessors. In
der 7% in our experiments and is dwarfed by the improve- Pmceedings of the Symposium on Opemting S.vstems Design
ments in turnaround time. and Implementation, 1994.~ '
As discussed in the previous section, we are currently en- [I21 I. Lee, X. Ma. M. Winslett, and S . Yu. Active buffering
hancing ABT for RFS reads, by introducing prefetching and plus compressed migration: An integrated solution to paral-
caching. Future work includes experiments with altemative le1 simulations' data transport needs. In Proceedings of the
communication protocols and parallel server platforms. International Conference on Superromputing. 2002.
[I31 J. Li, W.-K. Liao, R. Ross. R. Thakur, W. Gropp, R. Latham,
A. Siegel. B. Gallagher, and M. Zingalr. Pacallel netCDF
Acknowledgments A high-performance scientific 110 interface. In Proceedings
of SC2OO3,2003.
This work was supported in part by thc Mathematical, [I41 X. Ma, M. Winslett, J. Lee. and S. Yu. Improving MPI-
Information, and Computational Sciences Division subpro- 10 output perfurmance with active buffering plus threads.
In Proceedings of thr Intemational Parallel and Distributed
gram of the Office of Advanccd Scientific Computing Re-
Pmcessing Symposium, 2003.
search, Office of Science, U S . Department of Energy, under [ l S ] Message Passing Interfacc Forum. MPI-2: Extensions lo the
Contract W-31-109-ENG-38. This research was also sup- Message-Passing Standard. 1997.
ported through faculty start-up funds from North Carolina (161 I. Mums. M. Satyanarayanan. M. Conner, 1. Howard,
State University and a joint faculty appointment from Oak D. Rosenthal, and F. Smith. Andrew: A distributed per-
Ridge National Laboratory. sonal computing environment. Communications of ACM,
29(3):184201. 1986.
[171 F. Schmuck and R. Haskin. GPFS: A shared-disk file system
References for large computing clusters. In Proceedings of lhe Confer-
ence on File und Storage Technologies.2002.
[I1 NCSA HDF home page. http:/lhdf.ncsa.uiuc.edu. [IS] D. Thain. I. Basney. S:C. Son, and M. Livny. The Kangaroo
[21 NFS: Network File System protocol specification. RFC approach to data movement on the Grid. In Pmceedings of
1094. fhe Svmposium
~. on Hinh. Performance
. Disrributed Cofnpur-
[3] B. Allcock, J. Bester. J. Bresnahan, A. Chervenak, 1. Fos- ing. 2001.
ter, C. Kesselman. S . Meder, V. Nefedava, D. Quesnal, and [I91 R. Thakur, W. Gropp, and E. Lusk. An abstcact-device in-
S . Tuecke. Data management and transfer in high perfor- terface for imolementineI.oortable oarallel-110 interfaces. In
mance computational grid environments. Parallel Comput- Pmceedings of the Symposium on the Fmntiers of Mussivelv
ing J o u m l . 28(S):749-771,2002. Parallel Compuration, 1996.
I41 I. Bester, I. Foster, C. Kesselman. I. Tedesco, and S . Tuecke. [201 R. Thakur. W. Gropp, and E. Lusk. Data sieving and collec-
GASS: A data movement and access service for wide area tive I10 in ROMIO. In Pmceedings of rhe Symposium on the
computing systems. In Pmceedings of the Workshop on In- Fmntiers of Massively Parallel Computation, 1999.
put/Output in Parallel and Distributed Svstems. 1999. [211 J. Weissman. Smart file objects: A remote file access
[SI P.Cams, W. L. Ill, R. Ross, and R. Thakur. PVFS: A parallel paradigm. In Pmceedings of the Workshop on Input/Ourput
file system for Linux clusters. In Proceedings of the Annuul in Parallel and Disrributed Systems, 1999.
Linur Showcase and Conference,2000.
[6] A. Chervenak. 1. Foster, C. Kesselman, S . Salisbun, and
S . Tuecke. The Data Grid Towards an architecture for
82
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.