Beruflich Dokumente
Kultur Dokumente
Replication
Rodrigo Rodrigues and Barbara Liskov
MIT
Session
Time
time time
Figure 2: Distinction between sessions and lifetimes. Join Leave Join Leave
Average Availability
PlanetLab
0.8 Overnet
0.15 Farsite
PlanetLab
0.6
0.1
0.4
0.05
0.2
0 0
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Membership Timeout (hours) Membership Timeout (hours)
Figure 4: Membership dynamics as a function of the membership timeout (τ ). Figure 5: Average node availability as a function of the membership timeout
(τ ).
Figure 6: Required replication factor for four nines of per-object availability, Figure 7: Required coding redundancy factor for four nines of per-object avail-
as a function of the membership timeout (τ ). ability, as a function of the membership timeout (τ ) as determined by Equa-
tion 2, and considering the extra copy required to restore lost fragments.
4.1 Needed Redundancy and Bandwidth
width has two “steps” (around τ = 14 and τ = 64 hours).
Finally, we measure the bandwidth gains of using erasure cod-
These correspond to the people who turn off their machines at
ing vs. replication in the three deployments.
night, and during the weekends, respectively. Setting τ to be
First, we compute the needed redundancy for the two re-
greater than each of these downtime periods will prevent this
dundancy schemes as a function of the membership timeout
downtime from generating a membership change and the cor-
(τ ). To do this we used the availability values of Figure 5 in
responding data movement.
Equations (1) and (2), and plotted the corresponding redun-
Figure 9 shows the equivalent of Figure 8 for the case
dancy factors, assuming a target average per-object availability
when coding is used instead of replication. The average band-
of four nines.
width values are now lower due to the smaller redundancy used
The results for replication are shown in Figure 6. This
with coding, especially in the Overnet deployment where we
shows that Overnet requires the most redundancy, as expected,
achieve the most substantial redundancy savings.
reaching a replication factor of 20. In the other two deploy-
ments replication factors are much lower, on the order of a
few units. Note that the replication values in this figure are
5 Discussion and Conclusion
rounded off to the next integer, since we cannot have a fraction Several conclusions can be drawn from Figures 8 and 9.
of copies. For the Overnet trace, coding is a win since server avail-
Figure 7 shows the redundancy requirements (i.e., the ex- ability is low (we are on the left hand side of Figure 1) but
pansion factor) for the availability values of Figure 5 (using unfortunately the maintenance bandwidth for a scalable and
Equation 2 with m = 7 and four nines of target availability). highly available storage system with Overnet-like membership
The redundancy values shown in Figure 7 include the extra dynamics can be unsustainable for home users (around 100
copy of the object required to create new fragments as nodes kbps on average for a modest per-node contribution of a few
leave the system (as we explained in Section 3.3). gigabytes). Therefore, cooperative storage systems should tar-
As shown in Figure 7, Overnet still requires more redun- get more stable environments like Farsite or PlanetLab.
dancy than the other two deployments, but for Overnet cod- For the PlanetLab trace, coding is not a win, since server
ing leads to the most substantial storage savings (for a fixed availability is extremely high (corresponding to the right hand
amount of unique data stored in the system) since it can reduce side of Figure 1).
the redundancy factors by more than half. So the most interesting deployment for using erasure codes
Finally, we compare the bandwidth usage of the two is Farsite, where intermediate server availability of 80–90%
schemes. For this we use the basic equation for the cost of already presents visible redundancy savings.
redundancy maintenance (Equation 3) and apply for member- However, the redundancy savings from using coding in-
ship lifetimes the values implied by the leave rates from Fig- stead of full replication come at a price.
ure 4 (recall the average membership lifetime is the inverse of The main point against the use of coding is that it intro-
the average join or leave rate). We will also assume a fixed duces complexity in the system. Not only there is complexity
number of servers (10, 000), and a fixed amount of unique data associated with the encoding and decoding of the blocks, but
stored in the system (10T B). We used the replication factors the entire system design becomes more complex (e.g., the task
from Figure 6, and for coding the redundancy factors from Fig- of redundancy maintenance becomes more complicated as ex-
ure 7. plained in Section 3).
Figure 8 shows the average bandwidth used for the three As a general principle, we believe that complexity in sys-
different traces and for different values of τ . An interesting tem design should be avoided unless proven strictly necessary.
effect can be observed in the Farsite trace, where the band- Therefore system designers should question if the added com-
10000 10000
Overnet Overnet
Farsite Farsite
Maintenance BW (kbps)
Maintenance BW (kbps)
PlanetLab PlanetLab
1000 1000
100 100
10 10
1 1
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Membership Timeout (hours) Membership Timeout (hours)
Figure 8: Maintenance bandwidth – Replication. Average bandwidth re- Figure 9: Maintenance bandwidth – Erasure coding. Average bandwidth
quired for redundancy maintenance as a function of the membership time- required for redundancy maintenance as a function of the membership time-
out (τ ). This assumes that 10, 000 nodes are cooperatively storing 10T B out (τ ). This assumes that 10, 000 nodes are cooperatively storing 10T B of
of unique data, and replication is used for data redundancy. unique data, and coding is used for data redundancy.
plexity is worth the benefits that may be limited depending on [2] D. Bertsekas and R. Gallager. Data Networks. Prentice Hall,
the deployment. 1987.
Another point against the use of erasure codes is the down- [3] R. Bhagwan, S. Savage, and G. Voelker. Understanding avail-
load latency in a environment like the Internet where the inter- ability. In Proc. IPTPS ’03.
node latency is very heterogeneous. When using replication, [4] R. Bhagwan, K. Tati, Y. Cheng, S. Savage, and G. Voelker. Total
the data object can be downloaded from the replica that is clos- recall: System support for automated availability management.
est to the client, whereas with coding the download latency is In Proc. NSDI ’04.
[5] C. Blake and R. Rodrigues. High availability, scalable storage,
bounded by the distance to the mth closest replica. This prob-
dynamic peer networks: Pick two. In Proc. 9th HotOS, 2003.
lem was illustrated with simulation results in a previous pa- [6] W. Bolosky, J. Douceur, D. Ely, and M. Theimer. Feasibility of
per [7]. a serverless distributed file system deployed on an existing set
The task of downloading only a particular subset of the ob- of desktop PCs. In Proc. SIGMETRICS 2000.
ject (a sub-block) is also complicated by coding, where the en- [7] F. Dabek, J. Li, E. Sit, J. Robertson, F. Kaashoek, and R. Morris.
tire object must be reconstructed. With full replicas sub-blocks Designing a DHT for low latency and high throughput. In Proc.
can be downloaded trivially. NSDI ’04.
A similar observation is that erasure coding is not adequate [8] D. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and
for a system design where operations are done at the server R. Panigrahy. Consistent hashing and random trees: Distributed
side, like keyword searching. caching protocols for relieving hot spots on the world wide web.
A final point is that in our analysis we considered only im- In Proc. STC’97.
[9] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton,
mutable data. This assumption is particularly important for our
D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer,
distinction between session times and membership lifetimes, C. Wells, and B. Zhao. Oceanstore: An architecture for global-
because we are assuming that when an unreachable node re- scale persistent storage. In Proc. ASPLOS 2000.
joins the system, its state is still valid. This would not be [10] M. Luby. LT codes. In Proceedings of the 43rd Symposium
true if it contained mutable state that had been modified in on Foundations of Computer Science (FOCS 2002), Vancouver,
the meantime. The impact of mutability on the redundancy Canada, Nov. 2002.
choices is unclear, since we have to consider how a node deter- [11] M. Rabin. Efficient dispersal of information for security, load
mines whether its state is accurate, and what it does if it isn’t. balancing, and fault tolerance. J. ACM, 36(2), 1989.
A study of redundancy techniques in the presence of mutability [12] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker.
is an area for future work. A scalable content-addressable network. In Proc. SIGCOMM
’01.
[13] S. Reed and G. Solomon. Polynomial codes over certain finite
Acknowledgements fields. J. SIAM, 8(2):300–304, June 1960.
We thank Jeremy Stribling, the Farsite team at MSR, and the [14] S. Rhea, D. Geels, T. Roscoe, and J. Kubiatowicz. Handling
Total Recall team at UCSD for supplying the data collected churn in a DHT. In Proc. USENIX’04.
in their studies. We thank Mike Walfish, Emil Sit, and the [15] A. Rowstron and P. Druschel. Storage management and caching
anonymous reviewers for their helpful comments. in PAST, a large-scale, persistent peer-to-peer storage utility. In
Proc. SOSP’01.
[16] J. Stribling. Planetlab - all pairs pings. http://pdos.lcs.mit.edu/
References ˜strib/pl app.
[1] D. Andersen. Improving End-to-End Availability Using Overlay [17] H. Weatherspoon and J. Kubiatowicz. Erasure coding vs. repli-
Networks. PhD thesis, MIT, 2005. cation: A quantitative comparison. In Proc. IPTPS ’02.