Sie sind auf Seite 1von 14

Fault-Tolerant Broadcasting and Gossiping

in Communication Networks
Andrzej Pelc*
Departement d’lnformatique, Universite du Quebec a Hull, Hull, Quebec J8X 3x7, Canada

Broadcasting and gossiping are fundamental tasks in network communication. In broadcasting, or one-
to-all communication, informationoriginally held in one node of the network (called the source) must be
transmitted to all other nodes. In gossiping, or all-to-all communication, every node holds a message
which has to be transmitted to all other nodes. As communication networks grow in size, they become
increasingly vulnerable to component failures. Thus, capabilities for fault-tolerant broadcasting and gossiping
gain importance. The present paper is a survey of the fast-growing area of research investigating these
capabilities. We focus on two most important efficiency measures of broadcasting and gossiping algorithms:
running time and number of elementarytransmissionsrequired by the communication process. We emphasize
the unlfying thread in most results from the research in fault-tolerant communication: the trade-offs between
efficiency of communication schemes and their fault-tolerance. 0 7996 John Mey & Sons, Inc.

1. INTRODUCTION rounds (rime) required. Another concern in the design of


communication schemes is the demand that they impose
Broadcasting and gossiping are fundamental tasks in net- on the underlying network. Since dense networks are dif-
work communication. They both aim at disseminating ficult and costly to implement, it is important to consider
information among nodes. In broadcasting, also called efficient broadcasting and gossiping algorithms that work
one-to-all communication, information originally held in for networks as sparse as possible. Excellent accounts of
one node of the network (called the source) has to be the literature on broadcasting and gossiping focusing on
transmitted to all other nodes. In gossiping, or all-to- the above-mentioned problems can be found in surveys
all communication, every node holds a message (value) [33, 50, 511.
which must be transmitted to all other nodes. These types As communication networks grow in size, they become
of network communication often occur in distributed increasingly vulnerable to component failures. Some links
computing, e.g., in global processor synchronization and and/or nodes of the network may fail. It becomes im-
updating distributed databases. Moreover, such communi- portant to design communication algorithms in such a
cation tasks are implicit in many parallel computation way that the desired communication task be accomplished
problems, where data and results are distributed among efficiently in spite of these faults, usually without know-
processors. This happens, e.g., in matrix multiplication, ing their location ahead of time. As such, much attention
parallel solving of linear systems, parallel computing of has been devoted recently to fault-tolerant broadcasting
the discrete Fourier transform, or parallel sorting, cf. [ 8,
and gossiping. The present paper, which is an extended
31, 531.
version of [62], surveys this rapidly growing area of
Two most important measures of performance of
research. It has been necessary to make choices in the
broadcasting and gossiping algorithms are the number of
large body of literature concerning fault-tolerant commu-
elementary transmissions ( c a l f s ) and the number of
nication, leaving out vast and important subdomains re-
lated to the main focus of this paper. We do not cover
* E-mail: pelc@uqah.uquebec.ca the issue of network reliability, as it is not immediately

NETWORKS, Vol. 28 (1996) 143-156


0 1996 John Wiley & Sons. Inc. CCC 0028-3045/96/030143-14

143
I# PELC

concerned with the efficiency of fault-tolerant communi- whose vertices are sites of the network (e.g., processors,
cation but rather with necessary conditions for its feasibil- computers) and whose edges are communication links
ity. Another issue closely related to fault-tolerant broad- used to transmit messages from site to site. A number of
casting and gossiping is the Byzantine Agreement. This other features must be specified in order to describe the
material has been covered in the survey [ 5 ] which also model completely. Below, we point out the choices to be
discussed the important problem of multiprocessor system made and fix the appropriate terminology.
diagnosis. We do not consider problems of the Byzantine
Agreement in this paper, although we discuss some as-
2.1. Communication Mode
pects of broadcasting and gossiping in the presence of
Byzantine faults, pointing out differences of our approach. The communication primitive is a call taking place be-
Finally, we focus attention only on the two above-men- tween two adjacent sites (or nodes) of the network and
tioned communication tasks, leaving out, e.g., the im- usually lasting a unit of time. The communication mode
portant and largely studied issue of fault-tolerant, point- specifies which calls can be executed simultaneously dur-
to-point routing where problems and techniques are ing one unit of time and what messages can be transmitted
different from those encountered for broadcasting and in one call. In the shouting mode, also called n-porr or
gossiping. fink-bound,a node can communicate with all its neighbors
We focus on two most important and widely studied during a single unit of time, while in the whispering mode,
efficiency measures of broadcasting and gossiping algo- also called I-porr or processor-bound, a node can com-
rithms: running time and number of elementary transmis- municate with at most one neighbor. Further, the commu-
sions used in the communication process. We also discuss nication mode can be full-duplex, when during one call
the issue of sparsity of networks supporting efficient fault- messages between communicating nodes can travel in
tolerant communication schemes; schemes that work for both directions through the (bidirectional) link joining
sparse networks are more widely applicable than are those them, or halfduplex, when every node can only send or
requiring, e g , a completely connected network. In this only receive information in a given call. The shouting
survey, we emphasize the unifying thread found in most mode models, e g , radio communication, in which a sig-
results from research in fault-tolerant communication: the nal can be simultaneously transmitted to all receivers
trade-offs between efficiency of communication schemes within the range of the broadcasting station. The whisper-
and their fault-tolerance. ing mode is suitable to model wire-based communication,
The paper is organized as follows: In Section 2, we such as occurs in traditional telephone networks. Like-
discuss a variety of communication and fault models dis- wise, the full-duplex mode is appropriate for telephone
cussed in the literature. One dividing line is between conversations, while the half-duplex mode is used in
bounded and probabilistic fault models, as both the goals sending telegrams or letters. Several combinations and
and algorithm design techniques differ in each case. The variations of these modes have been considered, e.g., si-
rest of the paper is organized as follows: In Section 3, multaneous sending to all neighbors but sequential receiv-
we survey results in the bounded fault model, while in ing or the concurrent sending to one neighbor and receiv-
Section 4,random fault distribution is considered. Section ing from one (possibly different) neighbor.
5 is devoted to the discussion of possible directions of
future research.
2.2. Size of Packets
Another important characteristic of the communication
2. MODELS AND TERMINOLOGY process, significantly influencing the performance of algo-
rithms, is the size of message packets. A packet is the
The domain of fault-tolerant broadcasting and gossiping amount of information that can be sent by a node to its
is a rather broad area of research. Different authors have neighbor in one call. This parameter may dramatically
adopted varying approaches, both to network communica- change the process of gossiping, as large packets allow
tion and to fault modeling. Thus, many combinations of single transmission of already accumulated information.
assumptions concerning the communication process and In case of broadcasting, packet size often does not change
the possible faults can be found, yielding a large number the algorithm design, as only one message is to be dissem-
of communication models and fault-tolerance solutions. inated throughout the network. However, even in this
The design of algorithms and results concerning their case, it may play a significant role, especially if large
efficiency and robustness heavily depend on the underly- control messages concerning already detected faults need
ing model; hence, it is very important to specify the as- to be circulated during the algorithm execution. The size
sumptions in a detailed and rigorous way. of packets can vary from unit, when each packet can
The only assumption common to all papers reviewed contain the value of only one node, to unbounded, when
in the present survey is the consideration of point-to-point packets are large enough to contain all values of nodes
communication networks modeled as undirected graphs in the network. Large bandwidth availability, such as in
FAULT-TOLERANT BROADCASTING AND GOSSIPING 1 6

optical communication networks, makes the assumption other, with specified probability. The choice between
of large, essentially unbounded packets increasingly plau- these two assumptions regarding fault occurrence influ-
sible. ences the definition of the goal of broadcasting and gos-
siping in the presence of faults.
In the bounded fault model, we usually seek what is
2.3. Fault Classification termed k-tolerant broadcasting and gossiping. In case of
Several assumptions are made concerning aspects of broadcasting, the source is assumed fault-free (otherwise,
faults that can occur in the communication process. The no broadcasting is possible); its message must reach all
first concern is to specify which components are fault- fault-free nodes provided that no more than k components
prone: only links, only nodes, or both links and nodes. (links or nodes or both, depending on the particular sce-
Furthermore, the nature of faults must be described. The nario) are faulty. In the case of gossiping, all fault-free
two most commonly studied types are crash and Byzan- nodes must receive messages from all fault-free nodes,
tine faults. If the fault is a crash, the faulty node does not provided that no more than k components are faulty. It
send or receive messages or the faulty link does not trans- should be stressed that no agreement concerning messages
mit messages. Faulty components do not alter transmitted from faulty nodes is required among the fault-free nodes;
messages; they fail to transmit messages at all. Such faults this is where fault-tolerant gossiping differs from Byzan-
are relatively benign. Although some information may be tine Agreement in the case of Byzantine node failures.
lost, at least the information that is received can be trusted. In the probabilistic fault model, the communication
Byzantine faults, on the other hand, are a worst-case fault goal cannot be achieved with certainty, since, with some
scenario: Faulty components can behave arbitrarily (even small probability, all components may fail and preclude
maliciously) as transmitters, by stopping, rerouting, or any message transmission. As a result, almost safe broad-
altering transmitted messages in a way most detrimental casting and gossiping is sought. In broadcasting, all fault-
to the communication process. Byzantine failures that ex- free nodes must receive the source message with probabil-
hibit all these kinds of damaging behavior rarely occur ity at least 1 - ( I / n ) (the source being fault free); in
in practice. They may be caused by a hostile agent whose gossiping, all fault-free nodes must receive messages
aim is to destroy the communication process. On the other from all fault-free nodes with a specified probability. De-
hand, the concept of Byzantine faults plays an important signing efficient communication algorithms whose relia-
role in our study, representing a worst-case assumption. bility increases for networks of larger size is difficult for
Communication algorithms that work correctly in the the following reason: In small networks, it is relatively
presence of Byzantine faults can be used safely under any easy to achieve fault-tolerant communication using mas-
fauI t scenario. sive redundancy; due to the small scale of the network,
Another important characteristic of faults, which must resources used by such brute-force communication proce-
be specified, is their duration. Faults can be either p e m - dures (either time or number of messages) will not be
nent (i.e., the status faulty/fault-free of a component does excessive. In larger networks, however, the trade-off be-
not change during the algorithm execution) or transient tween reliability and efficiency becomes an important is-
(i.e., the status of a component may change in each unit of sue; for such networks, highly reliable and efficient algo-
time). Permanent faults usually correspond to hardware rithms are sought.
Component failures, while transient link faults correspond
to transmission failures due, e.g., to temporary magnetic
2.5. Flexibility of Algorithms
interference. Knowledge as to which types of faults are
likely to occur is important in communication algorithm In a fault-free environment, broadcasting and gossiping
design. Repeating attempts to transmit the same message algorithms have a simple form. All calls to be carried out
along the same link is useless in case of permanent faults in each time unit can be specified in advance, before the
but may be essential if faults are of a transient nature. algorithm execution. In the presence of faults, a distinc-
tion must be made regarding this point, which will sig-
nificantly affect the efficiency and robustness of fault-
2.4. Fault Distribution tolerant communication. Broadcasting and gossiping al-
One of the crucial assumptions made about faults con- gorithms can be either nonadaptive, also called oblivious,
cerns their distribution. Clearly, some limitations on the where all calls must be scheduled in advance, or adaptive,
number of possibly faulty components must be imposed; where every node can schedule its next call based on
otherwise, no communication is possible. Two commonly information currently available to it. In adaptive algo-
used fault models are the bounded model and the probabi- rithms, a node becomes aware of whether a call it at-
listic model. In the bounded model, an upper bound k is tempted was successful (i.e., if the called node and the
imposed on the number of faulty components; their worst- connecting link are fault-free) ; hence, different calls can
case location is assumed. In the probabilistic model, faults be executed depending on the success or failure of previ-
are assumed to occur randomly and independently of each ous ones. Even in this second case, however, nodes can
146 PELC

only take advantage of locally available information; we case. We denote by T the minimum time of k-tolerant
do not assume the existence of a central monitor supervis- broadcasting or gossiping and by C the minimum number
ing the overall communication process. Adaptive algo- of calls used in such a communication process.
rithms require more local control and memory at each
node but are usually more efficient for the same fault-
tolerance level. 3.1. Permanent Link Faults
We first assume that nodes are fault free and there are at
2.6. Models in the Literature most k permanently faulty links in the network. As such,
All the above characteristics of the communication and faults are of crash type. We assume the communication
fault models must be specified if results concerning fault- mode is whispering.
tolerant broadcasting and gossiping are to be meaningful. The first results on the number of calls in this model
Many authors make some of the assumptions tacitly, and were proved in [ 7 1, where both the full-duplex and half-
the complete description of the model can only be inferred duplex variations were considered. In case of broadcast-
from the algorithm description or from arguments con- ing, the exact value of C was obtained in both cases. For
cerning efficiency and robustness of the communication the full-duplex mode,
schemes. The interest of the research community in differ-
ent models varies; the popularity of some of the proposed
models have changed over time. An example is the choice C = r s 2( n - l ) 1 , ifksn-2, and
to be made between the bounded and probabilistic mod-
els; interest in nonadaptive vs. adaptive communication
algorithms has also varied with time. The first papers
in the area of fault-tolerant broadcasting and gossiping
favored the bounded fault model and nonadaptive algo-
c= r? n1. otherwise.

rithms. This is probably due to the fact that, at first, broad- For the half-duplex mode, C = ( k + 1 )(n - 1 ).
casting and gossiping were viewed mostly as combinato- In case of gossiping, the exact value of C was obtained
rial, graph theoretic problems; communication schemes +
for the half-duplex mode: C = ( k 2 ) n - 2. However,
have a simple combinatorial formulation under the above for the full-duplex mode, only bounds on C have been
assumptions. As the domain of research has evolved, sev- established:
eral researchers pointed out that the random fault assump-
tion corresponds much better to failure Occurrence in real
networks; as a result, the probabilistic fault model gained
prominence. In addition, as local memory and computa-
tion capabilities of processors have increased, the imple-
mentation of adaptive algorithms has become more realis-
tic. At present, there is a good balance of research con- 5 C L(k
I + i ) ( n - I)], if k 5 n - 2, and
ducted in each of the above models; we can see
technological advances as well as intrinsic interest in
combinatorics influencing the choice of particular com-
munication scenarios.

2.7. Notation
We use the following notation through the rest of the s L ( k + i)n - I)], ifk 2 n - 2.
paper: We let n denote the number of nodes in the net-
work. Unless otherwise specified, the network is a com-
plete graph. An m-hypercube is the graph on 2"' nodes
It was conjectured that the exact value of C is close
labeled by binary sequences of length rn in which adjacent
to the upper bound; more precisely, that C = [ ( k + 3 ) /
nodes have labels differing in exactly one place. We de-
2 ] n - consr. This conjecture was later disproved in [ 481
note by log the logarithm with base 2 and by In the natural
under the transient fault assumption (cf. Subsection 3.4).
logarithm.
The following results concern the minimum time T
of nonadaptive, k-tolerant broadcasting in the full-duplex
3. THE BOUNDED FAULT MODEL mode:
In [ 5 6 ] ,the lower bound T 2 [log nl + k , for n - 1
In this section, we assume that there are at most k faulty > k 2 0, was proved. Moreover, the exact value of T
components in the network and their distribution is worst- was established for k = 1, 2:
FAULT-TOLERANT BROADCASTING AND GOSSIPING 147

T=rlognl+ I, i f k = 1, n > 2 , lar, the authors construct networks supporting k-tolerant


broadcasting in minimum time, with a number of links
T=rlognl+2, ifk=2, n>4, n z 2 ' - 1, differing from BA(n) by at most a constant factor, for k
T = [log nl + 3, if k = 2, n > 6, n = 2' - 1. < Llog nl if n is even and for k 5 L2r10g''1- n + I J if
n is odd. It was also proved that
In [ 561, the k-tolerant broadcast function BA( n) was
defined and studied for the first time. This is the minimum (m - 1)(2"' - 6 )
B , (2"' - 6) = , when m is even.
number of links in a network supporting k-tolerant broad- 2
casting in minimum time T. In [ 561, the following bounds
on B , ( n ) were established: For general values of k , two upper bounds on time T
had been obtained earlier in [64]. First, it was proved
+
that T s 2k + h o g nl 4. This upper bound was also
obtained independently in [57]. For large k , e.g., linear
in n, it can be advantageous to decrease the coefficient
at k at the expense of increasing the coefficient at [log
nl . In these situations, the second upper bound from 1641
is useful:
Similar upper bounds on B 2 ( n )were obtained for some
values of n. The above upper bounds on B , ( n )and B 2 ( n )
T s (1 + $ ) k + 2 d l o g n l + d ,
were later improved in [ 151. Sparser networks supporting
1-tolerant and 2-tolerant broadcasting in minimum time
were constructed, giving the following estimates:
for any constant c I , k s [ ( n - 1)/(8c + l ) ] - c .
n n and a constant d depending on c but not on n or k.
B,(n)s -[log nl - - + 6, for even n, and The exact values of T for arbitrarily large k were first
2 2
obtained in [37], for n being a power of 2. If n = 2"'
and k 5 n - 2, k-tolerant broadcasting schemes were
constructed that perform in the optimal possible time:

T=logn+k, ifksn-logn- 1, and


T = logn + k + 1, i f n - logn 5 k s n - 2.

for n = 0 or 1 (mod 4 ) , and Moreover, the networks supporting these schemes have
the minimum possible number of links ( m + k)n/2,
thereby proving that B A ( n )= (m +
k)n/2, for n = 2"
and k s n - 2.
In the same paper, tighter upper bounds on T for other
values of n and k were also established. Let m = Llog nJ
for n = 2 or 3 (mod 4 ) . and n = 2"' + j , where 0 < j 5 2"'-2. It was proved that

In [54], it was shown that


T s m + k + l +
L m - hogj
if k
l-
5
1
2"' - m - 1. and
The above results concerning the minimum values for
the time T of k-tolerant broadcasting and for the k-tolerant
broadcast function Bk(n)were extended in [42]. For all
k 5 Llog PI],the minimum time T of k-tolerant broadcast-
ing was proved to be [log nl + k or rlog(n - I )1 + k
T=sm+k+2+
L k - l
m - riogjl- 1
J,
+ 1, depending on k . As for the function &(n), it was if 2"' - m - 1 <k 5 2"' - 2.
proved in [42] to be O(n(k + log n)), whenever k
s Llog nJ. Moreover, under the additional assumption The tightness of these bounds depends on k and j. For
that n = 2'". the exact value Bk(n)= ( m + k)n/2 was small values of j , when n is close to Llog nl, and small
established. Further bounds on the function B,(n)and its values of k, close to log n , the upper bound does not
values for specific k and n can be found in [ 11. In particu- exceed log n +
k very much and thus it is fairly tight.
148 PELC

On the other hand, for large j, close to 2”’-’, the bound at most k faulty links was proved to be completed in time
becomes close to log n + 2k, which is then similar to O(log n), with probability converging to 1.
the bound T 5 2k + [log nl + 4 from [64]and [57]
and is far from the lower bound when k is large. The
exact values of broadcasting time for arbitrary n and k 3.2. Permanent Node Faults
remain unknown.
In [ 591, k-tolerant broadcasting and gossiping was We now assume that links are fault-free and there are at
considered in the shouting mode for product graphs G most k permanently faulty nodes. Faults are of the crash
= G I X Gz.The authors derived bounds on the minimum type; the communication mode is full-duplex whispering.
time and the minimum number of calls for k-tolerant We first consider the minimum time T of k-tolerant adap-
broadcasting and gossiping in G depending on these val- tive broadcasting. The first results in this model were
ues for G , and Gz. obtained in [29]. It was proved that if there are at most
k nonadjacent faulty nodes in some chordal rin s then
A different approach to broadcasting in the presence
of faults was adopted in [ 121. In the usual scenario of k- adaptive broadcasting can be performed in time log nl
+
f
k. (Chordal rings are rings [ x , ) , . . . , x,,-~]with addi-
fault-tolerance, even an adaptive algorithm does not as-
sume any a priori knowledge of fault location; this infor- where t E S, for a fixed S
tional links [ x i , x~~+,,~,,,,~],
mation must be acquired by nodes during the communica- C ( 1, . . . , n - 1 ) .) Adaptive broadcasting in complete
tion process, which is usually costly in terms of time networks has been recently investigated in [45]. Two
and calls. In [ 121, fault-tolerant broadcasting in the m- variations of the above model were considered: In the
hypercube was considered, assuming that the location of first, called the wake-up model, only nodes that have al-
faults is a priori known to the source. Packets were as- ready obtained the source information (i.e., are awake)
can attempt placing calls. In this model, the exact value
sumed of size O ( m ) and a variation of the whispering
of minimum broadcasting time T = k + rlog(n - k ) l
mode was considered, in which each node could send
was obtained. The second variation does not impose any
information to two neighbors in a unit of time. Finally,
restriction on nodes placing calls. Uninformed nodes can
it was assumed that the number of faulty links does not
also place calls; this enables the algorithm to perform
exceed rm/21 - I . Under these assumptions, it was
preprocessing and avoid time-consuming calls from the
shown that a nonadaptive broadcasting algorithm for the
source to faulty nodes. In this general model, the authors
m-hypercube existed working in optimal time m and using
constructed a k-tolerant broadcasting algorithm working
the optimal number of 2”’ - 1 calls.
in time O(log2 n), whenever k I an, for a constant a
In [40], the lineur communication model was used. < 1. It remains open if adaptive broadcasting can be done
In this model, the time to transmit a message of length in time O(1og n).
L is p + LT, where p is the startup time and LT is In [ 431, nonadaptive broadcasting in the half-duplex
the propagation time. Four communication modes were whispering mode was considered, under the same as-
considered: whispering and shouting in both the half- sumption as above: k s an,for a constant a < 1. The
duplex and full-duplex variations. Also, both transient and authors established upper and lower bounds on T and
permanent link failures were investigated. Under these constructed a k-tolerant broadcasting algorithm requiring
assumptions, the authors used Rabin’s Information Dis- time at most 1.73 times greater than optimal.
persal Algorithm (IDA) to construct fast k-tolerant broad- In [ 521. a communication mode in between whispering
casting algorithms for the hypercube. Given a file F, IDA and shouting was considered: Every node can transmit to
encodes it into n smaller files F,,. . . , F, in such a way at most k other nodes in a unit of time. The authors
that F can be recovered from any subset of n - k files proposed nonadaptive k-tolerant broadcasting schemes
F, . Broadcasting proposed in [ 401 was performed so that working for the complete n-node network in time
one failure could only affect one file F, and, consequently, rlog,+,nl + 3. For k = 1 and k = 2, these schemes work
the algorithm was k-tolerant. For each of the considered in time rlog,+,nl + 2, and if no faults are present, they
communication modes, the execution time of this algo- work in optimal time hog,, ,nl.
rithm was shown to be smaller than the time of broadcast- In [ 21, the goal considered was that of minimizing the
ing using the straightforward ( k + I)-replication ap- number of calls in fault-tolerant gossiping. It was assumed
proach. that nodes fail in a Byzantine manner, but they have
In [ 301, randomized broadcasting in the whispering diagnostic capabilities: A fault-free node can correctly
mode was considered. Informed nodes decide randomly diagnose tested nodes, while faulty testers are unpredict-
to which neighbor they transmit the message at each time able. The half-duplex whispering mode was adopted, and
unit. (This should not be confused with deterministic the size of packets was assumed unbounded. The authors
broadcasting in the presence of randomly distributed presented an adaptive gossiping algorithm using 3n log
faults, to be considered in Section 4.) If k < cn, for a n + O ( n )calls and working correctly whenever the fault-
constant c < 4, random broadcasting in the presence of free part of the network is connected.
FAULT-TOLERANT BROADCASTING AND GOSSIPING 149

3.3. Permanent Link and Node Faults rithm was described that performs broadcasting in the
presence of up to m - 2 arbitrary node or link failures,
We now assume that both links and nodes can fail and using the optimal time m and the optimal number of calls
that the total number of such permanent crash faults is 2"' - I. This should be compared to the previously men-
at most k. In [65], nonadaptive broadcasting in the m- tioned result of [ 121. obtained under different assump-
hypercube was considered under the assumption that tions. In [ 121, the number of faults was smaller and only
nodes can simultaneously send and receive messages and links were fault-prone, but the location of faults was
that the number k of faults is less than m. A broadcasting known only to the source and the communication mode
algorithm was proposed, requiring time at most m + 1 if was more restrictive, i.e., a node could communicate with
a node can simultaneously transmit to all neighbors and only two neighbors in a unit of time.
time at most 2m if transmitting a message is possible only A similar result for the star graph was obtained in
to one neighbor at a time. [41]. The n-star graph S, is a graph whose nodes are
In [ 131, a communication mode in between whispering labeled by all permutations of integers 1, . . . , n and the
and shouting was considered. The authors assumed that node u = ( u l , . . . , u,,) is adjacent to all nodes u[i]. for
each node can communicate with at most t neighbors in i = 2, . . . ,n, where u[ i ] results from u by a transposition
a unit of time. Assuming that k < m, they proposed a of uI and u, . S, has n! nodes and the diameter D ( S , )
scheme for k-tolerant broadcasting in the m-hypercube = L3(n - 1)/2J. The authors determined the maximum
r(
requiring a time of m + k + 1 ) / t 1 . Some bounds on number r ( n ) of link or node faults such that the fault-
free part of S, still has diameter D(S,). Assuming at most
the broadcasting time in the presence of a larger number
of faults that do not disconnect the hypercube were also r( n ) link or node faults known to all nodes, they proposed
shown. a broadcasting algorithm for S, working in optimal time
In [ 491, k-tolerant broadcasting in jumping nenvorks D( S,) and using the optimal number of calls n ! - 1. In
was considered. These are n-node chordal rings in which the traditional fault-tolerant setting, i.e., in the presence of
nodes u and u are adjacent iff u - u = 2'(mod n), for k unknown faults, they proposed k-tolerant broadcasting
some integer r. In the full-duplex whispering mode, the using the optimal number of ( k + 1)( n ! - 1) calls.
upper bound T s [log nl + k + 2 on k-tolerant broadcast- A variation of the broadcasting problem, called linear
ing time was established under the assumption k s h o g broadcasting. was first considered in [ 10, 221. See also
n l - 2. Problem 13 by Greenberg (the Report Dispersal Prob-
Nonadaptive k-tolerant broadcasting in product net- lem) in [ 341. The source has an unrestricted number of
works G = GI X --- x G, was considered in [3]. identical tokens. In any unit of time, each node that holds
tokens can send at most one token to at most one other
The authors proposed broadcasting algorithms for such
networks using the construction of n independent span- node. The goal is for all nodes to be visited by at least
ning trees rooted at the source s = (sI, . . . , s,$) (the paths one token. The difference from broadcasting is that tokens
from the source to any node in distinct trees are mutually may not be "multiplied" for free at any node but have
internally node-disjoint ) . These algorithms can tolerate to travel to each node from the source. In [ 271, a fault-
up to L(n - 1)/21 Byzantine faults or up to n - 1 crash tolerant version of linear broadcasting was considered.
faults. The execution time t of these algorithms was de- Each token has a predetermined route, indicating in which
rived in terms of broadcasting time in the factor networks order nodes should be visited. A faulty node or link de-
G, ,both in the whispering and in the shouting mode. For
stroys all tokens passing through it. A linear broadcasting
scheme consisting of token routes is k-tolerant if every
the whispering mode, t = 2 ZyLl' p, + on+ I , and for
fault-free node is visited by at least one token, whenever
the shouting mode, t = 1 + C:=, a,,where 0, and a,
the total number of faults does not exceed k. The perfor-
denote optimal fault-free broadcasting time from s, in G, ,
mance measure of a k-tolerant scheme, adopted in [27],
in whispering and shouting mode, respectively.
was the number of tokens used by the scheme. For posi-
In [ 321, the previously described linear communica- tive integers k and n , let PL( n )denote the minimum num-
tion model was used and k-tolerant broadcasting and gos- ber of tokens for which there exists a k-tolerant linear
siping algorithms were proposed for the hypercube in broadcasting scheme in a (n + 1)-node complete net-
both shouting and whispering modes. It was proved that work. The authors established lower bounds on P k ( n )
their running time is asymptotically optimal, i.e., it con- and showed k-tolerant schemes using few tokens. It was
verges to the lower bound as the length L of the messages proved, in particular, that the minimum number of tokens
increases. sufficient to perform 2-tolerant linear broadcasting is
A different approach to broadcasting in the presence @(log log n ) and that it is 2 for 1-tolerant linear broad-
of faults, similar to that in [ 121, was adopted in [ 631 and casting.
[41]. In both papers, the location of faults was assumed
to be known to all nodes. The adopted mode of communi- 3.4. Transient Faults
cation was shouting. Under these assumptions, broadcast- We now turn attention to transient faults. Let us first
ing in the m-hypercube was considered in [ 631. An algo- note that if a broadcasting or gossiping algorithm works
correctly assuming at most k permanent link and/or node time in O ( log n + k ) . This order of magnitude is clearly
faults then it also works correctly assuming at most k optimal. This result was subsequently strengthened in
transient link and/or node faults during the entire commu- [ 371 where the following upper bounds on T were proved:
nication process. Hence, all upper bounds on time and
the number of calls reported above for permanent faults T I rn + k, if n = 2"'. which is tight, and
still hold in the present scenario. In this subsection, we
T s r n + 3 k + 1, i f n = 2 " - ' + j < 2 " .
discuss results that use the assumption that faults are tran-
sient, i.e., the same fault cannot prevent transmission in
Moreover, the gossiping algorithm constructed in
two distinct time units, unlike in the permanent fault case.
[37], requiring a minimum time T = rn + k , when n
All results concern only link faults. We assume that
= 2", has an additional feature: It uses the minimum
nodes are fault free and links are subject to transient
number (rn + k)n/2 of calls. The scheme also improves
failures, i.e., there are at most k faulty calls during the
on the number of calls, when compared to [48] for many
communication process. Faults are of crash type, i.e., no
values of k and j. The exact value of T for this model,
message passes during a faulty call. Packets are un-
for arbitrary n and k , remains unknown.
bounded, the communication mode is whispering, and
For broadcasting in the half-duplex variation of this
algorithms are nonadaptive.
model, the following lower bounds were proved in [ 381.
The first results concerning the minimum number C
Fork = 1 and arbitrary n , T = h o g nl + 2 and C 2 2n
of calls were proved in [48]. As mentioned in Subsection
- 2. On the other hand, a l-tolerant broadcasting scheme
3.1, it was conjectured in [ 7 ] that C = [(k + 3)/2]n
was shown for which both the time and the number of
- const under the assumption of at most k permanent
calls achieve the above optimal values. For arbitrary n
link faults. This conjecture was disproved in [48] under
= 2"' > 2k, it was proved that the minimum time of k-
the stronger assumption of at most k transient link faults.
tolerant broadcasting is T = rn + 2k. Moreover, it was
The following upper bound was shown:
shown that k-tolerant broadcasting in time rn + 2k, for k
> 2"-', can be achieved in the symmetric directed rn-
C5-
nk
2
+ O ( k 6 + n log n ) . hypercube. This network has the minimum number of
links among all networks supporting k-tolerant broadcast-
ing in the optimum time rn + 2k.
In [ 481, a class of improved upper bounds for almost We now return to the full-duplex mode. The following
all k was also obtained, e.g.: variation on the assumption of at most k faulty calls was
recently considered in [MI. Since the number of call
C f(k + 8 ) ( n - 1 ) + 2k + 16. failures is likely to increase with execution time, instead
of imposing a fixed upper bound on this number, it seems
reasonable to assume-that the possible number of failures
The exact value of the minimum number of calls in k-
is proportional to the time elapsed since the beginning of
tolerant gossiping in this model is still unknown.
the algorithm execution. To make broadcasting possible
A more general measure of communication cost has
in the whispering mode, the proportion coefficient must
been considered in [ 391. Every link (i, j ) of the complete
be smaller than 1;otherwise, no message leaves the source
network on n nodes is assigned a positive cost c ( i , j ) .
in the worst case. Fix a C 1. Assume that, for any t
The total cost of a broadcasting or gossiping algorithm
> 0, at most at calls may fail in the first t time units of
is obtained by adding c ( i , j ) whenever a message travels
the algorithm execution. The following results regarding
through the link (i, j ) . [For example, the total cost is
broadcasting time in this linearly bounded transient fault
equal to the number of calls in the half-duplex mode,
model were obtained in [ 441.
when all costs c(i, j ) are equal to I.] It was proved in
If the network is an n-node chain, then
[39] that, for given costs of links, the problem of de-
termining the minimum cost of performing k-tolerant gos-
T E O ( n ) , if a < $, and
siping among n nodes is NP-hard. The authors proposed
a k-tolerant gossiping algorithm with cost at most twice
the optimal. Moreover, their gossiping scheme requires
time at most ( k + 2)n - 1 larger than optimal. In the
TE a( (A)") , if a 2 -21 .
case of broadcasting, a k-tolerant algorithm was con-
structed with cost k + 1 times larger than the cost of a If the network is an rn-hypercube then, for any a
minimum spanning tree. It was proved that this cost is < 1,
optimal among all k-tolerant broadcasting algorithms.
The first result on the time of gossiping in the full-
duplex mode with transient link faults was proved in T = L L ( r n - l)]+rn.
[48]. A k-tolerant algorithm was proposed with running 1 -a
FAULT-TOLERANT BROADCASTING AND GOSSIPING I51

If the network is the complete n-node graph, then, for 4. THE PROBABILISTIC FAULT MODEL
a n y a < 1,
In this section, we assume that links fail with probability
1
0 s p < 1, nodes fail with probability 0 s q < 1, and
TE-
1 -a
log n + O(l0g log n). all faults are independent. The values p = 0 ( q = 0)
correspond to the assumption that links (nodes) are fault-
free. Unless explicitly stated, algorithms are deterministic
A similar assumption that the number of faulty trans- and probabilistic considerations relate only to their cor-
missions may increase in time was addressed in [ 351. The rectness and/or efficiency.
authors considered the shouting communication mode; In most of the papers using the probabilistic fault
therefore, it was possible to allow a larger number of model, results concerning the execution time and the
failures per time unit. They assumed that at most k calls number of calls used by broadcasting and gossiping algo-
may fail in each time unit for a constant k smaller than rithms are of asymptotic nature, i.e., only their orders of
the edge-connectivity of the network. (For larger values magnitude (up to a multiplicative constant) are mini-
of k , the network could be disconnected.) In [35], the mized. Using this approach, results concerning broadcast-
issue of broadcasting time in the m-hypercube was studied ing and gossiping are usually equivalent when packets
for this model, where it was proved that T E m + o ( m ) . are assumed of unbounded size. indeed, a reverse of a
This model was further investigated in [21] for general broadcasting algorithm can be used to gather all values
networks. Let d be the diameter of the network and let k in one node; then, the total information can be broadcast
(smaller than the edge-connectivity ) be the maximum from this node to all other nodes. In this way, both the
number of faulty calls per time unit. The following results time and number of calls are at most doubled.
were proved in [ 2 11 : In the probabilistic model, there are two natural varia-
For fixed d , T E O ( k d ' * - ' ) ,for arbitrary networks, tions for defining the performance of adaptive algorithms
and there exist networks for which T E @ ( k ( d ' 2 - ' ) . in terms of time and number of calls. Both these values
For fixed k , T E O(d'+ I ) , for arbitrary networks, and are random variables, as they depend on the location of
there exist networks for which T E 0(d'* I ) . faults which are random. Thus, we may ask about the
For multidimensional tori, T E O( d ) . As an open prob- worst-case or the expected time and number of calls used
lem, the authors asked whether T is linear in the diameter, by the algorithm. It will be seen below that this distinction
for all vertex transitive graphs. is sometimes significant. It does not occur in the case
A different way of defining fault-tolerant broadcasting of nonadaptive algorithms, as all transmissions in such
was proposed in [ 5 8 ] . Instead of demanding that a broad- algorithms are scheduled in advance and do not depend
casting scheme tolerate at most k link faults in the net- on the random occurrence of faults.
work, for a given parameter k , the requirement of fault-
tolerance was allowed to vary from node to node. Fix a
network N and a broadcasting source s. Given a node u
4.1. Permanent Link and Node Faults
f s, let k, be the minimum number of links whose dele- We first assume that all failures are permanent and of
tion disconnects u from s. A broadcasting algorithm is crash type and that the communication mode is full-du-
reliable for the node u, if u gets the source message when- plex whispering. First, consider the fault-free nodes sce-
ever less than k,, links are subject to transient crash faults. nario (q = 0).in the following papers, p < 1 was as-
A broadcasting algorithm is faithful if it is reliable for sumed to be constant. In [ 91, a nonadaptive, almost safe
every node. In [ 581, the author investigated the trade-off broadcasting algorithm was proposed. It used noncon-
between the minimum number of calls used by a faithful structive expanders and worked in time O(log n). In
broadcasting algorithm and the maximum amount of local [ 251, broadcasting and gossiping were studied under the
memory needed in a node. The latter is called the space assumption of unbounded packets. A simple treelike con-
complexity of the algorithm. It was proved that every struction was applied to guarantee almost safe adaptive
faithful broadcasting algorithm for an arbitrary network broadcasting and gossiping using an expected time O( log
uses at least Z,,, k,, calls. Moreover, for any network, n ) and an expected number of calls O ( n ) . Nonadaptive
there exists a faithful broadcasting algorithm of linear broadcasting and gossiping algorithms were proposed,
space complexity, using the above optimal number of working in time O(log2n) and using O ( n log n) calls.
calls. On the other hand, arbitrarily large networks were The order of the number of calls was proved to be optimal
constructed for which all faithful broadcasting algorithms for nonadaptive algorithms (cf. also [ 91). Later, the con-
using no local memory for computations at nodes use a struction from [25] was extended in [17] to decrease
number of calls exponential in the size of the network. nonadaptive broadcasting time to O ( log n),as well, under
Finally, the author characterized networks that support more general assumptions. Also, in [ 171, the result from
faithful broadcasting using the optimal number of calls [ 251 concerning adaptive broadcasting was strengthened.
and no local memory for computations. An adaptive, almost safe broadcasting algorithm was
152 PELC

given whose worst-case time was O(1og n ) and worst- awake and has to wake up all fault-free nodes by sending
case number of calls was O (n), both orders of magnitude the wake-up message. The difference from classical
being optimal. In [ 161, almost safe broadcasting was con- broadcasting is that only nodes that are awake (i.e., al-
sidered for the m-hypercube. The authors constructed a ready have received the message) can place calls; a dor-
nonadaptive broadcasting algorithm requiring time mant (uninformed) node cannot call to seek the source
O ( m ) , whenever p 5 0.09. message, as in broadcasting. This restriction has a sig-
In [ 661, almost safe, adaptive broadcasting was con- nificant impact on the minimum number of calls of an
sidered, assuming that link failure probability p is larger almost safe, waking-up algorithm, in the case when links
than constant. Let p* = 1 - p denote the probability that and nodes are fault prone. It was proved in [ 191 that every
a link is fault free. Forp* = [w(n)log n l l n , let w ( n ) be such algorithm (even adaptive) must use an expected
any function divcrging to infinity. The authors proposed number of 52( n log n ) calls. (This can be contrasted with
an almost safe broadcasting algorithm working in time the above-mentioned result from [ 171, where classical
log n + o(log n), with probability converging to 1. Under broadcasting was performed with an expected linear num-
the slightly stronger assumption p* = ( Kln2n ) l n , where ber of calls.) Moreover, the authors constructed an almost
K > 12/(ln 2), they showed an almost safe broadcasting safe wake up algorithm working in expected time O ( log
algorithm working in optimal time h o g nl . n), and using an expected number of O ( n log n ) calls,
These results were subsequently extended in [ 461. The both orders of magnitude being optimal. An additional
author showed almost safe broadcasting in time log n feature of the above scheme was that it worked in anony-
+ d log log n , for d < 1, whenever p* = ( c In n ) l n , mous (complete) networks in which nodes do not know
where c > 18.4. Moreover, almost safe broadcasting in their labels and execute identical algorithms.
time h o g nl was shown for p * = ( c In n-log log n ) l n , In [ 201, nonadaptive broadcasting in the hypercube
where c > 16. In [ 361, the latter result was strengthened: was considered. It is well known that, in this case, almost
Almost safe broadcasting in time h o g nl was shown for safe broadcasting is impossible for large fault probability
p * = (c In n ) l n , for some positive constant c. values, as the hypercube can be then disconnected with
In [ 471, the results from [ 461 were extended in another constant positive probability. For small probabilities of
way. On the one hand, almost safe broadcasting in time faults, satisfying the condition ( 1 - p ) ( 1 - q ) 2 0.99,
h o g nl + 1 was shown for p * = (C In nslog log log n ) l an almost safe broadcasting algorithm workmg in time
n, where c > 16. On the other hand, the author showed O ( m ) for the m-hypercube was constructed in [ 201.
almost safe broadcasting in time log n +
d log log log As observed previously, for packets of unbounded size,
n, for d < I , whenever p* = ( c In n ) l n , where c the above results can be immediately extended to gossip-
> 18.4, thus improving required time for this probability ing. The situation changes significantly if we consider
value. smaller packets. For unit-size packets, e.g., gossiping re-
Randomized broadcasting in the above fault model was quires linear time even without faults. In [28]. the rela-
+
considered in [30]. Forp* = [log n w ( n ) ] / n where , tions between the size of packets and the time of fault-
w ( n ) is any function diverging to infinity, randomized tolerant broadcasting were investigated, for arbitrary fault
broadcasting was proved to be almost safe and work in probability values p , 9 < 1. For packets of size b ( n ) ,the
time O(log2 n), while for p* 2 [ ( 1 + €)log n l l n , its authors constructed a nonadaptive almost safe gossiping
running time is O ( log n ) . algorithm working in time O ( [ n l b ( n ) ]+ log n), easily
In the following papers, both links and nodes were seen to be optimal. For the unbounded packet size, this
assumed fault-prone. In [ 171, nonadaptive broadcasting yields gossiping time O ( log n), which also follows from
was considered for arbitrary constant fault probability val- [ 171, while for unit-size packets, this yields linear gossip-
ues p , q < 1. The authors proposed an almost safe broad- ing time. The algorithm in [28] used explicitly con-
casting algorithm working in time O ( log n), using a tree- structible expanders.
like construction. This eliminated the need of noncon- It should be mentioned that in the above gossiping
structive expanders used in [ 91, simultaneously algorithm, nodes do not know, a priori, whose value is
weakening the assumption regarding node faults (in [ 91, currently transmitted; thus, it is necessary to attach node
nodes were assumed fault-free). In [17], an adaptive, labels to values during transmissions. Since labels must
almost safe broadcasting algorithm was also given having use at least log n bits, a packet of size b ( n ) , which, by
worst-case running time O(log n ) and expected number definition, must contain values of b(n) nodes, must, in
of calls O ( n ) .It remains open whether both the running fact, contain b(n)log n bits, even for one-bit values. For
time O(log n) and the number of calls O ( n ) can be example, unit-size packets must contain O( log n ) bits.
guaranteed in the worst case. As previously mentioned, This should be compared to the situation in [ 181 and [ 1 I],
the authors gave a positive answer to this question assum- where labels did not have to be attached and packets
ing that nodes are fault-free. could have only a constant number of bits.
A variation of broadcasting, called waking up, was In [ 141. a nonadaptive broadcasting algorithm tolerat-
considered in [ 191. In the beginning, only the source is ing at most m - I faults in the m-hypercube was proposed.
FAULT-TOLERANT BROADCASTING AND QOSSlPlNG 1s

Its running time is (2" - 1 )m and the number of calls is complete. For these models, gossiping algorithms
is m2" - 2" + 1. In addition, the author performed working in sparser networks were also constructed and
simulations showing that for fault probability p s f their performance was proved to be of the smallest possi-
all nodes become informed after an average time of less ble order of magnitude among gossiping schemes working
than 5m. in these networks. For example, in case of the model
Byzantine link failures under the assumption that nodes Nl, an almost safe gossiping algorithm was constructed,
are fault free were considered in [ 61. Link failure proba- working in time O ( n w ( n ) )and using O ( n 2 w ( n ) )calls,
4;
bility was a constantp < otherwise, no reliable commu- for any function w(n) -, 03. This algorithm used only
nication can be achieved in a Byzantine environment. The O ( n w ( n ) )links for communication. It was proved, on
communication mode was full-duplex whispering. Using the other hand, that almost safe gossiping in time 0( n )
nonconstructive expanders, the authors proposed a non- or using O ( n 2 )calls is impossible in any network with
adaptive almost safe broadcasting algorithm working in o ( n 2 ) links.
time O(1og n). The results from [ 1I ] should be compared to those
In [ 41, nonadaptive gossiping with unbounded packets from [ 281. Consider the model L2 from [ 111: whispering
was considered for the m-hypercube. The authors as- with fault-free nodes and faulty links. Suppose that node
sumed that either links or nodes fail in a Byzantine way. values are only 1 or 0. The lower bound T E O(n log n )
For fault-free links (respectively, nodes) and nodes (re- on gossiping time, proved in [ 111, remains true for crash
spectively, links) failing with probability p s c / m , with faults as well. This lower bound is valid for strongly
a small constant c , an almost safe gossiping algorithm nonadaptive algorithms described above and for one-bit
requiring time 2m was presented. For fault-free nodes and packets. On the other hand, the algorithm from [28]
links failing with probability p s c, with a small constant works for link and node crash faults. It is not strongly
c. almost safe gossiping in time 2m2 was shown. It re- nonadaptive in the above sense and works in time O ( n )
mains open whether almost safe gossiping with un- for unit-size packets. However, as previously remarked,
bounded packets, or even almost safe broadcasting, can in [ 281, labels have to be attached to nodes; thus, unit-
be performed in the m-hypercube in time O(m)when size packets must, in fact, contain a logarithmic number
links fail in a Byzantine manner with a small constant of bits. The following question remains open: Suppose
probability. that links and nodes of the complete network are subject
to permanent crash faults with probabilities p < 1 and q
Gossiping with unit-size packets in the presence of
Byzantine faults was studied in [ 111. The authors consid-
< 1, respectively, and that all values of nodes have a
constant number of bits. Does there exist an almost safe
ered strongly nonadaptive algorithms in which not only
gossiping algorithm using packets containing a constant
transmission scheduling is done in advance, but it is also
number of bits and working in linear time? A positive
predetermined which node's value is to be sent in a given
answer to this question for transient link faults and fault-
transmission. This enables the algorithm to skip the labels
free nodes was given in [ 181.
of nodes during transmissions, similar to the approach in
In [ 231, the linear broadcasting problem, discussed in
[ 181 and unlike that in [ 28 1. Thus, if node values have
Section 3, was considered under the name of token dis-
a constant number of bits, packets contain a constant
persal. The authors assumed that links and/or nodes of
number of bits, as well. Two communication modes were the (complete) network fail independently with constant
considered. In mode 1, sending was performed by shout- probabilities and that an attempt to send a token to a
ing, i.e., a node could send a packet simultaneously to all faulty node or via a faulty link does not succeed (i.e., the
neighbors, but receiving was sequential, i.e., a node could token remains at the sending node). Every fault-free node
receive only one packet at a time. Mode 2 was classical has to be visited. The performance measure adopted in
whispering. These two communication modes were com- [23] was the running time of the scheme. If only nodes
bined with two assumptions regarding faults: ( N ) fault- or only links can fail ( p = 0 or q = 0).then an almost
free links and nodes failing with constant probability 0 safe, token dispersal algorithm was shown with running
< q < and (L) fault-free nodes and links failing with time O( 6). If both links and nodes can fail (p, q > 01,
constant probability 0 < p < i. All four models yielded then almost safe token dispersal was proposed with run-
by combinations of these assumptions were considered, ning time O ( G ) .In both cases, algorithms were
labeled as LI, L2, N1, and N2. For each of these models, nonadaptive and the order of magnitude was proved to
almost safe gossiping algorithms were constructed whose be optimal. Other fault-tolerant aspects of the token dis-
running time T and the number of calls C were shown to persal problem were considered in [24].
be of the smallest possible order of magnitude. In case
of models L1 and L2, T E @( n log n), C E O(n210g n )
and the algorithm achieving this performance worked for 4.2. Transient Faults
the sparsest possible networks. In case of models N1 and We now assume that individual calls fail with constant
N2, T E 6 ( n ) and C E @ ( n 2 ) ,if the underlying network probability 0 zs p < 1 and all failures, including those
of calls placed along the same link, are independent. Node n ) can be constructed. Diks and Pelc (see Problem 29 in
failures occur with probability 0 5 9 < I ; they are perma- [34]) asked if there exists an almost safe broadcasting
nent and independent. Faults are of crash type and packets algorithm working in time O ( n )for Byzantine faults. The
are of unbounded size. The communication mode is full- main result of [55] is the construction and analysis of
duplex whispering. such an algorithm.
In [ 601. adaptive broadcasting for the complete net-
work was considered in this model, assuming arbitrary
p , q < 1. An almost safe broadcasting algorithm requiring
expected time O(1og n ) and worst-case time O ( n log n ) 5. FUTURE RESEARCH
was constructed. An improvement of this result follows
from [ 171, where almost safe, nonadaptive broadcasting This survey demonstrates that although an important body
working in time O(log n ) was given. The result from of research exists concerning fault-tolerant broadcasting
[ 171 holds for transient faults, as well. and thus yields and gossiping, our understanding of the relations between
worst-case time O ( log n ) . efficiency and fault tolerance of communication algo-
In the following papers, nodes were assumed fault- rithms for these tasks is still fragmentary and incomplete.
free, i.e., q = 0. Under this assumption, communication At least three groups of open problems can be specified
in n-node bounded degree networks of diameter D ( n ) on the basis of already obtained research results. Research
was investigated in [ 261. It was shown that almost safe, directed toward their solution is likely to deepen the un-
nonadaptive broadcasting and gossiping can be performed derstanding of the domain and increase the potential ap-
in such networks in time O ( D ( n ) )using O ( n log n ) plicability of theoretical results in practice.
calls. Adaptive broadcasting and gossiping algorithms The first group of problems concerns tightening the
also were designed, working in worst-case time O( D ( n)) gaps between upper and lower bounds on the minimum
and using an expected linear number of calls. All these time and/or the minimum number of calls in fault-tolerant
orders of magnitude are optimal. broadcasting and gossiping. In some cases, these gaps are
A variation of this model for unit-size packets was fairly small, and the remaining open problems are mostly
considered in [ 181. Under this scenario, gossiping must of combinatorial interest. In other situations, however,
take time at least linear in n , as every node must read even the exact order of magnitude of minimum time or
the value of every other node. In [18], an almost safe minimum number of calls in fault-tolerant communication
nonadaptive gossiping algorithm working in time O ( n ) has not yet been established. Future developments may
was constructed for the large class of networks having have a significant impact on the efficiency of actually
spanning trees of bounded maximum degree (including, implemented communication schemes.
e.g., all Hamiltonian graphs). It is worth noting that the The second direction for future research concerns in-
gossiping scheme in [ 181 was constructed in such a way vestigating new communication and fault models arising
that each node knows in advance the order in which it is as a consequence of emerging technologies. Although, as
going to get values of other nodes. Thus, it was not neces- we have seen, a large number of hypotheses have already
sary to append node labels to their values during transmis- been considered, their possible combinations yielding a
sions and, consequently, unit-size packets really meant plethora of potential models, important challenges are the
packets containing a constant number of bits, in the case study of models that faithfully describe existing patterns
when values were of constant size. This should be com- of communication and the exploration of features likely
pared to the scenario in [28], described in Subsec- to characterize networks built in the future.
tion 4.1. Finally, it can be seen that most of the research done
The relationships between almost safe broadcasting to date in the surveyed area concerns complete networks
time and the number of links in the network were studied and hypercubes. Good topological properties of these net-
in [61] for the shouting communication mode and crash works, such as symmetry and high connectivity, enable
transmission faults. It was shown that the minimum time them to support efficient and robust communication algo-
of almost safe broadcasting in networks with e ( n ) links rithms. However, especially in the case of complete net-
is T E @(n log n l e ( n ) ) . works, they are increasingly difficult and costly to build
Byzantine transmission faults were considered in [ 551. as the number of nodes grows. Hence, there is a need to
explore fault-tolerant capabilities of other networks, in
In this case, the assumption p < is necessary; otherwise,
particular, sparser ones. Here, again, the types of networks
no reliable communication can be guaranteed. The prob-
actually used in practice provide important and challeng-
lem considered in [ 5 5 ] is that of the minimum time re-
ing research targets.
quired for almost safe nonadaptive broadcasting in the n -
node chain. It follows from [26] that this time is O ( n ) ,
if faults are of crash type. On the other hand, for Byzan- This research was supported in part by NSERC Grant OGP
tine faults, a simple algorithm working in time O ( n log 0008 136.
FAULT-TOLERANT BROADCASTING AND GOSSIPING 1!%

REFERENCES B. S. Chlebus, K. Diks, and A. Pelc, Reliable broadcast-


ing in hypercubes with random link and node failures.
R. Ahlswede, L. Gargano, H. S. Haroutunian, and L. H. Combin. frob. Comput., to appear.
Khachatrian, Fault-tolerant minimum broadcast net-
B. S. Chlebus, K. Diks, and A. Pelc, Broadcasting in
works. Networks 27 ( 1996) 293-307.
synchronous networks with dynamic faults. Networks 27
A. Bagchi and S. L. Hakimi, Information dissemination (1996)309-318.
in distributed systems with faulty units. IEEE Trans.
C. T. Chou and I. S. Gopal, Linear broadcast routing.
Comp. 43 (1994)698-710.
J. Alg. 10 (1989)490-517.
F. Bao and Y.Igarashi, Reliable broadcasting in product
K. Diks, A. Malinowski, and A. Pelc, Reliable token
networks with Byzantine faults. Proceedings of the 26th
dispersal with random faults. Par. Proc. Lett. 4 (1994)
Annual International Symposium on Fault-Tolerant
417-427.
Computing ( 1996) 262-27 1.
K. Diks, A. Malinowski, and A. Pelc, Token transfer in
F. Bao, Y. Igarashi, and K. Katano, Broadcasting in
a faulty network. Theor. Znf: Appl. 29 (1995)383-400.
hypercubes with randomly distributed Byzantine faults.
Proceedings WDAG’95, LNCS 972,21 5-229. K. Diks and A. Pelc, Reliable gossip schemes with ran-
dom link failures. Proceedings of the 28th Annual Aller-
M. Barborak, M. Malek, and A. Dahbura, The consensus
ton Conference on Communication, Control and Com-
problem in fault-tolerant computing. ACM Comput. Sun.
puting. (Oct. 1990) 978-987.
25 (1993)171-220.
P. Berman, K. Diks, and A. Pelc, Reliable broadcasting
K. Diks and A. Pelc, Almost safe gossiping in bounded
degree networks. SIAM J. Disc. Math. 5 (1992)338-
in logarithmic time with Byzantine link failures. J. Alg.,
344.
to appear.
K. A. Berman and M. Hawrylycz. Telephone problems K. Diks and A. Pelc, Fault-tolerant linear broadcasting.
Proceedings of the First Canada- France Conference on
with failures. SIAMJ. Alg. Disc. Methods 7 ( 1986) 13-
Parallel and Distributed Computing, Theory and Prac-
17.
tice, Montreal. Canada, LNCS 805 (May 1994) 207-
D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distrib- 217.
uted Computation: Numerical Methods. Prentice-Hall,
Englewood Cliffs, NJ (1989). K. Diks and A. Pelc, Efficient gossiping by packets in
networks with random faults. SIAM J. Disc. Math. 9
D. Bienstock, Broadcasting with random faults. Disc.
(1996)7-18.
Appl. Math. 20 (1988)1-7.
A. Farley, Reliable minimum-time broadcast networks.
S . Bitan and S. Zaks, Optimal linear broadcast. J. Afg.
Proceedings of the 18th SE Conference on Combinato-
14 (1993)288-315.
rics, Graph Theory and Computing. Congress. Numer.
D. Blough and A. Pelc, Optimal communication in net- 59 (1987)37-48.
works with randomly distributed Byzantine faults. Net-
works 23 ( 1993)691-701. 301 U. Feige, D. Peleg, P. Raghavan, and E. Upfal, Random-
ized broadcast in networks. Random Struct. Alg. 1
J. Bruck, On optimal broadcasting in faulty hypercubes. (1990)447-460.
Disc. Appl. Math. 53 ( 1994) 3 - 13.
G.Fox, M.Johnsson, G. Lyzenga, S . Otto, J. Salmon,
S. Carlsson, Y.Igarashi, K. Kanai, A. Lingas, K. Miura, and D. Walker, Solving Problems on Concurrent Proces-
and 0. Peterson, Information disseminating schemes for sors. Prentice-Hall, Englewood Cliffs, NJ, I (1988).
fault-tolerance in hypercubes. IEICE Trans. Fund. E75
(1992)255-260. P. Fraigniaud. Asymptotically optimal broadcasting and
gossiping in faulty hypercube multicomputers. IEEE
S . C. Chau. Fast and robust broadcasting in faulty hyper-
Trans. Comp. 41 (1992) 1410-1419.
cubes. Manuscript.
P. Fraigniaud and E. Lazard, Methods and problems of
S. C. Chau and A. L. Liestman, Constructing fault-toler-
communication in usual networks. Disc. Appl. Math. 53
ant minimal broadcast networks, J. Comb. In5 Sys. Sci. (1994)79-133.
10 (1986)1-18.
P. Fraigniaud, A. L. Liestman, and D. Sotteau, Eds.,
B. S. Chlebus, K. Diks, and A. Pelc, Optimal broadcast-
Open problems. Par. Proc. Lett. 3 (1993)507-524.
ing in faulty hypercubes. Digest of Papers, FTCS’21
(1991)266-273. P. Fraigniaud and C. Peyrat, Broadcasting in a hyper-
cube when some calls fail. Inf. Proc. Lett. 39 (1991)
B. S. Chlebus, K. Diks, and A. Pelc, Sparse networks
115-1 19.
supporting efficient reliable broadcasting. Proc.
ICALP’93, LNCS 700,388-397. A. Frieze and M. Molloy, Broadcasting in random
B. S . Chlebus. K. Diks, and A. Pelc, Fast gossiping with graphs. Disc. Appl. Math. 54 (1994)77-80.
short unreliable messages. Disc. Appl. Math. 53 (1994) L. Gargano, Tighter bounds on fault-tolerant broadcast-
15-24. ing and gossiping. Networks 22 (1992)469-486.
B. S. Chlebus. K. Diks, and A. Pelc, Waking up an L. Gargano, A. L. Liestman, J. Peters, and D. Richards,
anonymous faulty network from a single source. Pro- Reliable broadcasting. Discr. Appl. Math. 53 (1994)
ceedings of the 27th Annual Hawaii International Con- 135- 148.
ference on System Sciences 2 (Jan. 1994) 187-193. L 391 L. Gargano and A. A. Rescigno, Communication com-
156 PELC

plexity of fault-tolerant information diffusion. Proceed- Parallel Processing and Medium-Scale Multiprocessors
ings of the Fifrh IEEE Symposium on Parallel and Dis- (A. Wouk, Ed.). SIAM (1989) 108-156.
tributed Computing ( 1 993 ) . L. H. Khachatrian and H. S. Harutounian, On optimal
[40] L. Gargano, A. A. Rescigno, and U. Vaccaro, Fault- broadcast graphs. Proceedings of the Fourth Interna-
tolerant hypercube broadcasting via information dis- tional Colloquium on Coding Theory, Armenia ( 1990)
persal. Networks 23 ( 1993) 271 -282. 69-77.
[ 411 L. Gargano. A. A. Rescigno, and U. Vaccaro, Minimum L. KuEera, Broadcasting through a noisy one-dimen-
time broadcast in faulty star networks. Manuscript. sional network. Technical Report MPI-1-93- 106, Max-
[ 421 L. Gargano and U. Vaccaro, Minimum time networks
Planck-Institut fur Infonnatik ( 1993).
tolerating a logarithmic number of faults. SIAM J . Disc. A. L. Liestman, Fault-tolerant broadcast graphs. Net-
Math. 5 (1992) 178-198. works 15 (1985) 159-171.
[ 431 L. Gpieniec and A. Pelc, Broadcasting with a bounded G. Maddaluno, Algorithms for the construction of fault-
fraction of faulty nodes. Technical Report RR 95/01-1, tolerant networks (in Italian). Thesis, Universita di Sa-
Universitt du Qu6bec i Hull ( 1995 ) . lerno, 1987.
[ 441 L. Gpieniec and A. Pelc, Broadcasting with linearly S. Moran, Message complexity versus space complexity
bounded transmission faults. Technical Report RR 951 in fault tolerant broadcast protocols. Networks 19 ( 1989)
04-7, Universitt du Quebec i Hull ( 1995). 505 -5 19.
S. Ohring and D. H. Hohndel, Optimal fault tolerant
[ 4 5 ] L. Gpieniec and A. Pelc, Adaptive broadcasting with
faulty nodes. Par. Comput.. to appear. communication algorithms on product networks using
spanning trees. Proceedings of the 6th IEEE Symposium
[46] A. V. Gerbessiotis, Broadcasting in random graphs. on Parallel and Distributed Processing ( 1994) 188-
Discr. Appl. Math. 53 (1994) 149-170. 195.
[ 471 A. V. Gerbessiotis, Close-to-optimal and near-optimal A. Pelc. Broadcasting in complete networks with faulty
broadcasting in random graphs. Disc. Appl. Math. 63 nodes using unreliable calls. I n f . Proc. Lett. 40 ( 1991 )
(1995) 129-150. 169-174.
[ 481 R. W. Haddad, S. Roy, and A. A. Schaffer. On gossiping A. Pelc, Broadcasting time in sparse networks with faulty
with faulty telephone lines. SIAM J. Alg. Disc. Methods transmissions. Par. Proc. Lett. 2 (1992) 355-361.
8 (1987) 439-445. A. Pelc, Fast fault-tolerant broadcasting and gossiping.
[ 491 Y.Han, Y.Igarashi, K. Kanai, and K. Miura, Broadcast- Proceedings of the 2nd Colloquium on Structural Infor-
ing in faulty binary jumping networks. J. Par. Dist. Com- mation and Communication Complexity, SIROCC0'95,
put. 23 (1994) 462-467. Greece (June 1995) 159-172.
[ 501 S. M. Hedetniemi, S. T. Hedetniemi. and A. L. Liestman, D. Peleg. A note on optimal time broadcast in faulty
A survey of gossiping and broadcasting in communica- hypercubes. J. Par. Dist. Comp. 26 (1995) 132-135.
tion networks. Networks 18 (1988) 319-349. D. Peleg and A. A. Schaffer, Time bounds on fault-
[ 511 J. HromkoviE, R. Klasing, B. Monien, and R. Peine, tolerant broadcasting. Neiworks 19 ( 1989) 803-822.
Dissemination of information in interconnection net- P. Ramanathan and K. G. Shin, Reliable broadcast in
works (broadcasting and gossiping). CombinarorialNer- hypercube multicomputers. IEEE Trans. Comp. 37
work Theory. (F. Hsu and D.-Z. Du, Eds.). Science (1988) 1654-1657.
Press & A M S , to appear. E. R. Scheinerman and J. C. Wierman, Optimal and
[ 521 Y. Igarashi, K. Kanai, K. Miura, and S.Osawa, Optimal near-optimal broadcast in random graphs. Disc. Appl.
schemes for disseminating information and their fault- Math. 25 (1989) 289-297.
tolerance. IEICE Trans. Inf: Syst. E75 ( 1992) 22-29.
[ 531 S. L. Johnsson and C. T. Ho, Matrix multiplication on Received November 22, 1995
Boolean cubes using generic communication primitives. Accepted March 27, 1996

Das könnte Ihnen auch gefallen